Content uploaded by Mehdi Elahi
Author content
All content in this area was uploaded by Mehdi Elahi on Apr 20, 2017
Content may be subject to copyright.
Using Mise-en-Scène Visual Features based on MPEG-7
and Deep Learning for Movie Recommendation
Yashar Deldjoo
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
yashar.deldjoo@polimi.it
Massimo Quadrana
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
massimo.quadrana@polimi.it
Mehdi Elahi
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
mehdi.elahi@polimi.it
Paolo Cremonesi
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
paolo.cremonesi@polimi.it
ABSTRACT
Item features play an important role in movie recommender
systems, where recommendations can be generated by using
explicit or implicit preferences of users on traditional fea-
tures (attributes) such as tag, genre, and cast. Typically,
movie features are human-generated, either editorially (e.g.,
genre and cast) or by leveraging the wisdom of the crowd
(e.g., tag), and as such, they are prone to noise and are ex-
pensive to collect. Moreover, these features are often rare or
absent for new items, making it difficult or even impossible
to provide good quality recommendations.
In this paper, we show that user’s preferences on movies
can be better described in terms of the Mise-en-Sc`ene fea-
tures, i.e., the visual aspects of a movie that characterize
design, aesthetics and style (e.g., colors, textures). We use
both MPEG-7 visual descriptors and Deep Learning hidden
layers as example of mise-en-sc`ene features that can visually
describe movies. Interestingly, mise-en-sc`ene features can be
computed automatically from video files or even from trail-
ers, offering more flexibility in handling new items, avoiding
the need for costly and error-prone human-based tagging,
and providing good scalability.
We have conducted a set of experiments on a large cat-
alogue of 4K movies. Results show that recommendations
based on mise-en-sc`ene features consistently provide the best
performance with respect to richer sets of more traditional
features, such as genre and tag.
Keywords
mpeg-7 features, movie recommendation, visual, deep learn-
ing
1. INTRODUCTION
Multimedia recommender systems base their recommen-
dations on human-generated content features which are ei-
ther crowd-sourced (e.g., tag) or editorial-generated (e.g.,
genre, director, cast). The typical approach is to recom-
mend items sharing features with the other items the user
liked in the past.
In the movie domain, information about movies (e.g., tag,
genre, cast) can be exploited to either use Content-Based
Filtering (CBF) or to boost Collaborative Filtering (CF)
with rich side information [30]. A necessary prerequisite for
both CBF and CF with side information is the availability
of a rich set of descriptive features about movies.
An open problem with multimedia recommender systems
is how to enable or improve recommendations when user rat-
ings and“traditional”human-generated features are nonexis-
tent or incomplete. This is called the new item problem [16,
29] and it happens frequently in video-on-demand scenar-
ios, when new multimedia content is added to the catalog
of available items (as an example, 500 hours of movie are
uploaded to YouTube every minute1).
Movie content features can be classified into three hierar-
chical levels [35].
•At the highest level, we have semantic features that
deal with the conceptual model of a movie. An exam-
ple of semantic feature is the plot of the movie The
Good, the Bad and the Ugly, which revolves around
three gunslingers competing to find a buried cache of
gold during the American Civil War;
•At the intermediate level, we have syntactic features
that deal with objects in a movie and their interac-
tions. As an example, in the same noted movie, there
are Clint Eastwood, Lee Van Cleef, Eli Wallach, plus
several horses and guns;
•At the lowest level, we have stylistic features, related to
the Mise-en-Sc`ene of the movie, i.e., the design aspects
that characterize aesthetic and style of a movie (e.g.,
colors or textures); As an example, in the same movie
predominant colors are yellow and brown, and camera
shots use extreme close-up on actors’ eyes.
The same plot (semantic level) can be acted by different ac-
tors (syntactic level) and directed in different ways (stylistic
level). In general, there is no direct link between the high-
level concepts and the low-level features. Each combination
of features convey different communication effects and stim-
ulate different feelings in the viewers.
1http://www.reelseo.com/hours-minute-uploaded-youtube/
arXiv:submit/1868404 [cs.MM] 20 Apr 2017
Recommender systems in the movie domain mainly focus
on high-level or intermediate-level features – usually pro-
vided by a group of domain experts or by a large commu-
nity of users – such as movie genres (semantic features, high
level), actors (syntactic features, intermediate level) or tags
(semantic and syntactic features, high and intermediate lev-
els) [32, 17, 21]. Movie genres and actors are normally as-
signed by movie experts and tags by communities of users
[33]. Human-generated features present a number of disad-
vantages:
1. features are prone to user biases and errors, therefore
not fully reflecting the characteristics of a movie;
2. new items might lack features as well as ratings;
3. unstructured features such as tags require complex Nat-
ural Language Processing (NLP) in order to account
for stemming, stop words removal, synonyms detection
and other semantic analysis tasks;
4. not all features of an item have the same importance
related to the task at hand; for instance, a background
actor does not have the same importance as a guest
star in defining the characteristics of a movie.
In contrast to human-generated features, the content of
movie streams is itself a rich source of information about
low-level stylistic features that can be used to provide movie
recommendations. Low-level visual features have been shown
to be very representative of the users feelings, according to
the theory of Applied Media Aesthetics [37]. By analyzing a
movie stream content and extracting a set of low-level fea-
tures, a recommender system can make personalized recom-
mendations, tailored to a user’s taste. This is particularly
beneficial in the new item scenario, i.e., when movies with-
out ratings and without user-generated tags are added to
the catalogue.
Moreover, while low-level visual features can be extracted
from full-length movies, they can also be extracted from
shorter version of the movies (i.e., trailers) in order to have
a scalable recommender system. I previous works, we have
shown that mise-en-sc`ene visual features extracted from trail-
ers can be used to accurately predict genre of movies [12, 13].
In this paper, we show how to use low-level visual features
extracted automatically from movie files as input to a hybrid
CF+CBF algorithm. We have extracted the low-level visual
features by using two different approaches:
•MPEG-7 visual descriptors [22]
•Pre-trained deep-learning neural networks (DNN) [31]
Based on the discussion above, we articulate the follow-
ing research hypothesis: “a recommender system using low-
level visual features (mise-en-sc`ene) provides better accuracy
compared to the same recommender system using traditional
content features (genre and tag)”.
We articulate the research hypothesis along the following
research questions:
RQ1: do visual low-level features extracted from any of
MPEG-7 descriptors or pre-trained deep-learning net-
works provide better top-N recommendations than genre
and tag features?
RQ2: do visual low-level features extracted from MPEG-7
descriptor in conjunction with pre-trained deep-learning
networks provide better top-N recommendations than
genre and tag features?
We have performed an exhaustive evaluation by compar-
ing low-level visual features with respect to more traditional
features (i.e., genre and tag). For each set of features, we
have used a hybrid CBF+CF algorithm that includes item
features as side information, where item similarity is learned
with a Sparse LInear Method (SLIM) [25].
We have used visual and content features either individ-
ually or in combination, in order to obtain a clear picture
of the real ability of visual features in learning the prefer-
ences of users and effectively generating relevant recommen-
dations.
We have computed different relevance metrics (precision,
recall, and mean average precision) over a dataset of more
than 8M ratings provided by 242K users to 4K movies. In
our experiments, recommendations based on mise-en-sc`ene
visual features consistently provide the best performance.
Overall, this work provides a number of contributions to
the RSs field in the movie domain:
•we propose a novel RS that automatically analyzes the
content of the movies and extracts visual features in or-
der to generate personalize recommendations for users;
•we evaluate recommendations by using a dataset of 4K
movies and compare the results with a state-of-the-art
hybrid CF+CBF algorithm;
•we have extracted mise-en-sc`ene visual features adopt-
ing two different approaches (i.e., MPEG-7 and DNN)
and fed them to the recommendation algorithm, either
individually or in combination, in order to better study
the power of these types of features;
•the dataset, together with the user ratings and the vi-
sual features extracted from the videos (both MPEG-7
and deep-networks features), is available for download
2.
The rest of the paper is organized as follows. Section 2
reviews the relevant state of the art, related to content-
based recommender systems and video recommender sys-
tems. This section also introduces some theoretical back-
ground on Media Aesthetics that helps us to motivate our
approach and interpret the results of our study. It describes
the possible relation between the visual features adopted in
our work and the aesthetic variables that are well known
for artists in the domain of movie making. In Section 3 we
describe our method for extracting and representing mise-
en-sc`ene visual features of the movies and provide the details
of our recommendation algorithms. Section 4 introduces the
evaluation method and presents the results of the study and
Section 5 discusses them. Section 6 draws the conclusions
and identifies open issues and directions for future work.
2. RELATED WORK
2recsys.deib.polimi.it
2.1 Multimedia Recommender Systems
Multimedia recommender systems typically exploit high-
level or intermediate-level features in order to generate movie
recommendation [6, 24]. This type of features express se-
mantic and syntactic properties of media content that are
obtained from structured sources of meta-information such
as databases, lexicons and ontologies, or from less struc-
tured data such as reviews, news articles, item descriptions
and social tags.
In contrast, in this paper, we propose exploiting low-level
features to provide recommendations. Such features express
stylistic properties of the media content and are extracted
directly from the multimedia content files [12].
While this approach has been already investigated in the
music recommendation domain [3], it has received marginal
attention for movie recommendations. The very few ap-
proaches only consider low-level features to improve the qual-
ity of recommendations based on other type of features. The
work in [36] proposes a video recommender system, called
VideoReach, which incorporate a combination of high-level
and low-level video features (such as textual, visual and au-
ral) in order to improve the click-through-rate metric. The
work in [39] proposes a multi-task learning algorithm to in-
tegrate multiple ranking lists, generated by using different
sources of data, including visual content.
While low-level features have been marginally explored
in the community of recommender systems, they have been
studied in other fields such as computer vision and content-
based video retrieval. The works in [20, 4] discuss a large
body of low-level features (visual, auditory or textual) that
can be considered for video content analysis. The work
in [27] proposes a practical movie genre classification scheme
based on computable visual cues. [26] discusses a similar
approach by considering also the audio features. Finally, the
work in [40] proposes a framework for automatic classifica-
tion of videos using visual features, based on the intermedi-
ate level of scene representation.
We note that, while the scenario of using the low-level
features, as an additional side information, to hybridize the
existing recommender systems is interesting, however, this
paper addresses a different scenario, i.e., when the only avail-
able information is the low-level visual features and the rec-
ommender system has to use it effectively for recommen-
dation generation. Indeed, this is an extreme case of new
item problem [29], where traditional recommender systems
fail in properly doing their job. It is worthwhile to note that
while the present work has a focus on exploiting computer
vision techniques on item description of products (i.e. the
item-centric aspect), computer vision techniques are also ex-
ploited in studying users’ interaction behavior for example
through studying their eye, gaze and head movement while
navigating with a recommender system (i.e. the user-centric
aspect) [1, 7, 10].
2.2 Aesthetic View
The relation of mise-en-sc`ene elements with the reactions
they are able to evoke in viewers, is one of the main con-
cern of Applied Media Aesthetic [37]. Examples of mise-en-
sc`ene elements that are usually addressed in the literature
on movie design are Lighting and Color [15].
Lighting is the deliberate manipulation of light for a cer-
tain communication purpose and it is used to create viewers’
perception of the environment, and establish an aesthetic
context for their experiences. The two main lighting al-
ternatives are usually addressed to as chiaroscuro and flat
lighting [38]. Figure 1.a and Figure 1.b exemplifies these two
alternatives.
Colors can strongly affect our perceptions and emotions
in unsuspected ways. For instance, red light gives the feeling
of warmth, but also the feeling that time moves slowly, while
blue light gives the feeling of cold, but also that time moves
faster. The expressive quality of colors strongly depends on
the lighting, since colors are a property of light [38]. Figure
2.a and Figure 2.b present two examples of using colors in
movies to evoke certain emotions.
Interestingly, most mise-en-sc`ene elements can be com-
puted from the video data stream as statistical values [27,
5]. We call these computable aspects as visual low level fea-
tures [11].
3. METHODOLOGY
The methodology adopted to provide recommendations
based on visual features comprises five steps:
1. Video Segmentation: the goal is to segment each
video into shots and to select a representative key-
frame from each shot;
2. Feature Extraction: the goal is to extract visual fea-
ture vectors from each key-frame. We have considered
two different types of visual features for this purpose:
(i) vectors extracted from MPEG-7 visual descriptors,
and (ii) vectors extracted from pre-trained deep-learn-
ing networks;
3. Feature Aggregation: feature vectors extracted from
the key-frame of a video are aggregated to obtain a fea-
ture vector descriptive of the whole video.
4. Feature Fusion: in this step, features extracted from
the same video but with different methods (e.g., MPEG-
7 descriptors and deep-learning networks) are com-
bined into a fixed-length descriptor;
5. Recommendation: the (eventually aggregated) vec-
tors describing low-level visual features of videos are
used to feed a recommender algorithm. For this pur-
pose, we have considered the method Collective SLIM
as a feature-enhanced collaborative filtering (CF).
The flowchart of the methodology is shown in Figure 3
and the steps are elaborated in more details in the following
subsections.
3.1 Video segmentation
Shots are sequences of consecutive frames captured with-
out interruption by a single camera. The transition between
two successive shots of the video can be abrupt, where one
frame belongs to a shot and the following frame belongs to
the next shot, or gradual, where, two shots are combined us-
ing chromatic, spatial or spatial-chromatic video production
effects (e.g., fade in/out, dissolve, wipe), which gradually
replace one shot by another.
The color histogram distance is one of the most stan-
dard descriptors used as a measure of (dis)similarity between
consecutive video frames in applications including: content-
based video retrieval, object recognition, and others. A his-
togram is computed for each frame in the video and the his-
togram intersection is used as the means of comparing the
local activity according to Equation 1,
(a) (b)
Figure 1: a. Out of the past (1947) an example of highly contrasted lighting. b. The wizard of OZ (1939) flat
lighting example.
(a) (b)
Figure 2: a. An image from Django Unchained (2012). The red hue is used to increase the scene sense of
violence. b. An image from Lincoln (2012). Blue tone is used to produce the sense of coldness and fatigue
experienced by the characters.
s(ht, ht+1) = X
b
min(ht(b), ht+1(b)) (1)
where htand ht+1 are histograms of successive frames and
bis the index of the histogram bin. By comparing swith a
predefined threshold, we segment the videos in our dataset
into shots. We set the histogram similarity threshold to 0.75.
3.2 Feature extraction
For each key frame, visual features are extracted by us-
ing either MPEG-7 descriptors or pre-trained deep-learning
networks.
3.2.1 MPEG-7 features
The MPEG-7 standard specifies descriptors that allow
users to measure visual features of images. More specifically,
MPEG-7 specifies 17 descriptors divided into four categories:
color, texture, shape, and motion [22]. In our work we have
focused our attention on the following five color and tex-
ture descriptors, as previous experiments have proven the
expressiveness of color and texture for similarity-based vi-
sual retrieval applications [22, 34]:
•Color Descriptors.
–Scalable Color Descriptor (SCD) is the color his-
togram of an image in the HSV color space. In
our implementation we have used SCD with 256
coefficients (histogram bins).
–Color Structure Descriptor (CSD) creates a mod-
ified version of the SCD histogram to take into
account the physical position of each color inside
the images, and thus it can capture both color
content and information about the structure of
this content. In our implementation, CSD is de-
scribed by a feature vector of length 256.
–Color Layout Descriptor (CLD) is a very compact
and resolution-invariant representation of color ob-
tained by applying the DCT transformation on
a 2-D array of representative colors in the YUV
color space. CLD is described by a feature vector
of length 120 in our implementation.
•Texture Descriptors.
–Edge Histogram Descriptor (EHD) describes local
edge distribution in the frame. The image is di-
vided into 16 non-overlapping blocks (subimages).
Edges within each block are classified into one of
five edge categories: vertical, horizontal, left di-
agonal, right diagonal and non–directional edges.
The final local edge descriptor is composed of a
histogram with 5 x 16 = 80 histogram bins.
–Homogeneous Texture Descriptor (HTD) describes
homogeneous texture regions within a frame, by
using a vector of 62 energy values.
3.2.2 Deep-Learning features
An alternative way to extract visual features from an im-
Video
Video Structure
Analysis
Shot
Detection
Keyframe
Extraction
SCD CSD CLD
EHD HTD
Color Descriptors
Texture Descriptors
Deep Neural Networks
MPEG-7 Visual Descriptors Fusion
Robust Visual
Descriptor
Feature
Aggreation
Video Content
Analysis
Figure 3: Flowchart of the methdology used to create visual features from MPEG-7 descriptors and pre-
trained deep-learning networks.
age is to use the inner layers of pre-trained deep-learning
networks [19]. We have used the 1024 inner neurons of
GoogLeNet, a 22 layers deep network trained to classify over
1.2 million images classified into 1000 categories [31]. Each
key frame is provided as input to the network and the ac-
tivation values of inner neurons are used as visual features
for the frame.
3.3 Feature aggregation
The previous step extracts a vector of features from each
key-frame of a video. We need to define a function to aggre-
gate all these vectors into a single feature vector descriptive
of the whole video. The MPEG-7 standard defines an exten-
sion of the descriptors to a collection of pictures known as
group of pictures descriptors [23, 2]. The main aggregation
functions are intersection histogram,average and median.
Inspired by this, our proposed aggregation functions consist
of the following:
•intersection histogram: each element of the ag-
gregated feature vector is the minimum of the cor-
responding elements of the feature vectors from each
key-frame;
•average: each element of the aggregated feature vec-
tor is the average of the corresponding elements of the
feature vectors from key-frame;
•median: each element of the aggregated feature vec-
tor is the median of the corresponding elements of the
feature vectors from key-frame.
•union histogram: each element of the aggregated
feature vector is the maximum of the corresponding
elements of the feature vectors from key-frame.
In our experiments we have applied each aggregation func-
tion to both MPEG-7 and deep-learning features.
3.4 Feature Fusion
Motivated by the approach proposed in [14], we employed
the fusion method based on Canonical Correlation Analy-
sis (CCA) which exploits the low-level correlation between
two set of visual features and learns a linear transformation
that maximizes the pairwise correlation between two set of
MPEG-7 and deep-learning networks visual features.
3.5 Recommendations
In order to test the effectiveness of low-level visual fea-
tures in video recommendations, we have experimented with
a widely adopted hybrid collaborative-filtering algorithm en-
riched with side information.
We use Collective SLIM (Sparse Linear Method), a widely
adopted sparse CF method that includes item features as
side information to improve quality of recommendations [25].
The item similarity matrix Sis learned by minimizing the
following optimization problem
argmin
S
αkR−RSk+ (1 −α)kF−F S k+γkSk(2)
where Ris the user-rating matrix, Fis the feature-item ma-
trix, and parameters αand γare tuned with cross validation.
The algorithm is trained using Bayesian Pairwise Ranking
[28].
4. EVALUATION AND RESULTS
We have used the latest version of the 20M Movielens
dataset [18]. For each movie in the Movielens dataset, the
title has been automatically queried in YouTube to search
for the trailer.
The final dataset contains 8’931’665 ratings and 586’994
tags provided by 242’209 users to 3’964 movies (spar-
sity 99.06%) classified along 19 genres: action, adventure,
animation, children’s, comedy, crime, documentary, drama,
fantasy, film-noir, horror, musical, mystery, romance, sci-fi,
thriller, war, western, and unknown.
For each movie, the corresponding video trailer is avail-
able. Low-level features have been automatically extracted
from the trailers according to the methodology described in
the previous section. The dataset, together with trailers and
low-level features, is available for download 3.
In order to evaluate the effectiveness of low-level visual
features, we have used two baseline set of features: genre
and tag. We have used Latent Semantic Analysis (LSA) to
pre-process the tag-item matrix in order to better exploit
the implicit structure in the association between tags and
items. The technique includes decomposing the tag-item
matrix into a set of orthogonal factors whose linear combi-
nation approximates the original matrix [8].
We evaluate the Top-Nrecommendation quality by adopt-
ing a procedure similar to the one described in [9].
•We randomly placed 80% of the ratings in the training
set, 10% in the validation set, and 10% in the test set.
Additionally, we performed a 5-fold cross validation
test to compute confidence intervals.
•For each relevant item irated by user uin the test
set, we form a list containing the item iand all the
items not rated by the user u, which we assume to be
irrelevant to her. Then, we form a recommendation
list by picking the top-Nranked items. Being rthe
rank of i, we have a hit if r < N , otherwise we have a
miss.
•We measure the quality of the recommendations in
terms of recall,precision and mean average preci-
sion (MAP) for different cutoff values N= 1,10,20.
We feed the recommendation algorithm with MPEG-7 fea-
tures, deep-learning networks features, genres and tags (tags
are preprocessed with LSA). We also feed the algorithm with
a combinations of MPEG-7 and deep-learning networks fea-
tures.
Tables 1 present the results of the experiments in terms
of precision, recall, and MAP, for different cutoff values, i.e.,
1, 10, and 20. First of all, we note that our initial analysis
have shown that the best recommendation results are ob-
tained by the intersection aggregation function for MPEG-7
and average for deep-learning networks features. Hence, we
report the results for these two aggregation functions.
As it can be seen, in terms of almost all the considered
metrics, and all the cutoff values, MPEG-7 + DNN has
shown the best results, and MPEG-7 alone has shown the
second best results. The only exceptions are Precision at 10
and MAP at 10. In the former case, MPEG-7 is the best
and Genre is the second. In the latter case, MPEG-7 is the
best and MPEG-7 + DNN is the second. Unexpectedly, the
recommendation based on tag is always the worst method.
These results are very promising and overall present the
power of recommendation based on MPEG-7 features, used
individually or in combination with DNN features. Indeed,
the results show that recommendation based on MPEG-
7 features always outperform genre and tag based recom-
mendations, and the combination of MPEG-7 features with
deep-learning networks significantly improves the quality of
the hybrid CF+CBF algorithm and provides the best rec-
ommendation results overall.
3recsys.deib.polimi.it
5. DISCUSSION
Our results provide empirical evidence that visual low-
level features extracted from MPEG-7 descriptors provide
better top-N recommendations than genres and tags while
the same does not apply to visual low-level features ex-
tracted from pre-trained deep-learning networks.
Overall, our experiments prove the effectiveness of movie
recommendations based on visual features automatically ex-
tracted from trailers of movies. Recommendations based
on deep-learning networks visual features can provide good
quality recommendations, in line with recommendations based
on human-generated attributes such as genres and tags while
visual features extracted from MPEG-7 descriptors consis-
tently provide better recommendations. Moreover, fusion of
the deep-learning networks and MPEG-7 visual features per-
forms the best recommendation results. These results sug-
gest an interesting consideration: users’ opinions on movies
are influenced more by style than content.
Given the small number of mise-en-sc`ene features (e.g.,
the combined MPEG-7 feature vector contains 774 elements
only, compared with half a million tags) and the fact that
we extracted them from movie trailers, we did not expect
this result. In fact, we would view it as a good news for
practitioners of movie recommender systems, as low-level
features combine multiple advantages. First, mise-en-sc`ene
features have the convenience of being computed automat-
ically from video files, offering designers more flexibility in
handling new items, without the need to wait for costly ed-
itorial or crowd-based tagging. Moreover, it is also possible
to extract low level features from movie trailers, without
the need to work on full-length videos [12]. This guarantees
good scalability. Finally, viewers are less consciously aware
of movie styles and we expect that recommendations based
on low level features could to be more attractive in terms of
diversity, novelty and serendipity.
We would like to offer an explanation as to why mise-
en-sc`ene low-level features consistently deliver better top-N
recommendations than a much large number of high-level
attributes. This may have to do with a limitation of high-
level features, which are binary in nature: movies either have
or not have a specific attribute. On the contrary low-level
features are continuous in their values and they are present
in all movies, but with different weights.
A potential difficulty in exploiting mise-en-sc`ene low-level
visual features is the computational load required for the
extraction of features from full-length movies. However,
we have observed that low-level visual features extracted
from the movie trailers are highly correlated with the cor-
responding features extracted from full-length movies [12].
Accordingly, the observed strong correlation indicates that
the movie trailers are indeed perfect representatives of the
corresponding full-length movies. Hence, instead of ana-
lyzing the lengthy full movies the trailers can be properly
analyzed which can result in significant reduction in the
computational load of using mise-en-sc`ene low-level visual
features.
6. CONCLUSION AND FUTURE WORK
This work presents a novel approach in the domain of
movie recommendations. The technique is based on the
analysis of movie content and extraction of stylistic low-
level features that are used to generate personalized recom-
Table 1: Recommendation based on MPEG-7 and DNN features in comparison with the traditional genre
and tag features
Recall Precision MAP
Features 1 10 20 1 10 20 1 10 20
MPEG-7 0.0337 0.1172 0.1751 0.1785 0.1354 0.1060 0.1785 0.1114 0.1001
DNN 0.0238 0.0872 0.1381 0.1346 0.1021 0.0831 0.1346 0.0785 0.0714
MPEG-7 + DNN 0.0383 0.1773 0.2466 0.2005 0.1063 0.0771 0.2005 0.1051 0.1040
Genre 0.0294 0.0904 0.1381 0.1596 0.1108 0.0892 0.1596 0.0892 0.0792
Tag-LSA 0.0127 0.0476 0.0684 0.1001 0.0686 0.0523 0.1001 0.0493 0.0389
mendations for users. This approach makes it possible to
recommend items to users without relying on any high-level
semantic features (such as, genre, or tag) that are expensive
to obtain, as they require expert level knowledge, and shall
be missing (e.g., in new item scenario).
While the results of this study would not underestimate
the importance of high-level semantic features, however, they
provide a strong argument for exploring the potential of low-
level visual features that are automatically extracted from
movie content.
For future work, we consider the design and development
of an online web application in order to conduct online stud-
ies with real user. The goal is to evaluate the effectiveness
of recommendations based on low-level visual features not
only in terms of relevance, but also in terms of novelty and
diversity. Moreover, we will extend the range of low-level
features extracted, and also, include audio features. We will
also extend the evaluation to user-generated videos (e.g.,
YouTube). Finally, we would feed the MPEG-7 features
as input to the initial layer of deep-networks and build the
model accordingly. We are indeed, interested to investigate
the possible improvement of recommendation based on the
features provided by the deep-networks trained in this way.
7. ACKNOWLEDGMENTS
This work is supported by Telecom Italia S.p.A., Open In-
novation Department, Joint Open Lab S-Cube, Milan. The
work has been also supported by the Amazon AWS Cloud
Credits for Research program.
8. REFERENCES
[1] X. Bao, S. Fan, A. Varshavsky, K. Li, and
R. Roy Choudhury. Your reactions suggest you liked
the movie: Automatic content rating via reaction
sensing. In Proceedings of the 2013 ACM international
joint conference on Pervasive and ubiquitous
computing, pages 197–206. ACM, 2013.
[2] M. Bastan, H. Cam, U. Gudukbay, and O. Ulusoy.
Bilvideo-7: an mpeg-7-compatible video indexing and
retrieval system. IEEE MultiMedia, 17(3):62–73, 2010.
[3] D. Bogdanov, J. Serr`a, N. Wack, P. Herrera, and
X. Serra. Unifying low-level and high-level music
similarity measures. Multimedia, IEEE Transactions
on, 13(4):687–701, 2011.
[4] D. Brezeale and D. J. Cook. Automatic video
classification: A survey of the literature. Systems,
Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, 38(3):416–430, 2008.
[5] W. Buckland. What does the statistical style analysis
of film involve? a review of moving into pictures. more
on film history, style, and analysis. Literary and
Linguistic Computing, 23(2):219–230, 2008.
[6] I. Cantador, M. Szomszor, H. Alani, M. Fern´andez,
and P. Castells. Enriching ontological user profiles
with tagging history for multi-domain
recommendations. 2008.
[7] L. Chen and P. Pu. Eye-tracking study of user
behavior in recommender interfaces. In International
Conference on User Modeling, Adaptation, and
Personalization, pages 375–380. Springer, 2010.
[8] P. Cremonesi, F. Garzotto, S. Negro, A. V.
Papadopoulos, and R. Turrin. Looking for
ˆa ˘
AIJgoodˆa ˘
A˙
I recommendations: A comparative
evaluation of recommender systems. In
Human-Computer Interaction–INTERACT 2011,
pages 152–168. Springer, 2011.
[9] P. Cremonesi, Y. Koren, and R. Turrin. Performance
of recommender algorithms on top-n recommendation
tasks. In Proceedings of the 2010 ACM Conference on
Recommender Systems, RecSys 2010, Barcelona,
Spain, September 26-30, 2010, pages 39–46, 2010.
[10] Y. Deldjoo and R. E. Atani. A low-cost
infrared-optical head tracking solution for virtual 3d
audio environment using the nintendo wii-remote.
Entertainment Computing, 12:9–27, 2016.
[11] Y. Deldjoo, M. Elahi, P. Cremonesi, F. Garzotto, and
P. Piazzolla. Recommending movies based on
mise-en-scene design. In Proceedings of the 2016 CHI
Conference Extended Abstracts on Human Factors in
Computing Systems, pages 1540–1547. ACM, 2016.
[12] Y. Deldjoo, M. Elahi, P. Cremonesi, F. Garzotto,
P. Piazzolla, and M. Quadrana. Content-based video
recommendation system based on stylistic visual
features. Journal on Data Semantics, pages 1–15,
2016.
[13] Y. Deldjoo, M. Elahi, M. Quadrana, and
P. Cremonesi. Toward building a content-based video
recommendation system based on low-level features.
In E-Commerce and Web Technologies. Springer, 2015.
[14] Y. Deldjoo, Y. Elahi, P. Cremonesi, F. B.
Moghaddam, and A. L. E. Caielli. How to combine
visual features with tags to improve movie
recommendation accuracy? In E-Commerce and Web
Technologies: 17th International Conference, EC-Web
2016, Porto, Portugal, September 5-8, 2016, Revised
Selected Papers, volume 278, page 34. Springer, 2017.
[15] C. Dorai and S. Venkatesh. Computational media
aesthetics: Finding meaning beautiful. IEEE
MultiMedia, 8(4):10–12, 2001.
[16] M. Elahi, F. Ricci, and N. Rubens. A survey of active
learning in collaborative filtering recommender
systems. Computer Science Review, 2016.
[17] M. Fleischman and E. Hovy. Recommendations
without user preferences: a natural language
processing approach. In Proceedings of the 8th
international conference on Intelligent user interfaces,
pages 242–244. ACM, 2003.
[18] F. M. Harper and J. A. Konstan. The movielens
datasets: History and context. ACM Transactions on
Interactive Intelligent Systems (TiiS), 5(4):19, 2015.
[19] R. He and J. McAuley. Vbpr: Visual bayesian
personalized ranking from implicit feedback. arXiv
preprint arXiv:1510.01784, 2015.
[20] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A
survey on visual content-based video indexing and
retrieval. Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on,
41(6):797–819, 2011.
[21] N. Jakob, S. H. Weber, M. C. M¨
uller, and
I. Gurevych. Beyond the stars: exploiting free-text
user reviews to improve the accuracy of movie
recommendations. In Proceedings of the 1st
international CIKM workshop on Topic-sentiment
analysis for mass opinion, pages 57–64. ACM, 2009.
[22] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and
A. Yamada. Color and texture descriptors. Circuits
and Systems for Video Technology, IEEE Transactions
on, 11(6):703–715, 2001.
[23] B. S. Manjunath, P. Salembier, and T. Sikora.
Introduction to MPEG-7: multimedia content
description interface, volume 1. John Wiley & Sons,
2002.
[24] C. Musto, F. Narducci, P. Lops, G. Semeraro,
M. de Gemmis, M. Barbieri, J. Korst, V. Pronk, and
R. Clout. Enhanced semantic tv-show representation
for personalized electronic program guides. In User
Modeling, Adaptation, and Personalization, pages
188–199. Springer, 2012.
[25] X. Ning and G. Karypis. Sparse linear methods with
side information for top-n recommendations. In
Proceedings of the sixth ACM conference on
Recommender systems, pages 155–162. ACM, 2012.
[26] Z. Rasheed and M. Shah. Video categorization using
semantics and semiotics. In Video mining, pages
185–217. Springer, 2003.
[27] Z. Rasheed, Y. Sheikh, and M. Shah. On the use of
computable features for film classification. Circuits
and Systems for Video Technology, IEEE Transactions
on, 15(1):52–64, 2005.
[28] S. Rendle, C. Freudenthaler, Z. Gantner, and
L. Schmidt-Thieme. Bpr: Bayesian personalized
ranking from implicit feedback. In Proceedings of the
twenty-fifth conference on uncertainty in artificial
intelligence, pages 452–461. AUAI Press, 2009.
[29] N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan.
Active learning in recommender systems. In
Recommender systems handbook, pages 809–846.
Springer, 2015.
[30] Y. Shi, M. Larson, and A. Hanjalic. Collaborative
filtering beyond the user-item matrix: A survey of the
state of the art and future challenges. ACM
Computing Surveys (CSUR), 47(1):3, 2014.
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–9, 2015.
[32] M. Szomszor, C. Cattuto, H. Alani, K. Oˆa˘
A´
ZHara,
A. Baldassarri, V. Loreto, and V. D. Servedio.
Folksonomies, the semantic web, and movie
recommendation. 2007.
[33] J. Vig, S. Sen, and J. Riedl. Tagsplanations:
explaining recommendations using tags. In Proceedings
of the 14th international conference on Intelligent user
interfaces, pages 47–56. ACM, 2009.
[34] X.-Y. Wang, B.-B. Zhang, and H.-Y. Yang.
Content-based image retrieval by integrating color and
texture features. Multimedia tools and applications,
68(3):545–569, 2014.
[35] Y. Wang, C. Xing, and L. Zhou. Video semantic
models: survey and evaluation. Int. J. Comput. Sci.
Netw. Security, 6:10–20, 2006.
[36] B. Yang, T. Mei, X.-S. Hua, L. Yang, S.-Q. Yang, and
M. Li. Online video recommendation based on
multimodal fusion and relevance feedback. In
Proceedings of the 6th ACM international conference
on Image and video retrieval, pages 73–80. ACM, 2007.
[37] H. Zettl. Essentials of applied media aesthetics. In
C. Dorai and S. Venkatesh, editors, Media Computing,
volume 4 of The Springer International Series in
Video Computing, pages 11–38. Springer US, 2002.
[38] H. Zettl. Sight, sound, motion: Applied media
aesthetics. Cengage Learning, 2013.
[39] X. Zhao, G. Li, M. Wang, J. Yuan, Z.-J. Zha, Z. Li,
and T.-S. Chua. Integrating rich information for video
recommendation with multi-task rank aggregation. In
Proceedings of the 19th ACM international conference
on Multimedia, pages 1521–1524. ACM, 2011.
[40] H. Zhou, T. Hermans, A. V. Karandikar, and J. M.
Rehg. Movie genre classification via scene
categorization. In Proceedings of the international
conference on Multimedia, pages 747–750. ACM, 2010.