ArticlePDF Available

Abstract and Figures

Item features play an important role in movie recommender systems, where recommendations can be generated by using explicit or implicit preferences of users on traditional features (attributes) such as tag, genre, and cast. Typically, movie features are human-generated, either editorially (e.g., genre and cast) or by leveraging the wisdom of the crowd (e.g., tag), and as such, they are prone to noise and are expensive to collect. Moreover, these features are often rare or absent for new items, making it difficult or even impossible to provide good quality recommendations. In this paper, we show that user's preferences on movies can be better described in terms of the Mise-en-Scène features , i.e., the visual aspects of a movie that characterize design, aesthetics and style (e.g., colors, textures). We use both MPEG-7 visual descriptors and Deep Learning hidden layers as example of mise-en-scène features that can visually describe movies. Interestingly, mise-en-scène features can be computed automatically from video files or even from trailers , offering more flexibility in handling new items, avoiding the need for costly and error-prone human-based tagging, and providing good scalability. We have conducted a set of experiments on a large catalogue of 4K movies. Results show that recommendations based on mise-en-scène features consistently provide the best performance with respect to richer sets of more traditional features, such as genre and tag.
Content may be subject to copyright.
Using Mise-en-Scène Visual Features based on MPEG-7
and Deep Learning for Movie Recommendation
Yashar Deldjoo
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
yashar.deldjoo@polimi.it
Massimo Quadrana
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
massimo.quadrana@polimi.it
Mehdi Elahi
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
mehdi.elahi@polimi.it
Paolo Cremonesi
Politecnico di Milano
Via Ponzio 34/5
20133, Milan, Italy
paolo.cremonesi@polimi.it
ABSTRACT
Item features play an important role in movie recommender
systems, where recommendations can be generated by using
explicit or implicit preferences of users on traditional fea-
tures (attributes) such as tag, genre, and cast. Typically,
movie features are human-generated, either editorially (e.g.,
genre and cast) or by leveraging the wisdom of the crowd
(e.g., tag), and as such, they are prone to noise and are ex-
pensive to collect. Moreover, these features are often rare or
absent for new items, making it difficult or even impossible
to provide good quality recommendations.
In this paper, we show that user’s preferences on movies
can be better described in terms of the Mise-en-Sc`ene fea-
tures, i.e., the visual aspects of a movie that characterize
design, aesthetics and style (e.g., colors, textures). We use
both MPEG-7 visual descriptors and Deep Learning hidden
layers as example of mise-en-sc`ene features that can visually
describe movies. Interestingly, mise-en-sc`ene features can be
computed automatically from video files or even from trail-
ers, offering more flexibility in handling new items, avoiding
the need for costly and error-prone human-based tagging,
and providing good scalability.
We have conducted a set of experiments on a large cat-
alogue of 4K movies. Results show that recommendations
based on mise-en-sc`ene features consistently provide the best
performance with respect to richer sets of more traditional
features, such as genre and tag.
Keywords
mpeg-7 features, movie recommendation, visual, deep learn-
ing
1. INTRODUCTION
Multimedia recommender systems base their recommen-
dations on human-generated content features which are ei-
ther crowd-sourced (e.g., tag) or editorial-generated (e.g.,
genre, director, cast). The typical approach is to recom-
mend items sharing features with the other items the user
liked in the past.
In the movie domain, information about movies (e.g., tag,
genre, cast) can be exploited to either use Content-Based
Filtering (CBF) or to boost Collaborative Filtering (CF)
with rich side information [30]. A necessary prerequisite for
both CBF and CF with side information is the availability
of a rich set of descriptive features about movies.
An open problem with multimedia recommender systems
is how to enable or improve recommendations when user rat-
ings and“traditional”human-generated features are nonexis-
tent or incomplete. This is called the new item problem [16,
29] and it happens frequently in video-on-demand scenar-
ios, when new multimedia content is added to the catalog
of available items (as an example, 500 hours of movie are
uploaded to YouTube every minute1).
Movie content features can be classified into three hierar-
chical levels [35].
At the highest level, we have semantic features that
deal with the conceptual model of a movie. An exam-
ple of semantic feature is the plot of the movie The
Good, the Bad and the Ugly, which revolves around
three gunslingers competing to find a buried cache of
gold during the American Civil War;
At the intermediate level, we have syntactic features
that deal with objects in a movie and their interac-
tions. As an example, in the same noted movie, there
are Clint Eastwood, Lee Van Cleef, Eli Wallach, plus
several horses and guns;
At the lowest level, we have stylistic features, related to
the Mise-en-Sc`ene of the movie, i.e., the design aspects
that characterize aesthetic and style of a movie (e.g.,
colors or textures); As an example, in the same movie
predominant colors are yellow and brown, and camera
shots use extreme close-up on actors’ eyes.
The same plot (semantic level) can be acted by different ac-
tors (syntactic level) and directed in different ways (stylistic
level). In general, there is no direct link between the high-
level concepts and the low-level features. Each combination
of features convey different communication effects and stim-
ulate different feelings in the viewers.
1http://www.reelseo.com/hours-minute-uploaded-youtube/
arXiv:submit/1868404 [cs.MM] 20 Apr 2017
Recommender systems in the movie domain mainly focus
on high-level or intermediate-level features – usually pro-
vided by a group of domain experts or by a large commu-
nity of users – such as movie genres (semantic features, high
level), actors (syntactic features, intermediate level) or tags
(semantic and syntactic features, high and intermediate lev-
els) [32, 17, 21]. Movie genres and actors are normally as-
signed by movie experts and tags by communities of users
[33]. Human-generated features present a number of disad-
vantages:
1. features are prone to user biases and errors, therefore
not fully reflecting the characteristics of a movie;
2. new items might lack features as well as ratings;
3. unstructured features such as tags require complex Nat-
ural Language Processing (NLP) in order to account
for stemming, stop words removal, synonyms detection
and other semantic analysis tasks;
4. not all features of an item have the same importance
related to the task at hand; for instance, a background
actor does not have the same importance as a guest
star in defining the characteristics of a movie.
In contrast to human-generated features, the content of
movie streams is itself a rich source of information about
low-level stylistic features that can be used to provide movie
recommendations. Low-level visual features have been shown
to be very representative of the users feelings, according to
the theory of Applied Media Aesthetics [37]. By analyzing a
movie stream content and extracting a set of low-level fea-
tures, a recommender system can make personalized recom-
mendations, tailored to a user’s taste. This is particularly
beneficial in the new item scenario, i.e., when movies with-
out ratings and without user-generated tags are added to
the catalogue.
Moreover, while low-level visual features can be extracted
from full-length movies, they can also be extracted from
shorter version of the movies (i.e., trailers) in order to have
a scalable recommender system. I previous works, we have
shown that mise-en-sc`ene visual features extracted from trail-
ers can be used to accurately predict genre of movies [12, 13].
In this paper, we show how to use low-level visual features
extracted automatically from movie files as input to a hybrid
CF+CBF algorithm. We have extracted the low-level visual
features by using two different approaches:
MPEG-7 visual descriptors [22]
Pre-trained deep-learning neural networks (DNN) [31]
Based on the discussion above, we articulate the follow-
ing research hypothesis: “a recommender system using low-
level visual features (mise-en-sc`ene) provides better accuracy
compared to the same recommender system using traditional
content features (genre and tag).
We articulate the research hypothesis along the following
research questions:
RQ1: do visual low-level features extracted from any of
MPEG-7 descriptors or pre-trained deep-learning net-
works provide better top-N recommendations than genre
and tag features?
RQ2: do visual low-level features extracted from MPEG-7
descriptor in conjunction with pre-trained deep-learning
networks provide better top-N recommendations than
genre and tag features?
We have performed an exhaustive evaluation by compar-
ing low-level visual features with respect to more traditional
features (i.e., genre and tag). For each set of features, we
have used a hybrid CBF+CF algorithm that includes item
features as side information, where item similarity is learned
with a Sparse LInear Method (SLIM) [25].
We have used visual and content features either individ-
ually or in combination, in order to obtain a clear picture
of the real ability of visual features in learning the prefer-
ences of users and effectively generating relevant recommen-
dations.
We have computed different relevance metrics (precision,
recall, and mean average precision) over a dataset of more
than 8M ratings provided by 242K users to 4K movies. In
our experiments, recommendations based on mise-en-sc`ene
visual features consistently provide the best performance.
Overall, this work provides a number of contributions to
the RSs field in the movie domain:
we propose a novel RS that automatically analyzes the
content of the movies and extracts visual features in or-
der to generate personalize recommendations for users;
we evaluate recommendations by using a dataset of 4K
movies and compare the results with a state-of-the-art
hybrid CF+CBF algorithm;
we have extracted mise-en-sc`ene visual features adopt-
ing two different approaches (i.e., MPEG-7 and DNN)
and fed them to the recommendation algorithm, either
individually or in combination, in order to better study
the power of these types of features;
the dataset, together with the user ratings and the vi-
sual features extracted from the videos (both MPEG-7
and deep-networks features), is available for download
2.
The rest of the paper is organized as follows. Section 2
reviews the relevant state of the art, related to content-
based recommender systems and video recommender sys-
tems. This section also introduces some theoretical back-
ground on Media Aesthetics that helps us to motivate our
approach and interpret the results of our study. It describes
the possible relation between the visual features adopted in
our work and the aesthetic variables that are well known
for artists in the domain of movie making. In Section 3 we
describe our method for extracting and representing mise-
en-sc`ene visual features of the movies and provide the details
of our recommendation algorithms. Section 4 introduces the
evaluation method and presents the results of the study and
Section 5 discusses them. Section 6 draws the conclusions
and identifies open issues and directions for future work.
2. RELATED WORK
2recsys.deib.polimi.it
2.1 Multimedia Recommender Systems
Multimedia recommender systems typically exploit high-
level or intermediate-level features in order to generate movie
recommendation [6, 24]. This type of features express se-
mantic and syntactic properties of media content that are
obtained from structured sources of meta-information such
as databases, lexicons and ontologies, or from less struc-
tured data such as reviews, news articles, item descriptions
and social tags.
In contrast, in this paper, we propose exploiting low-level
features to provide recommendations. Such features express
stylistic properties of the media content and are extracted
directly from the multimedia content files [12].
While this approach has been already investigated in the
music recommendation domain [3], it has received marginal
attention for movie recommendations. The very few ap-
proaches only consider low-level features to improve the qual-
ity of recommendations based on other type of features. The
work in [36] proposes a video recommender system, called
VideoReach, which incorporate a combination of high-level
and low-level video features (such as textual, visual and au-
ral) in order to improve the click-through-rate metric. The
work in [39] proposes a multi-task learning algorithm to in-
tegrate multiple ranking lists, generated by using different
sources of data, including visual content.
While low-level features have been marginally explored
in the community of recommender systems, they have been
studied in other fields such as computer vision and content-
based video retrieval. The works in [20, 4] discuss a large
body of low-level features (visual, auditory or textual) that
can be considered for video content analysis. The work
in [27] proposes a practical movie genre classification scheme
based on computable visual cues. [26] discusses a similar
approach by considering also the audio features. Finally, the
work in [40] proposes a framework for automatic classifica-
tion of videos using visual features, based on the intermedi-
ate level of scene representation.
We note that, while the scenario of using the low-level
features, as an additional side information, to hybridize the
existing recommender systems is interesting, however, this
paper addresses a different scenario, i.e., when the only avail-
able information is the low-level visual features and the rec-
ommender system has to use it effectively for recommen-
dation generation. Indeed, this is an extreme case of new
item problem [29], where traditional recommender systems
fail in properly doing their job. It is worthwhile to note that
while the present work has a focus on exploiting computer
vision techniques on item description of products (i.e. the
item-centric aspect), computer vision techniques are also ex-
ploited in studying users’ interaction behavior for example
through studying their eye, gaze and head movement while
navigating with a recommender system (i.e. the user-centric
aspect) [1, 7, 10].
2.2 Aesthetic View
The relation of mise-en-sc`ene elements with the reactions
they are able to evoke in viewers, is one of the main con-
cern of Applied Media Aesthetic [37]. Examples of mise-en-
sc`ene elements that are usually addressed in the literature
on movie design are Lighting and Color [15].
Lighting is the deliberate manipulation of light for a cer-
tain communication purpose and it is used to create viewers’
perception of the environment, and establish an aesthetic
context for their experiences. The two main lighting al-
ternatives are usually addressed to as chiaroscuro and flat
lighting [38]. Figure 1.a and Figure 1.b exemplifies these two
alternatives.
Colors can strongly affect our perceptions and emotions
in unsuspected ways. For instance, red light gives the feeling
of warmth, but also the feeling that time moves slowly, while
blue light gives the feeling of cold, but also that time moves
faster. The expressive quality of colors strongly depends on
the lighting, since colors are a property of light [38]. Figure
2.a and Figure 2.b present two examples of using colors in
movies to evoke certain emotions.
Interestingly, most mise-en-sc`ene elements can be com-
puted from the video data stream as statistical values [27,
5]. We call these computable aspects as visual low level fea-
tures [11].
3. METHODOLOGY
The methodology adopted to provide recommendations
based on visual features comprises five steps:
1. Video Segmentation: the goal is to segment each
video into shots and to select a representative key-
frame from each shot;
2. Feature Extraction: the goal is to extract visual fea-
ture vectors from each key-frame. We have considered
two different types of visual features for this purpose:
(i) vectors extracted from MPEG-7 visual descriptors,
and (ii) vectors extracted from pre-trained deep-learn-
ing networks;
3. Feature Aggregation: feature vectors extracted from
the key-frame of a video are aggregated to obtain a fea-
ture vector descriptive of the whole video.
4. Feature Fusion: in this step, features extracted from
the same video but with different methods (e.g., MPEG-
7 descriptors and deep-learning networks) are com-
bined into a fixed-length descriptor;
5. Recommendation: the (eventually aggregated) vec-
tors describing low-level visual features of videos are
used to feed a recommender algorithm. For this pur-
pose, we have considered the method Collective SLIM
as a feature-enhanced collaborative filtering (CF).
The flowchart of the methodology is shown in Figure 3
and the steps are elaborated in more details in the following
subsections.
3.1 Video segmentation
Shots are sequences of consecutive frames captured with-
out interruption by a single camera. The transition between
two successive shots of the video can be abrupt, where one
frame belongs to a shot and the following frame belongs to
the next shot, or gradual, where, two shots are combined us-
ing chromatic, spatial or spatial-chromatic video production
effects (e.g., fade in/out, dissolve, wipe), which gradually
replace one shot by another.
The color histogram distance is one of the most stan-
dard descriptors used as a measure of (dis)similarity between
consecutive video frames in applications including: content-
based video retrieval, object recognition, and others. A his-
togram is computed for each frame in the video and the his-
togram intersection is used as the means of comparing the
local activity according to Equation 1,
(a) (b)
Figure 1: a. Out of the past (1947) an example of highly contrasted lighting. b. The wizard of OZ (1939) flat
lighting example.
(a) (b)
Figure 2: a. An image from Django Unchained (2012). The red hue is used to increase the scene sense of
violence. b. An image from Lincoln (2012). Blue tone is used to produce the sense of coldness and fatigue
experienced by the characters.
s(ht, ht+1) = X
b
min(ht(b), ht+1(b)) (1)
where htand ht+1 are histograms of successive frames and
bis the index of the histogram bin. By comparing swith a
predefined threshold, we segment the videos in our dataset
into shots. We set the histogram similarity threshold to 0.75.
3.2 Feature extraction
For each key frame, visual features are extracted by us-
ing either MPEG-7 descriptors or pre-trained deep-learning
networks.
3.2.1 MPEG-7 features
The MPEG-7 standard specifies descriptors that allow
users to measure visual features of images. More specifically,
MPEG-7 specifies 17 descriptors divided into four categories:
color, texture, shape, and motion [22]. In our work we have
focused our attention on the following five color and tex-
ture descriptors, as previous experiments have proven the
expressiveness of color and texture for similarity-based vi-
sual retrieval applications [22, 34]:
Color Descriptors.
Scalable Color Descriptor (SCD) is the color his-
togram of an image in the HSV color space. In
our implementation we have used SCD with 256
coefficients (histogram bins).
Color Structure Descriptor (CSD) creates a mod-
ified version of the SCD histogram to take into
account the physical position of each color inside
the images, and thus it can capture both color
content and information about the structure of
this content. In our implementation, CSD is de-
scribed by a feature vector of length 256.
Color Layout Descriptor (CLD) is a very compact
and resolution-invariant representation of color ob-
tained by applying the DCT transformation on
a 2-D array of representative colors in the YUV
color space. CLD is described by a feature vector
of length 120 in our implementation.
Texture Descriptors.
Edge Histogram Descriptor (EHD) describes local
edge distribution in the frame. The image is di-
vided into 16 non-overlapping blocks (subimages).
Edges within each block are classified into one of
five edge categories: vertical, horizontal, left di-
agonal, right diagonal and non–directional edges.
The final local edge descriptor is composed of a
histogram with 5 x 16 = 80 histogram bins.
Homogeneous Texture Descriptor (HTD) describes
homogeneous texture regions within a frame, by
using a vector of 62 energy values.
3.2.2 Deep-Learning features
An alternative way to extract visual features from an im-
Video
Video Structure
Analysis
Shot
Detection
Keyframe
Extraction
SCD CSD CLD
EHD HTD
Color Descriptors
Texture Descriptors
Deep Neural Networks
MPEG-7 Visual Descriptors Fusion
Robust Visual
Descriptor
Feature
Aggreation
Video Content
Analysis
Figure 3: Flowchart of the methdology used to create visual features from MPEG-7 descriptors and pre-
trained deep-learning networks.
age is to use the inner layers of pre-trained deep-learning
networks [19]. We have used the 1024 inner neurons of
GoogLeNet, a 22 layers deep network trained to classify over
1.2 million images classified into 1000 categories [31]. Each
key frame is provided as input to the network and the ac-
tivation values of inner neurons are used as visual features
for the frame.
3.3 Feature aggregation
The previous step extracts a vector of features from each
key-frame of a video. We need to define a function to aggre-
gate all these vectors into a single feature vector descriptive
of the whole video. The MPEG-7 standard defines an exten-
sion of the descriptors to a collection of pictures known as
group of pictures descriptors [23, 2]. The main aggregation
functions are intersection histogram,average and median.
Inspired by this, our proposed aggregation functions consist
of the following:
intersection histogram: each element of the ag-
gregated feature vector is the minimum of the cor-
responding elements of the feature vectors from each
key-frame;
average: each element of the aggregated feature vec-
tor is the average of the corresponding elements of the
feature vectors from key-frame;
median: each element of the aggregated feature vec-
tor is the median of the corresponding elements of the
feature vectors from key-frame.
union histogram: each element of the aggregated
feature vector is the maximum of the corresponding
elements of the feature vectors from key-frame.
In our experiments we have applied each aggregation func-
tion to both MPEG-7 and deep-learning features.
3.4 Feature Fusion
Motivated by the approach proposed in [14], we employed
the fusion method based on Canonical Correlation Analy-
sis (CCA) which exploits the low-level correlation between
two set of visual features and learns a linear transformation
that maximizes the pairwise correlation between two set of
MPEG-7 and deep-learning networks visual features.
3.5 Recommendations
In order to test the effectiveness of low-level visual fea-
tures in video recommendations, we have experimented with
a widely adopted hybrid collaborative-filtering algorithm en-
riched with side information.
We use Collective SLIM (Sparse Linear Method), a widely
adopted sparse CF method that includes item features as
side information to improve quality of recommendations [25].
The item similarity matrix Sis learned by minimizing the
following optimization problem
argmin
S
αkRRSk+ (1 α)kFF S k+γkSk(2)
where Ris the user-rating matrix, Fis the feature-item ma-
trix, and parameters αand γare tuned with cross validation.
The algorithm is trained using Bayesian Pairwise Ranking
[28].
4. EVALUATION AND RESULTS
We have used the latest version of the 20M Movielens
dataset [18]. For each movie in the Movielens dataset, the
title has been automatically queried in YouTube to search
for the trailer.
The final dataset contains 8’931’665 ratings and 586’994
tags provided by 242’209 users to 3’964 movies (spar-
sity 99.06%) classified along 19 genres: action, adventure,
animation, children’s, comedy, crime, documentary, drama,
fantasy, film-noir, horror, musical, mystery, romance, sci-fi,
thriller, war, western, and unknown.
For each movie, the corresponding video trailer is avail-
able. Low-level features have been automatically extracted
from the trailers according to the methodology described in
the previous section. The dataset, together with trailers and
low-level features, is available for download 3.
In order to evaluate the effectiveness of low-level visual
features, we have used two baseline set of features: genre
and tag. We have used Latent Semantic Analysis (LSA) to
pre-process the tag-item matrix in order to better exploit
the implicit structure in the association between tags and
items. The technique includes decomposing the tag-item
matrix into a set of orthogonal factors whose linear combi-
nation approximates the original matrix [8].
We evaluate the Top-Nrecommendation quality by adopt-
ing a procedure similar to the one described in [9].
We randomly placed 80% of the ratings in the training
set, 10% in the validation set, and 10% in the test set.
Additionally, we performed a 5-fold cross validation
test to compute confidence intervals.
For each relevant item irated by user uin the test
set, we form a list containing the item iand all the
items not rated by the user u, which we assume to be
irrelevant to her. Then, we form a recommendation
list by picking the top-Nranked items. Being rthe
rank of i, we have a hit if r < N , otherwise we have a
miss.
We measure the quality of the recommendations in
terms of recall,precision and mean average preci-
sion (MAP) for different cutoff values N= 1,10,20.
We feed the recommendation algorithm with MPEG-7 fea-
tures, deep-learning networks features, genres and tags (tags
are preprocessed with LSA). We also feed the algorithm with
a combinations of MPEG-7 and deep-learning networks fea-
tures.
Tables 1 present the results of the experiments in terms
of precision, recall, and MAP, for different cutoff values, i.e.,
1, 10, and 20. First of all, we note that our initial analysis
have shown that the best recommendation results are ob-
tained by the intersection aggregation function for MPEG-7
and average for deep-learning networks features. Hence, we
report the results for these two aggregation functions.
As it can be seen, in terms of almost all the considered
metrics, and all the cutoff values, MPEG-7 + DNN has
shown the best results, and MPEG-7 alone has shown the
second best results. The only exceptions are Precision at 10
and MAP at 10. In the former case, MPEG-7 is the best
and Genre is the second. In the latter case, MPEG-7 is the
best and MPEG-7 + DNN is the second. Unexpectedly, the
recommendation based on tag is always the worst method.
These results are very promising and overall present the
power of recommendation based on MPEG-7 features, used
individually or in combination with DNN features. Indeed,
the results show that recommendation based on MPEG-
7 features always outperform genre and tag based recom-
mendations, and the combination of MPEG-7 features with
deep-learning networks significantly improves the quality of
the hybrid CF+CBF algorithm and provides the best rec-
ommendation results overall.
3recsys.deib.polimi.it
5. DISCUSSION
Our results provide empirical evidence that visual low-
level features extracted from MPEG-7 descriptors provide
better top-N recommendations than genres and tags while
the same does not apply to visual low-level features ex-
tracted from pre-trained deep-learning networks.
Overall, our experiments prove the effectiveness of movie
recommendations based on visual features automatically ex-
tracted from trailers of movies. Recommendations based
on deep-learning networks visual features can provide good
quality recommendations, in line with recommendations based
on human-generated attributes such as genres and tags while
visual features extracted from MPEG-7 descriptors consis-
tently provide better recommendations. Moreover, fusion of
the deep-learning networks and MPEG-7 visual features per-
forms the best recommendation results. These results sug-
gest an interesting consideration: users’ opinions on movies
are influenced more by style than content.
Given the small number of mise-en-sc`ene features (e.g.,
the combined MPEG-7 feature vector contains 774 elements
only, compared with half a million tags) and the fact that
we extracted them from movie trailers, we did not expect
this result. In fact, we would view it as a good news for
practitioners of movie recommender systems, as low-level
features combine multiple advantages. First, mise-en-sc`ene
features have the convenience of being computed automat-
ically from video files, offering designers more flexibility in
handling new items, without the need to wait for costly ed-
itorial or crowd-based tagging. Moreover, it is also possible
to extract low level features from movie trailers, without
the need to work on full-length videos [12]. This guarantees
good scalability. Finally, viewers are less consciously aware
of movie styles and we expect that recommendations based
on low level features could to be more attractive in terms of
diversity, novelty and serendipity.
We would like to offer an explanation as to why mise-
en-sc`ene low-level features consistently deliver better top-N
recommendations than a much large number of high-level
attributes. This may have to do with a limitation of high-
level features, which are binary in nature: movies either have
or not have a specific attribute. On the contrary low-level
features are continuous in their values and they are present
in all movies, but with different weights.
A potential difficulty in exploiting mise-en-sc`ene low-level
visual features is the computational load required for the
extraction of features from full-length movies. However,
we have observed that low-level visual features extracted
from the movie trailers are highly correlated with the cor-
responding features extracted from full-length movies [12].
Accordingly, the observed strong correlation indicates that
the movie trailers are indeed perfect representatives of the
corresponding full-length movies. Hence, instead of ana-
lyzing the lengthy full movies the trailers can be properly
analyzed which can result in significant reduction in the
computational load of using mise-en-sc`ene low-level visual
features.
6. CONCLUSION AND FUTURE WORK
This work presents a novel approach in the domain of
movie recommendations. The technique is based on the
analysis of movie content and extraction of stylistic low-
level features that are used to generate personalized recom-
Table 1: Recommendation based on MPEG-7 and DNN features in comparison with the traditional genre
and tag features
Recall Precision MAP
Features 1 10 20 1 10 20 1 10 20
MPEG-7 0.0337 0.1172 0.1751 0.1785 0.1354 0.1060 0.1785 0.1114 0.1001
DNN 0.0238 0.0872 0.1381 0.1346 0.1021 0.0831 0.1346 0.0785 0.0714
MPEG-7 + DNN 0.0383 0.1773 0.2466 0.2005 0.1063 0.0771 0.2005 0.1051 0.1040
Genre 0.0294 0.0904 0.1381 0.1596 0.1108 0.0892 0.1596 0.0892 0.0792
Tag-LSA 0.0127 0.0476 0.0684 0.1001 0.0686 0.0523 0.1001 0.0493 0.0389
mendations for users. This approach makes it possible to
recommend items to users without relying on any high-level
semantic features (such as, genre, or tag) that are expensive
to obtain, as they require expert level knowledge, and shall
be missing (e.g., in new item scenario).
While the results of this study would not underestimate
the importance of high-level semantic features, however, they
provide a strong argument for exploring the potential of low-
level visual features that are automatically extracted from
movie content.
For future work, we consider the design and development
of an online web application in order to conduct online stud-
ies with real user. The goal is to evaluate the effectiveness
of recommendations based on low-level visual features not
only in terms of relevance, but also in terms of novelty and
diversity. Moreover, we will extend the range of low-level
features extracted, and also, include audio features. We will
also extend the evaluation to user-generated videos (e.g.,
YouTube). Finally, we would feed the MPEG-7 features
as input to the initial layer of deep-networks and build the
model accordingly. We are indeed, interested to investigate
the possible improvement of recommendation based on the
features provided by the deep-networks trained in this way.
7. ACKNOWLEDGMENTS
This work is supported by Telecom Italia S.p.A., Open In-
novation Department, Joint Open Lab S-Cube, Milan. The
work has been also supported by the Amazon AWS Cloud
Credits for Research program.
8. REFERENCES
[1] X. Bao, S. Fan, A. Varshavsky, K. Li, and
R. Roy Choudhury. Your reactions suggest you liked
the movie: Automatic content rating via reaction
sensing. In Proceedings of the 2013 ACM international
joint conference on Pervasive and ubiquitous
computing, pages 197–206. ACM, 2013.
[2] M. Bastan, H. Cam, U. Gudukbay, and O. Ulusoy.
Bilvideo-7: an mpeg-7-compatible video indexing and
retrieval system. IEEE MultiMedia, 17(3):62–73, 2010.
[3] D. Bogdanov, J. Serr`a, N. Wack, P. Herrera, and
X. Serra. Unifying low-level and high-level music
similarity measures. Multimedia, IEEE Transactions
on, 13(4):687–701, 2011.
[4] D. Brezeale and D. J. Cook. Automatic video
classification: A survey of the literature. Systems,
Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, 38(3):416–430, 2008.
[5] W. Buckland. What does the statistical style analysis
of film involve? a review of moving into pictures. more
on film history, style, and analysis. Literary and
Linguistic Computing, 23(2):219–230, 2008.
[6] I. Cantador, M. Szomszor, H. Alani, M. Fern´andez,
and P. Castells. Enriching ontological user profiles
with tagging history for multi-domain
recommendations. 2008.
[7] L. Chen and P. Pu. Eye-tracking study of user
behavior in recommender interfaces. In International
Conference on User Modeling, Adaptation, and
Personalization, pages 375–380. Springer, 2010.
[8] P. Cremonesi, F. Garzotto, S. Negro, A. V.
Papadopoulos, and R. Turrin. Looking for
ˆa ˘
AIJgoodˆa ˘
A˙
I recommendations: A comparative
evaluation of recommender systems. In
Human-Computer Interaction–INTERACT 2011,
pages 152–168. Springer, 2011.
[9] P. Cremonesi, Y. Koren, and R. Turrin. Performance
of recommender algorithms on top-n recommendation
tasks. In Proceedings of the 2010 ACM Conference on
Recommender Systems, RecSys 2010, Barcelona,
Spain, September 26-30, 2010, pages 39–46, 2010.
[10] Y. Deldjoo and R. E. Atani. A low-cost
infrared-optical head tracking solution for virtual 3d
audio environment using the nintendo wii-remote.
Entertainment Computing, 12:9–27, 2016.
[11] Y. Deldjoo, M. Elahi, P. Cremonesi, F. Garzotto, and
P. Piazzolla. Recommending movies based on
mise-en-scene design. In Proceedings of the 2016 CHI
Conference Extended Abstracts on Human Factors in
Computing Systems, pages 1540–1547. ACM, 2016.
[12] Y. Deldjoo, M. Elahi, P. Cremonesi, F. Garzotto,
P. Piazzolla, and M. Quadrana. Content-based video
recommendation system based on stylistic visual
features. Journal on Data Semantics, pages 1–15,
2016.
[13] Y. Deldjoo, M. Elahi, M. Quadrana, and
P. Cremonesi. Toward building a content-based video
recommendation system based on low-level features.
In E-Commerce and Web Technologies. Springer, 2015.
[14] Y. Deldjoo, Y. Elahi, P. Cremonesi, F. B.
Moghaddam, and A. L. E. Caielli. How to combine
visual features with tags to improve movie
recommendation accuracy? In E-Commerce and Web
Technologies: 17th International Conference, EC-Web
2016, Porto, Portugal, September 5-8, 2016, Revised
Selected Papers, volume 278, page 34. Springer, 2017.
[15] C. Dorai and S. Venkatesh. Computational media
aesthetics: Finding meaning beautiful. IEEE
MultiMedia, 8(4):10–12, 2001.
[16] M. Elahi, F. Ricci, and N. Rubens. A survey of active
learning in collaborative filtering recommender
systems. Computer Science Review, 2016.
[17] M. Fleischman and E. Hovy. Recommendations
without user preferences: a natural language
processing approach. In Proceedings of the 8th
international conference on Intelligent user interfaces,
pages 242–244. ACM, 2003.
[18] F. M. Harper and J. A. Konstan. The movielens
datasets: History and context. ACM Transactions on
Interactive Intelligent Systems (TiiS), 5(4):19, 2015.
[19] R. He and J. McAuley. Vbpr: Visual bayesian
personalized ranking from implicit feedback. arXiv
preprint arXiv:1510.01784, 2015.
[20] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A
survey on visual content-based video indexing and
retrieval. Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on,
41(6):797–819, 2011.
[21] N. Jakob, S. H. Weber, M. C. M¨
uller, and
I. Gurevych. Beyond the stars: exploiting free-text
user reviews to improve the accuracy of movie
recommendations. In Proceedings of the 1st
international CIKM workshop on Topic-sentiment
analysis for mass opinion, pages 57–64. ACM, 2009.
[22] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and
A. Yamada. Color and texture descriptors. Circuits
and Systems for Video Technology, IEEE Transactions
on, 11(6):703–715, 2001.
[23] B. S. Manjunath, P. Salembier, and T. Sikora.
Introduction to MPEG-7: multimedia content
description interface, volume 1. John Wiley & Sons,
2002.
[24] C. Musto, F. Narducci, P. Lops, G. Semeraro,
M. de Gemmis, M. Barbieri, J. Korst, V. Pronk, and
R. Clout. Enhanced semantic tv-show representation
for personalized electronic program guides. In User
Modeling, Adaptation, and Personalization, pages
188–199. Springer, 2012.
[25] X. Ning and G. Karypis. Sparse linear methods with
side information for top-n recommendations. In
Proceedings of the sixth ACM conference on
Recommender systems, pages 155–162. ACM, 2012.
[26] Z. Rasheed and M. Shah. Video categorization using
semantics and semiotics. In Video mining, pages
185–217. Springer, 2003.
[27] Z. Rasheed, Y. Sheikh, and M. Shah. On the use of
computable features for film classification. Circuits
and Systems for Video Technology, IEEE Transactions
on, 15(1):52–64, 2005.
[28] S. Rendle, C. Freudenthaler, Z. Gantner, and
L. Schmidt-Thieme. Bpr: Bayesian personalized
ranking from implicit feedback. In Proceedings of the
twenty-fifth conference on uncertainty in artificial
intelligence, pages 452–461. AUAI Press, 2009.
[29] N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan.
Active learning in recommender systems. In
Recommender systems handbook, pages 809–846.
Springer, 2015.
[30] Y. Shi, M. Larson, and A. Hanjalic. Collaborative
filtering beyond the user-item matrix: A survey of the
state of the art and future challenges. ACM
Computing Surveys (CSUR), 47(1):3, 2014.
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–9, 2015.
[32] M. Szomszor, C. Cattuto, H. Alani, K. Oˆa˘
A´
ZHara,
A. Baldassarri, V. Loreto, and V. D. Servedio.
Folksonomies, the semantic web, and movie
recommendation. 2007.
[33] J. Vig, S. Sen, and J. Riedl. Tagsplanations:
explaining recommendations using tags. In Proceedings
of the 14th international conference on Intelligent user
interfaces, pages 47–56. ACM, 2009.
[34] X.-Y. Wang, B.-B. Zhang, and H.-Y. Yang.
Content-based image retrieval by integrating color and
texture features. Multimedia tools and applications,
68(3):545–569, 2014.
[35] Y. Wang, C. Xing, and L. Zhou. Video semantic
models: survey and evaluation. Int. J. Comput. Sci.
Netw. Security, 6:10–20, 2006.
[36] B. Yang, T. Mei, X.-S. Hua, L. Yang, S.-Q. Yang, and
M. Li. Online video recommendation based on
multimodal fusion and relevance feedback. In
Proceedings of the 6th ACM international conference
on Image and video retrieval, pages 73–80. ACM, 2007.
[37] H. Zettl. Essentials of applied media aesthetics. In
C. Dorai and S. Venkatesh, editors, Media Computing,
volume 4 of The Springer International Series in
Video Computing, pages 11–38. Springer US, 2002.
[38] H. Zettl. Sight, sound, motion: Applied media
aesthetics. Cengage Learning, 2013.
[39] X. Zhao, G. Li, M. Wang, J. Yuan, Z.-J. Zha, Z. Li,
and T.-S. Chua. Integrating rich information for video
recommendation with multi-task rank aggregation. In
Proceedings of the 19th ACM international conference
on Multimedia, pages 1521–1524. ACM, 2011.
[40] H. Zhou, T. Hermans, A. V. Karandikar, and J. M.
Rehg. Movie genre classification via scene
categorization. In Proceedings of the international
conference on Multimedia, pages 747–750. ACM, 2010.
... For example, working with images and videos on these systems would be very difficult without the aid of deep learning, or finding patterns in a sequence of user activities without deep learning will be complicated. 13 So far, various researches have been performed on the movie recommender systems based on deep learning models such as Long Short-Term Memory (LSTM), 15 Convolutional Neural Networks (CNN), [16][17] Autoencoder 6,[18][19][20] and Deep Boltzmann Machine (DBM). 21 On the other hand, the social information related to the movies and the users can be employed to provide refined and personalized recommendations in this field. ...
... Deldjoo et al. utilized visual features of movies to improve recommendations. 17 First, using video segmentation, keyframes were selected, then visual features were extracted from each keyframe. Then extracted feature vectors were aggregated over time to have a feature vector descriptive of the whole video. ...
... The and coefficients in Equation (17) are considered constant values of 0.1 and 0.9, respectively. Indeed, the effect of the movie trailer on users' choices is much higher than the movie poster. ...
Article
With the ever‐increasing use of Internet and social networks that generate a vast amount of information, there is a serious need for recommendation systems. In this article, we propose a recommender system utilizing deep neural networks that simultaneously considers both the users' ratings to the movies and the visual features of the movie poster and trailer. For this purpose, a hybrid movie recommender system, RSLCNet, has been developed using CNN and LSTM architectures. The proposed system considers the dynamics of users' interests in the collaborative filtering engine using the LSTM network that receives user‐rating sequences. In the content‐based filtering engine, utilizing CNN, the visual features of movie posters and trailers are extracted, and along with the actors and the directors, similar movies are recommended to the user. Moreover, each user's social influence is calculated employing the social information available on the user's Twitter account and used in the average movie rating to improve the effectiveness of the content‐based filtering part. The required datasets have been collected from MovieTweetings, Mise‐en‐scène, and OMDB. The evaluation results show that the accuracy and effectiveness of the proposed approach have been improved in terms of MAE and RMSE compared to the best available methods.
... Movie recommendation systems aim at recommending new movies for their users by using content information such as genre and movie title [9]. Deldjoo et al. [15] used descriptors obtained through deep learning neural networks and combined with descriptors of color and texture. Differently from what was presented by [15], in this work the use of neural networks to represent and combine different types of content of videos, such as text and image. ...
... Deldjoo et al. [15] used descriptors obtained through deep learning neural networks and combined with descriptors of color and texture. Differently from what was presented by [15], in this work the use of neural networks to represent and combine different types of content of videos, such as text and image. ...
Article
Full-text available
In this paper, we present a novel multimodal framework for video recommendation based on deep learning. Unlike most common solutions, we formulate video recommendations by exploiting simultaneously two data modalities, particularly: (i) the visual (i.e., image sequence) and (ii) the textual modalities, which in conjunction with the audio stream constitute the elementary data of a video document. More specifically, our framework firstly describe textual data by using the bag-of-words and TF-IDF models, fusing those features with deep convolutional descriptors extracted from the visual data. As result, we obtain a multimodal descriptor for each video document, from which we construct a low-dimensional sparse representation by using autoencoders. To qualify the recommendation task, we extend a sparse linear method with side information (SSLIM), by taking into account the sparse representations of video descriptors previously computed. By doing this, we are able to produce a ranking of the top-N most relevant videos to the user. Note that our framework is flexible, i.e., one may use other types of modalities, autoencoders, and fusion architectures. Experimental results obtained on three real datasets (MovieLens-1M, MovieLens-10M and Vine), containing 3,320, 8,400 and 18,576 videos, respectively, show that our framework can improve up to 60.6% the recommendation results, when compared to a single modality recommendation model and up to 31%, when compared to state-of-the art methods used as baselines in our study, demonstrating the effectiveness of our framework and highlighting the usefulness of multimodal information in recommender system.
... Baccouche et al. (2011) introduces a multiple 3d Deep Learning of Spatio-Temporal Features for Human Action recognition which obtains a high performance on KTH dataset. In addition, the recent work of Deldjoo et al. (2017) uses deep learning for movies recommendation with successful results. However, sharing parameters across time is insufficient for capturing all of the correlations between input samples. ...
Article
Human action recognition is a computer vision task. The evaluation of action recognition algorithms relies on the proper extraction and learning of the data. The success of the deep learning and especially learning layer by layer led to many imposing results in several contexts that include neural network. Here the Recurrent Neural Networks (RNN) with hidden unit has demonstrated advanced performance on tasks as varied as image captioning and handwriting recognition. Specifically Gated Recurrent Unit (GRU) is able to learn and take advantage of sequential and temporal data required for video recognition. Moreover video sequence can be better described on both visual and moving features. In this paper, we present our approach for human action recognition based on fusion and combination of sequential visual features and moving path. We evaluate our technique on the challenging UCF Sports Action, UCF101 and KTH dataset for human action recognition and obtain competitive results.
... • Mise-en-scene features [61,62,117] • MPEG7 features [63] • Deep Learning features [66,67,68] Cross-domain • Knowledge Aggregation [103] • Knowledge Transfer [104] We discussed various solutions that have been proposed in the literature. These solutions are summarized in table 1.1. ...
Preprint
Full-text available
Recommendation systems are essential tools to overcome the choice overload problem by suggesting items of interest to users. However, they suffer from a major challenge which is the so-called cold-start problem. The cold-start problem typically happens when the system does not have any form of data on new users and on new items. In this chapter, we describe the cold start problem in recommendation systems. We mainly focus on Collaborative Filtering (CF) systems which are the most popular approaches to build recommender systems and have been successfully employed in many real-world applications. Moreover, we discuss multiple scenarios that cold-start may happen in these systems and explain different solutions for them.
... Another approach to cold-start consists of complementing the rating data with other sources of information about the items. For example, in multimedia recommender systems, it has been shown that audio-visual features, e.g., variation of colour, camera and object motion, and lighting, can be automatically extracted from movies in order to solve the cold start problem and to effectively generate relevant recommendations ; Deldjoo et al. (2016Deldjoo et al. ( , 2017]. As an additional example, in a food recommender, it has been shown that one can exploit the information brought by tags assignments of the users to recipes to improve the My Book Title book-chapter-user-pref-elicit page 4 recommendation performance of a system that is using only ratings for recipes [Ge et al. (2015); Massimo et al. (2017)]. ...
Preprint
Full-text available
Preprint of a book chapter published in Book Collaborative Recommendations Algorithms, Practical Challenges and Applications [https://doi.org/10.1142/11131] © [copyright World Scientific Publishing Company] [https://www.worldscientific.com/worldscibooks/10.1142/11131]
... In the movies recommendation context, Deldjoo et al. show that user's preferences on movies can be better described in terms of the mise-en-scène features. 4 Hereby, authors used the low-level MPEG-7 for the extraction of MPEG-7 visual features from movie trailers. ...
Article
Full-text available
Recommendation and personalisation approaches aim to filter the most interesting resources that may attract users’ personal interests and preferences by analysing their past attitudes and consumption patterns. The visual aspect of content constitutes an important factor that drives consumers’ attitudes and decisions. In this context, this work proposes a framework for users’ attitude prediction based on items’ visual descriptors and details one of its possible applications for movies recommendations. The main idea of our proposal is to model users’ interests and consumption behaviors using the movies’ posters images and extract features based on the visual descriptors of the items that they interact with, in order to better predict their attitudes towards the ones they do not know. The recommendation approach was integrated into a movies recommendation application that visually assists users while searching for relevant movies to watch, by finding similar movie posters based on the visual aspects of the poster image.
Preprint
In this paper we examine the ability of low-level multimodal features to extract movie similarity, in the context of a content-based movie recommendation approach. In particular, we demonstrate the extraction of multimodal representation models of movies, based on textual information from subtitles, as well as cues from the audio and visual channels. With regards to the textual domain, we emphasize our research in topic modeling of movies based on their subtitles, in order to extract topics that discriminate between movies. Regarding the visual domain, we focus on the extraction of semantically useful features that model camera movements, colors and faces, while for the audio domain we adopt simple classification aggregates based on pretrained models. The three domains are combined with static metadata (e.g. directors, actors) to prove that the content-based movie similarity procedure can be enhanced with low-level multimodal information. In order to demonstrate the proposed content representation approach, we have built a small dataset of 160 widely known movies. We assert movie similarities, as propagated by the individual modalities and fusion models, in the form of recommendation rankings. Extensive experimentation proves that all three low-level modalities (text, audio and visual) boost the performance of a content-based recommendation system, compared to the typical metadata-based content representation, by more than 50%50\% relative increase. To our knowledge, this is the first approach that utilizes a wide range of features from all involved modalities, in order to enhance the performance of the content similarity estimation, compared to the metadata-based approaches.
Chapter
Full-text available
With the rapid growth of online market for clothing, footwear, hairstyle, and makeup, consumers are getting increasingly overwhelmed with the volume, velocity and variety of production. Fashion Recom� mender Systems can tackle with choice overload by suggesting the most interesting products to the users. However, recommender systems are unable to generate recommendation unless some information is collected from users. Indeed, there are situations where a recommender system is requested for recommendation while no or little information is provided by users (Cold Start problem). In this book chapter, we investigate the different scenarios where fashion recommender systems may encounter cold start problem and review ap� proaches that have been proposed to deal with this problem. We further elaborate potential solutions that can be applied to mitigate moderate and severe cases of cold start problem.
Article
Full-text available
Human behavior has been always an important factor in social communication. The human activity and action recognition are all clues that facilitate the analysis of human behavior. Human action recognition is an important challenge in a variety of application including human-computer interaction and intelligent video surveillance to enhance security in different domains. The evaluation algorithm relies on the proper extraction and the learning data. The success of the deep learning led to many imposing results in several contexts that include neural network. Here the emergence of Gated Recurrent Neural Networks with increased computation powers is being adopted for sequential data and video classification. However, to have an efficient classifier for assigning the class label, it is very necessary to have a strong features vector. Features are the most important information in each data. Indeed, features extraction can influence on the performance of the algorithm and the computation complexity. This paper proposes a novel approach for human action recognition based on hybrid deep learning model. The proposed approach is evaluated on the challenging UCF Sports, UCF101 and KTH datasets. An average of 96.3% is obtained when we have tested on KTH dataset.
Conference Paper
Full-text available
Previous works have shown the effectiveness of using stylistic visual features, indicative of the movie style, in content-based movie recommendation. However, they have mainly focused on a particular recommendation scenario, i.e. , when a new movie is added to the catalogue and no information is available for that movie (New Item scenario). However , the stylistic visual features can be also used when other sources of information is available (Existing Item scenario). In this work, we address the second scenario and propose a hybrid technique that exploits not only the typical content available for the movies (e.g., tags), but also the stylistic visual content extracted form the movie files and fuse them by applying a fusion method called Canonical Correlation Analysis (CCA). Our experiments on a large catalogue of 13K movies have shown very promising results which indicates a considerable improvement of the recommendation quality by using a proper fusion of the stylistic visual features with other type of features.
Conference Paper
Full-text available
In this paper, we present an ongoing work that will ultimately result in a movie recommender system based on the Mise-en-Scène characteristics of the movies. We believe that the preferences of users on movies can be well described in terms of the mise-en-scène, i.e., the design aspects of movie making influencing aesthetic and style. Examples of mise-en-scène characteristics are Lighting, colors, background, and movements. Our recommender system opens new opportunities in the design of new user interfaces able to offer a personalized way to search for interesting movies through the analysis of film styles rather than using the traditional classifications of movies based on explicit attributes such as genre and cast.
Chapter
Full-text available
In Recommender Systems (RS), a users preferences are expressed in terms of rated items, where incorporating each rating may improve the RS’s predictive accuracy. In addition to a user rating items at-will (a passive process), RSs may also actively elicit the user to rate items, a process known as Active Learning (AL). However, the number of interactions between the RS and the user is still limited. One aim of AL is therefore the selection of items whose ratings are likely to provide the most information about the user’s preferences. In this chapter, we provide an overview of AL within RSs, discuss general objectives and considerations, and then summarize a variety of methods commonly employed. AL methods are categorized based on our interpretation of their primary motivation/goal, and then sub-classified into two commonly classified types, instance-based and model-based, for easier comprehension. We conclude the chapter by outlining ways in which AL methods could be evaluated, and provide a brief summary of methods performance.
Article
Full-text available
This paper investigates the use of automatically extracted visual features of videos in the context of recommender systems and brings some novel contributions in the domain of video recommendations. We propose a new content-based recommender system that encompasses a technique to automatically analyze video contents and to extract a set of representative stylistic features (lighting, color, and motion) grounded on existing approaches of Applied Media Theory. The evaluation of the proposed recommendations, assessed w.r.t. relevance metrics (e.g., recall) and comThe evaluation of the proposed recommendations, assessed w.r.t. relevance metrics (e.g., recall) and compared with existing content-based recommender sys-tems that exploit explicit features such as movie genre, shows that our technique leads to more accurate rec-ommendations. Our proposed technique achieves better results not only when visual features are extracted from full-length videos, but also when the feature extraction technique operates on movie trailers, pinpointing that our approach is effective also when full-length videos are not available or when there are performance requirements. Our recommender can be used in combination with more traditional content-based recommendation tech-niques that exploit explicit content features associated to video files, in order to improve the accuracy of recom-mendations. Our recommender can also be used alone, to address the problem originated from video files that have no meta-data, a typical situation of popular movie-sharing websites (e.g., YouTube) where every day hun-dred millions of hours of videos are uploaded by users and may contain no associated information. As they lack explicit content, these items cannot be considered for recommendation purposes by conventional content-based techniques even when they could be relevant for the user.
Conference Paper
Full-text available
One of the challenges in video recommendation systems is the New Item problem, which happens when the system is unable to recommend video items, that no information is available about them. For example, in the popular movie-sharing websites, such as Youtube, everyday, hundred millions of hours of videos are uploaded and big portion of these videos may not contain any meta-data, to be used by the system to generate recommendations. In this paper, we address this problem by proposing a method, that is based on automatic analysis of the video content in order to extract a number representative low-level visual features. Such features are then used to generate personalized content-based recommendations. Our evaluation shows that our proposed method can outperform the baselines, by producing more relevant recommendations. Hence, a set low-level features extracted automatically can be more descriptive and informative of the video content than a set of high-level expert annotated features.
Article
Full-text available
A virtual audio system needs to track both the translation and rotation of an observer to simulate a realistic sound environment. Current existing virtual audio systems either do not fully account for rotation or require the user to carry a controller at all times. This paper presents a three-dimensional (3D) virtual audio system with a head tracking unit that fully accounts for both translation and rotation of a user without the need of a controller. The system consists of four infrared light-emitting diodes on the user’s headset together with a Wii-remote to track their movement through a graphical user interface. The system was tested with a simulation that used a pinhole camera model to map the 3D-coordinates of each diode onto the two-dimensional (2D) camera plane. This simulation of 3D head movement yields 2D coordinate data that were put into the tracking algorithm and to reproduced the 3D motion. The results from a prototype system, assembled to track the 3D movements of a rigid object were also consistent with the simulation results. The tracking system has been integrated into an Ericsson 3D-audio system and its effectiveness has been verified in a headtracked virtual 3D-audio system with real-time animating graphical outputs.
Article
Modern recommender systems model people and items by discovering or `teasing apart' the underlying dimensions that encode the properties of items and users' preferences toward them. Critically, such dimensions are uncovered based on user feedback, often in implicit form (such as purchase histories, browsing logs, etc.); in addition, some recommender systems make use of side information, such as product attributes, temporal information, or review text. However one important feature that is typically ignored by existing personalized recommendation and ranking methods is the visual appearance of the items being considered. In this paper we propose a scalable factorization model to incorporate visual signals into predictors of people's opinions, which we apply to a selection of large, real-world datasets. We make use of visual features extracted from product images using (pre-trained) deep networks, on top of which we learn an additional layer that uncovers the visual dimensions that best explain the variation in people's feedback. This not only leads to significantly more accurate personalized ranking methods, but also helps to alleviate cold start issues, and qualitatively to analyze the visual dimensions that influence people's opinions.
Conference Paper
We examine the problems with automated recommendation systems when information about user preferences is limited. We equate the problem to one of content similarity measurement and apply techniques from Natural Language Processing to the domain of movie recommendation. We describe two algorithms, a naïve word-space approach and a more sophisticated approach using topic signatures, and evaluate their performance compared to baseline, gold standard, and commercial systems.