ThesisPDF Available

Similarity and Diversity: Two Sides of the Same Coin in the Evaluation of Data Streams

Authors:

Abstract and Figures

The Information Systems represent the primary instrument of growth for the companies that operate in the so-called e-commerce environment. The data streams generated by the users that interact with their websites are the primary source to define the user behavioral models. Some main examples of services integrated in these websites are the Recommender Systems, where these models are exploited in order to generate recommendations of items of potential interest to users, the User Segmentation Systems, where the models are used in order to group the users on the basis of their preferences, and the Fraud Detection Systems, where these models are exploited to determine the legitimacy of a financial transaction. Even though in literature diversity and similarity are considered as two sides of the same coin, almost all the approaches take into account them in a mutually exclusive manner, rather than jointly. The aim of this thesis is to demonstrate how the consideration of both sides of this coin is instead essential to overcome some well-known problems that affict the state-of-the-art approaches used to implement these services, improving their performance. Its contributions are the following: with regard to the recommender systems, the detection of the diversity in a user profile is used to discard incoherent items, improving the accuracy, while the exploitation of the similarity of the predicted items is used to re-rank the recommendations, improving their effectiveness; with regard to the user segmentation systems, the detection of the diversity overcomes the problem of the non-reliability of data source, while the exploitation of the similarity reduces the problems of understandability and triviality of the obtained segments; lastly, concerning the fraud detection systems, the joint use of both diversity and similarity in the evaluation of a new transaction overcomes the problems of the data scarcity, and those of the non-stationary and unbalanced class distribution.
Content may be subject to copyright.
Universit `
a degli Studi di Cagliari
Dipartimento di Matematica e Informatica
Doctor of Philosophy in
Computer Science
Similarity and Diversity:
Two Sides of the Same Coin in the Evaluation of
Data Streams
Doctoral Dissertation of
Roberto Saia
PhD Coordinator:
Prof. Gian Michele Pinna
Supervisor:
Prof. Salvatore Carta
Academic Year 2014/15 - XXVIII Cycle - INF/01
Men are only as good as their technical development allows them to be.
George Orwell
Acknowledgements
We gratefully acknowledge the Regione Autonoma della Sardegna for the financial
support to my PhD program under the project Social Glue, through PIA, Pacchetti
Integrati di Agevolazione Industria, Artigianato e Servizi (Annuity 2010), as well
as the Ministero dell’Istruzione, dell’Universit`a e della Ricerca (MIUR) for the
financial support through PRIN 2010-11, under the Security Horizons project.
Moreover, a special thanks to Dr. Ludovico Boratto, extremely qualitative and
ecient co-supervisor of my thesis work, together with Prof. Salvatore Carta.
Abstract
The Information Systems represent the primary instrument of growth for the com-
panies that operate in the so-called e-commerce environment. The data streams
generated by the users that interact with their websites are the primary source to
define the user behavioral models.
Some main examples of services integrated in these websites are the Recom-
mender Systems, where these models are exploited in order to generate recom-
mendations of items of potential interest to users, the User Segmentation Systems,
where the models are used in order to group the users on the basis of their pref-
erences, and the Fraud Detection Systems, where these models are exploited to
determine the legitimacy of a financial transaction.
Even though in literature diversity and similarity are considered as two sides
of the same coin, almost all the approaches take into account them in a mutually
exclusive manner, rather than jointly. The aim of this thesis is to demonstrate how
the consideration of both sides of this coin is instead essential to overcome some
well-known problems that aict the state-of-the-art approaches used to implement
iv Abstract
these services, improving their performance.
Its contributions are the following: with regard to the recommender systems,
the detection of the diversity in a user profile is used to discard incoherent items,
improving the accuracy, while the exploitation of the similarity of the predicted
items is used to re-rank the recommendations, improving their eectiveness; with
regard to the user segmentation systems, the detection of the diversity overcomes
the problem of the non-reliability of data source, while the exploitation of the
similarity reduces the problems of understandability and triviality of the obtained
segments; lastly, concerning the fraud detection systems, the joint use of both
diversity and similarity in the evaluation of a new transaction overcomes the prob-
lems of the data scarcity, and those of the non-stationary and unbalanced class
distribution.
Contents
Abstract iii
1 Introduction 1
1.1 Information Systems and Joint Evaluation of Similarity and Di-
versity:Motivation ......................... 2
1.1.1 Recommender Systems . . . . . . . . . . . . . . . . . . . 2
1.1.2 User Segmentation Systems . . . . . . . . . . . . . . . . 4
1.1.3 Fraud Detection Systems . . . . . . . . . . . . . . . . . . 4
1.2 Contributions ............................ 5
1.3 ThesisStructure........................... 6
I Background and Related Work 8
2 Recommender Systems 9
2.1 Introduction............................. 9
vi Contents
2.2 UserProling............................ 10
2.2.1 Explicit, Implicit and Hybrid Strategies . . . . . . . . . . 11
2.2.2 Information Reliability . . . . . . . . . . . . . . . . . . . 13
2.2.3 Magic Barrier Boundary . . . . . . . . . . . . . . . . . . 15
2.3 Decision Making Process . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Non-personalized Models . . . . . . . . . . . . . . . . . 17
2.3.2 Latent Factor Models . . . . . . . . . . . . . . . . . . . . 18
3 User Segmentation Systems 21
3.1 Introduction............................. 21
3.2 Latent Space Discovering . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Behavioral Targeting . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Reliability of a semantic query analysis . . . . . . . . . . 24
3.2.3 Segment Interpretability and Semantic User Segmentation 24
3.2.4 Preference Stability . . . . . . . . . . . . . . . . . . . . . 26
3.2.5 Item Descriptions Analysis . . . . . . . . . . . . . . . . . 26
4 Fraud Detection Systems 29
4.1 Introduction............................. 29
4.2 A Proactive Approach for the Detection of Frauds Attempts . . . . 30
4.2.1 Supervised and Unsupervised Approaches . . . . . . . . . 31
4.2.2 DataUnbalance....................... 32
4.2.3 Detection Models . . . . . . . . . . . . . . . . . . . . . . 33
Contents vii
4.2.4 Dierences with the proposed approach . . . . . . . . . . 35
II On the Role of Similarity and Diversity in Recommender Sys-
tems 37
5 User Profiling 39
5.1 Introduction............................. 39
5.2 Architecture............................. 40
5.2.1 A State-of-the-Art Architecture for Content-based Rec-
ommenderSystems..................... 40
Limits at the State of the Art and Design Guidelines . . . 43
5.2.2 Recommender Systems Architecture . . . . . . . . . . . . 45
Overview.......................... 45
Proposed Solutions . . . . . . . . . . . . . . . . . . . . . 46
Approach.......................... 47
Conclusions and Future Work . . . . . . . . . . . . . . . 53
5.3 Implementation ........................... 54
5.3.1 Overview.......................... 54
5.3.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . 54
5.3.3 Adopted Notation . . . . . . . . . . . . . . . . . . . . . . 57
5.3.4 Problem Definition . . . . . . . . . . . . . . . . . . . . . 58
5.3.5 Approach.......................... 58
Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 60
viii Contents
Semantic Similarity . . . . . . . . . . . . . . . . . . . . . 61
Dynamic Coherence-Based Modeling . . . . . . . . . . . 63
Item Recommendation . . . . . . . . . . . . . . . . . . . 69
5.3.6 Experiments ........................ 70
5.3.7 RealData.......................... 71
Strategy........................... 72
Results ........................... 72
5.3.8 SyntheticData ....................... 73
Experimental Setup . . . . . . . . . . . . . . . . . . . . . 75
Experimental Results . . . . . . . . . . . . . . . . . . . . 75
5.3.9 Conclusions and Future Work . . . . . . . . . . . . . . . 80
6 Decision Making Process 83
6.1 Introduction............................. 83
6.2 Recommender Systems Performance . . . . . . . . . . . . . . . . 85
6.2.1 Overview.......................... 85
6.2.2 Proposed Solutions . . . . . . . . . . . . . . . . . . . . . 86
6.2.3 Adopted Notation . . . . . . . . . . . . . . . . . . . . . . 88
6.2.4 Problem Definition . . . . . . . . . . . . . . . . . . . . . 89
6.2.5 Approach.......................... 89
ItemsPopularity ...................... 90
Text Preprocessing . . . . . . . . . . . . . . . . . . . . . 91
PBSVD++ Algorithm ................... 92
Contents ix
6.2.6 Experiments ........................ 94
Experimental Setup . . . . . . . . . . . . . . . . . . . . . 94
Strategy........................... 96
Results ........................... 97
6.2.7 Conclusions and Future Work . . . . . . . . . . . . . . . 98
III On the Role of Similarity and Diversity in User Segmentation
Systems 101
7 Latent Space Discovering 103
7.1 Introduction.............................103
7.2 Overview ..............................104
7.3 Proposed Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4 AdoptedNotation..........................108
7.5 ProblemDenition .........................109
7.6 Approach ..............................110
7.6.1 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . 111
7.6.2 UserModeling .......................112
7.6.3 Semantic Binary Sieve Definition . . . . . . . . . . . . . 113
Primitive class-based SBS Definition . . . . . . . . . . . 114
Interclass-based SBS Definition . . . . . . . . . . . . . . 115
Superclass-based SBS Definition . . . . . . . . . . . . . . 117
Subclass-based SBS Definition . . . . . . . . . . . . . . . 119
xContents
Additional Considerations on the Boolean Classes . . . . 121
7.6.4 Relevance Score Definition . . . . . . . . . . . . . . . . . 122
7.6.5 Target Definition . . . . . . . . . . . . . . . . . . . . . . 123
7.7 Experiments.............................124
Strategy...........................124
7.7.1 Results ...........................126
DataOverview .......................127
Role of the Semantics in the SBS Data Structure . . . . . 129
Setting of the ϕparameter .................130
Analysis of the segments . . . . . . . . . . . . . . . . . . 131
Performance Analysis . . . . . . . . . . . . . . . . . . . 135
7.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . 137
IV On the Role of Similarity and Diversity in Fraud Detection
Systems 139
8 A Proactive Approach for the Detection of Frauds Attempts 141
8.0.1 Introduction.........................141
8.0.2 Overview..........................142
8.0.3 Proposed Solutions . . . . . . . . . . . . . . . . . . . . . 143
8.0.4 Adopted Notation . . . . . . . . . . . . . . . . . . . . . . 147
8.0.5 Problem Definition . . . . . . . . . . . . . . . . . . . . . 150
8.0.6 Approach..........................150
Contents xi
Absolute Variations Calculation . . . . . . . . . . . . . . 151
TDFDenition.......................153
EBSVOperation ......................154
Discretization process . . . . . . . . . . . . . . . . . . . 155
Transaction Evaluation . . . . . . . . . . . . . . . . . . . 155
8.0.7 Experiments ........................156
Strategy...........................157
Parameters Tuning . . . . . . . . . . . . . . . . . . . . . 159
Results ...........................160
8.0.8 Conclusions and Future Work . . . . . . . . . . . . . . . 160
V Conclusions 163
VI Publications 168
Appendices 173
A Natural Language Processing 175
A.1 Bag-of-words and Semantic Approaches . . . . . . . . . . . . . . 175
A.2 WordNet Environment . . . . . . . . . . . . . . . . . . . . . . . 176
A.3 VectorSpaceModel.........................178
B Metrics and Datasets 181
xii Contents
B.1 Metrics ...............................181
B.1.1 F1measure ........................181
B.1.2 Elbow Criterion . . . . . . . . . . . . . . . . . . . . . . . 182
B.1.3 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . 182
B.1.4 Levenshtein Distance . . . . . . . . . . . . . . . . . . . . 183
B.1.5 Average Precision . . . . . . . . . . . . . . . . . . . . . 184
B.2 Datasets...............................184
B.2.1 Yahoo! Webscope Movie Dataset (R4) . . . . . . . . . . . 184
B.2.2 Movielens10M.......................185
B.2.3 Credit Card Dataset . . . . . . . . . . . . . . . . . . . . . 186
Bibliography 189
Chapter 1
Introduction
In the last times, the Information Systems (IS) represent the primary instrument of
growth for the companies that operate in the so-called e-commerce environment.
Indeed, these systems aggregate the data they collected to generate information
provided to both the users (e.g., items recommendations) and to those who run the
business (e.g., segments of users to target, or possible fraudulent transactions).
In order to generate these types of services, an IS has to build predictions on the
usefulness of a particular piece of information (i.e., it has predict wether or not an
item should be recommended to a user, in which user segment a user should be
placed, or if a transaction can be successfully completed). In this matter, detecting
the similarity and diversity between the behavior of a user and that of the other
users, or with respect to her/his previous behavior, is essential in order to build
accurate predictions. Therefore, the main objective of this thesis is to inspect on
2Chapter 1. Introduction
the role of the similarity of diversity in these classes of IS, and to show why they
represent two faces of the same coin (i.e., why it is necessary to exploit both of
them in order for a system to perform well).
1.1 Information Systems and Joint Evaluation of Simi-
larity and Diversity: Motivation
1.1.1 Recommender Systems
One of the most important classes of IS are the Recommender Systems (RSs),
since their ability to perform accurate prediction on the future user preferences
about items is strongly (and directly) related to the earnings of the commercial
operators. Considering that, typically, a RS produces its results on the basis of
the historic interactions of the users with it, by evaluating the similarity/diversity
between their previous choices and the items not evaluated yet, the ability to define
a user profile able to reflect the real tastes of them represents a crucial task.
In order to face this problem, the joined evaluation of similarity and diversity
can lead toward significant improvements. In fact, the predictive models of a
RS are usually determined through the analysis of the data stream related with
the past activities of the users, and the similarity aspect (i.e., between users or
items, in accordance with the adopted recommendation strategy) represents the
primary criterion to determine their output. In such context, the diversity aspect
is considered as a mere implicit element of the problem, a specular factor of the
1.1. Information Systems and Joint Evaluation of Similarity and Diversity: Motivation3
similarity, which in many cases it is not even taken into account by the involved
strategies.
This happens because almost all these systems operate in accord with the as-
sumption that the past choices of the users represent a reliable source of informa-
tion that can be exploited in order to infer their future preferences. For this reason,
they usually generate the recommendations on the basis of the interpretation of all
historic interactions of the users with them, using algorithms primarily based on
the evaluation of the similarity between items (similarity with items not yet evalu-
ated and items already evaluated in the past) or between users (similarity with the
other users who share part of the past choices of her/his).
Although it may sound correct, such approach could lead to wrong results due
to several factors, such as a changes in user taste over time, the use of her/his ac-
count by third parties, or when the system does not allow their users to express a
feedback (or when it is possible, but they do not use this option). A RS that adopts
the previously mentioned criteria of similarity produces non optimal results. It
happens because its recommendations are based only on the explicit characteris-
tics of the users, which can be trivial, since present a low level of novelty and
serendipity (i.e., the ability to suggest something interesting to users, without they
have expressly searched it).
4Chapter 1. Introduction
1.1.2 User Segmentation Systems
Another important class of IS are those that perform a segmentation of users with
related interests, in order to target them (behavioral targeting). The set of target
users is detected from a segmentation of the user set, based on their interactions
with the website (pages visited, items purchased, etc.). Recently, in order to im-
prove the segmentation process, the semantics behind the user behavior has been
exploited, by analyzing the queries issued by the users. However, nearly half of
the times users need to reformulate their queries in order to satisfy their informa-
tion need. In this thesis, we tackle the problem of semantic behavioral targeting
considering reliable user preferences, by performing a semantic analysis on the
descriptions of the items positively rated by the users. We also consider widely-
known problems, such as the interpretability of a segment, and the fact that user
preferences are usually stable over time, which could lead to a trivial segmen-
tation. In order to overcome these issues, our approach allows an advertiser to
automatically extract a user segment by specifying the interests that she/he wants
to target, by means of a novel boolean algebra; the segments are composed of
users whose evaluated items are semantically related to these interests. This leads
to interpretable and non-trivial segments, built by using reliable information.
1.1.3 Fraud Detection Systems
Any business that operates on the Internet and accepts payments through debit or
credit cards, also implicitly accepts that some transaction may be fraudulent. The
1.2. Contributions 5
design of eective strategies to face this problem is challenging, due to factors
such as the heterogeneity and the non stationary distribution of the data, as well
as the presence of an imbalanced class distribution, and the scarcity of public
datasets. The state-of-the-art strategies are usually based on a unique model, built
by analyzing the past transactions of a user, whose similarity with the current
transaction is analyzed. In order to overcome the aforementioned problems, it
would be advisable to generate a set of models (behavioral patterns) to evaluate
a new transaction, by considering the behavior of the user in dierent temporal
frames of her/his history. These models can be built by evaluating dierent forms
of similarity and diversity between the financial transactions of a user.
1.2 Contributions
It should be observed, which in spite the fact that similarity and diversity can be
considered as two sides of the same coin, in many contexts these two factors are
taken into account in a mutually exclusive manner, rather than jointly.
For instance, almost all the approaches at the state of the art, used to generate
recommendations, are basically based on metrics of similarity between users/items,
without taking into account any factor of diversity. Otherwise, by taking in consid-
eration the diversity between items, they could evaluate the coherence in the past
choices of the users, removing from their profiles the incoherent elements, making
these as close as possible to their real tastes. This thesis shows how by performing
a pre-processing phase (based on the concept of diversity and addressed to remove
6Chapter 1. Introduction
the incoherent items from the user profiles), followed by a post-processing phase
(based instead on the concept of similarity and aimed to re-rank the suggestions
generated by a state-of-the-art algorithm of recommendation), we can lead the
recommender system toward better performance.
This thesis also proposes a novel approach to improve the user segmentation
process, by reducing the triviality that characterizes the results of many of the
state-of-the-art approaches. It is based on the evaluation of the semantic similar-
ity/diversity between users, in terms of preference for classes of items, allowing
us to group them in a non-trivial way, increasing the serendipity factor when the
results are exploited to perform a behavioral targeting.
In the fraud detection context, where the previously mentioned problems oc-
cur, this thesis proposes a new strategy to detect frauds. The main idea is to per-
form a joined evaluation of the similarity/diversity between the financial transac-
tion to evaluate and a series of models defined by using dierent temporal frames
of the user activity (exploiting all transaction fields). It allows us to overcome the
problem of data scarcity (by using multiple models) and data unbalance (it does
not use fraudulent transactions to train the models), operating in a proactive mode.
1.3 Thesis Structure
This thesis is organized as follows: it first presents the background and related
work (Chapter I) of the main concepts involved in the performed research, contin-
uing by presenting all details (adopted notation, problem definition, used datasets,
1.3. Thesis Structure 7
involved metrics, adopted strategy, experiments, and conclusions) about the three
areas taken in account in this thesis, respectively related with the role of the simi-
larity and diversity in the context of the recommender systems (Chapter II), in that
of the user segmentation systems (Chapter III), and in that of the fraud detection
systems (Chapter IV). It ends with some concluding remarks (Chapter V).
Background and Related Work
Chapter 2
Recommender Systems
2.1 Introduction
The context taken in consideration is that of the Recommender Systems (RS) [1],
where the rapid growth of the number of companies that sell goods through the
Word Wide Web has generated an enormous amount of valuable information,
which can be exploited to improve the quality and eciency of the sales crite-
ria [2]. Because of the widely-known information overload problem, it became
necessary to deal with the large amounts of data available on the Web [3]. The rec-
ommender systems represent an eective response to this problem, by filtering the
huge amount of information about their customers in order to get useful elements
to produce suggestions to them [4, 5, 6]. The denomination RS denotes a set of
software tools and techniques providing to a user suggestions for items, where the
10 Chapter 2. Recommender Systems
term item is used to indicate what the system recommends to users. This research
addresses one of the most important aspects related to the recommender systems,
i.e., how to represent a user profile, so that it only contains accurate information
about a user, and it allows a system to generate eective recommendations.
2.2 User Profiling
When it comes to producing personalized recommendations to users, the first re-
quirement is to understand the needs of the users and, according to them, to build
a user profile that models these needs. User profiles and context information are
the key elements that allow to perform personalized recommendations by a wide
range of techniques developed for using profile information to influence dierent
aspects of search experience. There are several approaches to build profiles: some
of them focus on short-term user profiles that capture features of the user’s current
search context [7, 8, 9], while others accommodate long-term profiles that capture
the user preferences over a long period of time [10, 11, 12]. As shown in [13],
compared with the short-term user profiles, the use of a long-term user profiles
generally produces more reliable results, at least when the user preferences are
fairly stable over a long time period. Otherwise, we need a specific strategy able
to manage the changes in the user profile that not reflect the real taste of the user
and that represent a form of “noise”.
Given this analysis of the literature, the definition of approaches able to detect
the presence of diversity in a user profile represents a novel problem.
2.2. User Profiling 11
Some important concepts related to the so-called Natural Language Process-
ing (NLP) are also presented in Appendix A.
2.2.1 Explicit, Implicit and Hybrid Strategies
The most common strategies to get useful information to build the user profiles are
two, i.e., explicit or implicit, or even a combination of these (hybrid strategies).
Explicit profiling strategies interrogate users directly by requesting dierent forms
of preference information, from categorical preferences [10, 12] to simple result
ratings [11]. Implicit profiling strategies attempt to infer preference information
by analyzing the user behavior, and without a direct interaction with users while
they perform actions in a website [10, 14, 15]. The hybrid strategies combine the
advantages of the implicit and explicit approaches of user profiling, by considering
both the static and dynamic characteristics of the users, these last ones obtained
by retrieving the behavioral information of them. This approach represents a good
compromise between advantages and disadvantages related with the two main
approaches of user profiling (i.e., explicit and implicit).
An explicit way to build the user profiles is reported in [16], where the user
profiling activity is considered as a process of analyzing of the static and inferable
characteristics of the users. Following this approach, their behavior is inferred
by the analysis of the available information about them, usually collected through
the use of on-line forms or other similar methods (e.g., specific surveys). This
approach is classified as static profiling or factual profiling. It should be noted that
12 Chapter 2. Recommender Systems
such strategy of data collecting presents some problem, such as those related with
the privacy aspect (many users do not like to reveal their information), and those
related with the form filling process (many users do not like to spend their time
for this activity). Regards this last kind of problem, is observed as the accuracy of
a filled form depends on the time needed to fill it.
The same work [16] reports a dynamic user profiling strategy, which instead
to adopt a static approach of data collecting, based exclusively on the explicit
information of the users, tries to learn more data about them. Such strategy is also
classified as behavioral profiling,adaptive profiling, or ontological profiling of the
users. It is performed by exploiting several filtering approaches, such as the rule
based filtering, the collaborative filtering, and the content based filtering [17].
As previous said, the hybrid strategies represent a good compromise between
the advantages and the disadvantages related with the implicit and explicit ap-
proaches of user profiling. A more sophisticated hybrid approach is reported
in [18], a strategy for learning the user profiles from both static and dynamic
information. In addition to the canonical static information about the users, it
exploits the tags associated with the items rated by the users. The tags taken in
consideration are the user tags, but also the so-called social tags, i.e., the tags
used by other users who rated the same items. It should be observed how this way
to proceed, based allows us to exploit the dierent knowledge of the users in the
domain taken in consideration, because the social tags represent a way to extend
the content-based paradigm toward a hybrid content-collaborative paradigm [19].
In order to face the problem related with the non univocity of the tags (due to
2.2. User Profiling 13
the fact that they are arbitrarily chosen by users), in the same work [19] is sug-
gested a semantic approach of disambiguation (i.e., word sense disambiguation)
performed by exploiting a lexical ontology such as Wordnet [20, 21]. Another hy-
brid approach of user profiling, where the content-based profiles and the interests
revealed through tagging activities are combined, is reported in [22].
Concluding, it is possible to state that the strategies of user profiling that
proved to be most eective are the implicit ones, where the preferences of the
users are inferred without any direct interaction with her/him. These implicit ap-
proaches usually requires long-term user profiles, where the information about
the tastes is considered over an extended period of time. However, there are some
implicit approaches that involve a short-term profiling, related to the particular
context in which the system operates [7].
2.2.2 Information Reliability
Regardless of the type of profiling that is adopted (e.g., long-term or short-term),
there is a common problem that may aect the goodness of the obtained results,
i.e., the capability of the information stored in the user profile to lead toward
reliable recommendations. Unreliable information in a user’s profile can be found
in many cases, e.g. when a private user account is used by other people, when the
user has expressed a wrong preference about an item, and so on. In order to face
the problem of dealing with unreliable information in a user profile, the state of
art proposes dierent strategies.
14 Chapter 2. Recommender Systems
Several approaches, such as [23], take advantage from the Bayesian analysis
of the user provided relevance feedback, in order to detect non-stationary user in-
terests. The work [24] describes an approach to learn the users preferences in a
dynamic way, a strategy able to work simultaneously in short-term and long-term
domains. Also exploiting the feedback information provided by the users, other
approaches such as [13] make use of a tree-descriptor model to detect shifts in
user interests. Another technique exploits the knowledge captured in an ontol-
ogy [25] to obtain the same result, but in this case it is necessary that the users
express their preferences about items through an explicit rating. There are also
other dierent strategies that try to improve the accuracy of information in the
user profiles by collecting the implicit feedbacks of the users during their natural
interactions with the system (reading-time, saving, etc.) [26].
However, it should be pointed out that most of the strategies used in this area
are usually eective only in specific contexts, such as for instance [27], where a
novel approach to model automatically the user profile according to the change of
her/his tastes is designed to operate in the context of the articles recommendation.
Despite the fact that implicit feedbacks from users are usually less accurate than
those explicitly expressed, in certain contexts this approach leads toward pretty
good results.
2.2. User Profiling 15
2.2.3 Magic Barrier Boundary
It should be noted that there is a common issue, related to the concept of items
incoherence, that aicts the recommender approaches. This is a problem that in
the literature is identified as magic barrier [28], a term used to define the theoret-
ical boundary for the level of optimization that can be achieved by an algorithm
of recommendation on known transactional data [29].
The inconsistency in the behavior of the users represents a well known aspect
in the context of recommender systems, a problem it has been investigated since
this study [30], where the reliability of the user ratings is questioned, as well as in
the work [28], which the level of noise in the user ratings has been discussed. The
evaluation models assume as a ground truth that the transactions made in the past
by the users, and stored in their profiles, are free of noise.
This is a concept that has been studied in [31, 32], where a study aimed to cap-
ture the noise in a service that operates in a synthetic environment was performed.
It should be noted that this is an aspect that, in the context of the recommender
systems, was mentioned for the first time in 1995, in a work [30] aimed to discuss
the concept of reliability of users in terms of rating coherence, as well as in the
work [28], where the level of noise in the user ratings has been discussed.
The proposed approach diers from the others in the literature, in the sense
that it does not need to focus on a specific type of profile (i.e., short-term or long-
term), it can operate with any type of data that contains a textual description, and it
overcomes the limitation introduced by the magic barrier from a novel perspective,
16 Chapter 2. Recommender Systems
represented by the semantic analysis of the items.
2.3 Decision Making Process
Content-based recommender systems suggest to users items that are similar to
those they previously evaluated [33, 34]. The early systems used relatively simple
retrieval models, such as the Vector Space Model, with the basic TF-IDF weight-
ing. The Vector Space Model is a spatial representation of text documents, where
each document is represented by a vector in a n-dimensional space (known as bag
of words, and each dimension is related to a term from the overall vocabulary
of a specific document collection. Examples of systems that employ this type of
content filtering are [35, 36, 37, 38]. Due to the fact that the approach based on
a simple bag of words is not able to perform a semantic disambiguation of the
words in an item description, content-based recommender systems evolved and
started employing external sources of knowledge (e.g., ontologies) and semantic
analysis tools, to improve their accuracy [39, 40, 41].
Regarding the user profile considered by a recommender system, there is a
common problem that may aect the eectiveness of the obtained results, i.e.,
the capability of the information stored in the user profile to lead toward reliable
recommendations. In order to face the problem of dealing with unreliable in-
formation in a user profile, the state of art proposes dierent strategies. Several
approaches, such as [23], take advantage from the Bayesian analysis of the user
provided relevance feedback, in order to detect non-stationary user interests. Also
2.3. Decision Making Process 17
exploiting the feedback information provided by the users, other approaches such
as [13] make use of a tree-descriptor model to detect shifts in user interests. An-
other technique exploits the knowledge captured in an ontology [25] to obtain the
same result, but in this case it is necessary that the users express their preferences
about items through an explicit rating. In [42, 43, 44], the problem of modeling
semantically correlated items was tackled, but the authors consider a temporal
correlation and not the one between the items and a user profile.
Considering the item incoherence problem, it should be noted that there is
another common issue that aicts the recommendation approaches. This is a
problem that in the literature is identified as magic barrier. The evaluation models
assume as a ground truth that the transactions made in the past by the users, and
stored in their profiles, are free of noise. This is a concept that has been studied
in [31, 32], where a study aimed to capture the noise in a service that operates in
a synthetic environment was performed.
No approach in the content-based recommendation literature ever studied how
the architecture and the flow of computation might be aected by the item inco-
herence and magic barrier issues. It is then an open problem to be tackled.
2.3.1 Non-personalized Models
The recommender systems based on the so-called non-personalized model [45],
propose to all users the same list of recommendations, without taking into account
their preferences. This static approach is usually based on two algorithms, the first
18 Chapter 2. Recommender Systems
of them (TopPop), operates by suggesting the most rated items (i.e., those most
popular), while the second (MovieAvg), works by suggesting the highest rated
items (i.e., those most liked).
The exclusive use of the non-personalized models, leads toward the absence of
two important characteristics that a recommender system should have, i.e., nov-
elty and serendipity [46]. Novelty occurs when a system is able to recommend
unknown items that a user might have autonomously found, while the serendipity
happens when it helps the user to find a surprisingly interesting item that a user
might not have otherwise found, or if it is very hard to find.
2.3.2 Latent Factor Models
The type of data with which a recommendation system operates is typically a
sparse matrix where the rows represent the users, and the columns represent the
items. The entries of this matrix are the interaction between users and items, in
the form of ratings or purchases. The aim of a recommender system is to infer,
for each user u, a ranked list of items, and in literature many of them are focused
on the rating prediction problem [1]. The most eective strategies in this field
exploit the so-called latent factor models, but especially, the matrix factorization
techniques [47].
Other CF ranking-oriented approaches that extend the matrix factorization
techniques, have been recently proposed, and most of them use a ranking oriented
objective function, in order to learn the latent factors of users and items [48].
2.3. Decision Making Process 19
SVD++, the Koren’s version of the Singular Value Decomposition (SVD) [49], is
today considered one of the best strategies in terms of accuracy and scalability.
20 Chapter 2. Recommender Systems
Chapter 3
User Segmentation Systems
3.1 Introduction
Behavioral targeting addresses ads to a set of users who share common properties.
In order to choose the set of target users that will be advertised with a specific ad,
asegmentation that partitions the users and identifies groups that are meaningful
and dierent enough is first performed.
3.2 Latent Space Discovering
The user segmentation is a process aimed at partitioning the potential audience
of an advertiser into several classes, according to specific criteria. Almost all the
existing approaches take into account only the explicit preferences of the users,
without considering the hidden semantics embedded in their choices, so the target
22 Chapter 3. User Segmentation Systems
definition is aected by widely-known problems. The most important is that eas-
ily understandable segments are not eective for marketing purposes due to their
triviality, whereas more complex segmentations are hard to understand. For this
reason, the definition of a new strategy able to perform an untrivial grouping of
the users is an open problem.
3.2.1 Behavioral Targeting
A high variety of behavioral targeting approaches has been designed by the indus-
try and developed as working products. Google’s AdWords1performs dierent
types of targeting to present ads to users; the closest to our proposal is the “Topic
targeting”, in which the system groups and reaches the users interested in a spe-
cific topic. DoubleClick2is another system employed by Google that exploits
features such as browser information and the monitoring of the browsing ses-
sions. In order to reach segments that contain similar users, Facebook oers Core
Audiences3, a tool that allows advertisers to target users with similar location, de-
mographic, interests, or behaviors; in particular, the interest-based segmentation,
allows advertisers to choose a topic and target a segment of users interested by
it. Among its user targeting strategies, Amazon oers the so-called Interest-based
ads policy4, a service that detects and targets segments of users with similar inter-
ests, based on what the users purchased, visited, and by monitoring dierent forms
1https://support.google.com/adwords/answer/1704368?hl=en
2https://www.google.com/doubleclick/
3https://www.facebook.com/business/news/Core-Audiences
4http://www.amazon.com/b?node=5160028011
3.2. Latent Space Discovering 23
of interaction with the website (e.g., the Amazon Browser Bar). SpecificMedia5
uses anonymous web surfing data in order to predict a user’s purchase prediction
score. Yahoo! Behavioral Targeting6creates a model with the online interactions
of the users, such as searches, page-views, and ad interactions to predict the set of
users to target. Other commercial systems, such as Almond Net7,Burst8,Phorm9,
and Revenue Science10 include behavioral targeting features.
Research studies, such as the one presented by Yan et al. [50], show that an
accurate monitoring of the click-through log of advertisements collected from
a commercial search engine can help online advertising. Beales [51] collected
data from online advertising networks and showed that a behavioral targeting per-
formed by exploiting prices and conversion rates (i.e., the likelihood of a click
to lead to a sale) is twice more eective than traditional advertising. Chen et
al. [52] presented a scalable approach to behavioral targeting, based on a linear
Poisson regression model that uses granular events (such as individual ad clicks
and search queries) as features. Approaches to exploit the semantics [53, 54] or
the capabilities of a recommender system [4, 5, 6] to improve the eectiveness of
the advertising have been proposed, but none of them generates segments of target
users.
5http://specificmedia.com/
6http://advertising.stltoday.com/content/behavioral FAQ.pdf
7http://www.almondnet.com/
8http://www.burstmedia.com/
9http://www.phorm.com/
10http://www.revenuescience.com/
24 Chapter 3. User Segmentation Systems
3.2.2 Reliability of a semantic query analysis
In the literature it has been highlighted that half of the time users need to reformu-
late their queries, in order to satisfy their information need [55, 56, 57]. Therefore,
the semantic analysis of a query is not a reliable source of information, since it
does not contain any information about whether or not a query led to what the
user was really looking for. Moreover, performing a semantic analysis on the
items evaluated by the users in order to perform a filtering on them can increase
the accuracy of a system [53, 54, 58]. Therefore, a possible way to overcome
this issue would be to perform a semantic analysis on the description of the items
a user positively evaluated through an explicitly given rating. However, another
issue arises in cascade.
3.2.3 Segment Interpretability and Semantic User Segmentation
Choosing the right criteria to segment users is a widely studied problem in the
market segmentation literature, and two main classes of approaches exist. On the
one hand, the a priori [59] or commonsense [60] approach is based on a simple
property, like the age, which is used to segment the users. Even though the gen-
erated segments are very easy to understand and they can be generated at a very
low cost, the segmentation process is trivial and even a partitioning with the k-
means clustering algorithm has proven to be more eective than this method [61].
On the other hand, post hoc [62] approaches (also known as a posteriori [59] or
data-driven [60]) combine a set of features (which are known as segmentation
3.2. Latent Space Discovering 25
base [63]) in order to create the segmentation. Even though these approaches are
more accurate when partitioning the users, the problem of properly understanding
and interpreting results arises [64, 65]. This is mostly due to the lack of guidance
on how to interpret the results of a segmentation [66].
Regarding the literature on behavioral user segmentation, Bian et al. [67] pre-
sented an approach to leverage historical user activity on real-world Web portal
services to build behavior-driven user segmentation. Yao et al. [68] adopted SOM-
Ward clustering (i.e., Self Organizing Maps, combined with Ward clustering), to
segment a set of customers based on their demographic and behavioral character-
istic. Zhou et al. [69] performed a user segmentation based on a mixture of factor
analyzers (MFA) that consider the navigational behavior of the user in a browsing
session. Regarding the semantic approaches to user segmentation, Tu and Lu [70]
and Gong et al. [71] both proposed approaches based on a semantic analysis of
the queries issued by the user through Latent Dirichlet Allocation-based models,
in which users with similar query and click behaviors are grouped together. Simi-
larly, Wu et al. [72] performed a semantic user segmentation by adopting a Prob-
abilistic Latent Semantic Approach on the user queries. As this analysis showed,
none of the behavioral targeting approaches exploits the interactions of the users
with a website in the form of a positive rating given to an item.
26 Chapter 3. User Segmentation Systems
3.2.4 Preference Stability
Burke and Ramezani highlighted that some domains are characterized by a sta-
bility of the preferences over time [73]. Preference stability leads also to the fact
that when users get in touch with diverse items, diversity is not valued [74]. On
the one side, users tend to access to agreeable information (a phenomenon known
as filter bubble [75]) and this leads to the overspecialization problem [33], while
on the other side they do not want to face diversity. Another well-known prob-
lem is the so called selective exposure, i.e., the tendency of users to make their
choices (goods or services) based only on their usual preferences, which excludes
the possibility for the users to find new items that may be of interest to them [76].
The literature presents several approaches that try to reduce this problem, e.g.,
NewsCube [77] operates oering to the users several points of views, in order to
stimulate them to make dierent and unusual choices.
3.2.5 Item Descriptions Analysis
For many years the item descriptions were analyzed with a word vector space
model, where all the words of each item description are processed by TF-IDF [78]
and stored in a weighted vector of words. Due to the fact that this approach based
on a simple bag of words is not able to perform a semantic disambiguation of the
words in an item description, and motivated by the fact that exploiting a taxonomy
for categorization purposes is an approach recognized in the literature [79], we
decided to exploit the functionalities oered by the WordNet environment. More
3.2. Latent Space Discovering 27
details about the bag of words and Wordnet are reported in Appendix A.
28 Chapter 3. User Segmentation Systems
Chapter 4
Fraud Detection Systems
4.1 Introduction
Nowadays, any business that carries out activities on the Internet and accepts pay-
ments through debit or credit cards, also implicitly accepts all the risks related to
them, like for some transaction to be fraudulent. Although these risks can lead to
significant economic losses, nearly all the companies continue to use these pow-
erful instruments of payment, as the benefits derived from them will outweigh the
potential risks involved.
Fraud is one of the major issues related with the use of debit and credit cards,
considering that these instruments of payment are becoming the most popular
way to conclude every financial transaction, both online and in a traditional way.
According to a study of some years ago conduct by the American Association of
30 Chapter 4. Fraud Detection Systems
Fraud Examiners1, fraud related with the financial operations are the 10-15% of
the whole fraud cases. However, this type of fraud is related to the 75-80% of
all involved finances with an estimated average loss per fraud case of 2 million
of dollars, in the USA alone. The research of ecient ways to face this problem
has become an increasingly crucial imperative in order to eliminate, or at least
minimize, the related economic losses.
4.2 A Proactive Approach for the Detection of Frauds At-
tempts
As highlighted in many studies, frauds represent the biggest problem in the E-
commerce environment. The credit card fraud detection represents one of the
most important contexts, where the challenge is the detection of a potential fraud
in a transaction, through the analysis of its features (i.e., description, date, amount,
an so on), exploiting a user model built on the basis of the past transactions of the
user. In [80], the authors show how in the field of automatic fraud detection there
is lack of real datasets (publicly available) indispensable to conduct experiments,
as well as a lack of publications about the related methods and techniques.
The most common causes of this problem are the policies (for instance, com-
petitive and legal) that usually stand behind every E-commerce activity, which
makes it very dicult to obtain real data from business. Furthermore, such datasets
composed by real information about user transactions could also reveal the poten-
1http://www.acfe.com
4.2. A Proactive Approach for the Detection of Frauds Attempts 31
tial vulnerabilities in the related E-commerce infrastructure, with a subsequent
loss of trust.
Literature underlines how the main two issues in this field are represented by
the data unbalance (i.e., the number of fraudulent transactions is typically much
smaller than legitimate ones), and by the overlapping of the classes of expense of a
user (i.e., due to the scarcity of information that characterizes a typical record of a
financial transaction). A novel approach able to face these two aspects represents
then a challenging problem.
4.2.1 Supervised and Unsupervised Approaches
In [81] it is underlined how the unsupervised fraud detection strategies are still a
very big challenge in the field of E-commerce. Bolton and Hand [82] show how
it is possible to face the problem with strategies based both on statistics and on
Artificial Intelligence (AI), two eective approaches in this field able to exploit
powerful instruments (such as the Artificial Neural Networks) in order to get their
results.
In spite the fact that every supervised strategy in fraud detection needs a reli-
able training set, the work proposed in [82] takes in consideration the possibility
to adopt an unsupervised approach during the fraud detection process, when no
dataset of reference containing an adequate number of transactions (legitimate and
non-legitimate) is available. Another approach based on two data mining strate-
gies (Random Forests and Support Vector Machines) is introduced in [83], where
32 Chapter 4. Fraud Detection Systems
the eectiveness of these methods in this field is discussed.
4.2.2 Data Unbalance
The unbalance of the transaction data represents one of the most relevant issues
in this context, since almost all of the learning approaches are not able to operate
with this kind of data structure [84], i.e., when an excessive dierence between the
instances of each class of data exists. The unbalanced training sets represent one
of the most big problems in the context of supervised learning, because the pres-
ence of a huge disproportion in the number of instances of the classes generates a
wrong classification of the new cases (i.e., they are assigned to the majority class).
This happens because the canonical learning approaches are not able to perform a
correct classification of the new cases in such contexts, in fact they report a good
accuracy only for the cases that belong to the majority class, reporting unaccept-
able values of accuracy for the other cases that belong to the minority class. In
other words, it means that is possible that, in presence of a data unbalance, a clas-
sifier predicts all the new cases as belonging to the major class, ignoring the minor
class.
To face this problem, several techniques of pre-processing have been devel-
oped, aimed to balance the set of data [85], and they can be grouped into three
main categories: sampling based,algorithms based, and feature-selection based.
Sampling based: this is a pre-processing strategy that faces the problem by re-
sampling the set of data. The sampling can be performed through dierent ways:
4.2. A Proactive Approach for the Detection of Frauds Attempts 33
by under-sampling the majority class, by over-sampling the minority class, or
by an hybrid-sampling that combines these two approaches. The under-sampling
technique randomly removes the transactions, until the balancing has been reached,
while the specular over-sampling technique, obtains the balancing by adding new
transactions, created through an interpolation of the elements that belong to a same
class [86].
Algorithms based: this strategy is aimed to optimize the performance of the
learning algorithm on unseen data. A single-class learning methods is used to
recognize the cases that belongs to that class, rejecting the other ones. This is a
strategy that in some contexts (i.e., the multi-dimensional data sets) gives better
performance than the other strategies [87].
Feature-selection based: this strategy operates by selecting a subset of fea-
tures (defined by the user) that allows a classifier to reach the optimal perfor-
mance. In the case of big data sets, some filters are used in order to score each
single feature, on the basis of a rule [87].
4.2.3 Detection Models
The static approach [88] represents a canonical way to operate to detect fraudulent
events in a stream of transactions. It is based on the initial building of a user
model, which is used for a long period of time, before its rebuilding. An approach
characterized by a simple learning phase, but not able to follow the changes of
user behavior during the time.
34 Chapter 4. Fraud Detection Systems
In a static approach, the data stream is divided into blocks of the same size,
and the user model is trained by using a certain number of initial and contiguous
blocks of the sequence, which use to infer the future blocks. In the so-called
updating approach [89], instead, when a new block appears, the user model is
trained by using a certain number of latest and contiguous blocks of the sequence,
then the model can be used to infer the future blocks, or aggregated into a big
model composed by several models. In another strategy, based on the so-called
forgeting approach [90], a user model is defined at each new block, by using a
small number of non fraudulent transactions, extracted from the last two blocks,
but keeping all previous fraudulent ones. Also in this case, the model can be used
to infer the future blocks, or aggregated into a big model composed by several
models.
The main disadvantages related of these approaches of user modeling are: the
incapacity to follow the changes in the users behavior, in the case of the static
approach; the ineectiveness to operate in the context of small classes, in the
case of the updating approach; the computational complexity in the case of the
forgetting approach.
There are several kind of approaches that are used in this context, such as
those based on Data Mining [91], Artificial Intelligence [92], Fuzzy Logic [93],
Machine Learning [94], or Genetic Programming [80]. However, regardless of the
used approach, the problem of the non stationary distribution of the data, as well
as that of the unbalanced classes distribution, remain still unaltered.
4.2. A Proactive Approach for the Detection of Frauds Attempts 35
4.2.4 Dierences with the proposed approach
The proposed approach introduces a novel strategy that, firstly, takes in account
all elements of a transaction (i.e., numeric and non numeric), reducing the prob-
lem related with the lack of information, which leads toward an overlapping of
the classes of expense [95]. The introduction of the Transaction Determinant
Field (TDF) set, also allows to give more importance to certain elements of the
transaction, during the model building. Secondly, dierently from the canonical
approaches at the state of the art, the proposed approach is not based on an unique
model, but instead on multiple user models that involve the entire set of data. This
allows us to evaluate a new transaction by comparing it with a series of behavioral
models related with many parts of the user transaction history.
The main advantage of this strategy is the reduction, or removal, of the issues
related with the stationary distribution of the data, and the unbalancing of the
classes. This because the operative domain is represented by the limited event
blocks, and not by the entire dataset. The discretization of the models, according
to a certain value of d, permit us to adjust their sensitivity to the peculiarities of
the operating environment.
In more details, regarding the analysis of the textual information related to
the transactions, the literature presents several ways to operate, and most of them
work in accord with the bag-of-words model, an approach where the words (for
instance, type and description of the transaction) are processed without taking into
account of the correlation between terms [23, 13].
36 Chapter 4. Fraud Detection Systems
This trivial way to manage the information does usually not lead toward good
results, and just for this reason the basic approaches are usually flanked by com-
plementary techniques aimed to improve their eectiveness [54, 79], or they are
replaced by more sophisticated alternative based on the semantic analysis of the
text [96], which proved to be eective in many contexts, such as the recommen-
dation one [58].
Considering the nature of the textual data related to a financial transaction, the
adoption of semantic techniques could lead toward false alarms, as well as a trivial
technique based on simple matching between words. This happen because, a con-
ceptual extension of a the textual field of a transaction could evaluate as similar
two transactions instead very dierent, while a simple matching technique could
lead to consider as dierent some string of text, due to the existence of some slight
dierences (i.e., plural forms instead of singular, words dierent but with a com-
mon root, and so on). For this reason, this work adopts the Levenshtein Distance
described in Appendix B, a metric that measure the similarity between two tex-
tual fields in terms of minimal number of insertions, deletions, and replacements,
needed to transforming the content of the first field into the content of the second
one.
On the Role of Similarity and Diversity
in Recommender Systems
Preface
The concepts of similarity and diversity are here taken in account in order to im-
prove the reliability of the user profiles (i.e., making them as close as possible
to the real tastes of the users), in the context of a recommender systems. Sub-
sequently, the same concepts of similarity/diversity have been exploited to im-
prove the decision making process of a recommender system that operates in the
e-commerce environment.
38
Chapter 5
User Profiling
5.1 Introduction
The main motivation behind this work is that most of the solutions regarding the
user-profiling involve the interpretation of the whole set of items previously eval-
uated by a user, in order to measure their similarity with those that she/he did
not consider yet, and recommend the most similar items. Indeed, the recommen-
dation process is usually based on the principle that users’ preferences remain
unchanged over time and this can be true in many cases, but it is not the norm
due to the existence of temporal dynamics in their preferences. Therefore, as dis-
cussed in Chapter 2.2, a static approach to user profiling can lead toward wrong
results due to various factors, such as a simple change of tastes over time or the
temporary use of their own account by other people. Several works have showed
40 Chapter 5. User Profiling
that the user ratings can be considered as outliers, due to the fact that the same
user may rate the same item with dierent ratings, at dierent moments of time.
This is a well-known problem, which in literature is defined as magic barrier
(Chapter 2.2.3).
Premising that the user profiling context taken into consideration is that re-
lated to recommendation systems, where such activity has a primary role, the
proposed approach aims to evaluate the similarity between a single item and the
others within the user profile, in order to improve the recommendation process
by discarding the items that are highly dissimilar with the rest of the user pro-
file. Considering that in the literature a formalization of a system architecture that
implements such mechanism of profile cleaning does not exist, before moving to-
ward the implementation steps, this research defines (first at a high-level, then in
detail) such an architecture.
5.2 Architecture
5.2.1 A State-of-the-Art Architecture for Content-based Recommender
Systems
This section will present the high-level architecture of a content-based recom-
mender system proposed in [33] and presented in Figure 5.1. In order to highlight
the limits of this architecture and present our proposal, we will explore it by pre-
senting the flow of the computation of a system that employs it.
5.2. Architecture 41
Figure 5.1: Architecture of a content-based recommender system.
42 Chapter 5. User Profiling
The description of the items usually has no structure (e.g., text), so it is nec-
essary to perform some pre-processing steps to extract some information from it.
Given an Information source, represented by the Item Descriptions (e.g., product
descriptions, Web pages, news, etc.) that will be processed during the filtering, the
first component employed by a system is a Content Analyzer. The component
coverts each item description into a format processable by the following steps (i.e.,
keywords, n-grams, concepts, etc.) thanks to the employment of feature extrac-
tion tools and techniques. The output generated by this component is a Structured
Item Representation, stored in a Represented Items repository.
Out of all the represented items, the system consider the ones evaluated by
each active user uato whom recommendations have to provided (User uatraining
examples), in order to build a profile that contains the preferences of the user. This
task is accomplished by a Profile Learner component, which employs Machine
Learning algorithms to combine the structured item representations in a unique
model. The output produced by the component is a user profile, stored in a Profiles
repository.
The recommendation task is performed by a Filtering Component, which
compares the output of the two previous components (i.e., the profile of the active
user and a set of items she/he has not evaluated yet). Given a new item represen-
tation, the component predicts wether or not the item is suitable for the active user
ua, usually with a value that indicates its relevance with respect to the user pro-
file. The filtered items are ranked by relevance and the top-nitems in the ranking
represent the output produced by the component, i.e., a List of recommendations.
5.2. Architecture 43
The List of recommendation is proposed to the active user ua, which either ac-
cepts or rejects the recommended items (e.g., by watching a recommended movie,
or by buying a recommended item), by providing a feedback on them (User ua
feedback), stored in a Feedback repository.
The feedback provided by the active user is then used by the system to update
her/his user profile.
Limits at the State of the Art and Design Guidelines
In the previous section, we presented the state-of-the-art architecture of a content-
based recommender system. We will now present the possible problems that might
occur by employing it and provide design guidelines on how to improve it.
The possible problems that might occur will be presented through possible use
cases/scenarios that might occur.
Scenario 1. The account of the active user is used by another person, who evalu-
ates items that the user would have never evaluated (e.g., she/he buys items
that the active user would have never bought). This would lead to the pres-
ence of noise in a user profile, since the Structured Item Representation of
these incoherent items with respect to the user profile would be considered
by the Profile Learner component, which would make them part of the
user uaprofile, stored as it is in the Profiles repository, and employed in the
recommendation process by the Filtering Component. This would gener-
ate bad recommendations and the accuracy of the system would strongly be
44 Chapter 5. User Profiling
aected.
Scenario 2. The preferences of the active user change over time, but the oldest
items that do not reflect the current preferences of the user, but positively
evaluated by her/him, are still part of the user profile. A form of aging
of the items in a user profile would allow the system to ignore such items
after some time, but until that moment those items would represent noise.
That noise might aect the system for a lot of time, since the aging process
is usually gradual and the items age slowly. Again, this would aect the
recommendation accuracy.
Scenario 3. If a mix of the two previous scenarios occurs and these type of prob-
lems are iterated over time, the system would reach the so-called magic
barrier. As previously highlighted, the problem has been widely studied
in the Collaborative Filtering literature, in order to identify and remove the
noisy items based on their ratings. No work in the literature studied the
magic barrier from a content-based point of view, so the state-of-the-art ar-
chitecture previously presented is limited also from that perspective.
The three previously presented scenarios put in evidence that the architecture
of a content-based system should be able to deal with the presence of incoherent
items in the user profiling process, in order to avoid the previously aforementioned
problems. Therefore, we will now present design guidelines on how to improve
the state-of-the-art-art architecture of a system.
5.2. Architecture 45
The first scenario highlighted the need for a system to detect how coherent is
an item with the rest of the items that have been evaluated by a user, in order to
detect the presence of noise. This could be done by comparing the content of the
item (i.e., the structured item representation) with that of the other items evaluated
by the user user uatraining examples.
Scenario 2 confirms the need for a system to evaluate the temporal correlation
of an item with the rest of the items in the user profile. Indeed, if an item is too
old and, as previously said, too dierent with respect the other items, it should be
removed from a user profile.
Both the second and the third scenarios highlighted that the presence of noisy/incoherent
items on a user profile should be reduced to a very limited amount of time. In
particular, thanks to scenario 3 we know that these items should not be ignored
gradually, but the system should be able to do a one-oremoval. This would allow
the filtering component to consider only items that are coherent with each other
and with the preferences of the users.
In Section 5.2.2 will adopt these design guidelines to present an architecture
that overcomes these issues.
5.2.2 Recommender Systems Architecture
Overview
This section proposes my novel architecture. The updated high-level architecture
of the system is first proposed (Section 5.2.2), and in Section 5.2.2 are presented
46 Chapter 5. User Profiling
the details of the novel component that faces the open problems highlighted in
the previous section. This part will close with a brief analysis that shows how
this proposal fits with the development of a real-world system in the big data era
(Section 5.2.2).
Proposed Solutions
This part of research first analyzes the state-of-the-art architecture of a content-
based recommender system, then it will explore in detail the possible problems
that might occur by employing it. Some design guidelines on how to enrich that
architecture will be proposed, and a novel architecture, which allows the system
to tackle the highlighted problems and improve the eectiveness of the recom-
mendation process, will be presented.
Even though we will focus on the emerging application domain we previously
mentioned (i.e., the semantics-aware systems), we will also show the usefulness
of our proposal on classic content-based approach.
The scientific contributions coming from the thesis are now summarized:
it will analyze the state-of-the-art architecture of a content-based recom-
mender system to study, for the first time in the literature, what might hap-
pen in the recommendation process if incoherent items are filtered by the
system;
this is the first study in which the magic problem is studied in a content-
based recommender system and from the architectural point of view;
5.2. Architecture 47
it presents design guidelines and a novel architecture, in order to improve
the existing one and overcome the aforementioned issues;
it will analyze the impact of the components we will introduce in the pro-
posed architecture from a computational cost point-of-view.
Approach
High-Level Architecture Figure 5.2 presents an updated version of the state-
of-the-art architecture illustrated in Section 5.2.1. The proposed architecture in-
tegrates a novel component named Profile Cleaner, with the name to analyze a
profile and remove the incoherent items, before storing it in the Profiles reposi-
tory. In order to solve the previous problems, the component should be able to
remove an item if it meets the following two conditions:
1. the coherence/content-based similarity of the item with the rest of the profile
is under a Minimum Coherence threshold value;
2. it is located in the first part of the user iteration history. Based on this re-
quirement, an item is considered far from the user’s preferences only when
it goes up in the first part of the iterations (i.e., when the distance with the
last evaluated item is higher than a Maximum Temporal Distance threshold).
By removing the incoherent old items, the Filtering Component would con-
sider only the real preferences of the users and the previously mentioned problems
are solved. Indeed, by checking that both conditions are met, the system avoids
48 Chapter 5. User Profiling
removing from a profile the items that are diverse from those she/he previously
considered, but that might be associated to a recent change in the preferences of
the user.
Regarding scenario 1, if among a user uatraining examples there is an incoher-
ent item evaluated by a third party, it would be detected by the component, since
it receives it as an input. Regarding scenarios 2 and 3, by checking the temporal
correlation of an item with the others in the user profile, the component would
be able to remove an item as soon as it becomes old and incoherent, avoiding the
problems related to the aging strategies (which might still be employed by the
Profile Learner, but are not enough) and to the presence of too many incoherent
items that would lead to the magic barrier problem.
Low-level Representation of the Profile Cleaner The Figure 5.3 inspects fur-
thermore on the component introduced in this novel architecture, to present a low-
level analysis and the subcomponents it should employ to accomplish its task.
As Figure 5.2 showed, the profile cleaner takes as input both an item i a user
has evaluated (i.e., one of the training examples or of the feedbacks provided by a
user) and a user profile.
The Items Coherence Analyzer subcomponent compares the structured rep-
resentation of an item iwith the rest of the user profile, in order to the detect the
coherence/similarity of the item with the rest of the profile. If the Structured Item
Representation involves semantic structures (e.g., Wordnet synsets), as the mod-
ern content-based systems do, several metrics can be employed to evaluate the
5.2. Architecture 49
Figure 5.2: Architecture of a semantics-aware content-based recommender system.
50 Chapter 5. User Profiling
Figure 5.3: Architectural organization of the profile cleaner task.
5.2. Architecture 51
semantic similarity between two structured representations that involve synsets.
The state-of-the-art ones are the following five: i.e., Leacock and Chodorow [97],
Jiang and Conrath [98], Resnik [99], Lin [100], and Wu and Palmer [101]. How-
ever, any type of similarity/coherence might be employed, even if no semantic
information is available the item representation (e.g., TF-IDF). The output pro-
duced by the subcomponent is an Item i Coherence value, which will be later
employed by the Items Removal Analyzer subcomponent to decide if the item
should be removed or not.
In parallel, the Temporal Analyzer subcomponent will consider how far was
the evaluation of the considered with respect to that of the other items in the user
profile (and especially the last evaluated one). The distance threshold might be
defined as a fixed value, or by defining regions based on the chronology with
which the items have been evaluated (e.g., remove an item if it was evaluated
in the first two quartiles that contain the oldest items). The output is an Item
i Temporal Distance, which will also be employed by Items Removal Analyzer
subcomponent.
The output of the two previously subcomponents is then handled by the Items
Removal Analyzer which also receive as input the Minimum Coherence and Max-
imum Temporal Distance thresholds, and decides if the considered item ishould
be removed from a user profile or not. The output produced by the subcomponent
(and by the Profile Cleaner main component) is a cleaned user uaprofile, which
does not contain the incoherent and oldest items.
52 Chapter 5. User Profiling
Developing a System that Employs this Architecture It becomes natural to
think that the introduction of a Profile Cleaner component, even if useful, might
be lead to heavy tasks to be computed by the system. Indeed, the component has to
deal with a comparison between each item and the rest of the user profile, and this
similarity might involve semantic elements and measures, which are usually very
heavy to compute. Given the widely-known big data problem that characterizes
and aects each systems nowadays, this work will try to inspect on how to develop
this component in real-world scenarios.
Indeed, the computation of the coherence of each of the new items with the
rest of the user profile might distributed over dierent computers, by employing
large scale distributed computing models like MapReduce. Moreover, this process
can be handled in background by the system, since when a user evaluates a new
item, it would hardly make any instant dierence on the computed recommen-
dations. Therefore, if it gets removed in a reasonable time and with a distributed
approach, the employment of Profile Cleaner component would be both eective
and ecient at the same time.
Moreover, we studied the structure of the Profile Cleaner component to let it
run two subcomponents in parallel, so that even under this perspective the process
can be parallelized and ecient.
In conclusion, we believe that even if we are introducing a possibly heavy
computational process, the improvements in terms of accuracy and the structure of
the component would overcome the complexity limits. Moreover, this complexity
would also be eciently dealt with the current technologies employed to face the
5.2. Architecture 53
big data problems (e.g., Hadoop’s MapReduce).
Conclusions and Future Work
This work deals with the problems that might occur with the current way in which
content-based recommender systems are engineered and designed.
Given the high impact that emerging aspects are having in research and real-
world recommender systems, such as the introduction of the semantics in the fil-
tering process and the so-called magic barrier problem, it analyzed the current
architecture employed by a content-based recommender system and highlighted
current limits. Indeed, it showed that a form of cleaning of the user profiles is
necessary in order to overcome these limitations.
It then proposed an updated architecture, which was analyzed both from a
high-level point of view and by inspecting on the component that allows a system
to clean a profile. Moreover, it studied the application of this proposal in real-
world scenarios, which would probably be characterized by the big data problem.
Future work will move from the software engineering perspective of our study,
to develop real-world ecient implementations of this architecture (e.g., on a
grid), in order to study its eciency and eectives in scenarios characterized by
the big data (e.g., the recommendations performed by an e-commerce website).
54 Chapter 5. User Profiling
5.3 Implementation
5.3.1 Overview
Recommender systems usually produce their results to the users based on the in-
terpretation of the whole historic interactions of these. This canonical approach
sometimes could lead to wrong results due to several factors, such as a changes in
user taste over time or the use of her/his account by third parties. This research
proposes a novel dynamic coherence-based approach that analyzes the informa-
tion stored in the user profiles based on their coherence.
The main aim is to identify and remove from the previously evaluated items
those not adherent to the average preferences, in order to make a user profile as
close as possible to the user’s real tastes. The conducted experiments show the
eectiveness of the proposed approach to remove the incoherent items from a
user profile, in order to increase the recommendation accuracy.
5.3.2 Proposed Solution
The coherence of an item, with respect to the user profile, in literature is usually
measured as the variance in the feature space that defines the item, typically based
on the rating given by the users [102]. This is done by employing several metrics,
such as the entropy, the mean value, or the standard deviation. Dierently from
the approaches at the state of the art, this research considers the semantic distance
between the concepts expressed by each item in a user profile, and the concepts
5.3. Implementation 55
expressed by the other ones. This way to proceed presents a twofold advantage:
firstly, it allow us to evaluate the coherence of an item in a more extensive way (by
employing semantic concepts) w.r.t. a limited mathematical approach; secondly,
it reduces the cause of the magic barrier problem. This happens because the
assumption of the magic barrier problem is the presence of incoherent items in
the user profiles. Considering that this approach removes them, keeping in the user
profiles only those items that are coherent with each other, it allow to consider any
observed improvement as real, instead that a mere side-eect (i.e., an overfitting).
To perform the task of removing semantically incoherent items from a user
profile, this research introduces the Dynamic Coherence-Based Modeling (DCBM),
an algorithm based on the concept of Minimum Global Coherence (MGC), a met-
ric that allows us to measure the semantic similarity between a single item with
the others within the user profile. Moreover, the algorithm takes into account two
other factors, i.e., the position of each item in the chronology of the user choices,
and the distance from the mean value of the global similarity (the term global
identifies all the items in a user profile). These metrics allow us to remove in a se-
lective way any item that could make the user’s profiles non-adherent to their real
tastes. The main idea is that the more information in the user profile is coherent,
the more the recommendations based on this profile will be reliable. Dierently
from other strategies designed for specific contexts, this approach is able to oper-
ate in all scenarios. Through it, the process of evaluation of the items coherence
has been moved from a domain based on rigorous mathematical criteria (i.e., vari-
ance of the user’s ratings in the feature space), to a new semantic domain, which
56 Chapter 5. User Profiling
presents a considerable advantage in terms of evaluation flexibility.
In order to evaluate the capability of this approach to produce accurate user
profiles, the DCBM algorithm is implemented into a state-of-the-art semantic-
based recommender system [39], where the accuracy of the recommendations is
evaluated. Since the task of the recommender system that predicts the interest of
the users for the items relies on the information included in a user profile, more
accurate user profiles lead to an improved accuracy of the whole recommender
system. Experimental results show the capability of this approach to remove the
incoherent items from a user profile, increasing the accuracy of recommendations.
The main contributions of this part of research can be summarized as follow-
ing:
introduction of a novel algorithm able to remove incoherent items from a
user profile, with the aim to improve the recommendation accuracy;
integration of this algorithm into a state-of-the-art recommender system, in
order to improve its eectiveness and validate the proposed approach;
verification on two datasets: a synthetic one that allows us to analyze the
behavior of the proposed approach under dierent settings, and a real-world
one used to compare the accuracy of the recommender system with and
without the incoherent items removing.
5.3. Implementation 57
5.3.3 Adopted Notation
Definition 5.1 (User preferences) We are given a set of users U ={u1,...,uN},
a set of items I ={i1,...,iM}, and a set V of values used to express the user pref-
erences (e.g., V =[1,5] or V ={like,dislike}). The set of all possible preferences
expressed by the users is a ternary relation P U×I×V. We denote as P+P
the subset of preferences with a positive value (i.e., P+={(u,i,v)P|vvv=
like}), where v indicates the mean value (in the previous example, v =3).
Definition 5.2 (User items) Given the set of positive preferences P+, we denote
as I+={iI|∃(u,i,v)P+}the set of items for which there is a positive pref-
erence, and as as Iu={iI|∃(u,i,v)P+uU}the set of items a user u
likes.
Definition 5.3 (Item semantic description) Let BoW ={t1,...,tW}be the bag
of words used to describe the items in I; we denote as dithe binary vector used
to describe each item i I (each vector is such that |di|=|BoW|). We define as
S={s1,...,sW}the set of synsets associated to BoW (that is, for each term used
to describe an item, we consider its associated synset), and as sdithe semantic de-
scription of i. The set of semantic descriptions is denoted as D ={sd1,..., sdM}
(note that we have a semantic description for each item, so |D|=|I|). The ap-
proach used to extract sdifrom diis described in detail in Section 5.3.5.
Definition 5.4 (Semantic user model) Given the set of positively evaluated items
by a user Iu, we define a semantic user model Muas the set of synsets in the seman-
58 Chapter 5. User Profiling
tic descriptions of the items in Iu. More formally, Mu={sw|swsdmimIu}.
Definition 5.5 (Item coherence) An item i Iuis coherent with the rest of the
items in the user profile Iu, if the similarity between the semantic description sdi
of the item and the union of the semantic descriptions of the rest of the items (i.e.,
Mu\sdi) is higher than a threshold value.
5.3.4 Problem Definition
Given a set of items Iuthat a user likes, the objective is to extract a set IuIu,
such that each item iIuis coherent with the others.
5.3.5 Approach
As already highlighted during the description of the limits that aect the user
profiling activity, individual profiles need to be as adherent as possible to the real
tastes of the users, because they are used to predict their future interests. For
this reason, this section proposes a novel approach defined Dynamic Coherence-
Based Modeling (DCBM) able to find and remove the incoherent items within the
user profiles, regardless of the profiling method chosen. The implementation on a
recommender system of the DCBM is articulated in the following four steps:
1. Data Preprocessing: preprocessing of the text present in the items that
compose a user profile, as well as of the text present in the items not yet
considered, in order to remove the useless elements and the items with a
user rating lower than the average;
5.3. Implementation 59
2. Semantic Similarity: WordNet features are used to retrieve, from the pre-
processed text, all the possible pairs between the WordNet synsets in the
text of the items not evaluated and the synsets in the text of the user profile,
keeping as a result only the pairs that have at least an element with the same
part-of-speech, for which we measure the semantic similarity according to
the Wu and Palmer metric;
3. Dynamic Coherence-Based Modeling: the items dissimilar from the aver-
age preferences of a user are identified by measuring the Minimum Global
Coherence (MGC). Moreover, in accordance with certain criteria, the items
that are more semantically distant from the context of a user’s real tastes are
removed from the user profile;
4. Item Recommendation: to perform the recommendation process, we sort
the not evaluated items by their similarity with the user profile, and propose
to a user a subset of those with the highest values of similarity.
Note that steps 1, 2, and 4 are followed by a state-of-art recommender system
based on the semantic similarity [39], in which the novel Dynamic Coherence-
Based Modeling (DCBM) algorithm (step 3) is integrated, in order to improve the
user profile and increase the accuracy of the recommender system.
How each step works is described in detail in the following
60 Chapter 5. User Profiling
Data Preprocessing
Before comparing the similarity between the items in a user profile, we need to
follow several preprocessing steps.
The first step detects the correct part-of-speech (POS) for each word in the
text; in order to perform this task, the Stanford Log-linear Part-Of-Speech Tag-
ger [103] has been used.
The second step removes punctuation marks and stop-words, i.e., the insignif-
icant words (such as adjectives, conjunctions, etc.) that represent noise in the
semantic analysis. Several stop-words lists can be found on the Internet; this work
used a list of 429 stop-words made available with the Onix Text Retrieval Toolkit1.
The third step, after the determination of the lemma of each word using the
Java API implementation for WordNet Searching (JAWS)2, performs the so-called
word sense disambiguation, a process where the correct sense of each word is de-
termined, which permits us to evaluate the semantic similarity in precise way.
The best sense of each word in a sentence was found using the Java implemen-
tation of the adapted Lesk algorithm provided by the Denmark Technical Uni-
versity (DTU) similarity application [104]. All the collected synsets form the set
S={s1,...,sW}defined in Section 5.3.3.
The output of this step is the semantic disambiguation of the textual descrip-
tion of each item iI, which is stored in a binary vector sdi; each element of the
vector sdi[w] is 1 if the corresponding synset appears in the item description, and
1http://www.lextek.com/manuals/onix/stopwords.html
2http://lyle.smu.edu/tspell/jaws/index.html
5.3. Implementation 61
0 otherwise.
Semantic Similarity
Although the most used semantic similarity measures are five, i.e. Leacock and
Chodorow [97], Jiang and Conrath [98], Resnik [99], Lin [100] and Wu and
Palmer [101], and each of them evaluates the semantic similarity between two
WordNet synsets, we calculate the semantic similarity by using the Wu and Palmer’s
measure, a method based on the path lengths between a pair of concepts (Word-
Net synsets), which in the literature is considered to be the most accurate when
generating the similarities [105, 39].
Given a set Xof iWordNet synsets x1,x2, ..., xithat are related to an item
description, and a set Yof jWordNet synsets y1,y2, ..., yjrelated to another item
description, a set Q, which contains all the possible pairs between the synsets in
the set Xand the synsets in the set Y, is defined as in Equation 5.1.
Q=hx1,y1i,hx1,y2i,...,Dxi,yjExX,yY(5.1)
In the next step, a subset Zof the pairs in Q(i.e., ZQ) that have at least an
element with the same POS is created (Equation 5.2).
Z={(xi,yj)|POS (xi)=POS (yj)}(5.2)
62 Chapter 5. User Profiling
The metric measures the similarity between concepts in an ontology, as shown
in Equation 5.3.
simWP (x,y)=2·A
B+C+(2 ·A)(5.3)
Assuming that the Least Common Subsumer (LCS) of two concepts xand yis
the most specific concept that is an ancestor of both x and y, where the concept tree
is defined by the is-a relation, in Equation 5.3 we have that A=depth(LCS(x,y)),
B=length(x,LCS(x,y)),C=length(y,LCS(x,y)). We can note that B+Crepresents
the path length from xand y, while Aindicates the global depth of the path in the
taxonomy.
The similarity between two items is defined as the sum of the similarity score
for all pairs, divided by its cardinality (the subset Zof WordNet synsets with a
common part-of-speech), as shown in Equation 5.4.
simWP(X,Y)=Pn
(x,y)ZsimWP(x,y)
|Z|(5.4)
This similarity metric is employed both by the proposed algorithm to compute
the coherence of an item with the rest of the semantic user model, and by the
recommendation algorithm to select and suggest items similar to those that the
user prefers.
5.3. Implementation 63
Dynamic Coherence-Based Modeling
For the purpose of being able to make eective recommendations to users, their
profiles need to store only the descriptions of the items that really reflect their
tastes.
In order to identify which items positively evaluated by a user (iIu) do not
reflect the user taste, in that they represent, for instance, the result of past wrong
choices or the use by third parties of her/his account, the Dynamic Coherence-
Based Modeling (DCBM) algorithm measures the Minimum Global Coherence
(MGC) of each single item description with the set of other items present in her/his
profile. In other words, through MGC, the most dissimilar item with respect to the
other items is identified.
The Wu and Palmer similarity metric previously presented can be used to
calculate the MGC , as shown in Equation 5.5 (sdidenotes the semantic description
of an item i, and Mu\sdiindicates the semantic user model from which the synsets
in sdihave been removed).
MGC =min
iIu
(simWP(sdi,Mu\sdi))(5.5)
The basic idea is to isolate each individual item iin a user profile, semantically
described by sdi, and then measure the similarity with respect to the remaining
items (i.e., the merging of the synsets of the rest of the items), in order to obtain a
measure of its coherence within the overall context of the entire profile.
64 Chapter 5. User Profiling
In other words, in order to individuate the most distant element from the gen-
eral context of the evaluated items, we are exploiting a basic principle of the dif-
ferential calculus, because the MGC value shown upon is nothing other than the
maximum negative slope, which is calculated by finding the ratio between the
changing on yaxis and the changing on xaxis. This is demonstrated in Theo-
rem 5.1.
Theorem 5.1 The Minimum Global Coherence coecient corresponds to the max-
imum negative slope.
Proof: Placing on the xaxis the user iterations in
a chronological order, and on the yaxis the corresponding values of GS (Global
Similarity) calculated as simW P(sdi,Mu\sdi),iIu, we can trivially calculate
the slope value (denoted by the letter m), as shown in Equation 5.6.
m=4y
4x=f(x+4x)f(x)
4x(5.6)
The mathematics of dierential calculus defines the slope of a curve at a point
as the slope of the tangent line at that point. Since we are working with a series of
points, the slope may be calculated not at a single point but between two points.
Considering that for each current user iteration 4xis always equal to 1 (in fact,
for Nuser iterations we have that 1 0=1, 2 1=1, ... ,N(N1) =1), the
slope value mwill always be equal to f(x+4x)f(x). As Equation 5.7 shows,
5.3. Implementation 65
12345 6 7 8 9 10 11
0.26
0.28
0.3
0.32
0.34
R1R2R3
MGC
GS
x(U serI teration s)
y(GS )
Figure 5.4: The maximum negative slope corresponds to the value of MGC
where simWP(Iu) denotes simWP (sdi,Mu\sdi),iIu, the maximum negative
slope consequently corresponds to the value of MGC.
min 4y
4x=min simWP(Iu)
1!=MGC (5.7)
In Figure 5.4, which displays the data reported in Table 5.1, we can see what
we just said in a graphical way.
In order to avoid the removal of an item that might correspond to a recent
change in the tastes of the user or an item not semantically distant enough from
the context of the remaining items, the DCBM algorithm removes an item only if
meets the following conditions:
1. it is located in the first part of the user iteration history. Based on this first
requirement, an item is considered far from the user’s tastes only when it
goes up in the first part of the iterations. This condition is checked thanks
66 Chapter 5. User Profiling
Table 5.1: User profile sample data
x y m
1 0.2884 +0.2884
2 0.2967 +0.0083
3 0.2772 -0.0195
4 0.3202 +0.0430
5 0.2724 -0.0478
6 0.2886 +0.0162
7 0.2708 -0.0178
8 0.3066 +0.0358
9 0.3188 +0.0122
10 0.2691 -0.0497
11 0.2878 +0.0187
5.3. Implementation 67
to a parameter r, taken as input by the algorithm, which defines the removal
area, i.e., the percentage of a user profile where an item can be removed.
Note that 0 r1, so in the example in Figure 5.4, r=2
3=0.66 (i.e.,
the element related to MGC value is located in the region R3, so it does not
meet this first requirement);
2. the value of MGC must be within a tolerance range, which takes into ac-
count the mean value of the global similarity (as global we mean are the
items in the user profile).
Regarding the first requirement, it should be noted that the regions extension
is strongly related both to the type of items and to the frequency of fruition of
these last, so it depends on the operative scenario.
With respect to the second requirement, we prevent the removal of items when
they do not have a significant semantic distance with the remaining items. For this
reason, we first calculate the value of the mean similarity in the context of the user
profile, then we define a threshold value that determines when an item must be
considered incoherent with respect to the current context. Equation 5.8 measures
the mean similarity, denoted by GS , by calculating the average of the Global
Similarity (GS) values, which are obtained as simW P(yj,PyY\yj),yY.
GS =1
|Iu|·X
iIu
(simWP(sdi,Mu\sdi))(5.8)
where |Iu|represents the total number of items stored in the profile (in the case
68 Chapter 5. User Profiling
of sample data shown in Table 5.1, the GS =0.2906). Obtained this average
value, we can proceed to define the condition ρto be used to decide when an item
has to be (1) or not to be (0) removed, based on a threshold value α, defined by
adding to average value GS a certain tolerance (as shown in Equation 5.9, in the
case in example we have defined the tolerance value as one-eighth of the average,
i.e., α=GS
8).
ρ=
1,if MGC <(GS α)
0,otherwise
(5.9)
Based on the above considerations, we can now define the Algorithm 1, used
to remove the semantically incoherent items from a user profile. The algorithm
requires as input the set Iu(i.e., the user profile), a parameter αused to define the
accepted distance of an item from the average, and a removal area rused to define
in which part of the profile an item should be removed. In step 3 we extract the
set of synsets Mu(Definition 5.4) from the description of the items in the user
profile Iu(Definition 5.2). Steps 4-6 compute the similarity between each couple
of synsets that belong to the user profile. In step 7, the average of the similarities
is computed, so that in steps 8-15 we can evaluate if an item has to be removed
from a user profile or not. In particular, once an item miis removed from a profile
in step 12, its associated similarity sis removed from the list S(step 13), so that
MGC in step 9 can be set as the minimum similarity value after the item removal.
In step 16, the algorithm returns the user profile Iuafter all the items in the first
5.3. Implementation 69
part of the user profile with the have been removed.
Algorithm 1 DCBM Algorithm
Require: Iu=set of items in the user profile, α=threshold value, r=removal area
1: procedure Process(Y)
2: N=|Iu|
3: Mu=GetS ynset s(Iu)
4: for each Pair p=(sdi,Mu\sdi) in Iudo
5: SsimWP (p)
6: end for
7: a=Average(S)
8: for each sin Sdo
9: MGC =Min(S)
10: i=index(MGC)
11: if i<rnAND MGC <(a+α)then
12: Remove(i)
13: Remove(s)
14: end if
15: end for
16: Return Iu
17: end procedure
Item Recommendation
After the user profile has been processed with the Algorithm 1, this step computes
the semantic similarity with all the items not evaluated, and recommends to a user
a subset of those with the highest similarity. As previously said, the amount of
items to recommend is related to the operative context. In this study we chose to
recommend a set of items equal to those in the test set, imagining a scenario in
70 Chapter 5. User Profiling
which the user requests a fixed set of items to consume (e.g., “Recommend me
three movies I can watch on a Sunday afternoon”).
5.3.6 Experiments
The experimental environment for this work is based on the Java language, with
the support of Java API implementation for WordNet Searching (JAWS) previ-
ously mentioned.
In order to perform the evaluation, two dierent datasets have been used, one
with real data and one with synthetic data. With the real data we estimate the
F1measure increment (or decrement) of the proposed novel DCBM approach,
compared with a recommender system at the state-of-art based on the semantic
similarity [39]. We also used a set of synthetic data, in order to evaluate the pro-
posed approach with dierent distributions of the incoherent items in the profile.
As highlighted throughout the thesis, the system presented in Section 5.3.5
performs the same steps as the reference one, with the introduction of the DCBM
algorithm to remove the incoherent items from the user profile. Since all the steps
in common between the two recommender systems are performed with the same
algorithms, the comparison of the F1-measure obtained by the two algorithms will
highlight the capability of DCBM to improve the quality of the user profile and of
the accuracy of a recommender system.
Regarding the first condition to meet (see Section 5.3.5) in order to remove the
items from a user profile, in the experiments we divided the user iteration history
5.3. Implementation 71
into 10 equal parts, considering valid for the items removal only the firsts 9 parts
(i.e., parameter r=0.9)3.
We have not compared the proposed approach with the classic magic barrier
formulations based on rating coherence, because the DCBM algorithm is per-
formed in a content based recommender system. Since content based approaches
do not employ the ratings during the filtering, it would not be useful to consider a
form of coherence based on them.
5.3.7 Real Data
The employed dataset is Yahoo! Webscope Movie dataset, described in Ap-
pendix B. Given the high sparsity that characterizes this dataset, a sample was
extracted, by removing all the users who evaluated less that 17 items and all the
items that have been evaluated by 13 users4. The final sample consists of 5070
users, 1647 items, and 153461 ratings. Since the algorithm considers only the
items with a rating above the average, we selected only the movies with a rating
3, and randomly extracted 33% of them as a test set. In order to evaluate the
performance of the proposed approach with this dataset, we use the performance
measures precision and recall, which we combine to calculate the F1measure
(described in Appendix B).
3The choice to divide the history into 10 parts was made based on the frequency of the ratings given by the users. This analysis is not presented
to facilitate the reading of the thesis.
4These values have been chosen in order to have a dataset in which useful information about each user and item was available to make the
predictions.
72 Chapter 5. User Profiling
54321012345
0
2
4
6
8
10
12
14
x(Distance f rom GS )
y(F1 % Increment)
Figure 5.5: F1measure Per Cent Improvement
Strategy
For the experiments, it is necessary to set the value of αin Algorithm 1, which
controls when an item is too distant from the average value GS . We have tested
some values positioned around the average value of the Global Similarity GS (see
Equation 5.8). The values interval experimented is the half of the GS value (e.g.,
if GS =0.4, the excursion of the values is from -0.2 to +0.2, so between 0.2 and
0.6). The interval of values is divided into 10 equal parts, labeled from -5 to 5.
Results
Figure 5.5 shows the per cent increment of F1measure of the proposed solution
compared with the state-of-the-art recommender system.
From the results shown in the graph of Figure 5.5, we can observe how the
average value of coherence (i.e., GS , represented by the 0 on the xaxis) represents
the borderline between the improvement and worsening in terms of quality of the
5.3. Implementation 73
carried out recommendations. That happens because we obtain the maximum
improvement in correspondence with the -1 value on the xaxis, which represents
the minimum distance from the mean value of coherence GS . This improvement
is progressively reduced as we approach the value of GS , becoming zero almost
immediately after this, because in this case we are removing from the user profile
some items that are coherent with her/his global choices, essential to perform
reliable recommendations.
To sum up, the graph in Figure 5.5 shows that the F1measure improvement
increases until it becomes stable above certain values and presents no gain below
others; this happens because we obtain an improvement only when the exclusion
process involves items with a high level of semantic incoherence with respect to
the others.
5.3.8 Synthetic Data
The set of synthetic data adopted is designed to simulate the real activity of a user
at an online site that sells movies, regarding four dierent types of scenarios:
1. in the first case we simulate a user profile (composed by 10 items) with 2
incoherent items not related with a possible change of tastes (because they
are positioned in the oldest part of her/his chronology);
2. also the second scenario presents a profile composed by 10 items with 2 of
them incoherent, but one of these two is positioned in the last part of the
history, representing a potential change in the user tastes;
74 Chapter 5. User Profiling
3. in the third case we reproduce a scenario where in the next to last user
iteration, 2 incoherent items were in the last part of user chronology (so
they should not be removed), and in the current iteration the user chooses
2 further incoherent items. The aim of this experiment is to reproduce a
scenario when the incoherent items are numerically consistent (4 out of 12
items), and for this reason we have to consider them not as a incoherent but
as a clear change in the user tastes;
4. in the last scenario we test the performance of the proposed approach in a
big user profile composed by 50 items (40 coherent items and 10 randomly
placed incoherent items). The aim is to check how many of these will be
properly identified and removed.
In order to avoid introducing a trivial criteria to discriminate incoherent items,
we suppose that all the items are evaluated by the users with the maximum rat-
ing. Regarding the first and second requirement that we need to meet (see Sec-
tion 5.3.5) in order to remove the items from a user profile, in the experiment we
take in consideration several subdivisions of the user iteration history, considering
valid for the items removal only the firsts N1 parts. We perform the exper-
iments taking into account dierent distances (the tolerance range α) from the
mean value of the global similarity.
5.3. Implementation 75
Experimental Setup
The distance between the user iterations is an important aspect that we have to
take into consideration to define the regions used to subdivide her/his profile. This
happens because we consider as incoherent only the items stored in the first N1
regions, considering a change of user tastes the items stored in the last region.
In order to evaluate the proposed approach, it is necessary to set the value of
αin Algorithm 1, which controls when an item is too distant from the average
value GS . We have tested some values positioned around the average value of the
Global Similarity GS (see Equation 5.8). The values interval experimented is the
5 percent of the GS value (e.g., if GS =0.5, the excursion of the values is from
0.025 to +0.025, centered in GS , then between 0.475 and 0.525). The interval
of values is divided into 10 equal parts, labeled from 5 to 5.
Experimental Results
Here we present the results of the performed experiments, where we tested four
dierent scenarios (case 1, 2, 3, and 4). In the first three cases (1, 2, and 3), the y
axis of the graph represents the user profile, and its values are the items (squares
inside the graph, in black those removed) progressively numbered (the lowest
number denotes the oldest item evaluated by user). In the last case (4) the values
in the yaxis of the graph are the number of items removed from the user profile.
In all cases, the values in the xaxis represent the experimented values around
the mean value of global coherence, in agreement with the criteria previously
76 Chapter 5. User Profiling
exposed.
Case 1. In the first experiment we take in consideration a user profile com-
posed by 10 items, and suppose that they have been evaluated by user in a
temporal frame of one year. In this case is reasonable to subdivide the items
in 5 regions, according to the frequency of the iterations.
We have introduced 2 incoherent items at the second and fourth position of
the user evaluation chronology. As we can observe in Figure 5.6, the items
considered as incoherent (2nd and 4th in chronology) are correctly detected
and removed by DCBM approach when the value on the xaxis reaches the
average value of global coherence (corresponding to the zero value on the x
axis); when we stay away from this value, we either get many false positive
(from 1 to 5), i.e., the items are incorrectly removed, or the obtained result
does not change (from 5 to 1).
It should be noted that in this case both items are located outside the No-
remove Region (i.e., the last region in chronological order).
5.3. Implementation 77
54321 0 1 2 3 4 5
1
2
3
4
5
6
7
8
9
10
x(Distance f rom GS )
y(Item s in Pro f ile)
Not removed
Removed
No-remove Region
Figure 5.6: Removed Items in the Case 1
Case 2. In the second experiment we process the same data of the previous
example, but in this configuration we locate one of the 2 incoherent items
inside the No-remove Region (in the ninth position of the chronology), and
the second item just before it (in the seventh position of the chronology).
As shown in Figure 5.7, only one of the two items considered as incoherent
was removed by the DCBM approach (item 7), because the second (item 9)
is evaluated as a change in the user tastes, and for this reason we can remove
it only when will be outside the No-remove Region, as long as its value of
coherence remains far from the mean value of global coherence.
Also in this experiment, the correct items removal takes place only when
the value on the xaxis reaches the average value of global coherence.
78 Chapter 5. User Profiling
54321 0 1 2 3 4 5
1
2
3
4
5
6
7
8
9
10
x(Distance f rom GS )
y(Item s in Pro f ile)
Not removed
Removed
No-remove Region
Figure 5.7: Removed Items in the Case 2
Case 3. As introduced before, in this third case we evaluate a scenario
where in the next to last user iteration, 2 incoherent items were in the last
part of user chronology (No-remove Region), and then they were not re-
moved, and in the current user iteration the user chooses 2 further incoher-
ent items. To summarize, we have a total of 12 items stored in the user
profile that is divided in 4 regions. At the end of the last user iteration we
have a profile with 4 incoherent items stored in the 9th, 10th, 11th and 12th
position of the chronology.
As we can observe in Figure 5.8, in this particular configuration of the pro-
file, none of the items recently evaluated by the user has been removed
by the DCBM algorithm, even though they were distant from the value of
global coherence previously estimated. This is because their numerical rel-
evance has changed this value. The obtained result is that the only item
removed (starting from the value zero on the xaxis) is one of the items
5.3. Implementation 79
previously close to the value of global coherence.
What we observed is that the proposed approach is able to align the user
profiles with the change in user tastes, when these are not related to scattered
events, but rather represent a real change in the user preferences.
54321 0 1 2 3 4 5
1
2
3
4
5
6
7
8
9
10
11
12
x(Distance f rom GS )
y(Item s in Pro f ile)
Not removed
Removed
No-remove Region
Figure 5.8: Removed Items in the Case 3
Case 4. In this last case we want to test the performance of the DCBM
approach related with a big user profile composed by 50 items. Through it,
we want simulate the activity of an assiduous customer that evaluates many
items. In this configuration it is reasonable to subdivide her/his profile in
10 regions, each containing 5 items. The test consists of introducing 10
incoherent items in a random position and check how many of these are
properly identified and removed by the proposed algorithm.
In a short, we have a profile composed of 40 coherent items and 10 inco-
herent items placed randomly. The results of the experiment are shown in
Figure 5.9 where TP denotes the True Positives (i.e., the items correctly re-
80 Chapter 5. User Profiling
moved) and with FP the False Positives (i.e., the items incorrectly removed).
Considering that 2 of the 10 items randomly placed were positioned within
the No-remove Region (last 5 positions of the profile), we have to consider
as the best possible result a number of 8 items removed (this upper limit is
denoted by a dashed line in the graph of Figure 5.9).
Every experiment showed that the best value to use as threshold for the
removal of an incoherent item is placed around the mean value of global
coherence, because if we move away from it we get many false positives or
no improvement.
54321 0 1 2 3 4 5
10
20
30
40
50
x(Distance f rom GS )
y(Number o f I tems)
TP
FP
Figure 5.9: Number of Removed Items in the Case 4
5.3.9 Conclusions and Future Work
This part of my work proposes a novel approach to improve the quality of the
user profiling, by taking into account the items related to a user, with the aim
of removing those that do not reflect her/his real tastes. This is useful in many
contexts, such as when the system does not allow the users to express her/his
5.3. Implementation 81
preferences or when the user decides not to make use of this option. If on the one
hand the proposed approach conducts toward more accurate recommendations,
on the other hand it reduces the number of items in the user profiles, thus the
computational complexity. This last aspect represents a very important result,
if we relate it with time-consuming approaches of recommendation, such as the
semantic ones.
A further possible expansion might involve the use of a large amounts of data
also related to contexts from each other as, for example, the scenario present on
sales platforms that give access to very heterogeneous goods, in which we could
operate in order to discover and process the semantic interconnections between
dierent classes of items and methods to evaluate their semantic coherence during
the user profiling activity.
82 Chapter 5. User Profiling
Chapter 6
Decision Making Process
6.1 Introduction
In order to lead the potential buyers toward a number of well-targeted sugges-
tions, related to the large amount of goods or services, a recommender system
plays a determinant role, since it is able to investigate on the user preferences,
suggesting to users the items that could be interesting. In order to identify these
items, it has to predict that an item is worth recommending. Most of the strategies
used to generate the recommendations are based on the so-called Collaborative
Filtering (CF) approach, which is based on the assumption that users have similar
preferences on a item if they have already rated other items in a similar way. As
discussed in Chapter 2.3, the rating prediction has been highlighted in the litera-
ture as the core recommendation task, and recent studies showed its eectiveness
84 Chapter 6. Decision Making Process
also in improving classification tasks.
In recent years, the latent factor models have been adopted in CF approaches
with the aim to uncover latent characteristics that explain the observed ratings.
Among these last approaches, the state of the art is represented by SVD++, which
exploits the so-called latent factor model and presents good performance in terms
of accuracy and scalability. Although this approach provides excellent perfor-
mance, it does not take into account the factor of popularity of the items that are
recommended, risking to penalize its performance under certain circumstances.
This can happen when the same score is given to multiple items, since not be-
ing able to discriminate them on the basis of their popularity, there is the risk to
recommend those unpopular, which are less likely to be preferred by the users.
The popularity of the items is an aspect that has been widely studied in the
recommender systems literature. While their ability to identify items of potential
interest to the users has been recognized, some limitations have been highlighted.
The most important of these is that the recommendations made according to pop-
ularity criteria are trivial, and do not bring considerable benefits neither to users,
nor to those that oer them goods or services. This happens when the so-called
non-personalized model are used, a naive approach of recommendation that does
not take into account the user preferences, because it always recommends a fixed
list with the most popular items, regardless of the target user. On the other hand,
however, recommending less popular items adds novelty (and also serendipity) to
the users, but usually it is a more dicult task to perform.
Another possible limitation that might occur when producing recommenda-
6.2. Recommender Systems Performance 85
tions considering only the ratings is the fact that these approaches ignore the se-
mantic relations between the words in the item descriptions. Therefore, thanks
to the advent of the so-called Semantic Web, other strategies, based on semantic
criteria, have also spread. The main advantage is their capability to interpret the
users preferences in a non-schematic mode, helping to understand the concepts
that are connected with a text, which can be used to determine the similarity be-
tween items, instead of merely using the single terms in their textual description.
By exploiting both the concepts of similarity and diversity, this part of the
research introduces, initially at architectural level, and subsequently at application
level, some novel approaches able to improve the performance of a recommender
system.
6.2 Recommender Systems Performance
6.2.1 Overview
This part of my research is focused on the role that the popularity of the items
plays in the recommendation process. If on the one hand, considering only the
most popular items generates trivial recommendations, on the other hand, not tak-
ing in consideration the item popularity could lead to a non-optimal performance
of a system, since it does not dierentiate the items, giving them the same weight
during the recommendation process. Therefore, there is the risk to exclude from
the recommendations some popular items that would have a high probability of
86 Chapter 6. Decision Making Process
being preferred by the users, suggesting instead others that, despite meeting the
selection criteria, have less chance to be preferred.
The proposed strategy aims to employ in the recommendation process new
criteria based on the items’ popularity, by introducing two novel metrics. The first
metric evaluates the semantic relevance of an item with respect to the user profile,
while the second metric measures how much it is preferred by users. Through a
post-processing approach, these metrics are implemented in order to extend one of
the most performing state-of-the-art recommendation techniques: SVD++. The
eectiveness of this hybrid strategy of recommendation has been verified through
a series of experiments, which show strong improvements in terms of accuracy
w.r.t. SVD++.
6.2.2 Proposed Solutions
This part of my work aims instead to improve the recommendations produced by
the SDV++ approach, by considering also the semantics behind the items and the
items’ popularity. This is done by employing two dierent strategies.
The first strategy involves a balanced use of two indices of item popularity:
one based on the positive feedbacks of the users, and one based on the conceptual
similarity of the textual description of the item with the descriptions of the other
ones positively evaluated in the past.
The second strategy consists in the application of these two metrics within
the boundaries of a recommendation list, generated through a state-of-the-art ap-