ArticlePDF Available

# A robust reputation system using online reviews?

Authors:

## Abstract and Figures

Evaluating sellers in an online marketplace is an important yet nontrivial task. Many online platforms such as eBay and Amazon rely on buyer reviews to estimate the reliability of sellers on their platform. Such reviews are, however, often biased by: (1) intentional attacks from malicious users and (2) conflation between a buyer?s perception of seller performance and item satisfaction. Here, we present a novel approach to mitigating these issues by decoupling measures of seller performance and item quality, while reducing the impact of malignant reviews. An extensive simulation study shows that our proposed method can recover seller reputations with high rank correlation even under assumptions of extreme noise.
Content may be subject to copyright.
Computer Science and Information Systems 00(0):0000–0000 https://doi.org/10.2298/CSIS123456789X
A Robust Reputation System using Online Reviews?
Hyun-Kyo Oh1, Jongbin Jung2, Sunju Park3, and Sang-Wook Kim4
1Samsung Electronics
hyunkyo.oh@samsung.com
2Stanford University
jongbin@stanford.edu
3Yonsei University
boxenju@yonsei.ac.kr
4Hanyang University
wook@agape.hanyang.ac.kr
Abstract. Evaluating sellers in an online marketplace is an important yet non-
trivial task. Many online platforms such as eBay and Amazon rely on buyer reviews
to estimate the reliability of sellers on their platform. Such reviews are, however,
often biased by: (1) intentional attacks from malicious users and (2) conﬂation be-
tween a buyer’s perception of seller performance and item satisfaction. Here, we
present a novel approach to mitigating these issues by decoupling measures of seller
performance and item quality, while reducing the impact of malignant reviews. An
extensive simulation study shows that our proposed method can recover seller rep-
utations with high rank correlation even under assumptions of extreme noise.
Keywords: reputation, reviews, attacks.
1. Introduction
One of the major challenges for online marketplaces such as eBay.com is that of accu-
rately measuring the reliability of sellers on their platform [3, 10, 14, 17, 25]. The most
common implementation of this task takes the form of a reputation system in which buy-
ers are tasked with evaluating their interaction with sellers on some common scale (e.g.,
a 5-star rating [15, 26–29].) These ratings are then aggregated and presented to future
buyers as a proxy for a seller’s quality. While such measures have substantial inﬂuence on
buyer behavior and overall marketplace dynamics, determining how reliable and robust
they are to bias is still an open question. The impact of this problem is only getting larger
as the presence and signiﬁcance of online platforms in society increase. With the addition
of more complex multi-agent systems and multi-faceted online marketplaces which often
span global economies—such as Uber or AirBnB, the question of whether a reputation
rating system can be robust to corruption and bias, and what the framework for measuring
such robustness could be is becoming more important than ever. [2] Our work aims to ad-
dress this question by ﬁrst introducing a novel method for reliably measuring reputation
from potentially corrupted ratings, whether maliciously intended or not, and subsequently
proposing a general simulation model for quantifying the robustness of various reputation
measurement strategies.
?Corresponding author: Sang-Wook Kim (wook@agape.hanyang.ac.kr)
2 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
Reputation systems aim to leverage the wisdom of crowds [11, 18], assuming that all
participants understand and agree upon the common goal of transparently measuring the
quality of a seller. This assumption, however, is ﬂawed, as pointed out by recent studies
that identify adversarial behavior in buyer ratings [28]. Various types of cheating behavior
have been identiﬁed, both in theory and in practice, along with recommendations for how
to account for such behavior in aggregating reviews [5, 7, 10, 19, 20, 23, 24].
In addition to the threat of malicious reviews, another—perhaps more subtle—issue
for reputation systems is that the common goal of a review may not be immediately ob-
vious to reviewers (e.g., buyers.) By evaluating an interaction, the reviewer is necessarily
conﬂating multiple aspects of a transaction, only one of which is the seller’s performance.
For example, a buyer could be extremely satisﬁed with an item, but ﬁnd the seller’s com-
petence in communication and execution problematic. In such a case, the buyer’s scoring
of the transaction, whether high or low, will be at best a biased measure of seller per-
formance and item quality. We address this issue by ﬁrst modeling a buyer’s review as
a combination of evaluating two factors: seller performance and item quality. Taking ad-
vantage of the fact that multiple sellers offer similar items and that one seller often offers
multiple items, we propose an iterative method, which we call RATI NG SE PARATIO N, for
teasing out each of the two factors that confound buyer reviews.
In order to create a comprehensive and robust reputation system, we further propose
INTEGRITY WEIGHTING, a novel approach to mitigate the adverse effects of dishon-
est reviews. As the name suggests, the idea is to estimate the level of trustworthiness of
each review, based on theories of buyers’ cheating behavior. Each review is subsequently
re-weighted according to the estimated trustworthiness. We show that, while existing ap-
proaches mainly focus on a subset of possible attacks, INTEGRITY WEIGHTING is robust
against a large pool of attack types and patterns that have been identiﬁed in literature.
Finally, we develop a simulation framework for evaluating the efﬁciency of a rep-
utation system in online marketplaces. Compared to existing marketplace simulations
[4, 12, 16], our model allows for the conﬂation of multiple factors in a buyer’s review.
Using this framework, we evaluate the proposed reputation system and ﬁnd it to be more
robust and reliable in measuring seller reputation compared to existing methods.
In summary, contributions of this paper are threefold. First, we propose RATING SE P-
AR ATIO N, a method for disentangling, from a single score given by a buyer, the ratings
for a seller and the item sold. Second, we present INTEGRITY WEIGHTING, a scheme for
mitigating the risk of malignant agents on the platform. Third, we present a novel and
comprehensive simulation approach for evaluating various policies and systems on an on-
line marketplace platform. Using this framework, we are able to evaluate the practical
efﬁcacy of complex reputation systems—such as the ones we propose here. Additionally,
this simulation framework allows us to better investigate existing methods, and identify
their strengths and potential shortcomings.
2. Related work
Many online platforms have their own reputation systems evaluating the reliability of
products or sellers by aggregating buyer reviews. However, these reputation systems are
intrinsically vulnerable to malicious users who intentionally give unjustiﬁed reviews to
Robust reputation system 3
products or sellers. Numerous studies have been conducted to improve the robustness of
reputation systems by mitigating the inﬂuence of such malicious users.
Online platforms can be largely categorized as either single-agent or multi-agent sys-
tems. In a single-agent system, a single provider curates a collection of products—such as
movies on IMDb.com—and users are tasked with rating their experience with each prod-
uct. Since there is a single provider, buyer ratings have a clear mapping to each product,
and buyers are often limited to rating each product at most once. Existing studies of such
single-agent systems give attention to eliminating anomalous (potentially malicious) rat-
ings based on statistical analysis of the distribution of buyer ratings or ranking/grouping
users based on their rating patterns in order to derive the weighted mean of ratings given
by the users [6, 8, 21, 22].
In a multi-agent system, numerous sellers can provide multiple goods and services—
as on Amazon.com. Unlike single-agent systems, in a multi-agent system, buyers can
evaluate both the seller and their product. And as a consequence of repeated interactions,
it is possible for buyers to provide more than one rating to the same seller, over a variety
of products.
Both single-agent and multi-agent systems share the same goal of diminishing the
risk of malicious ratings. However, due to the different graph structure between buyers
and sellers being more complex, state-of-the-art strategies that work well for single-agent
systems are often insufﬁcient for multi-agent systems. One of the often discussed chal-
lenges for multi-agent systems is that of identifying malicious buyers (attackers) who—in
coordination with associated sellers—aim to artiﬁcially manipulate the reputation of tar-
get sellers. This can take the form of either increasing the reputation of partnered sellers,
or decreasing the reputation of competing sellers. As online platforms become more com-
monplace and complex, deceptive rating strategies have also evolved, making it harder to
identify attackers. In response, most existing studies focus on not only ﬁltering out statisti-
cally insigniﬁcant ratings but also ﬁnding suspicious buyer-seller relationships by detect-
ing malicious behavioral patterns in their rating systems [1, 5, 6, 9, 13, 15, 20, 24, 26–30].
A less discussed issue for rating systems in multi-agent systems, however, is that of con-
founded ratings. As discussed in the previous section, benign users can still corrupt a
reputation system by conﬂating their evaluation of a seller and a product. While previous
studies deal with the issue of malicious users, we have yet to ﬁnd studies that address
the subtle, yet important, issue of confounded buyer ratings. In addition to addressing the
more traditional concerns of malicious ratings, our approach aims to achieve robustness
against the threat of such ambiguous ratings as well.
3. Proposed methods
Overall, we propose RATI NG SE PAR ATIO N AN D INTEGRITY WEIGHTING (RS&IW),
a system for retrieving reputations that are robust to confounded ratings and adversarial
behavior. The ﬁrst part of the system, RATIN G SEPARATI ON, decomposes the possibly
confounded rating into individual components of seller and item ratings. The second part
of the system, INTEGRITY WEIGHTING, extends the work of Oh et al. [23] to quantify
the trustworthiness of each rating. In the following sections, we present the details of each
method.
4 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
3.1. Decoupling ratings
A common goal for online marketplace providers is to identify the reputation, trustwor-
thiness, and quality of active sellers and items that are traded on their platform. For-
mally, given a set of sellers S={s1, s2, . . .}and items M={m1, m2, . . .}, a platform
provider—such as eBay.com—would like to recover some function ρS:SRthat
ranks sellers and ρM:MRthat ranks items. Since seller and item rankings are not
readily measurable, platform providers will often estimate rankings by asking buyers to
rate their interactions via some common scale (e.g., 5-star ratings). In other words, given
a set of buyers B={b1, b2, . . .}, platform providers observe a set of scores Y={y(b)
s,m}
when a buyer brates their interaction with seller sto purchase item m. One important, yet
subtle issue in this setting is that the observed scores y(b)
s,m do not directly measure values
of either the seller ratings (ρS(s)) or item quality (ρM(m)), but some function of the two.
To motivate our approach, we ﬁrst consider ys,m, the unobserved true score corre-
sponding to the evaluation of seller swith regard to item m. Next, we model this true
score as a function of two terms, seller performance (ρS(s)) and item quality (ρM(m)):
ys,m =fρS(s), ρM(m).(1)
Note that any single observation y(b)
s,m is only a noisy approximation of ys,m, since dif-
ferent buyers will combine the two components in a different way.5The ﬁrst part of our
proposed method involves an iterative clustering of the observed y(b)
s,m to decouple and
estimate each component ρS(s)and ρM(m).
Initial clustering and estimation of ρSFirst, deﬁne the set Bs,m Bas the set of
buyers who have rated an interaction of purchasing item mfrom seller s(i.e., Bs,m =
{bkB| ∃ y(bk)
s,m Y}). For each seller-item pair (s, m)S×M, we initially estimate
ys,m via the sample mean
¯ys,m =1
|Bs,m|X
bBs,m
y(b)
s,m.(2)
Next, we utilize the fact that multiple sellers can, and often do, offer the same items,
to cluster sellers and estimate ρS(s). For a speciﬁc seller si, let Msi={mjM|
y(b)
si,mjY}be the set of all items for which the seller sihas been rated. Similarly, for
some speciﬁc item mj, let Smj={siS| ∃ y(b)
si,mjY}be the set of all sellers who
have been rated with item mj. We initially construct K=|M|clusters of sellers, CS, by
collecting sellers who happened to have ratings for the same item m. Formally, we write,
CS(k) = {si|siSmk}.
5We also note that each buyer ratings y(b)
s,m could contain an additional bias component that depends on the
speciﬁc buyer b. For example, some buyers may intentionally give higher or lower ratings, independent of
seller or item quality, to satisfy their idiosyncratic goals. We speciﬁcally address this issue in the second part
of our method, INTEGRITY WEIGHTING, which is presented in Section 3.2.
Robust reputation system 5
seller siitems m1m2m3m4m5
s1
m1m2m3
s2
m1m2m5
s3
m3m4m5
s4
m1m3m4
s5
m2m3m4
m1
CS(1)
s1s2
s4
m2
CS(2)
s1s2
s5
m3
CS(3)
s1s3
s4s5
m4
CS(4)
s3s4
s5
m5
CS(5)
s2s3
Fig. 1: An example of initial seller clustering
Further, let Lsibe the set of parameters ksuch that CS(k)contains seller sias an element.
In other words,
Lsi={k|si∈ CS(k)}.(3)
An example of this initial clustering is presented in Fig. 1. Fig. 1 represents a platform
of ﬁve sellers, {s1, s2, . . . , s5}, and ﬁve items, {m1, m2, . . . , m5}. As a result, sellers are
initially organized into ﬁve clusters based on the items they offer: CS(1) = {s1, s2, s4},
CS(2) = {s1, s2, s5},CS(3) = {s1, s3, s4, s5},CS(4) = {s3, s4, s5}, and CS(5) =
{s2, s3}. Correspondingly, while not illustrated in Fig. 1, we can write out the sets of
clusters that include each seller as Ls1={1,2,3},Ls2={1,2,5},Ls3={3,4,5},
Ls4={1,3,4}, and Ls5={2,3,4}.
Next, we estimate the ranking of each seller ρS(s)by averaging relative ratings within
each cluster. Deﬁne ek:CS(k)Ras a scoring function for some seller si∈ CS(k)
with respect to each item mk,relative to all other sellers in CS(k). Speciﬁcally, we deﬁne,
ek(si) = ¯ysi,mk1
|CS(k)| − 1X
sj∈CS(k)\si
¯ysj,mk.(4)
Then, for each seller, we subsequently estimate ρSby computing
ˆρS(s) = 1
|Ls|X
k∈Ls
ek(s)sS. (5)
In other words, a seller’s rating, decoupled from item quality, is estimated by taking the av-
erage of all the relative scores achieved across clusters. Note that the range of ˆρSwill vary
depending on the range of the original scale implemented in the platform for recording
buyer feedback and ratings. However, for the purpose of quantifying rankings amongst a
6 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
0 0.2 0.4 0.6 0.8 1
seller reputation
ˆρ(0)
S(s4)
ˆρ(0)
S(s1)
ˆρ(0)
S(s3)
ˆρ(0)
S(s2)ˆρ(0)
S(s5)
seller siitems m1m2m3m4m5
s1
m1m2m3
s2
m1m2m5
s3
m3m4m5
s4
m1m3m4
s5
m2m3m4
ν= 0.1
s1s3
m1m2
m3
m4m5
C(0)
M(1)
s2
m1m2
m5
C(0)
M(2)
s4
m1m3
m4
C(0)
M(3)
s5
m2m3
m4
C(0)
M(4)
Fig. 2: An example of item clustering based on seller reputation
set of sellers, we can always normalize ˆρsto be within some desired range. For the follow-
ing sections and the experiment described in Section 4, we use min-max normalization
to restrict the range of ˆρSvalues to be in [0,1].
Iterative clustering and estimation of ρMOnce we have the initial estimates ˆρS, we
can use these values to further cluster items in a similar manner, and subsequently estimate
ρM. To formalize this iterative approach, we ﬁrst denote an estimate of ρSand ρMat the
tth iteration as ˆρ(t)
Sand ˆρ(t)
M, respectively. Hence, our initial estimate from (5) is denoted
ˆρ(0)
S, and similarly we let e(0)
kbe our initial values of ekcomputed in (4). At the tth
iteration, we create KNclusters of items by ﬁrst grouping sellers such that seller si
and seller sjare in the same group if |ˆρ(t)
S(si)ˆρ(t)
S(sj)|< ν, where νis a parameter for
determining the granularity and size of clusters and Kis determined as a consequence of
the distribution of ˆρ(t)
S. We deﬁne Pk, the set of sellers in group k, such that |ˆρ(t)
S(si)
ˆρ(t)
S(sj)|< ν for any si∈ Pkand sj∈ Pk. Then, items in the kth cluster are deﬁned as
the items that have been ranked for the sellers who are in group Pk. Formally, we write,
C(t)
M(k) = {mj|mjMsisi∈ Pk}.
Similar to (3), let L(t)
mbe the set of parameters ksuch that C(t)
Mcontains item mas a
member. In other words,
L(t)
mj={k|mj∈ C(t)
M(k)}.
Robust reputation system 7
Continuing our illustrative example from the previous section, Fig. 2 presents a nu-
merical example of such item clustering, where ν= 0.1. Based on the numerical values of
ˆρ(0)
S, presented on the upper-right scale, sellers s1and s3are grouped together, while the
other sellers form singletons, resulting in K= 4 clusters. Without loss of generalization,
we can arbitrarily assign numbers 1k4to each cluster, deﬁning sets P1={s1, s3},
P2={s2},P3={s4},P4={s5}; and clusters C(0)
M(1) = {m1, m2, m3, m4, m5},
C(0)
M(2) = {m1, m2, m5},C(0)
M(3) = {m1, m3, m4}, and C(0)
M(4) = {m2, m3, m4}. Cor-
respondingly, the clusters that include each item is stored as L(0)
m1={1,2,3},L(0)
m2=
{1,2,4},L(0)
m3={1,3,4},L(0)
m4={1,3,4}, and L(0)
m5={1,2}.
Similar to (2), we compute a within-cluster mean ¯y(t)
m,k for each item min cluster k
by taking the average rating over each buyer and seller within the cluster. In other words,
¯y(t)
m,k =Ps∈PkPbBs,m y(b)
s,m
Ps∈Pk|Bs,m|.(6)
Let z(t)
k:C(t)
M(k)Rbe a scoring function for some item m∈ C(t)
M(k), relative to
all other items in C(t)
M(k). In particular, we deﬁne
z(t)
k(mi) = ¯y(t)
mi,k 1
C(t)
M(k)
1X
mj∈C(t)
M(k)\mi
¯ymj,k.
Then, for each item, we subsequently estimate ρMat iteration tby computing
ˆρ(t)
M(m) = 1
L(t)
m
X
k∈L(t)
m
z(t)
k(m)mM.
In other words, the quality of an item, decoupled from a seller’s performance rating, is
estimated by taking the average of all the relative scores achieved by that item across
clusters. As in (5) for ˆρS, the range of ˆρ(t)
Mwill vary. For the following sections and the
experiment described in Section 4, we use min-max normalization at each iteration tto
restrict the range of ˆρ(t)
Mvalues to be in [0,1].
Iterative clustering and estimation of ρSGiven values of ˆρ(t)
M, we can further improve
our estimate of ρSby taking additional iterations. An iteration of computing ˆρ(t)
Sis very
similar to the initial estimation procedure we describe for (5), with the primary difference
being in how clusters C(t)
S(k)are deﬁned for t > 0.
Speciﬁcally, at the tth iteration for t > 0, we create KNclusters of sellers by ﬁrst
grouping items such that item miand item mjare in the same group if |ˆρ(t1)
M(mi)
ˆρ(t1)
M(mj)|< ν. The set of items in group k,Qk, is deﬁned such that |ˆρ(t1)
M(mi)
ˆρ(t1)
M(mj)|< ν for any mi∈ Qkand mj∈ Qk. The set of sellers in the kth cluster for
t > 0are then deﬁned as
C(t)
S(k) = si|siSmjmj∈ Qk, t > 0.
8 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
0 0.2 0.4 0.6 0.8 1
seller reputation
ˆρ(0)
M(m4)
ˆρ(0)
M(m5)
ˆρ(0)
M(m3)
ˆρ(0)
M(m2) ˆρ(0)
M(m1)
seller siitems m1m2m3m4m5
s1
m1m2m3
s2
m1m2m5
s3
m3m4m5
s4
m1m3m4
s5
m2m3m4
ν= 0.1
m1
s1s2
s4
C(1)
S(1)
m2m3
s1s2
s3
s4s5C(1)
S(2)
m4
s3s4
s5
C(1)
S(3)
m5
s2s3
C(1)
S(4)
Fig. 3: An example of seller clustering based on item reputation
The set L(t)
sis trivially deﬁned similar to Lsin (3).
To continue our example, Fig. 3 illustrates such a clustering of sellers with ν= 0.1
at t= 1. Based on the numerical values of ˆρ(0)
M, presented on the upper-right scale,
items m2and m3are grouped together, while the other items form singletons, result-
ing in K= 4 clusters. Again, we can assign numbers 1k4to each cluster,
deﬁning sets Q1={m1},Q2={m2, m3},Q3={m4},Q4={m5}; and clus-
ters C(1)
S(1) = {s1, s2, s4},C(1)
S(2) = {s1, s2, s3, s4, s5},C(1)
S(3) = {s3, s4, s5}, and
C(1)
S(4) = {s2, s3}. Correspondingly, the clusters that include each seller is stored as
L(1)
s1={1,2},L(1)
s2={1,2,4},L(1)
s3={2,3,4},L(1)
s4={1,2,3}, and L(1)
s5={2,3}.
Within-cluster mean ¯y(t)
s,k for each seller sin cluster kis computed by taking the
average rating over each buyer and item within the cluster. In other words,
¯y(t)
s,k =Pm∈QkPbBs,m y(b)
s,m
Pm∈Qk|Bs,m|, t > 0.(7)
Finally, e(t)
kand ˆρ(t)
Sfor t > 0are deﬁned similar to the initial case of t= 0, following (4)
and (5), but replacing the initial estimates ¯ys,m with the within-cluster average ¯y(t)
s,k.
Complete algorithm for RATI NG SEPAR ATI ON As a stopping condition of the iterative
algorithm, we deﬁne a tolerance parameter ε. After completing iteration t > 0, and given
estimates ˆρ(t)
Sand ˆρ(t)
M, the algorithm is to advance to the next iteration t+ 1 until |ˆρ(t)
S
ˆρ(t1)
S|< ε and |ˆρ(t)
Mˆρ(t1)
M|< ε. The overall procedure presented in this section is
formally summarized in Algorithm 1.
Robust reputation system 9
Algorithm 1: Rating Separation (RS)
Input: set of buyers B, set of sellers S, set of items M, set of ratings Y, clustering range
ν, convergence tolerance ε
Output: estimated quality scores for each seller and item (ˆρS,ˆρM)
t0;
repeat
// Seller rating separation
if t=0then
cluster sellers by item to build C(0)
S;
else
cluster sellers by item score ˆρ(t1)
Mand νto build C(t)
S;
end
foreach C(t)
S(k)∈ C(t)
Sdo
foreach s∈ C(t)
S(k)do
compute e(t)
k(s);
end
end
foreach sSdo
compute ˆρ(t)
S(s);
end
// Item rating separation
cluster items by seller score ˆρ(t)
Sand νto build C(t)
M;
foreach C(t)
M(k)∈ C(t)
Mdo
foreach m∈ C(t)
M(k)do
compute z(t)
k(m);
end
end
foreach mMdo
compute ˆρ(t)
M(m);
end
tt+ 1;
until t>0, |ˆρ(t)
Se?
t1|< ε,|ˆρ(t)
Mz?
t1|< ε;
3.2. Mitigating adversarial reviews
A known issue with many buyer rating systems is that malicious actors may negatively
affect the accuracy of scores through various cheating behavior. [5, 7, 10, 19, 20, 23, 24]
Here, we mitigate such risk by proposing a method to score each review in terms of an
estimated measure of trustworthiness—or integrity, which is then used to weigh each
observed rating. Our proposed measure of trustworthiness considers three components:
engagement, diversity, and anomaly. We calculate each component for every buyer, based
on observed rating behavior across item categories. Formal deﬁnitions of each component
are presented below.
10 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
Concretely, we deﬁne GM, a subset of items in category , as a collection of items
which satisfy some predetermined criteria6. For example, G1might be the collection of
all electronics, while G2might be all items classiﬁed as furniture. Then, let BBbe
the set of all buyers who have rated items in G. Similarly, we use YYto denote the
subset of all ratings that were observed for items in category , while Y[b]Yfurther
denotes the subset of ratings for items in category that were given by user b. In other
words, we deﬁne
B={bB| ∃ y(b)
s,mjY, mj∈ G}
Y={y(b)
s,mjY|mj∈ G}
Y[b]={y(b)
s,mjY|mj∈ G, b B}.
Engagement A common measure for quantifying the trustworthiness of users on a typi-
cal platform is user engagement. For example, scores provided by a buyer who is highly
engaged in the platform—purchasing items and rating interactions on a regular basis—is
considered more reliable than that from a one-time visitor. Here, engagement is opera-
tionalized as the relative frequency of ratings given by each buyer bwithin a category
:
αb, =
Y[b]
Y
|B|,
where |Y|/|B|is the average number of ratings provided by each buyer in category .
The corresponding user engagement weights αb, are further normalized to be within the
range [0, 1], using min-max normalization for each category .
Diversity Another consideration for a buyer’s trustworthiness is the concentration of rat-
ings. Conceptually, a buyer is considered more trustworthy if they interact with, and rate,
a variety of different sellers, as opposed to repeatedly rating a small number of sellers.
Thus, we quantify diversity as the proportion of unique sellers that the buyer has rated
over all ratings the buyer has given within that category. Formally, let S[b]Sbe the
subset sellers who have received a rating from user b, for at least one item in G. In other
words
S[b]={si| ∃ y(b)
si,mjY, mj∈ G}.
Then, the diversity weight βb, for a buyer bcorresponding to item category is calculated
as
βb, =
S[b]
Y[b]
.
Note that this quantiﬁcation of diversity is relative to the total number of ratings made by
the buyer. For example, a buyer who provided only one rating (
Y[b]
= 1) is considered
to have high diversity (βb, = 1). As the buyer rates more transactions, βb, will decrease
whenever the buyer rates a seller that they have already rated previously. Similar to en-
gagement weights, we normalize the corresponding diversity weights βb, via min-max
normalization within each item category G.
6In this study, we use item categories as deﬁned by the lowest-level grouping of items on eBay
(https://www.ebay.com/v/allcategories).
Robust reputation system 11
Algorithm 2: Integrity weighting (IW)
Input: set of buyers B, set of ratings Y, set of item categories G
Output: integrity-adjusted ratings, ˆy(b)
s,m
// Compute integrity weights
foreach item group do
foreach bBdo
wb, αb, ×βb, ×γb,;
foreach y(b)
s,m Y[b]do
ˆy(b)
s,m wb, ×y(b)
s,m;
end
end
end
Anomaly We are also concerned with how much a buyer’s rating of an item deviates or
conforms to that of the general consensus of other buyers, which we refer to as anomaly.
To quantify anomaly, we ﬁrst consider the standardized distance of a buyer’s rating for
each item, from the overall distribution of ratings for that item. For any given item m, let
µmand σmdenote the average and standard deviation of ratings that the item received
across all buyers and sellers. Then, for each rating y(b)
s,m given by buyer bfor item m, we
compute the normalized distance from the mean as:
δ(b)
s,m =
y(b)
s,m µm
σm
,
where smaller values of δ(b)
s,m indicate that the ratings given by buyer bfor item mis
similar and consistent with ratings given by other buyers for that same item. Then, γb,,
the anomaly weight for buyer bin item group is computed by taking the average of δ(b)
s,m
for all items m∈ G:
γb, =1
Y[b]
X
Y[b]
δ(b)
s,m.
As with engagement weights and diversity weights, anomaly weights γb, are subse-
quently normalized via min-max normalization within each item category G.
Integrity weighted ratings Given the normalized weights αb,,βb,, and γb, for engage-
ment, diversity, and anomaly, respectively, we can compute a comprehensive integrity
weight for each buyer bwithin item category Gas
wb, =αb, ×βb, ×γb,.
Then, for an observed rating y(b)
s,m where m∈ G, we can compute an integrity-adjusted
rating—where the observed rating is weighted by the estimated integrity of user bwithin
category Gas
ˆy(b)
s,m =wb,` ×y(b)
s,m.
This procedure is formally summarized in Algorithm 2.
12 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
3.3. A comprehensive reputation score
Finally, we can compute reputation scores for sellers and items that are robust to con-
founded and adversarial ratings by combining RATI NG SE PAR ATIO N from 3.1 and IN-
TEGRITY WEIGHTING from 3.2. This is achieved by replacing the raw ratings y(b)
s,m with
their integrity weighted counter parts, ˆy(b)
s,m, in (2), (6), and (7) of R ATING SEPARATI ON.
4. Experiment
To evaluate the efﬁcacy of the methods proposed, we further present a novel and compre-
hensive simulation framework. The contributions of our new approach are two fold. First,
in contrast to existing literature we directly model item-level transactions. This enables
our framework to distinguishing between a buyer’s rating of sellers versus satisfaction
of a speciﬁc item. Second, the proposed framework incorporates a comprehensive model
of plausible adversarial behavior, allowing us to test how robust our reputation scoring
systems are to numerous realistic attack scenarios.
4.1. A simulation framework for online marketplaces
The simulation framework we propose involves four components: three entities—items,
sellers, buyers—and a model for how the different entities interact—transactions. A ma-
jor advantage of our approach is that by explicitly modeling items, in addition to buyers
and sellers, we can further capture the realistic dynamics that take place in online market-
places.
An online marketplace is characterized by the number of items, buyers, and sellers
on the platform. For our experiment, we consider two parameter regimes: small-size and
large-size marketplaces. For the small-size marketplace, we set 1,000 items, 500 sellers,
and 5,000 buyers For the large-size marketplace, we set 2,000 items, 1,000 sellers, and
10,000 buyers. For each setting, we simulate 300 days of marketplace activity, where each
buyer is limited to one transaction per day. We describe each component of the simulation
in detail below.
Items An item is parameterized by its quality and categorization. The quality of an item
is represented by a continuous score in the range [0,1]. We allow for multiple hierarchical
item categories.
For the purpose of our simulations in this study, we limit the hierarchy of item cat-
egories to three levels—top, middle, and bottom, which we ﬁnd sufﬁcient to represent
many typical online marketplace categorizations realistically. In our experiments, we set
three top-level categories,each of which consists of ﬁve mid-level subcategory. Each mid-
level category is further classiﬁed in to six bottom-level subcategories. In total, there are
90 different unique item categories. We further assign items uniformly across different
categories, so that the number of items available in each category is similar.
Sellers Sellers are parameterized by their capability and the items that they offer. We
assume that seller capabilities follow a truncated normal distribution, bounded in [0,1],
with mean 0.5 and standard deviation 0.25. Higher capability scores correspond to faster
Robust reputation system 13
delivery and better service, while lower capability scores correspond to late delivery and
poor service.
An important characteristic of sellers, which is not often captured in existing simula-
tion frameworks, is the variety of items that they offer. For example, while some sellers
may focus on selectively offering only a few items in major categories, others may choose
to offer a wide-selection of items across multiple categories. By modeling items as enti-
ties, and parameterizing sellers by the items they offer, the simulation framework we
propose is capable of representing this diversity.
In our experiment, we assume that sellers offer items in one major category, along
with items from up to three minor categories. To operationalize this assumption, for each
seller we ﬁrst sample one major item category, from which they offer between three and
six items. Then, we sample between zero and three minor item categories, from which
one to six items are subsequently sampled.
Buyers Buyers are parameterized by item categories of interest and the level of interest
for each category. Each buyer is randomly assigned to 3 to 6 item categories of interest.
The level of interest for each item category is assigned a continuous value in the range
[0,1]. We assume that buyers are more likely to purchase items in categories for which
they have a higher level of interest. After each transaction, buyers will leave a single score
rating as a function of item quality and seller capability.
Transactions Transactions represent the event in which a buyer purchases an item from
a seller, and provides a rating. Each buyer is assigned a random purchase cycle, between
zero and three days, which represents how often the buyer will participate in a transaction
on the marketplace. Buyers are more likely to purchase items from sellers who offer items
for which they have higher levels of interest in. This could result in unrealistically high-
frequency transactions between the same buyer-seller pairs. To mitigate this issue, we
require a repurchase waiting time of three, ﬁve, or ten days for the same buyer-seller pairs
to have a repeat transaction.
4.2. Simulating malicious ratings
To evaluate the robustness of a reputation system in the presence of adversarial buyers,
we model the behavior of malicious ratings. Based on existing literature [5, 15, 20, 24, 27,
29], we categorize adversarial buyers by three behavioral patterns and six attack strategies.
The three behavioral patterns and six attack strategies are presented in Tables 1 and 2,
respectively, along with a short description and relevant literature reference.
Any attacker will adopt one behavioral pattern and an attack strategy, allowing for
a total of 18 possible attacker types. Compared to existing work, which only consider a
limited subset of these 18 possible pairs, here we investigate the performance of a rating
system under all 18 types of attacks. This is achieved by modeling each type of attack
behavior and strategy within the simulation framework presented in Section 4.1. In our
experiment, we parameterize the intensity of attacks on a platform as the attack rate—the
proportion of all ratings that are malicious. We compare results for varying attack rates,
from 10% to 90%, in 10% increments.
14 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
Table 1: Three categories of adversarial behavior patterns.
Pattern Description Reference
Basic Attackers consistently exhibit adversarial behavior—
granting high ratings to conspiring sellers or low ratings
to rival sellers.
[5, 7, 15, 20, 24, 26–29]
Camouﬂage Attackers attempt to camouﬂage their adversarial intent
by strategically mixing justiﬁed ratings with malicious
ones. Under this behavioral scheme, attackers typically
exhibit benign behavior in early interactions, and tran-
sition to adversarial activities at later stages.
[29]
Whitewashing Attackers behave under the basic scheme, while subse-
quently creating multiple accounts on the platform to
mitigate detection and create an illusion that their mali-
cious ratings are socially validated.
[29]
Table 2: Six categories of attack strategy. Each strategy is assigned a number which we use
as a referense in the text.
# Name Description Reference
1 Ballot stufﬁng (BS) Attempting to boost the reputation of conspiring sellers
by giving maximum ratings
[5, 20, 27]
2 Bad mouthing (BM) Attempting to hurt the reputation of rival sellers by giv-
ing minimum ratings
[5, 20, 27]
3 BS & BM Employing a mix of both ballot stufﬁng and bad
mouthing
[5, 20, 27]
4r-high shifting Attempting to boost the reputation of conspiring sellers
by giving ratings that are rpoints higher than the aver-
age ratings
[15, 20, 24]
5r-low shifting Attempting to hurt the reputation of rival sellers by giv-
ing ratings that are rpoints lower than average
[15, 20, 24]
6r-high/low shifting Employing both r-high shifting and r-low shifting
strategies to boost reputation of conspiring sellers while
simultaneously reducing the reputation of rival sellers
[15, 20, 24]
Note that the last three strategies—r-high, r-low, and r-high/low shifting—require that the
attackers assign some distribution of benign buyer ratings for the target sellers.
5. Results
We use Spearman’s rank correlation coefﬁcient to measure and compare how well a repu-
tation system can recover the true capability rankings of sellers.7Our results are presented
in two parts. First, we investigate the efﬁcacy of RATIN G SEPAR ATION (R S). To do so,
we compare RS with a naive baseline approach of computing a seller’s reputation via
simple average of observed ratings. Second, to test whether INTEGRITY WEIGHTING
7Note that here, we focus on seller reputation, but the proposed methods and simulation framework could also
be used for estimating item quality via trivial extension of this work.
Robust reputation system 15
(IW) and the combined approach of RATIN G SEPAR ATION AND INTEGRITY WEIGHT-
ING (RS &IW) is truly robust to adversarial rating activity, we compare performance of
each method to existing mitigation techniques.
5.1. RATING SEPAR ATION (R S) performance
First, we evaluate the performance of RATING SEPARATI ON in recovering true seller rank-
ings. Because RATING SEPAR ATIO N in itself does not mitigate against adversarial ratings,
for this section we focus on a simulated platform that assumes no malicious attacks.8We
compare performance under two different assumptions of marketplace parameters, as de-
scribed in Section 4.1. As a baseline, we compute a naive measure of seller reputation by
taking the average of all ratings that a seller received.
Baseline Rating separation Baseline Rating separation
0.80
0.85
0.90
0.95
Method
Rank correlation
Fig. 4: Simulation results comparing a single iteration of RATI NG SE PARATI ON versus the baseline.
Each box plot summarizes the results of 10 simulations. The y-axis shows the rank correlation
between the estimated seller reputation and true seller capability, for each method. Column panels
show the two simulation parameter settings. Overall, RATIN G SEPAR ATIO N consistently recovers
the true ranking of seller capability more reliably than the baseline, with less variance across each
trial.
In Fig. 4, we compare the rank correlation between estimated seller reputation and true
seller capabilities for the baseline and RATING SEPAR ATIO N using just a single iteration.
The box plot represents the distribution of rank correlation performance achieved for each
method, across 10 simulations each.
We ﬁnd that for every simulation trial, RATING SEPARATI ON achieved consistently
higher rank correlation compared to the baseline. RATIN G SEPAR ATION also was more
consistent in better recovering true seller rankings, demonstrated by the low variance in
performance across simulations, compared to the baseline.
8We investigate robustness of our proposed methods to attacks in Section 5.2.
16 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
Parameter set 1
Parameter set 2
1234567891012345678910
0.90
0.92
0.94
0.96
0.98
Iteration
Rank correlation
Fig. 5: Rank correlation between the estimated seller reputation computed via RATING SEPA RA-
TI ON and true reputation as a function of the number of iterations. Column panels show the two
simulation parameter sets.
While Fig. 4 shows that just a single iteration of RATING SEPARATIO N can achieve
superior performance compared to the baseline, the iterative nature of RATING SEPARA -
TI ON allows for further improvement. In Fig. 5, we show that the performance of RATING
SEPAR ATION can be substantially improved by just 3 additional iterations, at which point
the rank correlation is close to perfect at about 0.98. This represents a tremendous im-
provement, considering that the baseline approach, at best, recovers seller rankings with
5.2. Performance with adversarial ratings
Next, we evaluate the efﬁcacy of INTEGRITY WEIGHTING in mitigating the harms of ad-
versarial ratings. The two approaches of using just INTEGRITY WEIGHTING and using
both RATI NG SE PAR ATIO N AN D INTEGRITY WEIGHTING are evaluated. We compare
performance to three existing methods from previous literature: BRS [27], PA [28], and
iCLUB [20]. A baseline approach, which does not explicitly adjust for potential adversar-
ial behavior is included as well. As mentioned in Section 4.2, we conduct simulations for
18 possible attack types, unique pairs of three behavior patterns and six attack strategies.
For each attack type, we vary the rate of attack between 10% and 90%, in 10% increments.
The results are presented in Fig. 6.
From Fig. 6, we ﬁrst note that the methods we propose (IW and RS&IW) consistently
outperform all alternative methods in every setting that we test. Notably, we ﬁnd that
the baseline method, which assigns equal weight to all ratings and does not account for
any malicious behavior, performs better than more sophisticated methods under some
conditions.
Overall, as the rate of attacks increases, the performance of all methods in all set-
tings decrease, albeit in varying degrees. One interesting ﬁnding is that iCLUB typically
achieved either negative correlation, or the highest performance among the four bench-
mark approaches. This suggests that while iCLUB can be a high-performing method un-
der speciﬁc assumptions of adversarial behavior, it is not generally reliable.
Under either a bad mouthing strategy (Pattern 2) or r-low shifting strategy (Pattern
5), other methods for mitigating adversarial ratings typically do no better than a naive
Robust reputation system 17
basic
camouflage
washing
Pattern 1
Pattern 2
Pattern 3
Pattern 4
Pattern 5
Pattern 6
10% 30% 50% 70% 90% 10% 30% 50% 70% 90% 10% 30% 50% 70% 90%
0.0
0.2
0.5
0.8
1.0
0.0
0.5
1.0
−0.5
0.0
0.5
1.0
0.0
0.2
0.5
0.8
1.0
0.0
0.5
1.0
−0.5
0.0
0.5
1.0
Rate of attack
Rank correlation
Methods
IW
RS&IW
Baseline
BRS
PA
iCLUB
Fig. 6: Comparison of multiple methods for mitigating malicious reviews. The x-axis shows the pro-
portion of attacks that are assumed in each simulation, while the y-axis shows the rank correlation
between the estimated seller reputation and true reputation, for each method. Column panels show
different attack types and row panels show different attack patterns. Overall, the proposed methods
(RS and RS&IW) are able to recover the true reputation more reliably than any existing method
across all simulated circumstances.
baseline approach. This indicates that existing methods are tailored to certain types of
attack strategies, and do not perform well against a wide range of attacks, in general.
6. Conclusions
In increasingly complex online marketplaces that involve the interaction of multiple agents,
evaluating the quality and characteristics of each agent is becoming more important. This
paper addresses the issue of the confounded buyer ratings, as well as malicious ratings, in
reputation systems and proposes RATING SEPAR ATIO N AN D INTEGRITY WEIGHTING
18 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
(RS &IW), a system for providing agent reputations that are robust to confounded ratings
and various types of cheating behavior. Through extensive experiments, we showed that
our reputation system can both disentangle scores for sellers from the confounded rat-
ings and are robust to numerous realistic attack scenarios generated by incorporating a
comprehensive model of plausible adversarial behaviors.
While, in the interest of clarity and consistency, we have focused our work in this
paper on the concrete problem of identifying seller rankings and mitigating malignant
buyer behavior, the methods we propose could be extended to a broader family of prob-
lems in a more general context of multi-agent platforms. One possible extension would
be to apply RATING SEPAR ATION in matching markets, where participating agents report
numerous confounded signals with regard to the quality of other entities. For example, in
ride sharing applications, RATING SEPAR ATIO N could be applied on ratings to decouple
rider satisfaction of driver (e.g., personal, vehicle) and route (e.g., trafﬁc conditions, travel
time) characteristics. Or in a three-sided market, such as food delivery services, RATING
SEPAR ATION could be extended to disentangle courier and restaurant ratings from an
eater’s single score.
Besides seller performance and item quality, item price is the one of the main factors
having an inﬂuence on conforming a single score. Sometimes, buyers could give a gener-
ous score for a seller, even though both this seller’s performance and his item quality are
not satisfactory, because the price is discovered as the lowest one in an online platform. In
a further study, we plan to develop a framework to accurately disentangle this price effect
from a user’s single score for measuring better puriﬁed seller reputation.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government(MSIT) (No. NRF-2020R1A2B5B03001960) and by
Next-Generation Information Computing Development Program through the National
Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (NRF-
2017M3C4A7069440).
Bibliography
[1] Bidgoly, A., Ladani, B.: Modeling and Quantitative Veriﬁcation of Trust Systems
Against Malicious Attackers. The Computer Journal 59(7), 1005–1027 (2016)
[2] Biega, A.J., Gummadi, K.P., Weikum, G.: Equity of attention: Amortizing individual
fairness in rankings. In: The 41st international acm sigir conference on research &
development in information retrieval. pp. 405–414 (2018)
[3] Cabral, L., Hortacsu, A.: The dynamics of seller reputation - theory and evidence
from eBay. The Journal of Industrial Economics LVIII(1), 54–78 (2010)
[4] Chandrasekaran, P., Esfandiari, B.: A model for a testbed for evaluating reputation
systems. In: Proceedings of the 10th International Conference on Trust, Security and
Privacy in Computing and Communications. pp. 296–303. IEEE (2011)
[5] Dellarocas, C.: Immunizing Online Reputation Reporting Systems Against Unfair
Ratings and Discriminatory Behavior. In: Proceedings of the 2nd ACM conference
on Electronic Commerce. pp. 150–157 (2000)
[6] Fan, Z.P.and Xi, Y., Liu, Y.: Supporting consumer’s purchase decision: a method for
ranking products based on online multi-attribute product ratings. Soft Computing
22, 5247–5261 (2018)
[7] Fang, H., Zhang, J., Sensoy, M., Thalmann, N.M.: SARC : Subjectivity Alignment
for Reputation Computation ( Extended Abstract ) Categories and Subject Descrip-
tors. In: Proceedings of the 11th International Con- ference on Autonomous Agents
and Multiagent Systems (AAMAS). pp. 1365–1366 (2012)
[8] Gao, J., Zhou, T.: Evaluating user reputation in online rating systems via an iterative
group-based ranking method. Physica A: Statistical Mechanics and its Applications
473, 546–560 (2017)
[9] Ghiasi, H., Brojeny, M., Gholamian, M.: A reputation system for e-marketplaces
based on pairwise comparison. Knowledge and Information Systems 56, 613–636
(2018)
[10] Houser, D., Wooders, J.: Reputation in auctions: Theory, and evidence from eBay.
Journal of Economics and Management Strategy 15(2), 353–369 (2006)
[11] Howe, J.: The Rise of Crowdsourcing. Wired Magazine (14) (2006),
http://archive.wired.com/wired/archive/14.06/crowds.html
[12] Irissappane, A.A., Jiang, S., Zhang, J.: Towards a comprehensive testbed to evaluate
the robustness of reputation systems against unfair rating attacks. In: User Modeling
Adaptation and Personalizatoin Workshops (2012)
[13] Jiang, W, X.Y.G.H.W.C.Z.L.: Multi agent system-based dynamic trust calculation
model and credit management mechanism of online trading. Intelligent Automation
& Soft Computing 22(4), 639–649 (2016)
[14] Jøsang, A., Ismail, R., Boyd, C.: A survey of trust and reputation systems for online
service provision. Decision Support Systems 43(2), 618–644 (2007)
[15] Jøsang, A., Ismail, R., Jsang, A., Ismail, R.: The Beta Reputation System. Proceed-
ings of the 15th Bled Electronic Commerce Conference 160, 324–337 (2002)
[16] Kerr, R., Cohen, R.: TREET: The Trust and Reputation Experimentation and Evalu-
ation Testbed. Electronic Commerce Research 10(3), 271–290 (2010)
20 Hyun-Kyo Oh, Jongbin Jung, Sunju Park, and Sang-Wook Kim
[17] Kramer, M.A.: Self-selection bias in reputation systems. In: Etalle, S., Marsh, S.
(eds.) IFIP International Federation for Information Processing, vol. 238, chap. Trust
Mang, pp. 255–268. Boston: Springer (2007)
[18] Leadbeater, C.: We-think. Proﬁle books (2009)
[19] Lim, E.P., Nguyen, V.A., Jindal, N., Liu, B., Lauw, H.W.: Detecting product review
spammers using rating behaviors. In: Proceedings of the 19th ACM international
conference on Information and knowledge management - CIKM ’10. pp. 939–948
(2010)
[20] Liu, S., Zhang, J., Mao, C., Theng, Y., Kot, A.: iCLUB: an integrated clustering
based approach to improve the robustness of reputation systems. In: Proceedings
of 10th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS). pp. 1151–1152 (2011)
[21] Ma, L., Pei, Q., Xiang, Y., Yao, L., Yu, S.: A reliable reputation computation frame-
work for online items in E-commerce. Journal of Network and Computer Applica-
tions 134, 13–25 (2019)
[22] Oh, H., Kim, S., Park, S., Zhou, M.: Can You Trust Online Ratings? A Mutual
Reinforcement Model for Trustworthy Online Rating Systems. IEEE Transactions
on Systems, Man, and Cybernetics 45(12), 1564–1576 (2015)
[23] Oh, H.K., Kim, S.W., Park, S., Zhou, M.: Trustable aggregation of online ratings.
In: Proceedings of the 22nd ACM international conference on Conference on infor-
mation & knowledge management - CIKM ’13. pp. 1233–1236 (2013)
[24] Regan, K., Poupart, P., Cohen, R.: Bayesian reputation modeling in e-marketplaces
sensitive to subjectivity, deception and change. In: Proceedings of the National Con-
ference on Artiﬁcial Intelligence. vol. 21, p. 1206 (2006)
[25] Resnick, P., Zeckhauser, R., Swanson, J., Lockwood, K.: The value of reputation on
eBay: A controlled experiment. Experimental Economics 9(2), 79–101 (2006)
[26] Teacy, W.T., Patel, J., Jennings, N.R., Luck, M.: TRAVOS: Trust and reputation in
the context of inaccurate information sources. Autonomous Agents and Multi-Agent
Systems 12(2), 183–198 (2006)
[27] Whitby, A., Jøsang, A., Indulska, J.: Filtering Out Unfair Ratings in Bayesian Rep-
utation Systems. Icfain Journal of Management Research 6(2), 106–117 (2005)
[28] Zhang, J., Cohen, R.: Evaluating the trustworthiness of advice about seller agents
in e-marketplaces: A personalized approach. Electronic Commerce Research and
Applications 7(3), 330–340 (2008)
[29] Zhang, L., Jiang, S., Zhang, J., Ng, W.K.: Robustness of Trust Models and Combi-
nations. Trust Management VI 374, 36–51 (2012)
[30] Zhou, X., Murakami, Y., Ishida, T., Liu, X., Huang, G.: ARM: Toward Adaptive
and Robust Model for Reputation Aggregation. IEEE Transactions on Automation
Science and Engineering 17(1), 88–99 (2020)
Hyun-Kyo Oh received his B.S., M.S. and Ph.D. degree in Electronics and Computer
Engineering from Hanyang University, Seoul, Korea at 2008, 2010 and 2016. He visited
the Department of Computer Science of Carnegie Mellon University as a visiting scholar
in 2013. He worked with the Knowledge Computing Group at Microsoft Research Asia
as a research intern from 2014 to 2015. In 2016, he joined Samsung Electronics, where
he currently is a lead data scientist working on creating a new machine learning and deep
Robust reputation system 21
learning platform that deals with the health of V-NAND ﬂash memory and the perfor-
mance of Solid State Disk (SSD). Now, he is also a visiting researcher at Institute for
Software Research at Carnegie Mellon University.
Jongbin Jung received a Ph.D. in Computational Social Science and Decision Analysis
from Stanford University. He is primarily interested in using quantitative methods and
data analytics to help improve and evaluate human decisions. Jongbin studied operations
research (M.S.) and business administration (B.B.A.) at Yonsei University (Seoul, South
Korea).
Sunju Park received her B.S. and M.S. in computer engineering from Seoul National
University, Seoul, Korea, and the Ph.D. degree in computer science and engineering from
the University of Michigan, Ann Arbor, MI, USA. She has served on the faculties of
Management Science and Information Systems, Rutgers University, NJ, USA. She is a
professor of Operations, Decisions and Information at School of Business, Yonsei Uni-
versity, Seoul, Korea. Her current research interests include analysis of online social net-
works, multiagent systems for online businesses, and pricing of network resources. Her
publications include Computers and Industrial Engineering, Electronic Commerce Re-
search, Transportation Research, IIE Transactions, the European Journal of Operational
Research, the Journal of Artiﬁcial Intelligence Research, Interfaces, Autonomous Agents
and Multiagent Systems, and other leading journals.
Sang-Wook Kim received the B.S. degree in computer engineering from Seoul National
University, in 1989, and the M.S. and Ph.D. degrees in computer science from the Ko-
rea Advanced Institute of Science and Technology (KAIST), in 1991 and 1994, respec-
tively. In 2003, he joined Hanyang University, Seoul, Korea, where he currently is a pro-
fessor at the Department of Computer Science and Engineering and the director of the
Brain-Korea21-Plus research program. He is also leading a National Research Lab (NRL)
Project funded by the National Research Foundation since 2015. From 2009 to 2010, he
visited the Computer Science Department, Carnegie Mellon University, as a visiting pro-
fessor. From 1999 to 2000, he worked with the IBM T. J. Watson Research Center, USA,
as a postdoc. He also visited the Computer Science Department at Stanford University
as a visiting researcher in 1991. He is an author of more than 200 papers in refereed
international journals and international conference proceedings. His research interests in-
clude databases, data mining, multimedia information retrieval, social network analysis,
recommendation, and web data analysis. He is a member of the ACM and the IEEE.
Article
Reputation is a crucial factor that governs the importance of a software agent in the agent-mediated e-market. In the e-market, various buyers and service providers are involved in buying and selling the products. A buyer agent (BA) acts on behalf of a buyer to buy the products from a service provider agent (SPA) preferably having a good reputation score (Rep-Score). The conventional customer rating mechanism for online transactions lacks adequate analysis and investigation of customer reviews and hence does not reflect the accurate reputation of the service providers. This research investigates the reputation of a software agent using customer feedback based on product attributes such as product quality, design, price, delivery time, and defects. A knowledge rule-set is formed to establish a link between customer feedback and the repute of a software agent. Further, a simulation-based approach using the Rosetta toolkit and the Fuzzy Control System is applied to quantify and fine-tune the reputation of a software agent. There could be a chance of an unfair relationship between the same buyer-seller pair due to recurrent transactions. The proposed work eliminates any chance of a conspiracy between a service provider and a buyer agent. In case, the buyer agent makes repeated transactions with a particular service provider agent, the value of the weight assigned to the reputation of the service provider agent is significantly diminished for each new transaction, hence decreasing the final value of the Rep-Score. As a result, this method guarantees the correctness of the reputation evaluation of a software agent. A performance analysis is performed to validate the proposed approach using mean squared error and standard deviation.
Chapter
Fair evaluation of users is the basic guarantee for the healthy development of the service ecosystem. However, existing methods do not provide an indicator of when can get fair evaluation and how to reduce the proportion of malicious users from the root. This paper proposes a “user-service” double-side evaluation(USDSE) model to solve the problem above. Firstly, we start with getting the reputation of users by using the evaluation of service. Normal and malicious users are distinguished by their reputation. Secondly, we use the minimum number of normal users as the indicator to show when we can get fair evaluation. Finally, the revenue of employing collusive users has been analyzed to reduce the proportion of collusive users indirectly. The simulation experiments show that USDSE effectively improves the accuracy of identifying malicious users and reduces the revenue of employing collusive users.
Article
Human activities and behaviour in different domains are usually influenced by other people’s actions and opinion. Nowadays, it is evident that there is a growing research interest in sentiment analysis, evaluation and prediction. Content from web sources and social media is frequently used when people want to see others’ opinion about different things. Our research is focused on ML-based sentiment analysis of food services reviews data. The comparison of several regression models with regards to prediction of customer satisfaction of restaurant and food services is presented. The experimental data collected from food serving businesses located in Shanghai Lujiazui Commercial Zone includes keywords extracted from the customers’ written reviews. Additionally, the data are spatially labelled enabling to conduct separate analyses for different geographical regions. As a conclusion, the keywords extracted from the customer’s reviews were suitable for the prediction of three observed satisfaction criteria: food taste, service, and environment.
Article
Full-text available
Implementing a reputation system is an effective strategy to facilitate trust and security in an online environment. In addition to that, reputation systems can help online customers through decision-making process. However, in real-world situations, these systems have to deal with plenty of problems and challenges. This paper aims to solve four problems that are common to reputation systems in e-marketplaces, namely the subjectivity of ratings, inequality of transactions, multi-context reputation and dynamic behavior of users. The proposed model starts with the pairwise comparison, which is a powerful tool for removing bias from ratings. Then, we extend the concept of pairwise comparison to contests between users. A pairwise comparison has only a winner and a loser, but we can associate a score differential with a pairwise comparison when we consider it as a match. This score differential is adjusted in a way that three other problems can be solved. We implemented our model in a multi-agent simulation in which real-world data were also incorporated. We compared our model with some of previous reputation systems. Experiments show that our model outperforms previous ones when faced with real-world challenges.
Article
Full-text available
Online product ratings, as a type of electronic word-of-mouth, play an important role for helping consumers select desirable products, but it is difficult for consumers to read a large number of online ratings on e-commerce Web site. To support consumer’s purchase decision, how to rank the candidate products based on online product ratings and consumer’s preferences is a noteworthy research topic, while the existing studies concerning this issue are still relatively scarce. This paper proposes a method for ranking products based on online multi-attribute product ratings. In the method, a discrete percentage distribution of the evaluation of each candidate product with respect to each attribute based on online ratings is first constructed, and the $$3\sigma$$ criterion is used to eliminate the anomalous ratings. Then, by defining of the stochastic dominance rules and the stochastic dominance degrees on comparing two discrete percentage distributions, the stochastic dominance relation between each pair of products is determined, and the corresponding stochastic dominance degree is calculated. Further, according to the obtained stochastic dominance degrees, the ranking of candidate products can be determined using the PROMETHEE-II method. A case study on selecting the automobile is given to illustrate the use of the proposed method.
Article
Full-text available
Reputation is a valuable asset in online social lives and it has drawn increased attention. How to evaluate user reputation in online rating systems is especially significant due to the existence of spamming attacks. To address this issue, so far, a variety of methods have been proposed, including network-based methods, quality-based methods and group-based ranking method. In this paper, we propose an iterative group-based ranking (IGR) method by introducing an iterative reputation-allocation process into the original group-based ranking (GR) method. More specifically, users with higher reputation have higher weights in dominating the corresponding group sizes. The reputation of users and the corresponding group sizes are iteratively updated until they become stable. Results on two real data sets suggest that the proposed IGR method has better performance and its robustness is considerably improved comparing with the original GR method. Our work highlights the positive role of users' grouping behavior towards a better reputation evaluation.
Conference Paper
Full-text available
Evaluation of the effectiveness and robustness of reputation systems is important for the trust research community. However, existing testbeds are mainly simulation based and not flexible to perform robust-ness evaluation, and none of them is specifically designed to evaluate the robustness of reputation systems against unfair rating attacks. In this pa-per, we propose a novel comprehensive testbed by simulating three types of environments (simulated environments, real environments with simu-lated unfair rating attacks, and real environments with detected unfair ratings). The testbed incorporates sophisticated deception models and unfair rating attack models, and introduces several performance metrics to fully test and compare the effectiveness and robustness of different reputation systems. We also provide two case studies to demonstrate the usage of partial features of our proposed testbed.
Article
In dynamic, open, and service-oriented computing environments, e.g., e-commerce and crowdsourcing, service consumers must choose one of the services or items to complete their tasks. Due to the scale and dynamic characteristics of these environments, service consumers may have little or no experience with the available services. To this end, reputation systems are proposed and have played a crucial role in the success of online service-oriented transactions. In this paper, we study the current reputation systems used in commercial environments. In these rating-based reputation systems, we found they are not only resilient to the changes (time lag) but also vulnerable to unfair ratings. To address the problems in parallel, we propose an adaptive reputation model (ARM). ARM can dynamically adjust its model parameters to adapt the latest changes in a service. To tackle time lag, the proposed model generalizes the fixed sliding window, used in current commercial platforms, into a dynamic sliding window mechanism. Thus, the model can completely mitigate the influence of obsolete ratings. To detect unfair ratings, our model implements a statistical strategy based on hypothesis testing after transforming the ratings in the linear window into residuals. Experiments not only validate the effectiveness of the proposed model but also show that it outperforms the existing reputation system by 45% on average based on five test cases. The results also show that the proposed model can asymptotically converge to the underlying reputation value as ratings begin to accumulate.
Article
Most of online trading platforms allow consumers to give personal ratings to online items. By computing the weighted mean of the ratings, the reputation values of online items can be derived to assist consumers to make purchasing decisions. However, it is never a simple task to derive a reliable reputation value of any given item and existing works fail to achieve this. Thus, in this paper, we propose a reliable reputation computation framework for online items which can be adopted by online trading platforms or run by a third party to provide reputation computation as a service. At first, a fine-grained two-phase detection method is proposed to detect malicious ratings. After filtering out the ratings detected as malicious, the weights of the remaining ratings are determined by computing the degrees to which the users giving these ratings are interested in a target item. Extensive experiments verify that the proposed reliable reputation computation framework is effective to detect different kinds of malicious ratings and determine the interest degrees of users.
Article
Now all kinds of malicious acts appear in C2C online auctions, particularly the phenomenon of trust lack and credit fraud is very outstanding. Therefore, how to build an effective trust model has become a burning problem. Based on analyzing limitations of the existing online trust transaction mechanism, and according to characteristics (such as dynamic, innominate and suppositional) of online transaction trust problem, the article proposes a dynamic trust calculation model and reputation management mechanism of online trading based on multi-Agent system. The model consists of three parts. The first part is the trust of user domain, to put importance on the influence on current trust by recent credibility status, to motivate users to adopt an agreed cooperative strategy. The second part is the weighted average of reputation feedback score. The weighted part mainly considers the trust from the reputation feedback score person (the credibility of the feedback score), the value of the transaction (to prevent the ?credit squeeze?), temporal discounted (?guard against the fluctuations of the credibility? ) and other factors. The third part is to give a weighting on the community contribution, according to the action taken by a user to the other members of the community in a time domain, to increase or decrease the user?s trust to isolate the feedback submission of the credibility and punish the fraud. The paper builds the fraud limitation mechanism, which combines the prevention beforehand, coordination in the event and punishment afterwards. The mechanism makes the online transaction safe. Theoretic proof and experimental verification indicate the following three problems can be solved effectively: 1) solving the problem, which is difficult to prevent and is that speculative user accumulates the little trusts and squeeze on the large trading; 2) preventing members from cheating by false trading or personation; 3) reducing the arbitration workload of the online business platform.
Article
Nowadays, trust systems (TSs) are widely used for tackling dishonest entities in many modern environments. However, these systems are vulnerable to some kinds of attacks where attackers try to deceive the system using sequences of misleading behaviors and dishonest recommendations. A robust TS is expected to function properly even in the possibility of such attacks. To the best of our knowledge, simulation has been the main approach for evaluation of TSs so far, and there is no remarkable verification method for this aim. In this paper, a method for quantitative verification of TSs' robustness against malicious attackers is proposed. The proposed method consists of a formalism for specifying any given trust model named TS attack process that is cast into partially observable Markov decision process mathematical framework. The proposed method is capable of verifying TSs against both well-known attacks and the worst possible attack scenario. The method could also be used to help adjusting parameters of the given TS. Moreover, a quantitative robustness measure is introduced, which helps to compare the robustness of different TSs. To illustrate the applicability of the proposed method, a number of case studies for analysis and comparison of selected trust models (including Subjective Logic and REGRET) are presented.
Article
The average of customer ratings on a product, which we call a reputation, is one of the key factors in online purchasing decisions. There is, however, no guarantee of the trustworthiness of a reputation since it can be manipulated rather easily. In this paper, we define false reputation as the problem of a reputation being manipulated by unfair ratings and design a general framework that provides trustworthy reputations. For this purpose, we propose TRUE-REPUTATION, an algorithm that iteratively adjusts a reputation based on the confidence of customer ratings. We also show the effectiveness of TRUE-REPUTATION through extensive experiments in comparisons to state-of-the-art approaches.
Conference Paper
The average of the customer ratings on the product, which we call reputation, is one of the key factors in online purchasing decision of a product. There is, however, no guarantee in the trustworthiness of the reputation since it can be manipulated rather easily. In this paper, we define false reputation as the problem of the reputation to be manipulated by unfair ratings, and design a general framework that provides trustable reputation. For this purpose, we propose TRUEREPUTATION, an algorithm that iteratively adjusts the reputation based on the confidence of customer ratings.