Conference PaperPDF Available

PAL: a position-bias aware learning framework for CTR prediction in live recommender systems

Authors:

Abstract and Figures

Predicting Click-Through Rate (CTR) accurately is crucial in recommender systems. In general, a CTR model is trained based on user feedback which is collected from traffic logs. However, position-bias exists in user feedback because a user clicks on an item may not only because she favors it but also because it is in a good position. One way is to model position as a feature in the training data, which is widely used in industrial applications due to its simplicity. Specifically, a default position value has to be used to predict CTR in online inference since the actual position information is not available at that time. However, using different default position values may result in completely different recommendation results. As a result, this approach leads to sub-optimal online performance. To address this problem, in this paper, we propose a P osition-bias A ware L earning framework (PAL) for CTR prediction in a live recommender system. It is able to model the position-bias in offline training and conduct online inference without position information. Extensive online experiments are conducted to demonstrate that PAL outperforms the baselines by 3% - 35% in terms of CTR and CVR (ConVersion Rate) in a three-week AB test.
Content may be subject to copyright.
PAL: A Position-bias Aware Learning Framework for CTR
Prediction in Live Recommender Systems
Huifeng Guo, Jinkai Yu, Qing Liu, Ruiming Tang*, Yuzhou Zhang
* Corresponding author.
Noah’s Ark Lab, Huawei, China.
{huifeng.guo,yujinkai,liuqing48,tangruiming,zhangyuzhou3}@huawei.com
ABSTRACT
Predicting Click-Through Rate (CTR) accurately is crucial in recom-
mender systems. In general, a CTR model is trained based on user
feedback which is collected from trac logs. However, position-
bias exists in user feedback because a user clicks on an item may
not only because she favors it but also because it is in a good posi-
tion. One way is to model position as a feature in the training data,
which is widely used in industrial applications due to its simplic-
ity. Specically, a default position value has to be used to predict
CTR in online inference since the actual position information is
not available at that time. However, using dierent default position
values may result in completely dierent recommendation results.
As a result, this approach leads to sub-optimal online performance.
To address this problem, in this paper, we propose a
P
osition-bias
A
ware
L
earning framework (PAL) for CTR prediction in a live rec-
ommender system. It is able to model the position-bias in oine
training and conduct online inference without position information.
Extensive online experiments are conducted to demonstrate that
PAL outperforms the baselines by 3% - 35% in terms of CTR and
CVR (ConVersion Rate) in a three-week AB test.
ACM Reference Format:
Huifeng Guo, Jinkai Yu, Qing Liu, Ruiming Tang*, Yuzhou Zhang . 2019.
PAL: A Position-bias Aware Learning Framework for CTR Prediction in
Live Recommender Systems. In Thirteenth ACM Conference on Recommender
Systems (RecSys ’19), September 16–20, 2019, Copenhagen, Denmark. ACM,
New York, NY, USA, 5 pages. https://doi.org/10.1145/3298689.3347033
1 INTRODUCTION
Click-Through Rate (CTR) prediction is to estimate the probability
that a user will click on a recommended item under a specic
context. It plays a crucial role in recommender systems, especially
in the industry of App Store and online advertising [
2
,
5
,
7
9
,
12
,
18
].
To maximize revenue and user satisfaction, the recommended items
are presented in descending order of the scores computed by a
function of the predicted CTR and bid [
10
,
19
], i.e.,
f(CTR ,bid)
,
where “bid” is benet that the system receives if the item is clicked
by a user. Therefore, the accuracy of the predicted CTR directly
determines the revenue and user experience [16].
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6243-6/19/09. . . $15.00
https://doi.org/10.1145/3298689.3347033
(a) (b)
Figure 1: CTRs at dierent positions
Figure 2: Workow of Recommendation
commender systems in a real production environmentThe re
usually consists of two important procedures [
2
,
6
,
19
]: oine train-
ing and online inference, as shown in Figure 2. In oine training,
a CTR prediction model is trained based on the user-item interac-
tion information (as well as user and item information), collected
from trac logs of recommender systems [
3
]. In online inference,
the trained model is deployed to the live recommender systems to
predict the CTR and then make recommendations.
One problem of these procedures is that the user-item interaction
is aected by the positions that the items are displayed. As discussed
in [
14
], the CTR declines rapidly with the display position. Similarly,
we also observe such exogenous phenomena in a mainstream App
Store. As shown in Figure 1, either for App Store (Figure 1(a)) or
a specic App (Figure 1(b), an app is an item of App Store), we
observe that the normalized CTR drops dramatically with position.
These observations imply that a user clicks on an item may not
only because she favors it but also it is in a good position, so that
the training data collected from trac logs contains positional bias.
We denote it as position-bias through this paper. As an important
factor to CTR signal, it is necessary to model the position-bias into
the CTR prediction model in oine training [10, 15].
Although various click models are proposed to model the position-
bias in the training data [
1
,
3
], there is limited research on the re-
alistic issue that the position information is unavailable in online
inference. One practical approach is inverse propensity weight-
ing [
15
], in which a user-dened transformation on the position
information is applied and then the transformed value is xed. How-
ever, as mentioned in [
10
], it is hard to design a good transformation
manually for the position information, which leads to worse perfor-
mance than a automatically-learned transformation. Therefore, the
452
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark H. Guo, J. Yu, Q. Liu, R. Tang, Y. Zhang
authors of [
10
] propose to model position as a feature in the training
data, which is widely used in industrial applications due to its sim-
plicity. Specically, a default position value has to be used to predict
CTR in online inference since the actual position information is
not available at that time. Unfortunately, using dierent default
position values may result in completely dierent recommendation
results which leads to sub-optimal online performance.
In this paper, we propose a
P
osition-bias
A
ware
L
earning frame-
work (PAL) to model the position-bias in oine training and to
conduct online inference without position information. The idea of
PAL is based on the assumption that the probability of an item is
clicked by a user depends on two factors: a) the probability that the
item is seen by the user and b) the probability that the user clicks
on the item, given that the item has been seen by the user. Each
factor is modeled as a module in PAL and the product of the outputs
of these two modules is the probability that an item is clicked by
a user. If the two modules are optimized separately, it may lead
the overall system to a sub-optimal status, because of the inconsis-
tency between training objectives of the two modules, as claimed
in [
18
]. To avoid such limitation and facilitate better CTR prediction
performance, the two modules in PAL are optimized jointly and
simultaneously. Once these two modules are well trained through
oine training, the second module, i.e., the probability that the
user clicks on the item, given that the item has been seen by the
user, is deployed to predict CTR in online inference.
We conduct online experiment and user case analysis to demon-
strate the superiority of PAL over competitive methods. The results
show that PAL improves CTR and CVR over the baseline methods
by 3% - 35%, across three-week AB test.
2 METHOD
2.1 Notation
We assume the oine click dataset to be
S={(xN
i,posiyi)}|i=1
,
where
N
is the total number of samples,
xi
is the feature vector
of sample
i
which includes user prole, item features and context
information,
posi
is the position information of sample
i
,
yi
is the
user feedback (
yi=
1if user clicks on this item,
yi=
0, other-
wise). We use
x
,
pos
and
y
to denote the feature vector, the position
information and the label of a data sample in general.
2.2 Preliminary
There are two directions to model the position-bias in oine train-
ing, namely as a feature and as a module.
As a feature:
This approach models the position information as a
feature. In oine training, the input feature vector of the CTR model
is the concatenation of
x
and
pos
, i.e.,
x
ˆ=[x,pos]
. A CTR prediction
model is trained based on the concatenated feature vector.
As the position information is modeled as a feature in the oine
training, a feature representing “position" should also be included
in the online inference
1
, as shown in the right part of Figure 3.
However, position information is unavailable when online inference
is performed. One way to resolve the problem of lacking position
information for inference is to decide, for each position, the most
1
Assigning “null" to the position feature usually results in unreliable inference results.
suitable item, sequentially from the top-most position to bottom-
most position. As can be analyzed, it is a brute-force method with
O(lnT )
time complexity (where
l
is the length of ranking list,
n
is the
number of candidate items, and
T
is the latency for one inference),
which is unacceptable in a low-latency online environment.
To shorten the response latency, an alternative method with
O(nT )
time complexity, as presented in [
10
], is to select a position
for all the items as the value of the position feature, i.e., position
value in short. However, dierent position values may result in
completely dierent recommendation results
2
. So we have to nd a
proper position value to achieve good online performance. There are
two ways to compare the performance of performing inference with
dierent position values: online experiment and oine evaluation.
The former is better but very expensive for a live recommender
system. Therefore, we have to adopt oine evaluation to select
the suitable position value. Moreover, no matter online experiment
or oine evaluation is utilized to select position value, it is not of
good generalization, as the position value for online inference in
one application scenario may not be suitable for another scenario.
As a module:
To address the above limitations of taking position
information as a feature, we propose a novel framework to take posi-
tion information as a module, so that we can model the position-bias
in oine training and conduct online inference without position
information. We present our framework in the next section.
2.3 Our Framework
Our framework is motivated based on the assumption that an item
is clicked by a user only if it has been seen by her. More specically,
we consider the probability that an item is clicked by a user to be
dependent on two factors: a) the probability that the item is seen
by a user; b) the probability that the user clicks on the item, given
that the item has been seen by the user [14] as shown in Eq. (1).
p(y=1|x,pos)=p(s ee n |x,pos )p(y=1|x,pos,s ee n).(1)
Eq. (1) is simplied to Eq. (2) if we further assume: a) the proba-
bility that an item has been seen only relates to the probability that
the associated position has been observed; b) the probability that
an item is clicked is independent of its position if it has been seen.
p(y=1|x,pos)=p(s ee n |po s)p(y=1|x,se en ).(2)
As shown in the left part of Figure 3, our proposed framework
PAL is designed based on Eq. (2) and consists of two modules. The
rst module models the probability
p(seen |pos)
which we denote as
“ProbSeen" in Figure 3 and takes the position information
pos
as the
input. The second module models the probability
p(y=
1
|x,seen)
which we denote as “pCTR" in Figure 3, meaning the
p
redicted
CTR
by the model. Its input is feature vector
x
in the training data. Any
CTR prediction models, such as linear models and deep learning
models [
5
,
13
,
16
], can be applied for these two modules. Then,
the learned CTR as denoted “bCTR" in Figure 3 that considers the
position bias in oine training is the multiplication of the outputs
of these two modules. As mentioned in [
18
], if the two modules are
optimized separately, the inconsistency between dierent training
objectives leads the overall system to a sub-optimal status. To avoid
such sub-optimal performance, we optimize these two modules in
2Signicantly dierent recommendation results are observed in our experiments.
453
PAL: A Position-bias Aware Learning Framework RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
Figure 3: Framework of PAL V.S. BASE.
our framework jointly and simultaneously. To be more specic, the
loss function of PAL is dened as:
1Õ
N1Õ
N
L(θps ,θpC T R )=l(yi,bCT R i)) =l(yi,Pr obS e en i×pCT Ri)),(3)
N N
i=1i=1
where
θps
and
θpCT R
are the parameters of ProbSeen and pCTR
modules,
l(·)
is cross-entropy loss function. The pCTR module, used
in online inference procedure, is not directly optimized. In fact,
when the logloss between label and predicted bCTR is minimized,
the parameters of ProbSeen and pCTR modules are optimized as
Eq. (4) and Eq. (5) (
η
is the learning rate) by stochastic gradient
descent (SGD), such that the inuence of position-bias and user
preference is learned implicitly and respectively.
1Õ
NPr ob S ee ni
θps =θp s η· (bCT Riyi) · pCT R i·.(4)
Nθps
i=1
1Õ
NpCT Ri
θpCT R =θpCT R η· (b CT Riyi) · P r ob Se e ni·.(5)
Nθ
=1pCT R
i
In the oine training procedure, similar with [
5
,
13
,
16
] and
other related works, the early stop strategy is used to obtain well-
trained model in training procedure. Once PAL is well trained, the
module pCTR is deployed online for CTR inference. As can be easily
observed, position is not required in the pCTR module in PAL, so
that we do not need to assign position values to the items in online
inference as the “as a feature" does.
3 ONLINE EXPERIMENTS
We design online experiments in a live recommender system to ver-
ify the performance of PAL. Specically, we conduct a three-week
AB test to validate the superiority of PAL over the “as a feature"
baseline methods. The AB test is conducted in game recommenda-
tion scenario in the game center of Company X’s App Store.
3.1 Datasets
In the production environment of Company X’s App Store, we sam-
ple around 1 billion records from trac logs as our oine training
dataset. To update our model, the training dataset is refreshed in a
sliding time-window style. The features for training consist of app
features (e.g., app identication, category and etc), user features
(e.g., downloaded, click history and etc), and context features (e.g.,
operation time and etc).
3.2 Baseline
The baseline framework refers to “as a feature" strategy in Sec-
tion 2.2. Actually, this baseline is the method adopted in [
10
]. As
stated, we need to select a proper position value for online infer-
ence. However, due to resource limit, it is impossible to evaluate
the baseline framework with all possible positions. Therefore, we
conduct an oine experiment to select proper positions.
Settings.
To select proper position(s), we collect two datasets from
two business scenarios in Company X’s App Store. The test dataset
is collected from the next day’s trac logs. We apply dierent posi-
tion values in the test dataset, ranging from position 1 to position
10. Similar to the related works [
5
,
11
,
13
,
16
], AUC and LogLoss
are adopted as the metrics to evaluate the oine performance of
the cases with dierent assigned position values.
Results and analysis.
The results of the oine experiments are
presented in Figure 5, where BASE_p
k
is the baseline framework
with position value
k
assigned to all the items in the test data.
PAL is our proposed framework of which the test data is collected
without position values. From Figure 5, we can see that AUC and
Logloss values are varying as we assign dierent position values
to the test data. In addition, BASE_p9 achieves the highest AUC,
BASE_p5 achieves the lowest LogLoss, and BASE_p1 achieves the
worst performance in both AUC and LogLoss. We select two best
ones (namely, BASE_p5 and BASE_p9) and the worse one (namely,
BASE_p1) as baselines to conduct online AB test with PAL. It is
worthy to notice that PAL does not achieve the best performance
in this oine experiment in terms of either AUC or LogLoss.
3.3 AB test
Settings.
For the control group, 2% of users are randomly selected
and presented with recommendation generated by the baseline
framework. For the experimental group, 2% of users are presented
with recommendation generated by PAL. The model used in base-
line and PAL is DeepFM [
5
] with the same network structure and
the same set of features. Due to resource limit, we do not deploy the
three baselines (namely, BASE_p1, BASE_p5 and BASE_p9) online
at the same period of time. Instead, they are deployed one by one,
each for one week, to compare with PAL. More specically, we com-
pare PAL v.s. BASE_p1, PAL v.s. BASE_p5 and PAL v.s. BASE_p9
for the rst, second, and third week, respectively.
Metrics.
We adopt two metrics to compare the online performance
of PAL and the baselines, namely
r
ealistic
C
lick
T
hrough
R
ate:
#dow nl oads
rCT R =#rCV R =
imp r es s io ns
and
r
ealistic
C
on
v
ersion
R
ate:
#dow nl oads
#
, where #
downloads
,#
impressions
us e r s
and #
users
are the
number of downloads, impressions and visited users in the day of
AB test, respectively. Dierent from the predicted CTR, i.e., “pCTR"
in Figure 3, “rCTR" is the realistic CTR we observe online.
454
RecSys ’19, September 16–20, 2019, Copenhagen, Denmark H. Guo, J. Yu, Q. Liu, R. Tang, Y. Zhang
Figure 4: Results of Online AB Test.
(a) (b)
Figure 5: Oline Experimental Results
Results.
Figure 4 presents the results of the online experiments.
The blue and red histograms show rCTR and rCVR improvement
of PAL over the baselines. Firstly, the metrics of rCTR and rCVR
are both consistently improved across the entire three-week AB
test, which validates the superiority of PAL over the baselines.
Secondly, we also observe that the average improvement of both
rCTR and rCVR (the dash lines in Figure 4) in the rst week (where
the baseline is BASE_p1) is the highest and it is the lowest in the
second week (where the baseline is BASE_p5). This phenomenon
tells us that the performance of the baseline varies signicantly by
assigning dierent position values, because the recommendations
may be completely dierent with dierent position values assigned.
3.4 Online Analysis
To fully understand the result of AB test and verify whether our
proposed framework eliminates position bias in online inference, we
analyze the recommendations generated by PAL and the baselines.
The
rst
experiment is to compare the
ranking distance to
ground truth ranking
. We dene the ground truth ranking as the
list of items that are ranked by descending order of
f(rCT R,bid )
value. Spearman’s Footrule [
4
] is adopted to measure the displace-
ment of the items in two rankings, which is widely used to measure
the distance between two rankings. We dene the distance between
the ground truth ranking and a ranking
σM
generated by either
PAL or baselines at top-L, as:
1Õ Õ
L
D(σM,L)=( |iσM
|U,u(i)| ) (6)
|uUi=1
where
u
is a user in the user group
U
with popularity
|U|
,
σM,u
is
the recommendation list to user
u
by model
M
. The
th
i
item in the
ground truth ranking is at position
σM,u(i)
in the recommendation
σM,u
. We compare
D(σM,L)
for
M
{PAL, BASE_p1, BASE_p5,
BASE_p9} and
L [
1
,
20
]
, as displayed in Figure 6(a), where the
solid red line is the result of PAL and the other lines are the results
(a) Ranking Distance (b) Results of Personalization
Figure 6: Online Analysis
of the baselines. We can see that PAL yields the shortest distance to
the ground truth ranking, which means that the recommendation
generated by PAL is most similar to the real ranking we observed
online. This is achieved by PAL modeling position-bias wisely in
oine training and eliminating position-bias in online inference,
which explains why PAL outperforms the baselines consistently
and signicantly.
The
second
experiment is to compare the
personalization
be-
tween PAL and the baselines. Personalization@
L
[
6
,
17
] can mea-
sure the inter-user diversity, an important factor of the recommen-
dation results, of top-
L
in a ranking across dierent users. The
Personalization@Lis dened in Eq. (7):
1Õ Õ q
Pe r s on al iz at i on @L=1ab (L)
1( ) (7)
|U| × ( |U| ) L
aUbU
where
|U|
is the size of the user group
U
,
qab (L)
is the number of
common items in the top-
L
of both user
a
and user
b
. A higher
Personalization@
L
means more diverse recommended items in the
top-Lpositions across dierent users.
We compute the personalization@
L
of PAL and the baselines
(namely, BASE_p1, BASE_p5, BASE_p9), respectively. Figure 6(b)
presents the personalization at top-5 (
L=
5), top-10 (
L=
10) and
top-20 (
L=
20) of the recommendations by dierent frameworks.
We can see that PAL yields the highest level of personalization,
which demonstrates that the top items in the recommendation gen-
erated by PAL are more diverse than that generated by the baselines,
because PAL is able to capture specic interests of dierent users
better and recommend items according to users’ personal interest
after eliminating position-bias aect.
4 CONCLUSION
In this paper, we propose a framework PAL that can model the
position-bias in the training data in oine training and predict
CTR without position information in online inference. Compared
to the baselines, PAL yeilds better results in a three-week online AB
test. Extensive online experimental results verify the eectiveness
of our proposed framework.
455
PAL: A Position-bias Aware Learning Framework RecSys ’19, September 16–20, 2019, Copenhagen, Denmark
REFERENCES
[1]
Ye Chen and Tak W. Yan. 2012. Position-normalized click prediction in search
advertising. In KDD. ACM, 795–803.
[2]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
et al
.
2016. Wide & Deep Learning for Recommender Systems. In DLRS@RecSys.
ACM, 7–10.
[3]
Nick Craswell, Onno Zoeter, Michael J. Taylor, and Bill Ramsey. 2008. An experi-
mental comparison of click position-bias models. In WSDM. 87–94.
[4]
Persi Diaconis and Ronald L Graham. 1977. Spearman’s footrule as a measure of
disarray. Journal of the Royal Statistical Society. Series B (Methodological) (1977),
262–268.
[5]
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.
Deepfm: a factorization-machine based neural network for ctr prediction. In
IJCAI. 1725–1731.
[6]
Huifeng Guo, Ruiming Tang, YunmingYe, Zhenguo Li, Xiuqiang He, and Zhenhua
Dong. 2018. DeepFM: An End-to-End Wide & Deep Learning Framework for
CTR Prediction. CoRR (2018).
[7]
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine
Atallah, Ralf Herbrich, and Stuart Bowers. 2014. Practical Lessons from Predicting
Clicks on Ads at Facebook. In Eighth International Workshop on Data Mining for
Online Advertising. 1–9.
[8]
Kuang Chih Lee, Burkay Orten, Ali Dasdan, and Wentong Li. 2012. Estimating
conversion rate in display advertising from past performance data. In SIGKDD.
768–776.
[9]
Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and
Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature
Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).
[10]
Xiaoliang Ling, Weiwei Deng, Chen Gu, Hucheng Zhou, Cui Li, and Feng Sun.
2017. Model Ensemble for Click Prediction in Bing Search Ads. In Proceedings of
the 26th International Conference on World Wide Web Companion, Perth, Australia,
April 3-7, 2017. ACM, 689–698.
[11]
Bin Liu, Ruiming Tang, Yingzhi Chen, Jinkai Yu, Huifeng Guo, and Yuzhou Zhang.
2019. Feature Generation by Convolutional Neural Network for Click-Through
Rate Prediction. In The World Wide Web Conference, San Francisco, CA, USA, May
13-17. ACM, 1119–1129.
[12]
H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,
Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat
Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos,
and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In ACM
SIGKDD. https://doi.org/10.1145/2487575.2488200
[13]
Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng
Guo, Yong Yu, and Xiuqiang He. 2019. Product-based Neural Networks for User
Response Prediction over Multi-eld Categorical Data. ACM Trans. Inf. Syst. 37,
1 (2019), 5:1–5:35.
[14]
Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting
Clicks: Estimating the Click-through Rate for New Ads. In WWW. 521–530.
[15]
Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016.
Learning to Rank with Selection Bias in Personal Search. In SIGIR. ACM, 115–
124.
[16]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through
rate prediction. In SIGKDD. ACM, 1059–1068.
[17]
Tao Zhou, Zoltán Kuscsik, Jian-Guo Liu, Matúš Medo, Joseph Rushton Wakeling,
and Yi-Cheng Zhang. 2010. Solving the apparent diversity-accuracy dilemma of
recommender systems. PNAS 107, 10 (2010), 4511–4515.
[18] Han Zhu, Daqing Chang, Ziru Xu, Pengye Zhang, Xiang Li, Jie He, Han Li, Jian
Xu, and Kun Gai. 2019. Joint Optimization of Tree-based Index and Deep Model
for Recommender Systems. CoRR abs/1902.07565 (2019).
[19]
Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017.
Optimized Cost per Click in Taobao Display Advertising. In Proceedings of the
23rd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 2191–2200.
456
... This approach models positional information as a feature during training, allowing the model to separate the impact of item position from its actual relevance. Due to its simplicity, this method is widely used in practical applications [9]. In contrast, another common approach -using inverse propensity weighting transformations during training -often results in challenges related to the accuracy of the transformations [14] and high variance [20]. ...
... Methodology. Using the position-aware approach [9,30], we add position information as a feature, allowing the model to disentangle the influence of item position from the true relevance of the probability that a user would engage with an item. During training, the model is conditioned on positions, while during serving, it becomes position-independent by setting a default value. ...
Preprint
Full-text available
Position bias poses a persistent challenge in recommender systems, with much of the existing research focusing on refining ranking relevance and driving user engagement. However, in practical applications, the mitigation of position bias does not always result in detectable short-term improvements in ranking relevance. This paper provides an alternative, practically useful view of what position bias reduction methods can achieve. It demonstrates that position debiasing can spread visibility and interactions more evenly across the assortment, effectively reducing a skew in the popularity of items induced by the position bias through a feedback loop. We offer an explanation of how position bias affects item popularity. This includes an illustrative model of the item popularity histogram and the effect of the position bias on its skewness. Through offline and online experiments on our large-scale e-commerce platform, we show that position debiasing can significantly improve assortment utilization, without any degradation in user engagement or financial metrics. This makes the ranking fairer and helps attract more partners or content providers, benefiting the customers and the business in the long term.
... Click-Through-Rates. Measuring CTR has become one of the most common metrics in online evaluations of recommender models, it is simple, easy to calculate and can be monitored live through user activity logs. Some recommendation approaches proposed in the literature optimize the algorithms for predicted CTRs [9] as it appears to be more realistic than such classic information retrieval metrics as precision and recall. However, the CTR metric can be problematic in multiple ways -firstly, it was shown that higher CTR results do not necessarily mean increased profitability of a recommender [4], which warns against employing it as a main key performance indicator (KPI). ...
... Secondly, CTR results might be non-trivial to interpret, as it is directly connected to implicit user feedback, which can sometimes be falsely attributed to user satisfaction and recommendation relevancy [36]. Last but not least, CTR values are drastically affected by position bias, when sometimes the attention an item receives is connected not to its quality or relevance, but rather the good positioning on the front page [9]. Now, this particular drawback can be partially neglected in our particular scenario as the contents of the front page on VG are very dynamic and change constantly. ...
Preprint
Full-text available
Accepted and to be presented at 12th International Workshop on News Recommendation and Analytics (INRA) at RecSys 2024 Conference \\ The application of recommender systems in the news domain has experienced rapid growth in recent years. Various news outlets are proposing a full automation of a newspaper front page through automated recommendation. In this paper, however, we explore the synergy of editorial and algorithmic news curation by analyzing the front page of a real-world news platform, where news articles are either selected automatically by a recommendation algorithm or are selected manually by editors. An investigation of the interaction log data from an online newspaper revealed that while the editorial staff is focusing on content that is generally popular across large parts of the audience, the algorithmic curation can, in addition, provide small yet noteworthy personalization touches for individual readers. The results of the analysis demonstrate an example of a successful coexistence of editorial and algorithmic news curation.
... The two-tower model is a popular tool for unbiased learning to rank (ULTR) in industry applications due to its simplicity and reported effectiveness [6,16,17]. The model employs two neural networks called towers. ...
... This paper assumes the popular additive form of the two-tower model [6,16,17], which we define as: ...
Preprint
Full-text available
Despite the popularity of the two-tower model for unbiased learning to rank (ULTR) tasks, recent work suggests that it suffers from a major limitation that could lead to its collapse in industry applications: the problem of logging policy confounding. Several potential solutions have even been proposed; however, the evaluation of these methods was mostly conducted using semi-synthetic simulation experiments. This paper bridges the gap between theory and practice by investigating the confounding problem on the largest real-world dataset, Baidu-ULTR. Our main contributions are threefold: 1) we show that the conditions for the confounding problem are given on Baidu-ULTR, 2) the confounding problem bears no significant effect on the two-tower model, and 3) we point to a potential mismatch between expert annotations, the golden standard in ULTR, and user click behavior.
... The algorithm biases are commonly solved from the perspective of causal reasoning [53]. These solutions are roughly categorized into causal embedding [5,28,83], inverse propensity weighting [4,15,34], and causal intervention [62,76,84] based methods. ...
Preprint
User interactions in recommender systems are inherently complex, often involving behaviors that go beyond simple acceptance or rejection. One particularly common behavior is hesitation, where users deliberate over recommended items, signaling uncertainty. Our large-scale surveys, with 6,644 and 3,864 responses respectively, confirm that hesitation is not only widespread but also has a profound impact on user experiences. When users spend additional time engaging with content they are ultimately uninterested in, this can lead to negative emotions, a phenomenon we term as tolerance. The surveys reveal that such tolerance behaviors often arise after hesitation and can erode trust, satisfaction, and long-term loyalty to the platform. For instance, a click might reflect a need for more information rather than genuine interest, and prolonged exposure to unsuitable content amplifies frustration. This misalignment between user intent and system interpretation introduces noise into recommendation training, resulting in suggestions that increase uncertainty and disengagement. To address these issues, we identified signals indicative of tolerance behavior and analyzed datasets from both e-commerce and short-video platforms. The analysis shows a strong correlation between increased tolerance behavior and decreased user activity. We integrated these insights into the training process of a recommender system for a major short-video platform. Results from four independent online A/B experiments demonstrated significant improvements in user retention, achieved with minimal additional computational costs. These findings underscore the importance of recognizing hesitation as a ubiquitous user behavior and addressing tolerance to enhance satisfaction, build trust, and sustain long-term engagement in recommender systems.
... In live environments where resources are constrained, making rapid decisions is just as crucial as accurate CTR predictions (Veneranta, 2024;Guo et al., 2019). Failure to respond within ten ms may result in decisions being made by rule-based models, which often lag behind deep learning models in performance (Mukherjee et al., 2017). ...
Preprint
Click-through-rate (CTR) prediction plays an important role in online advertising and ad recommender systems. In the past decade, maximizing CTR has been the main focus of model development and solution creation. Therefore, researchers and practitioners have proposed various models and solutions to enhance the effectiveness of CTR prediction. Most of the existing literature focuses on capturing either implicit or explicit feature interactions. Although implicit interactions are successfully captured in some studies, explicit interactions present a challenge for achieving high CTR by extracting both low-order and high-order feature interactions. Unnecessary and irrelevant features may cause high computational time and low prediction performance. Furthermore, certain features may perform well with specific predictive models while underperforming with others. Also, feature distribution may fluctuate due to traffic variations. Most importantly, in live production environments, resources are limited, and the time for inference is just as crucial as training time. Because of all these reasons, feature selection is one of the most important factors in enhancing CTR prediction model performance. Simple filter-based feature selection algorithms do not perform well and they are not sufficient. An effective and efficient feature selection algorithm is needed to consistently filter the most useful features during live CTR prediction process. In this paper, we propose a heuristic algorithm named Neighborhood Search with Heuristic-based Feature Selection (NeSHFS) to enhance CTR prediction performance while reducing dimensionality and training time costs. We conduct comprehensive experiments on three public datasets to validate the efficiency and effectiveness of our proposed solution.
Article
Graph convolutional networks (GCNs) have become prevalent in recommender system (RS) due to their superiority in modeling collaborative patterns. Although improving the overall accuracy, GCNs unfortunately amplify popularity bias — tail items are less likely to be recommended. This effect prevents the GCN-based RS from making precise and fair recommendations, decreasing the effectiveness of recommender systems in the long run. In this paper, we investigate how graph convolutions amplify the popularity bias in RS. Through theoretical analyses, we identify two fundamental factors: (1) with graph convolution (i.e., neighborhood aggregation), popular items exert larger influence than tail items on neighbor users, making the users move towards popular items in the representation space; (2) after multiple times of graph convolution, popular items would affect more high-order neighbors and become more influential. The two points make popular items get closer to almost users and thus being recommended more frequently. To rectify this, we propose to estimate the amplified effect of popular nodes on each node’s representation, and intervene the effect after each graph convolution. Specifically, we adopt clustering to discover highly-influential nodes and estimate the amplification effect of each node, then remove the effect from the node embeddings at each graph convolution layer. Our method is simple and generic — it can be used in the inference stage to correct existing models rather than training a new model from scratch, and can be applied to various GCN models. We demonstrate our method on two representative GCN backbones LightGCN and UltraGCN, verifying its ability in improving the recommendations of tail items without sacrificing the performance of popular items. Codes are open-sourced 1).
Article
Sequential recommendation is a critical but challenging task in capturing users' potential preferences due to inherent biases in the data. Existing debiasing recommendation methods aim to eliminate biases from historical interaction data collected by recommender systems and have shown promising results. However, there is another significant bias that hinders the improvement of sequential recommendation models: dynamic item tendency bias. This bias arises because a period might have some unique tendencies consisting of items interacted with by users with the same intent, leading to a dynamic tendency distribution that biases the model training towards these tendencies. To address this issue, we propose a causal approach to model dynamic item tendency bias in sequential recommendation. We first extract tendencies on carefully designed item-item graphs through community detection. We then use causal intervention to conduct deconfounded training to capture true user preferences and introduce the beneficial item tendency bias to the inference process through optimal transport techniques. Experimental results on four real-world datasets demonstrate that our proposed method consistently outperforms state-of-the-art debiasing recommendation methods, confirming that our model is effective in reducing dynamic item tendency bias and dealing with tendency drifts.
Preprint
Full-text available
Click-Through Rate prediction is an important task in recommender systems, which aims to estimate the probability of a user to click on a given item. Recently, many deep models have been proposed to learn low-order and high-order feature interactions from original features. However, since useful interactions are always sparse, it is difficult for DNN to learn them effectively under a large number of parameters. In real scenarios, artificial features are able to improve the performance of deep models (such as Wide & Deep Learning), but feature engineering is expensive and requires domain knowledge, making it impractical in different scenarios. Therefore, it is necessary to augment feature space automatically. In this paper, We propose a novel Feature Generation by Convolutional Neural Network (FGCNN) model with two components: Feature Generation and Deep Classifier. Feature Generation leverages the strength of CNN to generate local patterns and recombine them to generate new features. Deep Classifier adopts the structure of IPNN to learn interactions from the augmented feature space. Experimental results on three large-scale datasets show that FGCNN significantly outperforms nine state-of-the-art models. Moreover, when applying some state-of-the-art models as Deep Classifier, better performance is always achieved, showing the great compatibility of our FGCNN model. This work explores a novel direction for CTR predictions: it is quite useful to reduce the learning difficulties of DNN by automatically identifying important features.
Article
Full-text available
User response prediction is a crucial component for personalized information retrieval and filtering scenarios, such as recommender system and web search. The data in user response prediction is mostly in a multi-field categorical format and transformed into sparse representations via one-hot encoding. Due to the sparsity problems in representation and optimization, most research focuses on feature engineering and shallow modeling. Recently, deep neural networks have attracted research attention on such a problem for their high capacity and end-to-end training scheme. In this article, we study user response prediction in the scenario of click prediction. We first analyze a coupled gradient issue in latent vector-based models and propose kernel product to learn field-aware feature interactions. Then, we discuss an insensitive gradient issue in DNN-based models and propose Product-based Neural Network, which adopts a feature extractor to explore feature interactions. Generalizing the kernel product to a net-in-net architecture, we further propose Product-network in Network (PIN), which can generalize previous models. Extensive experiments on four industrial datasets and one contest dataset demonstrate that our models consistently outperform eight baselines on both area under curve and log loss. Besides, PIN makes great click-through rate improvement (relatively 34.67%) in online A/B test.
Conference Paper
Full-text available
Click-through rate prediction is an essential task in industrial applications, such as online advertising. Recently deep learning based models have been proposed, which follow a similar Embedding&MLP paradigm. In these methods large scale sparse input features are first mapped into low dimensional embedding vectors, and then transformed into fixed-length vectors in a group-wise manner, finally concatenated together to fed into a multilayer perceptron (MLP) to learn the nonlinear relations among features. In this way, user features are compressed into a fixed-length representation vector, in regardless of what candidate ads are. The use of fixed-length vector will be a bottleneck, which brings difficulty for Embedding&MLP methods to capture user's diverse interests effectively from rich historical behaviors. In this paper, we propose a novel model: Deep Interest Network (DIN) which tackles this challenge by designing a local activation unit to adaptively learn the representation of user interests from historical behaviors with respect to a certain ad. This representation vector varies over different ads, improving the expressive ability of model greatly. Besides, we develop two techniques: mini-batch aware regularization and data adaptive activation function which can help training industrial deep networks with hundreds of millions of parameters. Experiments on two public datasets as well as an Alibaba real production dataset with over 2 billion samples demonstrate the effectiveness of proposed approaches, which achieve superior performance compared with state-of-the-art methods. DIN now has been successfully deployed in the online display advertising system in Alibaba, serving the main traffic.
Article
Full-text available
Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods have a strong bias towards low- or high-order interactions, or rely on expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed framework, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide & Deep model from Google, DeepFM has a shared raw feature input to both its "wide" and "deep" components, with no need of feature engineering besides raw features. DeepFM, as a general learning framework, can incorporate various network architectures in its deep component. In this paper, we study two instances of DeepFM where its "deep" component is DNN and PNN respectively, for which we denote as DeepFM-D and DeepFM-P. Comprehensive experiments are conducted to demonstrate the effectiveness of DeepFM-D and DeepFM-P over the existing models for CTR prediction, on both benchmark data and commercial data. We conduct online A/B test in Huawei App Market, which reveals that DeepFM-D leads to more than 10% improvement of click-through rate in the production environment, compared to a well-engineered LR model. We also covered related practice in deploying our framework in Huawei App Market.
Conference Paper
Full-text available
Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide & Deep model from Google, DeepFM has a shared input to its "wide" and "deep" parts, with no need of feature engineering besides raw features. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on both benchmark data and commercial data.
Conference Paper
Full-text available
Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning---jointly trained wide linear models and deep neural networks---to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open-sourced our implementation in TensorFlow.
Article
Full-text available
Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide \& Deep model from Google, DeepFM has a shared input to its "wide" and "deep" parts, with no need of feature engineering besides raw features. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on both benchmark data and commercial data.
Conference Paper
Accurate estimation of the click-through rate (CTR) in sponsored ads significantly impacts the user search experience and businesses' revenue, even 0.1% of accuracy improvement would yield greater earnings in the hundreds of millions of dollars. CTR prediction is generally formulated as a supervised classification problem. In this paper, we share our experience and learning on model ensemble design and our innovation. Specifically, we present 8 ensemble methods and evaluate them on our production data. Boosting neural networks with gradient boosting decision trees turns out to be the best. With larger training data, there is a nearly 0.9% AUC improvement in offline testing and significant click yield gains in online traffic. In addition, we share our experience and learning on improving the quality of training.
Conference Paper
Combinatorial features are essential for the success of many commercial models. Manually crafting these features usually comes with high cost due to the variety, volume and velocity of raw data in web-scale systems. Factorization based models, which measure interactions in terms of vector product, can learn patterns of combinatorial features automatically and generalize to unseen features as well. With the great success of deep neural networks (DNNs) in various fields, recently researchers have proposed several DNN-based factorization model to learn both low- and high-order feature interactions. Despite the powerful ability of learning an arbitrary function from data, plain DNNs generate feature interactions implicitly and at the bit-wise level. In this paper, we propose a novel Compressed Interaction Network (CIN), which aims to generate feature interactions in an explicit fashion and at the vector-wise level. We show that the CIN share some functionalities with convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We further combine a CIN and a classical DNN into one unified model, and named this new model eXtreme Deep Factorization Machine (xDeepFM). On one hand, the xDeepFM is able to learn certain bounded-degree feature interactions explicitly; on the other hand, it can learn arbitrary low- and high-order feature interactions implicitly. We conduct comprehensive experiments on three real-world datasets. Our results demonstrate that xDeepFM outperforms state-of-the-art models. We have released the source code of xDeepFM at https://github.com/Leavingseason/xDeepFM.
Conference Paper
Taobao, as the largest online retail platform in the world, provides billions of online display advertising impressions for millions of advertisers every day. For commercial purposes, the advertisers bid for specific spots and target crowds to compete for business traffic. The platform chooses the most suitable ads to display in tens of milliseconds. Common pricing methods include cost per mille (CPM) and cost per click (CPC). Traditional advertising systems target certain traits of users and ad placements with fixed bids, essentially regarded as coarse-grained matching of bid and traffic quality. However, the fixed bids set by the advertisers competing for different quality requests cannot fully optimize the advertisers' key requirements. Moreover, the platform has to be responsible for the business revenue and user experience. Thus, we proposed a bid optimizing strategy called optimized cost per click (OCPC) which automatically adjusts the bid to achieve finer matching of bid and traffic quality of page view (PV) request granularity. Our approach optimizes advertisers' demands, platform business revenue and user experience and as a whole improves traffic allocation efficiency. We have validated our approach in Taobao display advertising system in production. The online A/B test shows our algorithm yields substantially better results than previous fixed bid manner.