Conference PaperPDF Available

Deep Multi-Interest Network for Click-through Rate Prediction

Authors:

Abstract and Figures

Click-through rate prediction plays an important role in many fields, such as recommender and advertising systems. It is one of the crucial parts to improve user experience and increase industry revenue. Recently, several deep learning-based models are successfully applied to this area. Some existing studies further model user representation based on user historical behavior sequence, in order to capture dynamic and evolving interests. We observe that users usually have multiple interests at a time and the latent dominant interest is expressed by the behavior. The switch of latent dominant interest results in the behavior changes. Thus, modeling and tracking latent multiple interests would be beneficial. In this paper, we propose a novel method named as Deep Multi-Interest Network (DMIN) which models user's latent multiple interests for click-through rate prediction task. Specifically, we design a Behavior Refiner Layer using multi-head self-attention to capture better user historical item representations. Then the Multi-Interest Extractor Layer is applied to extract multiple user interests. We evaluate our method on three real-world datasets. Experimental results show that the proposed DMIN outperforms various state-of-the-art baselines in terms of click-through rate prediction task.
Content may be subject to copyright.
Deep Multi-Interest Network for Click-through Rate Prediction
Zhibo Xiao, Luwei Yang, Wen Jiang, Yi Wei, Yi Hu, Hao Wang
Alibaba Group, Hangzhou, China
{xiaozhibo.xzb,luwei.ylw,wen.jiangw}@alibaba-inc.com
ABSTRACT
Click-through rate prediction plays an important role in many
elds, such as recommender and advertising systems. It is one of
the crucial parts to improve user experience and increase industry
revenue. Recently, several deep learning-based models are success-
fully applied to this area. Some existing studies further model user
representation based on user historical behavior sequence, in or-
der to capture dynamic and evolving interests. We observe that
users usually have multiple interests at a time and the latent dom-
inant interest is expressed by the behavior. The switch of latent
dominant interest results in the behavior changes. Thus, modeling
and tracking latent multiple interests would be benecial. In this
paper, we propose a novel method named as Deep Multi-Interest
Network (DMIN) which models user’s latent multiple interests for
click-through rate prediction task. Specically, we design a Be-
havior Rener Layer using multi-head self-attention to capture
better user historical item representations. Then the Multi-Interest
Extractor Layer is applied to extract multiple user interests. We eval-
uate our method on three real-world datasets. Experimental results
show that the proposed DMIN outperforms various state-of-the-art
baselines in terms of click-through rate prediction task.
CCS CONCEPTS
Information systems Recommender systems
;Learning to
rank;Personalization.
KEYWORDS
Recommender System; Click-through Rate prediction; Multi-interest
ACM Reference Format:
Zhibo Xiao, Luwei Yang, Wen Jiang, Yi Wei, Yi Hu, Hao Wang. 2020. Deep
Multi-Interest Network for Click-through Rate Prediction. In Proceedings
of the 29th ACM International Conference on Information and Knowledge
Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland. ACM,
New York, NY, USA, 4 pages. https://doi.org/10.1145/3340531.3412092
1 INTRODUCTION
Click-through rate (CTR) prediction is the task of estimating the
likelihood that an item will be clicked by the user, which plays an
important role in many elds, such as recommender and advertising
systems. For example, there are two main parts in e-commerce rec-
ommender system [
4
], i.e, matching and ranking. Matching is able
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM ’20, October 19–23, 2020, Virtual Event, Ireland
©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-6859-9/20/10. . . $15.00
https://doi.org/10.1145/3340531.3412092
to retrieve several thousands of candidate items from a huge item
set with hundreds of millions of items. Ranking is responsible for
scoring these candidates in terms of click-through rate prediction
task. Sometimes, a conversion rate task can be used here. Finally,
the scored items are presented to the end user by a descending
order. Note that, additional reranking can be added in accordance
with some rule-based models.
In this paper, we mainly focus on ranking, and treat it as a click-
through rate prediction task. Recently, many deep learning-based
methods have been proposed to successfully improve the perfor-
mance of click-through rate prediction task [
2
,
5
,
7
,
10
,
11
,
13
,
14
].
By observing user’s interest is expressed by his/her behavior se-
quences, several models [
4
,
9
,
16
,
17
] have focused on extracting
user interest from the behavior sequences. In DIN [
17
], an attention-
based method is utilized to capture relative interests from the user
behavior sequence with regard to candidate item. However, it ig-
nores the ordering of user behavior sequence. DIEN [
16
] further
uses a specially designed GRU structure to capture the evolution of
user interest. DSIN [
4
] introduces a hierarchical view of behavior
sequence by dividing it into sessions. Then it models each session
with self-attention network to capture multiple user interests.
Nonetheless, we observe that users have multiple diverse inter-
ests at a time and the latent dominant interest is expressed by the
behavior. The switch of latent dominant interest results in behav-
ior changes. Motivated by above observations, we propose a Deep
Multi-Interest Network (DMIN) to improve CTR task by modeling
latent multiple user interests from user behavior sequence. There
are two main components: Behavior Rener Layer and Multi-Interest
Extractor Layer.Behavior Rener Layer is applied to rene the repre-
sentation of item in user behavior sequence.Multi-Interest Extractor
Layer is able to capture latent multiple interests.
The main contributions of this paper are as follows:
We highlight the multi-interest phenomenon in e-commerce
eld, and focus on modeling latent multiple user interests.
We propose a novel model with two main components, Be-
havior Rener Layer and Multi-Interest Extractor Layer.
We evaluate our proposed method on three real-world datasets
in terms of click-through rate prediction. The results show
the ecacy of our proposed DMIN.
2 THE PROPOSED METHOD
In this section, we introduce our proposed approach Deep Multi-
Interest Network (DMIN). The overall architecture is illustrated
by Figure 1. DMIN follows the basic paradigm of embedding &
Multilayer Perceptron (MLP) model [
17
]. The two main components
of DMIN, Behavior Rener Layer and Multi-Interest Extractor Layer
are put into middle to better model and extract user interests.
DMIN takes as input the user historical behaviors, user prole
feature, context feature and target item. It rstly embeds these input
features as low-dimensional vectors by an embedding layer. Then
Short Paper Track
CIKM '20, October 19–23, 2020, Virtual Event, Ireland
2265
Target Item Context User Profile
𝑒!
Concat & Flatten
MLP
𝐞"#! 𝐳"%
𝐞"#!
Click Not click
Auxiliary Loss
Neg
Sampling
𝑒$𝑒"
𝑝!𝑝$𝑝"
interest!interest%!
Output
User Historical Behaviors
Multi-head Self-Attention 2
Cross
product
Concat
Concat
Dice (36)
Linear (1)
Attention Unit
Inputs
from head
head!head%!
Embedding
Layer
𝑝"
Target
Item
Multi-Interest
Extractor Layer
head!head%!
Sum pooling for head!Sum pooling for head%!
Multi-head Self-Attention 1
Attention
Unit
Attention
Unit
Attention
Unit
Attention
Unit
Behavior Refiner
Layer
Inner Product
Product
Figure 1: The architecture of DMIN.
Behavior Rener Layer renes each user historical behavior item
representation as
𝑧𝑡
with the help of auxiliary loss and multi-head
self-attention. In Multi-Interest Extractor Layer, another multi-head
self-attention layer and a local attention unit are introduced to
extract multiple interests. The multiple interests and embedding
vectors of remaining features are concatenated, and fed into a MLP
for nal CTR prediction.
2.1 Embedding Layer
There are four groups of features: User Prole,User Historical Behav-
ior,Context and Target Item.User Prole contains features related
to the user, e.g. user id,country and so on. Target Item refers to
the candidate item with corresponding features such as item id,
category id,statistical ctr and so on. User Historical Behavior is a
list of user interacted items by clicking, purchasing or add-to-cart.
Each item in this list has same feature elds as Target Item.Context
is a group of features including but not limited to time,match type,
trigger id and so on.
Each feature can be encoded into a one-hot vector with high-
dimension. These features are usually very sparse and should be
transformed into low-dimensional dense features by embedding
layer, which is a common operation in deep learning-based rank-
ing methods [
2
]. For example, the item id can be represented by
a matrix
ER𝐾×𝑑𝑣
, where
𝐾
is the total number of items and
𝑑𝑣
is the embedding size with
𝑑𝑣𝐾
. With embedding layer, User
Prole,User Historical Behavior,Context and Target Item can be
represented as
x𝑢
,
x𝑏
,
x𝑐
and
x𝑡
, respectively. Especially, User His-
torical Behavior contains multiple items and can be represented by
x𝑏={e1,e2, ..., e𝑇} ∈ R𝑇×𝑑𝑚𝑜𝑑𝑒𝑙
, where
𝑇
is the number of user’s
history behaviors and 𝑑𝑚𝑜𝑑𝑒𝑙 is the dimension of item embedding
e𝑡
. Note that items in User Historical Behavior and Target Item share
the same embedding matrices.
p𝑡R𝑑𝑝
is the position encoding
for the 𝑡-th item and 𝑑𝑝is the dimension of p𝑡.
2.2 Behavior Rener Layer
In this part, we describe how to rene the item representation using
multi-head self-attention [
3
,
12
] from user behavior sequence. One
of the naive methods to obtain the simple item representation is just
concatenating each item’s feature embedding. However, in prac-
tice, we found that good item representations are greatly benet
to downstream task. Thus we employ multi-head self-attention to
rene the item representation. The capability of keeping contex-
tual sequential information and capturing relationships between
elements in the sequence make multi-head self-attention stand out
in this task.
Self-attention is a special attention mechanism, which has been
successfully applied to a variety of tasks[
3
,
12
,
15
]. The input of
the self-attention module consists of query, key, and value and
these three components come from the same place. Multi-head
self-attention is a combination of multiple self-attention struc-
tures, which can learn the relationship in dierent representation
subspaces[
12
]. To be specic, the output of the
head
is calculated
as follows,
head=Attention(x𝑏W𝑄
,x𝑏W𝐾
,x𝑏W𝑉
)
=Softmaxx𝑏W𝑄
· (x𝑏W𝐾
)
p𝑑
·x𝑏W𝑉
,
(1)
where
W𝑄
,W𝐾
,W𝑉
R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑
are projection matrices of the
-th head for query, key and value respectively. Thus each
head
represents a latent item representation in a subspace.
Then vectors of dierent heads are concatenated to form the
rened item representations, which can be dened as follows,
Z=MultiHead(x𝑏)=Concat(head1,head2, ..., head𝐻𝑅)W𝑂,(2)
where
𝐻𝑅
is the number of heads,
W𝑂R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑚𝑜𝑑𝑒 𝑙
is a lin-
ear matrix. Moreover, we also conduct residual connection [
6
],
dropout [8] and layer normalization [1] in the self-attention.
Inspired by [
16
], an auxiliary loss is used to supervise better
item representation learning. It uses behavior
e𝑡+1
, the
(𝑡+
1
)
-th
behavior original item embedding, to supervise the learnt item
representation at time
𝑡
,
z𝑡
, which is the
𝑡
-th row vector of
Z
. The
positive example is the real next behavior and the negative example
is sampled from the whole item set excluding the clicked items.
Mathematically, the auxiliary loss is formulated as,
Short Paper Track
CIKM '20, October 19–23, 2020, Virtual Event, Ireland
2266
𝐿𝑎𝑢𝑥 =1
𝑁(
𝑁
Õ
𝑖=1Õ
𝑡
log𝜎(z𝑖
𝑡,e𝑖
𝑡+1) + log(1𝜎(z𝑖
𝑡,ˆ
e𝑖
𝑡+1))),(3)
where
𝜎(.)
is the sigmoid activation function and
,
denotes the
inner product.
ˆ
e𝑖
𝑡+1
is the original embedding of negative example,
𝑁represents the number of training examples.
2.3 Multi-Interest Extractor Layer
After obtaining the rened item representations, we need to ex-
tract multiple interests in this rened sequence. We use another
multi-head self-attention same as previous part to capture multiple
interests,
head
=Attention(ZW𝑄
,ZW𝐾
,ZW𝑉
)
=SoftmaxZW𝑄
· (ZW𝐾
)
p𝑑𝑚𝑜𝑑𝑒𝑙
·ZW𝑉
,
(4)
similarly,
W𝑄
,W𝐾
,W𝑉
R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑚𝑜𝑑𝑒𝑙
are projection matri-
ces of the
-th head for query, key and value respectively. We
pack all output head vectors
{head
1,head
2, ..., head
𝐻𝐸}
as a ma-
trix
IR𝑇×𝐻𝐸×𝑑𝑚𝑜𝑑𝑒𝑙
.
𝐻𝐸
is the number of heads, which is also
equivalent to the number of user interests.
Inspired by [
17
], we use an attention unit to capture the relevance
of each output head with respect to the target item, which is shown
on the right of Figure 1. Besides, we add a position embedding to
incorporate position information. Thus, the
-th interest of user
can be formulated as,
interest=
𝑇
Õ
𝑗=1
𝑎(I𝑗ℎ ,x𝑡,p𝑗)I𝑗ℎ =
𝑇
Õ
𝑗=1
𝑤𝑗I𝑗ℎ ,(5)
where
I𝑗ℎ R𝑑𝑚𝑜𝑑𝑒𝑙
represents the
-th head’s vector of the
𝑗
-th
item,
p𝑗R𝑑𝑝
is the position encoding for
𝑗
-th item. Note that the
position of each item in the user behavior is the reverse sequential
order sorted by timestamp when it is occurred, i.e. the more recent
occurred behavior items will be put at the higher ranks.
𝑎
denotes
the attention unit which is shown on the right of Figure 1. It tells
how relevant of the interestto the target item.
The attention unit takes the output of the second multi-head
self-attention
I
, target item embedding
x𝑡
and position embedding
p𝑗
as input. The corresponding position embedding of each head
and the target item vector are concatenated and then transformed
into a vector with the same size as the input vectors from each
head by a linear transformation. Then we apply a cross-product to
the concatenated vector of the target item embedding and position
embedding with the input from head. The result of them together
with the original value are fed into a fully-connected network. The
nal output is a scalar which indicates to what extent is the interest
relevant to the target item.
Therefore, multiple interests of users can be obtained. The nal
output vectors of User Historical Behavior is now represented as
ˆ
x𝑏={interest1,interest2, ..., interest𝐻𝐸} ∈ R𝐻𝐸×𝑑𝑚𝑜𝑑𝑒𝑙
. Note that,
the number of user’s interests directly depends on the number of
headers, 𝐻𝐸.
2.4 MLP & Loss Function
All the features vectors,
x𝑢
,
ˆ
x𝑏
,
x𝑐
and
x𝑡
, are concatenated and fed
into the MLP layer for nal prediction. Since the click-through rate
prediction task is a binary classication task, the loss function is
chosen as cross-entropy loss, which is usually dened as:
𝐿𝑡𝑎𝑟𝑔 𝑒𝑡 =1
𝑁(
𝑁
Õ
𝑖=1
𝑦𝑖log𝑓(x𝑖)+(1𝑦𝑖)log(1𝑓(x𝑖))),(6)
where
x𝑖=(x𝑢,x𝑏,x𝑐,x𝑡) ∈ D
,
D
is the training set with size
𝑁
.
𝑦𝑖∈ {
0
,
1
}
is the click label,
𝑓(𝑥)
is the prediction output of our net-
work. As we use the auxiliary loss to supervise item representation
rening, the nal loss can be dened as,
𝐿𝑡𝑜𝑡 𝑎𝑙 =𝐿𝑡𝑎𝑟 𝑔𝑒𝑡 +𝜆𝐿𝑎𝑢𝑥 ,(7)
where
𝜆
is a hyper-parameter in order to balance the two sub tasks.
3 EXPERIMENTS
In this section, we conduct experiments and compare with several
state-of-the-art methods on three real-world datasets.
3.1 Datasets
We use three real-world datasets for evaluation. The statistics of
them are summarized in Table 1.
Public Datasets.
Amazon Dataset contains product reviews and
metadata from Amazon
1
. We conduct experiments on two subsets
named
𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠
and
𝐵𝑜𝑜𝑘𝑠
. These datasets collect user behavior
sorted by timestamp. Assuming there are
𝐾
reviewed products in
a user behavior sequence, our aim is to predict whether the user
𝑢
will write reviews for the
𝐾
-th product based on the rst
𝐾
-1
reviewed products. We create training, validation and test sets by
random sampling from the original dataset with a split rate as 80%,
10% and 10%.
Alibaba.com.
We sampled part of the user behavior logs from
the alibaba.com online recommendation system
2
as a new dataset.
We use logs of the rst 7 days as training set, logs in the 8th day as
validation set and logs in the 9th day as test set.
Table 1: Dataset
Dataset Users Items Categories Samples
Books 603,668 367,982 1,600 603,668
Electronics 192,403 63,001 801 192,403
Alibaba.com 177,947 4,996,525 5057 1,200,000
3.2 Compared Models
We compare DMIN
3
with some mainstream CTR prediction meth-
ods:
Wide&Deep
[
2
] is a widely applied method in industrial
applications. It contains a deep part and a wide part, which
combines the memorization of a linear model and the gener-
alization of a DNN model.
PNN
[
10
] uses a product layer to capture high-order feature
interactions.
DIN
[
17
] applies attention both to user historical sequence
and the target item in order to better model user interest
subject to the target item.
1http://jmcauley.ucsd.edu/data/amazon/
2https://www.alibaba.com/
3The source code is available at https://github.com/mengxiaozhibo/DMIN
Short Paper Track
CIKM '20, October 19–23, 2020, Virtual Event, Ireland
2267
DIEN
[
16
] can be considered as an improved version of
DIN, it uses GRU with attentional update gate to model the
evolution user interest.
We use the implementations of above baselines that provided by [
16
].
Note that, by removing the rst multi-head self-attention, the aux-
iliary loss and the position embedding, and setting
𝐻𝐸
equals 1, our
model is almost equivalent to DIN.
3.3 Experimental Results
The dimension of item embedding (
𝑑𝑚𝑜𝑑𝑒𝑙
) is set as 36. The embed-
ding size of position encoding is set as 2. The dimension of user
prole embedding is 18. The batch size is 128, the learning rate is
0.001 and the
𝜆
is 1. Dierent datasets have dierent maximum
length of user behavior sequences. This maximum length set by
us is related to distribution of user behavior lengths in the dataset.
Therefore, we set the maximum length to 10, 20, 80 in
𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠
,
𝐵𝑜𝑜𝑘𝑠
and
𝐴𝑙𝑖𝑏𝑎𝑏𝑎 .𝑐𝑜𝑚
, respectively. According to the performance
and eect of the experiments,
𝐻𝑅
is set to be 4 and
𝐻𝐸
can be set
to 2/4. The dropout rate is set to 0.2.
We use Area Under ROC (AUC) as evaluation metric. All experi-
ments are repeated 5 times and the average and standard deviation
of the results are reported. The experimental results on three real-
world datasets are shown by Table 2. We can nd Wide&Deep
with manually designed features performs not well. PNN is table to
automatically learn the interaction between features, which beats
Wide&Deep. DIN represents the user interests with regard to tar-
get item and the result beats Wide&Deep and PNN. DIEN uses
a specially designed GRU structure to capture the evolution of
user interest, which helps to obtain better interest representation
than DIN. DMIN achieves the highest AUC score among all three
datasets, which shows the ecacy of modeling and tracking user’s
latent multiple interests. As shown in Table 2, the usage of auxiliary
loss and position embedding bring a greater gain.
Figure 2 shows an example of captured multiple interests for a
user in Alibaba.com dataset. The user’s historical sequence was
alternating between pants and shoes. The
interest1
and
interest2
successfully captured user interests in pants and shoes respectively.
The color represents the output from the attention unit, the darker
the color is the corresponding interest is more dominant. We can see
that the switch of latent dominant interest results in the behavior
changes.
User click
sequences
𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡!
𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡"
t
Figure 2: Example of captured multiple interests for a user
in Alibaba.com dataset.
4 CONCLUSIONS
In this paper, we have proposed a novel method named as Deep
Multi-Interest Network (DMIN) to model user’s latent multiple
interests for click-through rate prediction task. Specically, we
design a Behavior Rener Layer using multi-head self-attention to
capture better user historical item representations. Then the Multi-
Interest Extractor Layer is applied to extract multiple user interests.
Table 2: Experimental Results (AUC) on Three Real-world
Dtasets.
Model Electronics Books Alibaba.com
Wide&Deep 0.7456 ±0.0018 0.7788 ±0.0016 0.7769 ±0.0017
PNN 0.7543 ±0.0004 0.7824 ±0.0025 0.7825 ±0.0015
DIN 0.7589 ±0.0006 0.7903 ±0.0013 0.7837 ±0.0014
DIEN 0.7707 ±0.0032 0.8445 ±0.0025 0.7881 ±0.0015
DIEN-NO-AUXa0.7604 ±0.0006 0.7980 ±0.0019 0.7861 ±0.0022
DMIN-NO-AUXb0.7630 ±0.0005 0.8020 ±0.0013 0.7872 ±0.0012
DMIN-NO-PEc0.7890 ±0.0009 0.8620 ±0.0010 0.7930 ±0.0013
DMIN 0.7918 ±0.0004 0.8670 ±0.0011 0.7950 ±0.0008
aDIEN without auxiliary loss
bDMIN without auxiliary loss
cDMIN without position embedding
Experimental results show that the proposed DMIN outperforms
various state-of-the-art baselines in terms of click-through rate
prediction task. In the future, we will try more dierent methods
to model user’s latent multiple interests.
REFERENCES
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Georey E Hinton. 2016. Layer normaliza-
tion. arXiv preprint arXiv:1607.06450 (2016).
[2]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al
.
2016. Wide & deep learning for recommender systems. In the 1st workshop on
deep learning for recommender systems.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
[4]
Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping
Yang. 2019. Deep session interest network for click-through rate prediction. In the
Twenty-Eighth International Joint Conference on Articial Intelligence (IJCAI-19).
[5]
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.
DeepFM: a factorization-machine based neural network for CTR prediction. In
Proceedings of the 26th International Joint Conference on Articial Intelligence.
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on CVPR.
[7]
Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse
predictive analytics. In Proceedings of the 40th International ACM SIGIR conference
on Research and Development in Information Retrieval.
[8]
Georey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and
Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-
adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012).
[9]
Ze Lyu, Yu Dong, Chengfu Huo, and Weijun Ren. 2020. Deep Match to Rank
Model for Personalized Click-Through Rate Prediction. In AAAI.
[10]
Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.
2016. Product-based neural networks for user response prediction. In IEEE ICDM.
[11]
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,
and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-
attentive neural networks. In Proceedings of the 28th ACM International Conference
on Information and Knowledge Management.
[12]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information processing systems.
[13]
Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network
for ad click predictions. In Proceedings of the ADKDD’17.
[14]
Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua.
2017. Attentional factorization machines: Learning the weight of feature in-
teractions via attention networks. In Proceedings of the 26th International Joint
Conference on Articial Intelligence.
[15]
Shuai Zhang, Yi Tay, Lina Yao, and Aixin Sun. 2018. Next item recommendation
with self-attention. arXiv preprint arXiv:1808.06414 (2018).
[16]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang
Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate
prediction. In Proceedings of the AAAI Conference on Articial Intelligence.
[17]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through
rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining.
Short Paper Track
CIKM '20, October 19–23, 2020, Virtual Event, Ireland
2268
... Wang et al. [32] used a Bi-LSTM network to model dependencies between actions for capturing the importance of underlying user interests behind user behavior data, which can effectively learn functional interactions. The DSIN model [33] simulates the user behavior that is closely related to the session. First, the user's historical click behavior is divided into different sessions. ...
Article
Full-text available
Click-through rate prediction is crucial in network applications such as recommendation systems and online networks. Designing feature extraction schemes to obtain features and modeling users’ click behavior are used to estimate the probability of users clicking on recommended items. The AutoInt model is a recent and effective research finding. It constructs combined features by referencing the multi-head attention mechanism but does not fully mine meaningful high-order cross-features and ignores user privacy protection. To address this problem, this study proposes the differential privacy bidirectional long short-term memory network (DP-Bi-LSTM-AutoInt) model, which is an improved AutoInt model. A bidirectional long short-term memory network is added after the embedding layer to deeply mine the nonlinear relationship between user click behaviors and construct high-order features. Further, differential privacy technology is adopted for user privacy protection, and the Gaussian mechanism is used to randomly perturb the gradient descent algorithm of the model. Using the Criteo dataset to conduct experiments, the experimental results show that the accuracy of the Bi-LSTM-AutoInt model proposed herein is improved by 0.65 % compared to the original AutoInt model. When the privacy budget is greater than 3.0, the accuracies of the DP-Bi-LSTM-AutoInt and Bi-LSTM-AutoInt models are nearly equivalent. However, the DP-Bi-LSTM-AutoInt model algorithm is more secure and reliable than the AutoInt model.
Article
Full-text available
Click through rate (CTR) and Conversion Rate (CVR) are core tasks in e-commerce recommender systems. Sequence behavior and multi-task learning have been widely used in CTR and CVR. Based on the concept of a transformer, we develop a technique of time and space feature representation for the prediction, which can capture high-level information better. In order to formulate user’s different interests from historical sequence behavior, we design multi-task learning to improve multiple objectives simultaneously. It is difficult to turn the super parameters as the tasks increasing. In this paper, we propose an adaptive learning mixture-of-experts approach, which tackles this challenge and can learn super parameters among tasks automatically. It not only saves resources but also improves the performance with cognitive of the model. Furthermore, to enhance the flexibility, we improve the loss function with a constrained joint strategy and introduce RESNET mechanism. We design feature-cross-unit module, augment-expert module, and topK-dispatch module, which assist multi-task learning to improve better. Experiments on public dataset and our library dataset demonstrate the superiority of our model over the state-of-art method. Our method achieves + 2.29% AUC gain in the CTR task and + 1.81% AUC gain in the CVR task, which is a significant improvement and demonstrates the effectiveness of proposed approach.
Chapter
In online advertising systems, predicting the click-through rate (CTR) is an important task. Many studies only consider targeted advertisements in isolation, but do not focus on its relationship with other ads that may affect the CTR. We look at a variety of additional elements that can help with CTR prediction for tailored advertisements. We consider supplementary ads from two different angles: 1) the spatial domain, where contextual adverts on the same page as the target advertisements are considered and 2) from the perspective of the temporal component, where we assume people have previously clicked unclicked advertisements. The intuition is that contextual ads shown with targeted ads may influence each other. Also, advertisements that are connected reflect user preferences. Ads that are not connected may indicate to some extent what the user is not interested in. We propose a deep spatio-temporal neural network (DSTAN) for CTR prediction to use these auxiliary data effectively. Our model can decrease the noise in new data, learn the interaction between varied extra data and targeted advertisements, and fuse heterogeneous data into a coherent framework, highlighting important hidden information. Offline experiments on two public datasets show that DSTAN outperforms several of the most common methods in CTR prediction.
Chapter
Click-through rate (CTR) prediction is widely used in recommendation systems. Accurately modeling user interest is the key to improve the performance of CTR prediction task. Existing methods pay attention to model user interest from a single perspective to reflect user preferences, ignoring user different interests in different aspects, thus limiting the expressive ability of user interest. In this paper, we propose a novel Deep User Multi-Interest Network (DUMIN) which designs Self-Interest Extraction Network (SIEN) and User-User Interest Extraction Network (UIEN) to capture user different interests. First, SIEN uses attention mechanism and sequential network to focus on different parts in self-interest. Meanwhile, an auxiliary loss network is used to bring extra supervision for model training. Next, UIEN adopts multi-headed self-attention mechanism to learn a unified interest representation for each user who interacted with the candidate item. Then, attention mechanism is introduced to adaptively aggregate these interest representations to obtain user-user interest, which reflects the collaborative filtering information among users. Extensive experimental results on public real-world datasets show that proposed DUMIN outperforms various state-of-the-art methods.KeywordsRecommender systemCTR predictionMulti-interest learningMulti-headed attention mechanismDeep learningAuxiliary loss
Chapter
Full-text available
In the face of the information recommendation requirements in mobile Internet applications, in order to better use the user micro implicit feedback behavior obtained by the mobile intelligent terminal to improve the recommendation efficiency, this paper intends to carry out the analysis of the implicit feedback behavior by analyzing the behavior distribution and behavior correlation. The analytical results reveal the particularity of the implicit feedback behavior in mobile intelligent terminal.
Chapter
Full-text available
User browsing behavior is an important kind of implicit feedback data reflecting users’ interests and preferences in the field of recommendation system. How to make full use of user browsing behavior data and combined with other context information to improve recommendation efficiency has become a research hotspot. This paper analyzes the user micro network implicit feedback behavior of mobile intelligent terminal, and studies the influence of user attribute context on user micro network implicit feedback behavior by using binary and multiple regression analysis. The results show that the user’s age attribute, regional attribute and occupation attribute are a kind of very important context information.
Article
Full-text available
Click-through rate (CTR) prediction, whose goal is to estimate the probability of a user clicking on the item, has become one of the core tasks in the advertising system. For CTR prediction model, it is necessary to capture the latent user interest behind the user behavior data. Besides, considering the changing of the external environment and the internal cognition, user interest evolves over time dynamically. There are several CTR prediction methods for interest modeling, while most of them regard the representation of behavior as the interest directly, and lack specially modeling for latent interest behind the concrete behavior. Moreover, little work considers the changing trend of the interest. In this paper, we propose a novel model, named Deep Interest Evolution Network (DIEN), for CTR prediction. Specifically, we design interest extractor layer to capture temporal interests from history behavior sequence. At this layer, we introduce an auxiliary loss to supervise interest extracting at each step. As user interests are diverse, especially in the e-commerce system, we propose interest evolving layer to capture interest evolving process that is relative to the target item. At interest evolving layer, attention mechanism is embedded into the sequential structure novelly, and the effects of relative interests are strengthened during interest evolution. In the experiments on both public and industrial datasets, DIEN significantly outperforms the state-of-the-art solutions. Notably, DIEN has been deployed in the display advertisement system of Taobao, and obtained 20.7% improvement on CTR.
Conference Paper
Full-text available
Click-Through Rate (CTR) prediction plays an important role in many industrial applications, such as online advertising and recommender systems. How to capture users' dynamic and evolving interests from their behavior sequences remains a continuous research topic in the CTR prediction. However, most existing studies overlook the intrinsic structure of the sequences: the sequences are composed of sessions, where sessions are user behaviors separated by their occurring time. We observe that user behaviors are highly homogeneous in each session, and heterogeneous cross sessions. Based on this observation, we propose a novel CTR model named Deep Session Interest Network (DSIN) that leverages users' multiple historical sessions in their behavior sequences. We first use self-attention mechanism with bias encoding to extract users' interests in each session. Then we apply Bi-LSTM to model how users' interests evolve and interact among sessions. Finally, we employ the local activation unit to adaptively learn the influences of various session interests on the target item. Experiments are conducted on both advertising and production recommender datasets and DSIN outperforms other state-of-the-art models on both datasets.
Conference Paper
Full-text available
Click-through rate prediction is an essential task in industrial applications, such as online advertising. Recently deep learning based models have been proposed, which follow a similar Embedding&MLP paradigm. In these methods large scale sparse input features are first mapped into low dimensional embedding vectors, and then transformed into fixed-length vectors in a group-wise manner, finally concatenated together to fed into a multilayer perceptron (MLP) to learn the nonlinear relations among features. In this way, user features are compressed into a fixed-length representation vector, in regardless of what candidate ads are. The use of fixed-length vector will be a bottleneck, which brings difficulty for Embedding&MLP methods to capture user's diverse interests effectively from rich historical behaviors. In this paper, we propose a novel model: Deep Interest Network (DIN) which tackles this challenge by designing a local activation unit to adaptively learn the representation of user interests from historical behaviors with respect to a certain ad. This representation vector varies over different ads, improving the expressive ability of model greatly. Besides, we develop two techniques: mini-batch aware regularization and data adaptive activation function which can help training industrial deep networks with hundreds of millions of parameters. Experiments on two public datasets as well as an Alibaba real production dataset with over 2 billion samples demonstrate the effectiveness of proposed approaches, which achieve superior performance compared with state-of-the-art methods. DIN now has been successfully deployed in the online display advertising system in Alibaba, serving the main traffic.
Conference Paper
Full-text available
Feature engineering has been the key to the success of many prediction models. However, the process is nontrivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, they generate all the interactions implicitly, and are not necessarily efficient in learning all types of cross features. In this paper, we propose the Deep & Cross Network (DCN) which keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions. In particular, DCN explicitly applies feature crossing at each layer, requires no manual feature engineering, and adds negligible extra complexity to the DNN model. Our experimental results have demonstrated its superiority over the state-of-art algorithms on the CTR prediction dataset and dense classification dataset, in terms of both model accuracy and memory usage.
Conference Paper
Full-text available
Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide & Deep model from Google, DeepFM has a shared input to its "wide" and "deep" parts, with no need of feature engineering besides raw features. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on both benchmark data and commercial data.
Article
Click-through rate (CTR) prediction is a core task in the field of recommender system and many other applications. For CTR prediction model, personalization is the key to improve the performance and enhance the user experience. Recently, several models are proposed to extract user interest from user behavior data which reflects user's personalized preference implicitly. However, existing works in the field of CTR prediction mainly focus on user representation and pay less attention on representing the relevance between user and item, which directly measures the intensity of user's preference on target item. Motivated by this, we propose a novel model named Deep Match to Rank (DMR) which combines the thought of collaborative filtering in matching methods for the ranking task in CTR prediction. In DMR, we design User-to-Item Network and Item-to-Item Network to represent the relevance in two forms. In User-to-Item Network, we represent the relevance between user and item by inner product of the corresponding representation in the embedding space. Meanwhile, an auxiliary match network is presented to supervise the training and push larger inner product to represent higher relevance. In Item-to-Item Network, we first calculate the item-to-item similarities between user interacted items and target item by attention mechanism, and then sum up the similarities to obtain another form of user-to-item relevance. We conduct extensive experiments on both public and industrial datasets to validate the effectiveness of our model, which outperforms the state-of-art models significantly.
Conference Paper
Click-through rate (CTR) prediction, which aims to predict the probability of a user clicking on an ad or an item, is critical to many online applications such as online advertising and recommender systems. The problem is very challenging since (1) the input features (e.g., the user id, user age, item id, item category) are usually sparse and high-dimensional, and (2) an effective prediction relies on high-order combinatorial features (a.k.a. cross features), which are very time-consuming to hand-craft by domain experts and are impossible to be enumerated. Therefore, there have been efforts in finding low-dimensional representations of the sparse and high-dimensional raw features and their meaningful combinations. In this paper, we propose an effective and efficient method called the AutoInt to automatically learn the high-order feature interactions of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled. The whole model can be efficiently fit on large-scale raw data in an end-to-end fashion. Experimental results on four real-world datasets show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good explainability. Code is available at: \urlhttps://github.com/DeepGraphLearning/RecommenderSystems.
Conference Paper
Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance. In this work, we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network. Extensive experiments on two real-world datasets demonstrate the effectiveness of AFM. Empirically, it is shown on regression task AFM betters FM with a 8.6% relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep [Cheng et al., 2016] and DeepCross [Shan et al., 2016] with a much simpler structure and fewer model parameters. Our implementation of AFM is publicly available at: https://github.com/hexiangnan/attentional_factorization_machine
Conference Paper
Many predictive tasks of web applications need to model categorical variables, such as user IDs and demographics like genders and occupations. To apply standard machine learning techniques, these categorical predictors are always converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse. To learn from such sparse data effectively, it is crucial to account for the interactions between features. Factorization Machines (FMs) are a popular solution for efficiently using the second-order feature interactions. However, FM models feature interactions in a linear way, which can be insufficient for capturing the non-linear and complex inherent structure of real-world data. While deep neural networks have recently been applied to learn non-linear feature interactions in industry, such as the Wide&Deep by Google and DeepCross by Microsoft, the deep structure meanwhile makes them difficult to train. In this paper, we propose a novel model Neural Factorization Machine (NFM) for prediction under sparse settings. NFM seamlessly combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher-order feature interactions. Conceptually, NFM is more expressive than FM since FM can be seen as a special case of NFM without hidden layers. Empirical results on two regression tasks show that with one hidden layer only, NFM significantly outperforms FM with a 7.3% relative improvement. Compared to the recent deep learning methods Wide&Deep and DeepCross, our NFM uses a shallower structure but offers better performance, being much easier to train and tune in practice.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.