ChapterPDF Available

Time-Aware Subscription Prediction Model for User Acquisition in Digital News Media



Content may be subject to copyright.
Time-Aware Subscription Prediction Model for User Acquisition in Digital
News Media
Heidar Davoudi Morteza Zihayat Aijun An
User acquisition is one of the most challenging problems
for online news providers. In fact, due to availability of
different news media, users have a lot of choices in selecting
the news source. To date, most of digital news portals
have tried to approach the solution indirectly by targeting
the user satisfaction through the recommendation systems.
In contrast, we address the problem directly by identifying
valuable visitors who are likely potential subscribers in the
future. First, we suggest that the decision for subscription
is not a sudden, instantaneous action, but is the informed
decision based on positive experience with digital medium.
As such, we propose effective engagement measures and
show that they are effective in building the predictive
model for subscription. We design a model that not only
predicts the potential subscribers but also answers queries
about the subscription occurrence time. The proposed
model can be used to predict the subscription time and
recommend accurately the “potential users” to the current
marketing campaign. We evaluate the proposed model using
a real dataset from The Globe and Mail which is a major
newspaper in Canada. The experimental results show that
the proposed model outperforms the traditional state-of-the-
art approaches significantly.
1 Introduction
Digital media and online news providers are facing the
user acquisition challenge as the pressing issue more
than before. In fact, from a business point of view,
successful user acquisition can be directly translated to
huge profits and values. However, whilst around 45%
of people pay for a printed newspaper at least once a
week, it has been much harder to persuade readers to
pay for the online news subscription [10].
News recommender systems are widely exploited to
improve the user experience, and consequently user ac-
quisition indirectly. However, such systems mainly fo-
cus on recommending items that coincide with user’s in-
terests (to maximize the user’s satisfaction) and do not
identify potential subscribers and predict the subscrip-
tion time. Identifying potential subscribers and pre-
dicting their subscription time are of paramount impor-
tance for news websites since it allows them to launch
a targeted marketing campaign in advance. To the best
of our knowledge, this problem has not been explored
directly in the digital news media domain from data
Department of Electrical Engineering and Computer Science,
York University, Canada, {davoudi, zihayatm, ann}
mining/machine learning perspectives, but rather con-
sidered in marketing studies which need a lot of human
The problem of identifying potential subscribers
for news media from the data mining/machine learn-
ing point of view is facing several challenges. First, a
decision for subscription is under influence of many fac-
tors such as demographical, social, or cultural circum-
stances. For example, one might decide to subscribe
as she/he was referred by her/his friend (e.g., word of
mouth), or based on her/his good experience. Find-
ing an appropriate set of predictors for identifying and
recommending such users (i.e., potential subscribers)
is a challenging problem. Second, domain knowledge
is extremely limited for “the decision to subscription”
process (i.e., the knowledge acquisition bottleneck). In
other words, domain experts do not have a clear idea
on who subscribes and why/when a subscription occurs.
Third, subscription should be considered in combination
with the time dimension. In fact, the predictive model
should identify the potential subscribers in a right time
(i.e., neither soon nor late) since targeting a user who is
either not ready to subscribe yet or no longer interested
in subscription (while was previously interested) by any
marketing campaign results in no subscription.
In this paper, we propose an end-to-end solution to
address the aforementioned challenges in the problem
of identifying potential users prone to subscription in
news portals. First, we argue that the subscription act
is not an instantaneously sudden decision, but rather
an informed decision based on previous positive expe-
riences. Accordingly, we propose a set of engagement
measures as subscription predictors. The engagement
measures are quantified in fully data-driven fashion, so
we do not rely on the domain expert knowledge for their
calculation. Then, we propose a Time-aware Subscrip-
tion Prediction (TASP) model that combines the time
dimension with the suggested predictors. The proposed
model not only identifies and recommends the users who
are very likely to become subscribers but also is able to
predict their subscription time. In the TASP model,
we treat subscription time as a dependent random vari-
able and utilize generalized linear model to combine all
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
engagement measures (i.e., independent random vari-
ables). Then, we cast the problem into an optimization
problem aiming to maximize the likelihood of the pro-
posed model. The learning algorithm is designed and
parameters of model are learned respectively. Our main
contributions are as follows:
We define the problem of time-aware subscription
prediction for user acquisition in news portals and
design an end-to-end data-driven solution based
on the data which are usually available in news
We propose effective user engagement measures as
the main component of the subscription prediction
model and show that they have a good predictive
power to model subscription occurrence/time.
We argue that time is an important factor in user
subscription prediction and develop a probabilis-
tic model to recommend the trustworthy potential
subscribers. The proposed model predicts the po-
tential users prone to subscription before a given
time. Moreover, it can predict when the subscrip-
tion occurs.
The conducted experiments on a real dataset show
the effectiveness of the proposed framework and the
developed model in solving the problem of time-
aware subscription prediction for user acquisition.
The rest of paper is organized as follows. Section 2
discusses the proposed framework for user acquisition.
In particular, we present our Time-aware Subscription
Prediction Model (TASP) in section 2.3. We outline
the empirical evaluation in section 3 and discuss related
work in section 4. Section 5 concludes the paper and
present the future work.
2 Time-Aware User Acquisition in News
Figure 1 shows an overview of the proposed framework
for user acquisition in news portals. The framework con-
sists of three main components: (1) Data preparation:
most of news portals (e.g., The Globe and Mail1) use
a data collection platform (e.g., Omniture by Adobe2)
to capture the interactions with users. However, the
captured data need to be preprocessed and aggregated
before applying any learning algorithm (see §2.1). (2)
Learning phase: given the preprocessed data, this com-
ponent first finds a set of engagement measures (see
Data Collection
User Engagement
Data preparation Learning phase Inference phase
Marketing Campaign
Figure 1: The proposed user acquisition framework.
§2.2) and then uses them to design the Time-aware Sub-
scription Prediction (TASP) model (see §2.3). (3) Infer-
ence phase: as we learn the parameters of the proposed
model, the interference models answer two type of ques-
tions: (i) time-aware subscription occurrence predic-
tion: (i.e., what is the probability that a user becomes a
subscriber by the given time tsince the first visit?) (ii)
subscription time prediction (i.e., when will a user be-
come a subscriber since the first visit?). The inference
outcomes can be utilized by the marketing campaign to
boost user acquisition.
2.1 Data Preparation In this section we present
two main phases to prepare the data for user acquisition
analysis: (1) Data collection, and (2) Preprocessing and
2.1.1 Data collection: Every time a user reads an
article, watches a video or generally takes an action in
a news portal, the interaction is being tracked on the
portal and is recorded as a hit. In data collection frame-
works (e.g., Omniture), a hit simply shows a record
in the data warehouse which contains rich information
about the visitor and her/his actions. Typically, a hit
contains information like date, time, user id (for a sub-
scribed user), user environment variables (e.g., browser
type, IP address), visited article, special events of inter-
est like subscription, sign in, etc. Although the click-
stream data are composed of billions of hits that tell
what visitors have done in their visits, they contain a
lot of noisy information needed to be cleaned properly.
2.1.2 Preprocessing and Aggregation: The data
captured in any data collection platform contains a lot
of low-level interactions (e.g., hits) mixed with a lot of
noises. For example, spending a lot of time in a session
does not necessarily mean that a user spends more time
on reading the clicked articles as the user might use
multi-tab browser and is engaged in other activities. As
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
such, we need to deal with both aggregation, and data
cleansing as a part of data preparation.
Given that data are organized as hits, we roll-up
the data from page view hits to visits and then to
visitors. We refer to a visit as a set of page views in one
“session” (a session is terminated if the data collection
server does not hear from the same user for 30 minutes).
We use cookie and the device’s IP information which
is anonymized and encoded in the data warehouse to
detect the unique visitors.
Data collection platforms record a timestamp for
each hit, so the difference between two consecutive page
click timestamps can be utilized to calculate the time
the user spent on an article. As usual in web analytics,
the last article in a visit is ignored since we cannot
estimate the time the user spent on it.
We filter out the unnecessary attributes which are
not needed in calculation of user engagement defined in
the next section (§2.2). We perform data cleaning by
removing the outlier visitors whose engagement mea-
sures deviate more than 3 times the standard deviation
from the mean of respective engagement measures in
the data [5]. This helps us simply remove unreasonable
values for the measures. Finally, all the engagement
measures are normalized based on the z-score method.
2.2 User Engagement Measures As we suggest
that user engagement have a close relationship with user
acquisition, one important task in the proposed frame-
work is to measure the user engagement. To understand
the rationale behind the relationship, consider the sce-
nario that we want to predict users prone to subscrip-
tion based on the historical data stored as a clickstream
collection. A reasonable assumption is that the user’s
decision on subscription is based on a long-term and
short-term positive experiences rather than a sudden
instantaneously thought. This is exactly related to the
area of “user engagement” modeling. In fact, a well-
known definition of engagement is based on “positive
aspects” of user experience while interacting with an on-
line application [7]. The positive aspects of experience
are different among domains and applications and very
hard to measure (e.g., visiting Twitter more frequently
by a user in comparison to Facebook does not show es-
sentially she/he has a better experience with Twitter
due to differences in engagement patterns of these two
social media). Moreover, other engagement measure-
ment approaches such as self-reporting methods [8] (i.e.,
using questionnaires, surveys or interviews) and physio-
logical methods [3] (i.e., utilizing observational methods
such as facial expression or speech analysis) are based
on a small number of users while assuming to be the
representative of the whole population.
Alternatively, as we aim to have a fully data-driven
framework, we propose the following simple but effec-
tive web analytics measures, inspired by [7], to quantify
the user engagement and show that they have predictive
power for subscription prediction in digital news media
Total Number of Paywall: In news portals which
provide subscribed services, there is a restriction on the
number of articles that a non-subscriber can read in a
period of time. For example, in The Globe and Mail
this period is one month. That is, as a visitor tries to
read more articles, she/he is directed to a page asking
for subscription (or login). This page is referred as a
paywall. In our proposed approach, this interaction is
used as an indicator of a user’s interest in subscription.
We calculate the total number of paywalls each user hits
in all of her/his visits.
Average Number of Paywalls per Visit: This mea-
sure is calculated by normalizing the total number of
paywalls by the number of visits.
Total Article Read: This measure is simply defined
as the number of articles read by the user. There is
difference between page visit and this measure. While
in page visit we consider all of the pages (e.g., naviga-
tional or search pages), in this measure we only count
article pages since they may better show the interest of
users in contents and could be more close to the real
user engagement, considering situations where, e.g., we
count the number of page visits when a user visits a lot
of navigational pages while looking for a single article.
Average Number of Articles per Visit: This mea-
sure is the number of articles read by the user normal-
ized by the number of visits.
Average Spent Time per Article: The time a user
spent on each article is calculated based on the method
described in §2.1.2. The average time spent per article
is calculated by dividing the total time that the user
spent on articles by the number articles she/he visited.
This measure roughly shows how much a user is inter-
ested in articles.
Average Spent Time per Visit: This measure is de-
fined as the time that the user spent on visits divided by
the number of visits. Each visit time is calculated based
on the sum of time that the user spent on all articles
during the respective visit.
Total Spent Time: The total spent time is measured
as the sum of time that a visitor spent on each article
during all her/his visits.
Although these measures are the indirect proxy of
real engagement our experimental results show their
effectiveness for user subscription prediction.
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
2.3 Time-aware Subscription Prediction Model
(TASP): Given the set of engagement measures, in
this section, we first outline the problem statement;
then, in subsequent sections we describe our proposed
Time-aware Subscription Prediction (TASP) model in
details. We utilize the generalized linear model as the
building block of the model. By assuming an underlying
distribution for subscription time (i.e., Weibull), we cast
the problem into the maximum likelihood optimization.
Finally, we derive the solution to learn the parameters
of the model.
2.3.1 Problem Statement: Given the processed
data for all the users, we refer to the time period of
this data set as “exploration period”. We first remove
the users who subscribed before the exploration period.
The remaining users either subscribed during the ex-
ploration period (i.e., subscribers) or never subscribed
either before or during the period (i.e., non-subscribers).
Note that we do not consider the users who subscribed
before the exploration period since we do not have their
information before their subscription and our targeted
problem is to build a model to predict how and when
the unsubscribed users turn to subscribed ones.
Definition 2.1. (Subscription Occurrence Time):
The subscription time e
tiis defined as the time that
passed since the first visit of user i until her/his sub-
scription. Thus, given the absolute subscription time t0
and the first visit time tfifor user i, e
tiis computed as
(2.1) e
The absolute subscription time refers to the timestamp
that is recorded for each subscription. In our analysis,
all timestamps are in day scale.
For non-subscribers, we define the possible subscrip-
tion period as follows:
Definition 2.2. (Possible Subscription Period): We
define (¯
ti,)as the possible subscription period for user
i, where ¯
tiis defined as:
(2.2) ¯
where tliis the last visit time in the exploration
period for a non-subscriber. Alternatively, ¯
timight be
considered as the time that subscription might occur
afterward since the first visit for the user i. Please
note that if the subscription occurs we know the exact
time of subscription (e
ti), whereas in the case that the
subscription does not occur, all we know is that the
subscription time exceeds ¯
The training set for the subscription time prediction
problem is defined as follows:
(2.3) L={(Xi, ti, Ii)|i= 1,2, . . . , n}
where Xi= [xji ]m×1is the engagement measure vector
for the user i(xji is the j’th engagement measure
calculated for the user i, see §2.2). We calculate the
user engagement measures for subscribers based on
the visits before the subscription time and for non-
subscribers based on the first visit till the last visit in
the exploration period. For simplicity, the vector of Xi
is append by 1 to address the bias in the linear system.
Iiis defined as the indicator function which specifies
whether user isubscribed during the exploration period
or not:
(2.4) Ii=(1 if user i is a subscriber
0 otherwise
and tiis defined as e
tifor subscribed users (i.e., Ii= 1)
and ¯
tifor non-subscribed user (i.e., Ii= 0). We refer to
this arrangement in §2.3.4 as we want to formulate the
optimization problem.
Let Tbe a non-negative continuous random vari-
able representing the waiting time for subscription oc-
currence since the first visit. We assume fT(t) be the
probability density function and FT(t) = P(T < t)
(p.d.f.) be the cumulative distribution function (c.d.f.)
of subscription occurrence by time t.
Now we define the problem of user subscription
time prediction as follows. Given training data L(Eq.
2.3), we want to estimate the cumulative distribution
function F(t) = P(T < t) for any subscription time t.
2.3.2 Generalized Linear Model: In order to
make a connection between subscription time (i.e., vari-
able of interest) and engagement factors, we first de-
velop a generalized linear model. The generalized linear
model bridges the gap between the probability distri-
bution of subscription time and engagement factors cal-
culated for each user and parameterize our model from
observed data. Once the connection (i.e., a model) is
established, we can predict the subscription time from
the engagement behaviors.
Given vector Xias the engagement measure vector
(i.e., exploratory variables), subscription time observa-
tion for the user iis modeled as follows:
(2.5) Ti=B|Xi+
where is a stochastic residual coming from exponential
family. The main idea is to model the expectation of
subscription time as a function (i.e., link function) of
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
Figure 2: Weibull Distribution.
linear combination of engagement measures. So,
(2.6) E[Ti] = g1(B|Xi)
To ensure the strict positivity of E[Ti], we assume gis
the exponential function:
(2.7) E[Ti]exp (B|Xi)
This assumption also helps us simplify the objective
function introduced in §2.3.4. Please note that if we
choose Gaussian or Bernoulli distribution, the model
will be reduced to linear regression or logistic regression
2.3.3 Underlaying Distribution for Subscrip-
tion Time: As our goal is to model relationship be-
tween user engagement and subscription time, we need
to find the proper distribution for predicting the sub-
scription time. The Weibull distribution has the flexi-
bility to model right-skewed, left-skewed or even sym-
metric distributed data. Thus, we chose to use it in our
model. It has been used in different domains to model
the waiting time of an event [6]. The Weibull probabil-
ity distribution for subscription time is as follows:
(2.8) fTi(t;γ, α) = γ
exp {− t
where αiand γare scale and shape parameters respec-
tively. The shape parameter γcan be learned to model
the waiting time where the rate of event (i.e., hazard
function) decreases (γ < 1), increases (γ > 1), or is
constant (γ= 1) with time. Increasing the value of
scale parameter (αi) while holding shape parameter (γ)
constant has the effect of stretching out the probabil-
ity density function. Figure 2 shows the Weibull dis-
tribution for different parameters. The expectation of
Weibull distribution is expressed as:
(2.9) E[Ti] = αiΓ(1 + 1
where γis the Gamma function. Given (Eq. 2.7), we
can assume that:
(2.10) αi= exp (B|Xi)
The cumulative distribution function is written as fol-
(2.11) FTi(t) = P(Tit)=1exp {− t
Note that the distributions in (Eq. 2.8) have the same
shape parameter γ, but different expectation values
via parameter αi. In fact, the basic assumption is
that each value of a random variable Tiis drawn
from a distribution indicated in (Eq. 2.8) where the
expectation of distribution depends on the data point
in (Eq. 2.9 and 2.10).
2.3.4 Optimization Problem: Assuming that ob-
servations (i.e., data points) are statistically indepen-
dent and drawn from the distribution (Eq. 2.8), the
log-likelihood of the model is formulated as follows:
log `=
{Iilog fTi(ti;γ, αi) + (1 Ii) log P(Ti> ti)}
where tiis the subscription occurrence time (e
ti) for the
subscriber i(i.e., Ii= 1) and the start of possible
subscription period ( ¯
ti) for the non-subscriber i(i.e.,
Ii= 0). The basic idea is that subscribers contribute to
the log-likelihood by the probability density function
fTiwhile non-subscribers contribute to log-likelihood
by the probability P(Ti> ti). If we plug in the
probability density function in (Eq. 2.8) and the
cumulative distribution function in (Eq. 2.11) into the
log-likelihood function (Eq. 2.12), we can simplify the
log-likelihood of model in vector format as follows:
log `=I|(log(γ)1+ (γ1) log(Ts))+
γ I|XB1|exp{γ(log(Ts) + XB)}(2.13)
where I= [Ii]n×1is the indicator vector whose com-
ponents are defined in (Eq. 2.4), 1= [1]n×1is the
identity vector (i.e., all components are 1), Ts= [ti]n×1
is the vector of subscription time defined in (Eq. 2.3),
X= [Xi]n×mis the matrix of engagement measures,
where each row is Xidefined in (Eq. 2.3), and γ(scaler)
and B= [βi]m×1(vector) are parameters.
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
Algorithm 1 TASP Learning Algorithm
1: Initialize B(0)
1, B(0)
2: Initialize γ(0) = 1
3: t:= 0
4: while not converge and t < max iterations do
5: B(t+1) := B(t)+ηB[log `(B(t), γ(t))]
6: γ(t+1) := γ(t)+ηγ[log `(B(t+1), γ (t))]
7: t:= t+ 1
8: end while
9: return B, γ
2.3.5 Learning Algorithm: We use the gradient
ascending method to maximize the log-likelihood and
learn the parameters of the proposed model. First, we
derive the gradient of log-likelihood of model (Eq. 2.13)
with respect to γand B. The gradient of model with
respect to Bis specified as follows:
B[log `(B, γ )] = γX|IγX|exp{γ(log(Ts) + XB)}
and gradient of log-likelihood with respect to the γis
derived as follows:
γ[log `(B, γ )] =I|{(1)1+ log Ts+XB}−
(log(Ts) + XB)|exp{γ(log(Ts) + XB)}(2.15)
The overall procedure is outlined in Algorithm 1. We
use the coordinate ascending method [14] to learn the
Band γiteratively. In step 5, we update the parameter
Bbased on the gradient derived in (Eq. 2.14 ), then
in step 6, keeping Bfixed, the parameter γis updated
according to the gradient in (Eq. 2.15).
2.4 Inference models: After the parameters of the
model (i.e., γand B) are learned, inference with the
model is straightforward. Particularly, we are interested
in answering two types of questions: (1) what is the
probability that a user be subscriber by the given time t
since the first visit? (time-aware subscriber prediction)
(2) when will a user be a subscriber since the first visit?
(subscription time prediction).
2.4.1 Time-aware Subscription Occurrence
Prediction: To find the users who will be subscriber
by time tsince the first visit, we need to estimate the
P(Tt). Given the user buhas a engagement vector
Xbu, we calculate the scale parameter αbuusing (Eq.
(2.16) αbu= exp (B|Xbu)
The desired probability is calculated as follows:
(2.17) FT(t) = P(Tt)=1exp {− t
Figure 3: Subscription time prediction performance.
We consider FT(t) = P(Tt)0.5 as the subscription
2.4.2 Subscription Time Prediction: For the
subscription time prediction, as the final distribution
can be skewed, we propose to use the median as predic-
tion time. This measure is less susceptible to outliers
and extreme values and empirically performs better in
our experiments. Given user buwith Xbuas the engage-
ment vector, the subscription time tfor the user buis
calculated as follows:
(2.18) tbu=αbulog(2)
where αbuis estimated using (Eq. 2.16).
3 Empirical Evaluation
In this section, we evaluate our proposed Time-aware
Subscription Prediction (TASP) model and compare it
with the state-of-the-art techniques as the baselines. We
compare our model with Logistic Regression (LR), Ran-
dom Forest (RF), Decision Tree (J48) and Naive Bayes
(NB). We use the Mean Absolute Error (MAE) and
F1-Measure as performance measures for “subscription
time” and “subscription occurrence prediction” accord-
ingly. All the experiments in this section are based on
the 10-fold cross validation. All the time values in the
experiments are in day scale. We use The Globe and
Mail dataset in our experiments. We set the learning
rate ηand maximum number of iterations (i.e., max it-
erations) in Algorithm 1 to 0.01 and 1000 respectively.
3.1 Dataset The Globe and Mail is the major news
paper in Canada. In this news portal, interactions with
users are captured using the Omniture data collection
platform. The original data repository contains about
2 billions of hits (see §2.1.2) ranging from article read-
ing behavior to video watching. We use the data from
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
Figure 4: Subscription occurrence prediction performance (all time values are in days).
2014-01 to 2014-08 as the exploration period in our ex-
periments. Since the data contains a lot of irrelevant
information, the original data is processed to keep only
the necessary information needed to calculate the pro-
posed engagement measures. Then, we aggregate and
preprocess the data based on the proposed steps de-
scribed in (§2.1.2). The dataset used in the experiments
contains 17,009 subscribers and 71,639 non-subscribers.
Note again that the subscribers are the ones who sub-
scribed during the exploration period.
3.2 Subscription Time Prediction: Figure 3
shows the results of subscription time prediction for the
proposed model (TASP) and Average Time (AVG) as
the baseline. Each point in the figure shows the MAE
between the predicted subscription time and actual sub-
scription time for users who subscribe before time t(all
time values are in days). For the AVG model we calcu-
late the average subscription time of visitors who sub-
scribed before time tin the training set. Then, the MAE
is calculated based on the difference between the actual
subscription time of users in the test set and the re-
spective average time value. As observed, MAE for the
proposed method is much less than the AVG method for
different t. In particular, for small values of t, the pro-
posed model performs better than bigger time values,
which means that the proposed method works better in
short-time subscription time prediction than it does in
longer term prediction although it performs better than
the AVG method in both short term and long term.
3.3 Subscription Occurrence Prediction: Fig-
ure 4 shows the performance of TASP compared to the
other baselines for different values of t. Each figure
shows the performance of the different models in pre-
dicting the subscription occurrence before time t. Note
that the proposed model (TASP) considers the time in
the training stage and answers the queries about sub-
scription with respect to the time (i.e., probability of
subscription before given time t). Figure 4 shows that
the proposed model outperforms the baselines for dif-
ferent tvalues. Moreover, it can be seen that the TASP
model performs better in short-time subscription pre-
diction. Among the baselines, tree-based models (i.e.,
J48 and Random Forest) perform the best and the Lo-
gistic Regression has the worst performance.
3.4 Imbalanced Sensitivity Analysis: In this sec-
tion, we study the performance sensitivity of the pro-
posed model under different portions of non-subscribers
to subscribers as the training data. As such, we vary
the portion of non-subscribers to subscribers by down
sampling the non-subscribers. Figure 5 shows the per-
formance of the proposed model (TASP) as well as
the baselines for different portions of non-subscribers to
subscribers (nns/ns) where nns and nsare the number
of non-subscribers and subscribers respectively in the
training set. The performance of the proposed model in
predicting the subscriber is better when the dataset is
balanced and consistently better than the baselines for
different portions. As our model performance is better
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
Figure 5: The subscription occurrence prediction performance sensitivity with respect to the number of non-
subscribers to subscribers (nns /ns).
in the case that the dataset is balanced, we are aiming
to embed a mechanism in our model to deal with imbal-
anced data as the future work. Figure 6 shows the per-
formance sensitivity of the proposed model in predicting
the subscription time. As it can be seen the MAE has
a small sensitivity to portion of non-subscribers to sub-
scribers in the training set.
4 Related Work
User acquisition traditionally is studied under area of
Customer Relation Management (CRM) [12] where the
main goal is to understand the customer behaviors
and maximize the customer value to the organization
in the long term. However, to date, most of efforts
have focused on user attraction,retention and churn
management rather than user acquisition.
Ng et al. [11] performed one of the first attempts on
using data mining techniques for user retention in imagi-
native telecommunication domain (the real domain was
unanimous due to privacy issue). They identified the
objective indicators and used a decision tree induction
method for the prediction purpose. In [1], authors pro-
posed a rule-based evolutionary algorithm and applied it
to predict churns in a telecommunication domain. They
argued that interpretability was important in this prob-
lem, and their suggested rule-based method could ad-
dress the issue by uncovering interpretable churn pat-
In another work, Mozer et al [9] considered the
problem of churn prediction for a major carrier com-
pany. They utilized features (overall 134 variables) such
as call details records, billing information, application
for service to predict the users churns. Three classes
of predictive models: (i.e., decision tree, logit regres-
sion, and non linear neural network) were exploited and
compared for the user churn prediction.
Kim et al. [4] conducted a research to measure
the attractiveness (click values) of individual words
for users. Assuming some words significantly induce
more clicks than others, they proposed a generative
model which jointly modeled headlines, contents of news
articles as well as the clickstream data. The model was
an extension to Latent Dirichlet Allocation (LDA) [2]
whereas topic-specific click values of each word and
clicked words were modeled using beta and binomial
distributions respectively.
Customer Life Time Value (LTV) analysis is an-
other related area to user acquisition. Customer LTV
is usually defined as the total net income that a com-
pany expects from its customers. For example, Rosset
et al. [13] calculated the current customer LTV based on
three factors: customer value over time, length of ser-
vice, and discounting factor. However, they estimated
the effects of retention campaigns on Lifetime Value and
did not investigate how a visitor (e.g., non-subscriber)
becomes a customer (subscriber).
In this paper, we consider the problem of user
acquisition in the digital news portal domain. To best of
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
our knowledge this is the first work that considers this
problem in the digital news media domain and provides
end-to-end solution for it. In particular, our proposed
predictive model considers subscription time as a main
component in both learning and inference process which
has not been tackled before.
Figure 6: The subscription time prediction perfor-
mance sensitivity with respect to the number of non-
subscribers to subscribers (nns /ns).
5 Conclusion and Future Work
User acquisition for digital news portals are one of the
most pressing issue as the users are exposed to many
available news sources. In this paper, we addressed
the problem by predicting users who are prone to
subscription in a given period of time. One important
challenge is to define the measures that have enough
power to predict the subscription (since the subscription
is a complex decision depending on many factors). We
simply showed that engagement measures had the good
capability in predicting the subscription. The intuition
is that the engagement as a positive experience has a
direct impact on subscription. We proposed a time-
aware prediction model that not only could predict
the subscription in a given period of time, but also
the subscription time. The empirical study on a real
dataset showed that the proposed model performed well
compared to the baseline models. In the future, we plan
to improve and embed a mechanism in the model to deal
with imbalanced data (for the situation that the number
of subscribers to non-subscribers are very low). We will
also investigate the capability of the proposed model in
other domains.
This work is funded by Natural Sciences and Engineer-
ing Research Council of Canada (NSERC), The Globe
and Mail, and the Big Data Research, Analytics, and
Information Network (BRAIN) Alliance established by
the Ontario Research Fund - Research Excellence Pro-
gram (ORF-RE). We would like to thank The Globe and
Mail for providing the dataset used in this research. In
particular, we thank Gordon Edall and Shengqing Wu of
The Globe and Mail for their insights and collaboration
in our joint project.
[1] W.-H. Au, K. C. Chan, and X. Yao. A novel evolution-
ary data mining algorithm with applications to churn
prediction. IEEE transactions on evolutionary compu-
tation, 7(6):532–545, 2003.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet
allocation. Journal of machine Learning research,
3(Jan):993–1022, 2003.
[3] C. Jennett, A. L. Cox, P. Cairns, S. Dhoparee, A. Epps,
T. Tijs, and A. Walton. Measuring and defining the ex-
perience of immersion in games. International journal
of human-computer studies, 66(9):641–661, 2008.
[4] J. H. Kim, A. Mantrach, A. Jaimes, and A. Oh.
How to compete online for news audience: Modeling
words that attract clicks. In Proceedings of the 22nd
ACM SIGKDD international conference on Knowledge
discovery in data mining. ACM, 2016.
[5] H.-P. Kriegel, P. Kr¨oger, and A. Zimek. Outlier detec-
tion techniques. In Tutorial at the 16th ACM SIGKDD
international conference on Knowledge discovery and
data mining. ACM, 2010.
[6] C.-D. Lai, D. Murthy, and M. Xie. Weibull distribu-
tions and their applications. In Springer Handbook of
Engineering Statistics, pages 63–78. Springer, 2006.
[7] M. Lalmas, H. O’Brien, and E. Yom-Tov. Measuring
user engagement. Synthesis Lectures on Information
Concepts, Retrieval, and Services, 6(4):1–132, 2014.
[8] I. Lopatovska and I. Arapakis. Theories, methods and
current research on emotions in library and information
science, information retrieval and human–computer
interaction. Information Processing & Management,
47(4):575–592, 2011.
[9] M. C. Mozer, R. Wolniewicz, D. B. Grimes, E. Johnson,
and H. Kaushansky. Predicting subscriber dissatisfac-
tion and improving retention in the wireless telecom-
munications industry. IEEE Transactions on neural
networks, 11(3):690–696, 2000.
[10] N. Newman, D. A. Levy, and R. K. Nielsen. Reuters
institute digital news report 2016. Available at SSRN
2619576, 2016.
[11] K. Ng and H. Liu. Customer retention via data mining.
Artificial Intelligence Review, 14(6):569–590, 2000.
[12] E. W. Ngai, L. Xiu, and D. C. Chau. Application of
data mining techniques in customer relationship man-
agement: A literature review and classification. Expert
systems with applications, 36(2):2592–2602, 2009.
[13] S. Rosset, E. Neumann, U. Eick, and N. Vatnik.
Customer lifetime value models for decision support.
Data mining and knowledge discovery, 7(3):321–339,
[14] S. J. Wright. Coordinate descent algorithms. Mathe-
matical Programming, 151(1):3–34, 2015.
Copyright © by SIAM
Unauthorized reproduction of this article is prohibited
Downloaded 06/18/18 to Redistribution subject to SIAM license or copyright; see
... In another study, a group at the New York Times has shown that with a dynamic paywall steered by propensity scores, they were able to increase the probability of subscribing to premium content by 10% in only 11 weeks, leading to a net profit of at least $230,000 during the seven-month study [3]. Additionally, researchers have already demonstrated the value of time-aware user engagement features in building subscription prediction models [7]. ...
... The detection of user's propensity to subscribe for digital content is an underexplored research problem [6]. In [7], Davoudi et al. propose the time-aware subscription prediction model using user engagement measures. In more recent research presented in [5,6] the authors study the adaptive paywall control policies. ...
In this paper, we study how the application of a dynamic user engagement profiling can influence the efficiency of systems aimed at detecting the user’s propensity to buy a subscription. Specifically, we address a task of identifying the digital media readers who are involved enough in the publisher’s offer to pay for access to the content of a given webpage. We present the user engagement profile updating framework responsible for enriching raw events with time-agnostic temporal features. In particular, we experimentally evaluate the performance of machine learning algorithms for the task of predicting the user propensity to subscribe using the synthetic dataset based on publicly available data streams on users of KKBox’s music service. Additionally, we provide the results of online tests in which the propensity-to-subscribe prediction model is used to control the paywall displays on a digital media website with live traffic. The results of experiments have proven that enrichment of data with engagement profiles leads to higher performance of prediction models than relying just on raw features and tuning the model’s hyperparameters.KeywordsBig dataUser engagement modelingUser profilingSubscription for digital contentStreaming data processing
... Potential to Trigger Subscription: This feature represents the total number of times the item was requested right before a paywall was presented to a user who subsequently made a subscription [10,11]. In a subscription-based item delivery model a paywall is the page asking for subscription before allowing an unsubscribed user to continue accessing items. ...
Conference Paper
Full-text available
Online news reading has become very popular as the web provides access to news articles from millions of sources around the world. As a specific application domain, news recommender systems aim to give the most relevant news article recommendations to users according to their personal interests and preferences. Recently, a family of models has emerged that aims to improve recommendations by adapting to the contextual situation of users. These models provide the premise of being more accurate as they are tailored to satisfy the continuously changing needs of users. However, little attention has been paid to the emotional context and its potential on improving the accuracy of news recommendations. The main objective of this paper is to investigate whether, how and to what extent emotion features can improve recommendations. Towards that end, we derive a large number of emotion features that can be attributed to both items and users in the domain of news. Then, we devise state-of-the-art emotion-aware recommendation models by systematically leveraging these features. We conducted a thorough experimental evaluation on a real dataset coming from news domain. Our results demonstrate that the proposed models outperform state-of-the-art non-emotion-based recommendation models. Our study provides evidence of the usefulness of the emotion features at large, as well as the feasibility of our approach on incorporating them to existing models to improve recommendations.
... While diminishing income due to decline in the advertisement revenue makes newspapers to start implementing paywall mechanisms, there are few studies on the analytical side to make this mechanism more effective. Most studies in the journalism community focus on the qualitative and quantitative investigation of features (e.g., age) influencing people to pay for subscription [4,5,7,8]. The sequential decision making over time has been studied in many disciplines with different names such as: reinforcement learning in machine learning, and approximate dynamic programming in operation research. ...
Conference Paper
Many online news agencies utilize the paywall mechanism to increase reader subscriptions. This method offers a non-subscribed reader a fixed number of free articles in a period of time (e.g., a month), and then directs the user to the subscription page for further reading. We argue that there is no direct relationship between the number of paywalls presented to readers and the number of subscriptions, and that this artificial barrier, if not used well, may disengage potential subscribers and thus may not well serve its purpose of increasing revenue. Moreover, the current paywall mechanism neither considers the user browsing history nor the potential articles which the user may visit in the future. Thus, it treats all readers equally and does not consider the potential of a reader in becoming a subscriber. In this paper, we propose an adaptive paywall mechanism to balance the benefit of showing an article against that of displaying the paywall (i.e., terminating the session). We first define the notion of cost and utility that are used to define an objective function for optimal paywall decision making. Then, we model the problem as a stochastic sequential decision process. Finally, we propose an efficient policy function for paywall decision making. The experimental results on a real dataset from a major newspaper in Canada show that the proposed model outperforms the traditional paywall mechanism as well as the other baselines.
Understanding the high likelihood of a dissatisfied customer leaving, customer satisfaction modeling has received significant attention by marketers and academic research. The major challenge in customer satisfaction modeling is the low response rate of surveys and the potential loss of valuable insights from non-respondents. We introduce a modeling framework that allows marketers to leverage existing information about non-respondents for predicting customer satisfaction at a specific time. We design a novel procedure to discover data-driven attributes that effectively represent the interactions of customers. Then, we propose a time-aware model to predict customer satisfaction or dissatisfaction and the time of events. We also design a learn-to-rank model to leverage non-respondents data for building a more accurate customer satisfaction model. A real-world dataset from an insurance company shows that the proposed framework accurately identifies satisfied or dissatisfied customers at a specific time and achieves a significantly better performance compared to extant methods.
Subscription-based online newspapers usually offer non-subscribed users a certain number of free articles in a period of time, and then directs them to a page (called paywall) asking for subscription. This approach (also known as metered or fixed paywall) does not consider the user's reading history nor the articles that the user may read in the future, and consequently, it may disengage many potential subscribers. To that end, we propose adaptive paywall mechanisms to make optimal paywall decisions (i.e., showing the article or the paywall) by balancing the benefit of showing the article against that of presenting the paywall. We define the notions of utility and cost which are used to define an objective function for the optimal paywall decision problem. We propose the Lookahead policy (LAP) and QPaywall policy (QP) as two data-driven approaches to solve the adaptive paywall problem. While the LAP method makes paywall decisions on the fly by simulating trajectories of article requests using Monte Carlo sampling, the QP approach is based on reinforcement learning and learns a neural network-based action-value (Q) function for this purpose. We compare advantages of the proposed approaches and discuss the practical considerations of using them in a real environment. Empirical studies on a real dataset from a major newspaper in Canada show that the proposed methods outperform several baseline approaches in terms of various business objectives.
Full-text available
Classification is an important topic in data mining research. Given a set of data records, each of which belongs to one of a number of predefined classes, the classification problem is concerned with the discovery of classification rules that can allow records with unknown class membership to be correctly classified. Many algorithms have been developed to mine large data sets for classification models and they have been shown to be very effective. However, when it comes to determining the likelihood of each classification made, many of them are not designed with such purpose in mind. For this, they are not readily applicable to such problems as churn prediction. For such an application, the goal is not only to predict whether or not a subscriber would switch from one carrier to another, it is also important that the likelihood of the subscriber's doing so be predicted. The reason for this is that a carrier can then choose to provide a special personalized offer and services to those subscribers who are predicted with higher likelihood to churn. Given its importance, we propose a new data mining algorithm, called data mining by evolutionary learning (DMEL), to handle classification problems of which the accuracy of each predictions made has to be estimated. In performing its tasks, DMEL searches through the possible rule space using an evolutionary approach that has the following characteristics: 1) the evolutionary process begins with the generation of an initial set of first-order rules (i.e., rules with one conjunct/condition) using a probabilistic induction technique and based on these rules, rules of higher order (two or more conjuncts) are obtained iteratively; 2) when identifying interesting rules, an objective interestingness measure is used; 3) the fitness of a chromosome is defined in terms of the probability that the attribute values of a record can be correctly determined using the rules it encodes; and 4) the likelihood of predictions (or classifications) made are estimated so that subscribers can be ranked according to their likelihood to churn. Experiments with different data sets showed that DMEL is able to effectively discover interesting classification rules. In particular, it is able to predict churn accurately under- different churn rates when applied to real telecom subscriber data.
Conference Paper
Headlines are particularly important for online news outlets where there are many similar news stories competing for users' attention. Traditionally, journalists have followed rules-of-thumb and experience to master the art of crafting catchy headlines, but with the valuable resource of large-scale click-through data of online news articles, we can apply quantitative analysis and text mining techniques to acquire an in-depth understanding of headlines. In this paper, we conduct a large-scale analysis and modeling of 150K news articles published over a period of four months on the Yahoo home page. We define a simple method to measure click-value of individual words, and analyze how temporal trends and linguistic attributes affect click-through rate (CTR). We then propose a novel generative model, headline click-based topic model (HCTM), that extends latent Dirichlet allocation (LDA) to reveal the effect of topical context on the click-value of words in headlines. HCTM leverages clicks in aggregate on previously published headlines to identify words for headlines that will generate more clicks in the future. We show that by jointly taking topics and clicks into account we can detect changes in user interests within topics. We evaluate HCTM in two different experimental settings and compare its performance with ALDA (adapted LDA), LDA, and TextRank. The first task, full headline, is to retrieve full headline used for a news article given the body of news article. The second task, good headline, is to specifically identify words in the headline that have high click values for current news audience. For full headline task, our model performs on par with ALDA, a state-of-the art web-page summarization method that utilizes click-through information. For good headline task, which is of more practical importance to both individual journalists and online news outlets, our model significantly outperforms all other comparative methods.
Coordinate descent algorithms solve optimization problems by successively performing approximate minimization along coordinate directions or coordinate hyperplanes. They have been used in applications for many years, and their popularity continues to grow because of their usefulness in data analysis, machine learning, and other areas of current interest. This paper describes the fundamentals of the coordinate descent approach, together with variants and extensions and their convergence properties, mostly with reference to convex objectives. We pay particular attention to a certain problem structure that arises frequently in machine learning applications, showing that efficient implementations of accelerated coordinate descent algorithms are possible for problems of this type. We also present some parallel variants and discuss their convergence properties under several models of parallel execution.
Despite the word's common usage by gamers and reviewers alike, it is still not clear what immersion means. This paper explores immersion further by investigating whether immersion can be defined quantitatively, describing three experiments in total. The first experiment investigated participants’ abilities to switch from an immersive to a non-immersive task. The second experiment investigated whether there were changes in participants’ eye movements during an immersive task. The third experiment investigated the effect of an externally imposed pace of interaction on immersion and affective measures (state anxiety, positive affect, negative affect). Overall the findings suggest that immersion can be measured subjectively (through questionnaires) as well as objectively (task completion time, eye movements). Furthermore, immersion is not only viewed as a positive experience: negative emotions and uneasiness (i.e. anxiety) also run high.
We present and discuss the important business problem of estimating the effect of marketing activities on the Lifetime Value of a customer in the Telecommunications industry. We discuss the components of this problem, in particular customer value and length of service (or tenure) modeling, and present a novel segment-based approach, motivated by the segment-level view marketing analysts usually employ. We describe in detail how we build on this approach to estimate the effects of retention campaigns on Lifetime Value, and also discuss its application in other situations. Our solution has been successfully implemented by the Business Insight (BI) Professional Services.
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Emotions are an integral component of all human activities, including human–computer interactions. This article reviews literature on the theories of emotions, methods for studying emotions, and their role in human information behaviour. It also examines current research on emotions in library and information science, information retrieval and human–computer interaction, and outlines some of the challenges and directions for future work.
Despite the importance of data mining techniques to customer relationship management (CRM), there is a lack of a comprehensive literature review and a classification scheme for it. This is the first identifiable academic literature review of the application of data mining techniques to CRM. It provides an academic database of literature between the period of 2000–2006 covering 24 journals and proposes a classification scheme to classify the articles. Nine hundred articles were identified and reviewed for their direct relevance to applying data mining techniques to CRM. Eighty-seven articles were subsequently selected, reviewed and classified. Each of the 87 selected papers was categorized on four CRM dimensions (Customer Identification, Customer Attraction, Customer Retention and Customer Development) and seven data mining functions (Association, Classification, Clustering, Forecasting, Regression, Sequence Discovery and Visualization). Papers were further classified into nine sub-categories of CRM elements under different data mining techniques based on the major focus of each paper. The review and classification process was independently verified. Findings of this paper indicate that the research area of customer retention received most research attention. Of these, most are related to one-to-one marketing and loyalty programs respectively. On the other hand, classification and association models are the two commonly used models for data mining in CRM. Our analysis provides a roadmap to guide future research and facilitate knowledge accumulation and creation concerning the application of data mining techniques in CRM.