Hybrid Machine Learning Approach to Popularity Prediction of Newly Released Contents for Online Video Streaming Service

Preprint (PDF Available) · January 2019with 402 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
Cite this publication
Preprints and early-stage research may not have been peer reviewed yet.
Abstract
In the industry of video content providers such as VOD and IPTV, predicting the popularity of video contents in advance is critical not only from a marketing perspective but also from a network optimization perspective. By predicting whether the content will be successful or not in advance, the content file, which is large, is efficiently deployed in the proper service providing server, leading to network cost optimization. Many previous studies have done view count prediction research to do this. However, the studies have been making predictions based on historical view count data from users. In this case, the contents had been published to the users and already deployed on a service server. These approaches make possible to efficiently deploy a content already published but are impossible to use for a content that is not be published. To address the problems, this research proposes a hybrid machine learning approach to the classification model for the popularity prediction of newly video contents which is not published. In this paper, we create a new variable based on the related content of the specific content and divide entire dataset by the characteristics of the contents. Next, the prediction is performed using XGBoosting and deep neural net based model according to the data characteristics of the cluster. Our model uses metadata for contents for prediction, so we use categorical embedding techniques to solve the sparsity of categorical variables and make them learn efficiently for the deep neural net model. As well, we use the FTRL-proximal algorithm to solve the problem of the view-count volatility of video content. We achieve overall better performance than the previous standalone method with a dataset from one of the top streaming service company.
HYBRID MACHINE LEARNING APPROACH TO POPULARITY
PREDICTION OF NEWLY RELEASED CONTENTS FOR ONLINE
VIDEO STREAMING SERVICE
A PREPRINT
Hongjun Jeon
TEAMLAB
Gachon University
Seongnam, Republic of Korea
jeonhongjun730@gmail.com
Wonchul Seo
Graduate School of Management of Technology
Pukyong National University
Busan, Republic of Korea
wcseo@pknu.ac.kr
Eunjeong Lucy Park
Papago
NAVER
Seongnam, Republic of Korea
lucy.park@navercorp.com
Sungchul Choi
Department of Industrial and Management Engineering
TEAMLAB at Gachon University
Seongnam, Republic of Korea
sc82.choi@gachon.ac.kr
February 12, 2019
ABS TR ACT
In the industry of video content providers such as VOD and IPTV, predicting the popularity of video
contents in advance is critical not only for marketing but also network usage. By successful predicting
preference of user content, large files can be efficiently deployed in the optimal servers and ultimately
lead to network cost reduction. Many previous studies have deployed view count prediction research
to do this. However, the studies have been making predictions based on historical view count data
from users. In cases such as these, contents had been published to the users and already sent to the
service server. These approaches make it possible to efficiently deploy content already published
impossible to handle the cold start of new content. To address the problem, this research proposes a
hybrid machine learning approach to the classification model for the popularity prediction of new
video contents which are not published.
In this paper, we create a new variable based method on the related content of the specific content
and divide the entire dataset by the characteristics of the contents. Next, the prediction is performed
using XGBoost and deep neural net based model according to the data characteristics of the cluster.
Our model uses metadata for contents for prediction, thus using categorical embedding techniques to
solve the sparsity of categorical variables and requiring the system to learn efficiently for the specified
deep neural net model. In addition, we use the FTRL-proximal algorithm to solve the problem of the
view-count volatility of video content. The results would display overall better performance than the
previous standalone method with a dataset from one of the top streaming service company.
Keywords Streaming Service ·Popularity Prediction ·Embeddings ·Deep Learning ·Boosting Decision Tree
1 Introduction
Along with the rise of online video streaming service market, the way in which video contents are consumed is changing
drastically. Consumers are moving rapidly from traditional terrestrial broadcasting consumption to IPTV and online
They are both co-corresponing authors. Tel: +82-31-750-5804, Email: teamlab.gachon@gmail.com, Web: theteamlab.io
APREPRINT - FE BRUA RY 12, 2019
streaming service services, for example, YouTube and Netflix
2
. Recently, a new report from the Video Advertising
Bureau
3
shows that the number of households that use only streaming services has tripled since 2013
4
. According to
the Global Internet Phenomena Report
5
, the massive online streaming platform currently accounts for 50.31% of North
American Internet traffic6.
These changes in the video content market have increased service traffic at a rapid pace. Also, the rapid development of
High-Definition(HD), Full-HD, 4K or more and 3D video, the large data size have caused additional service traffic
issues. As a result, the streaming service companies have faced an emerging need to actively respond to the change of
the market regarding a system architecture for well-organized services. Due to the characteristics of streaming service,
traffic monopoly by popular content is gaining in popularity, and this increase in traffic is a big issue for streaming
service companies
7
. Accordingly, the development of a Content Delivery Networks (CDNs) that can actively address
the problem has been increasingly in demand [1].
To provide quality network and efficiency in streaming service, the service provider needs technology that pre-categorizes
popular contents that are heavily consumed, all the while, storing them in storage on the fastest network available.
This process improves the capability of CDNs possible to react by providing huge consumer demand for high popular
contents. To address problems of network efficiency, an online streaming platform arranges contents on storage by
utilizing view-count records based on log data accumulated for a certain period after new content had been published.
This method requires view-count records to consider time series characteristics, thus making it difficult to configure the
content arrangement for efficient traffic management in one to two weeks. When new content is published, the content
manager could arrange them through a heuristic approach which is problematic due to the immense amount of contents
published every week.
The purpose of our research is to propose a classification model of the degree of popularity of new content in advance
without the historical view-count logs of the content. The proposed model aims to provide efficient network management
for service companies by classifying popular contents. We expect that companies can benefit from cost savings through
network optimization by utilizing the proposed model.
In this paper, we present two contributions for the research of the content popularity prediction. First, a hybrid approach
is proposed to build a machine learning model to predict the popularity of newly released contents. Generally, the
video contents could be separated into two types of contents series and standalone work. For example, a seasonal
drama or weekly talk show will have previous related video contents. Whereas, movies or standalone documentaries
sometimes would not have any other related previous works. In the former case, the popularity of the previous content
is an important feature to predict the popularity of new content. Second, we adopt batch FTRL-proximal optimizer
algorithm [
2
] to address the volatility problem to train view-count log data. Basically, the demand of video streaming
contents have higher volatility than general social data. Subsequently, the demand of contents are high at the beginning
of the release, but rapidly declines after a certain period of time. The view-count volatility affects the gradient descent
based training process, and distorts the performance of the model. Taking into consideration the characteristics of newly
released contents, we propose the hybrid approach, based on XGBoost [
3
] and Neural Net with batch FTRL-proximal
optimizer algorithm [2], demonstrating overall better performance than previous standalone method.
The rest of this paper is organized as follows. In Chapter 2, we review the related works of users’ content preference
prediction and the popularity of media content prediction based on machine learning. In Chapter 3, we describe the
proposed content prediction model and algorithms for the newly published contents without history log data. Based on
this, In Chapter 4, we compare the performance of the proposed model and discuss our main contributions. Finally, we
discuss conclusions and future works in Chapter 5.
2 Related Works
Research for content content popularity has been widely conducted in various areas such as news, advertising, TV
shows, streaming videos, and movie markets. The researches require the user’s information, historical usage data, and
metadata for contents. In reference to video contents, many recent researches utilize time-series based log data with
content metadata, or external information such as text data written on social network applications.
2Here’s a guide to every live streaming service on the market or coming soon - https://read.bi/2VgcTb1
3https://www.thevab.com/about-vab/
4The number of cord-cutters has tripled in the last 5 years, and it’s starting to hurt the TV channels - https://bit.ly/2EQacax
5https://www.sandvine.com/blog
6Half of All Internet Traffic Goes to Netflix and YouTube - https://bit.ly/2Q6r52A
7https://pc.video.dmkt-sp.jp
2
APREPRINT - FE BRUA RY 12, 2019
Initially, many researchers focused on the analysis of tree series based on meta data of contents and QOE data. Nielsen,
Sereday and Cui used XGBoost [
3
] to predict the audience ratings of TV programs in United States over the next
year [
4
]. More recently, researchers have further considered clustering techniques for data before view count prediction.
Zhu et al. used Chinese online streaming data to predict the popularity of the program using the random forests after
clustering trend using dynamic time wrapping (DTW) and K-medoids clustering [
5
]. For newly imported programs, they
used GBM to allocate them to existing program trends. A similar study of program prediction using clustering is the case
of estimating TV popularity rating using incremental k-means clustering proposed by D. Anand and A.V.Satyavani [
6
].
Incremental k-means clustering is a technique to register new content when it comes into the database and grouping it
with existing content clusters that already exist. This is different from the previous k-medoids clustering [
5
], in which
there is a difference in the clustering method for new programs entering in the middle of the process.
Another research used the text data for the contents to perform view count prediction. Fukushima and Yusuke proposed
the model with a focus on variables for actors, staff, and directors rather than on the program’s metadata or watching
records to predict the popularity of contents [
7
]. They also used it as a predictor of the model, taking into account how
much of the actor was mentioned in social media and how many awards there were. The prediction model was done
using support vector regression (SVR). In the field of movies similar to streaming contents, Mestyan et al. predicted the
movie’s success and financial success based on the number of editors contributing to the Wikipedia article on cinema
and movie audiences [
8
]. Similarly, to predict the count of the viewing data, it is vital to quantify various external
variables in addition to the primary data about the contents to improve the performance.
However, these studies are somewhat different from our proposed model where view-count predictions are performed
even before the content is published. This is possible because we have used a time series model based on past view
count of specific content. As a result, the methods of previous studies may have a profound effect on the performance
improvement of broadcast program popularity estimation using the time series of the content. However, before these
time series data are accumulated, it is hard to predict the popularity of newly released contents and storage optimizer
to perform properly for them. Consequently, the scarcity of research focusing on predicting popularity of the newly
released contents are not consistent with previous view-count history data.
3 Framework
3.1 Problem Formulation
In this research, we propose a model to predict popularity of new content by using information of previously released
content. We can define the problem as follows. First,
ct
is an individual content, an instance of data at present time
t. We define the content
{ct
new|ct
new /Ct}
that has not yet been published and content
{ct
old|ct
old Ct}
that has
already been released from a set of all contents
C
that are observed from a period
r
to present time t. Next, we build a
model predicting the popularity of content
ct
new
,
Pct
new (ct
new, C t)
. In
Pct
new (ct
new, C t)
,
Ct
is the view-count history
log data set including the their metadata at the present time
t
, and
ct
new
is a set of metadata of new content. After
predicting the popularity, the set of
cnew
is included in
Ct
at the next
t+ 1
sequence after being predicted. In the end,
the better the prediction, the closer the predicted probability
Pcnew (tr, t)
with a scale of [0,1] and the actual value
Pcoldcnew (tr+ 1, t + 1)
which is a binary value
1
or
0
.
1
is high popular content and
0
is not popular content. In
this paper, we designate them as hot content and cold content, respectively.
3.2 Method Overview
Our model is divided into three stages. First, we collect historical log data and content metadata from the streaming
service database. In this stage, the set of contents are separated into two types: Contents having successive previous
works and not having them. In this paper, we call them type A and type B respectively. The proposed hybrid model uses
different features for each type. In type A, because it is available to utilize the data regarding view-count of previous
contents, we comprise structured dataset including the data of previous contents. However, in B type, due to the fact
that related data could not be available, we utilize additional metadata such as unstructured data. Metadata of the video
contents includes many kinds of text data, for example actor names and keywords for content. By adding the additional
information, we complement the shortage of information to B type contents. The overall utilized features are described
in Table 1.
The second stage is training the dataset with two different models. Type A dataset only has structured features, and
Type B dataset has structured features and unstructured dataset. Because of the characteristics of each dataset, we apply
the tree-based model and neural net based model to two different datasets, individually. XGBoost model is used to type
A dataset. XGBoost is a tree-based model utilizing a technique called Boosting. Boosting is a technique for predicting
accuracy by combining weak classifiers into sets, and XGBoost is an intuitive model characterized by rapid learning
3
APREPRINT - FE BRUA RY 12, 2019
Table 1: Descriptions of Attributes
ID Feature Description Example
1 Payment Payment method Pay, Free
2 Type Program Type Drama, Movie, Ani
3 Genre Genre Details Mystery, Romance
4 Playtime Regeneration time 3072, 5047(s)
5 Episode episodes count 1, 25, 45
6 Age an age limit 19,17,15
7 Channel Program channel TBS, NTV
8 Actor actor, Voice actor Sung Dong Il, Seo Yu ri
9 title episode title Detective Conan 7th
10 Keyword episode Keyword Japan, famous actor
11 Release publish date 20180716
12 Related view related works rating 40728, 6587
and classification based on parallel processing. Recently, this model is widely used and proved to be the most accurate
and useful technique for structured dataset through the Data Competition. Next, because a tree-based model is not
suitable to unstructured data like text, we utilize the neural network with categorical embedding techniques for type B.
Recently, the deep learning, which is another name of the neural network, is a best suitable algorithm for unstructured
data such as text. A summary of the program type by method is described in Table 2. We use embedding techniques
representing the compressed representation of learning on latent space of data. We achieve overall better performance
than a standalone model by composing the hybrid model. The performance of the proposed model is shown in Section
4 below.
Table 2: Description of the popularity contents prediction method by program type.
Type A Type B
Description Drama, Animation, News, etc. Movie, Independent film, etc
Method XGBoost Neural Net(with CE)
The central purpose of the proposed model is to provide information regarding whether the popularity of content is
high or not for supporting to deploy large file of contents to CDNs for efficient network management. The demand
for stream service typically shows long-tail distribution, which means some popular contents uses the most network
resources. As shown in Fig 1, the only few contents, referred to as hot contents, have high view-count, and others,
referred to as cold contents, have very low view-count. The distribution of recent contents determines the criterion
for dividing the popular contents of a label. We set the boundary as top 20% to separate hot and cold contents for a
network optimization point of view for storage distribution. Based on the boundary, we train the model to predict the
classification results of hot and cold contents which are not released. To compare the performance of the model, we use
recall, precision, and f1-score as performance evaluation metrics. Because the dataset is imbalanced, we focus on the
accurate prediction for the classification of hot contents. Also, since the dataset not considering past view count does
not exist, The popular content in some duration is due to the performance value being 0. Thus we calculate the overall
average of classification of results during all periods.
The overall training algorithm of the framework is depicted in Algorithm 1.
3.3 Deep Neural Network Model using categorical embedding
The type A dataset is typically structured data, named as tabular data, which if fitted to gradient based boosting
algorithm. However, type B dataset includes text dataset which is unstructured and not fitted to the boosting algorithm.
Therefore, we adopt deep neural net model to handle the type B dataset with an embedding technique.
The categorical data type is one obstacle to apply the neural net model. Because a categorical dataset generally is
represented as a one-hot encoding vector. However the approach makes the dimensions of the input data larger, and
the learning speed can also be lowered. As well, this cannot represent the relationship and semantic distance of each
4
APREPRINT - FE BRUA RY 12, 2019
Figure 1: Relationship between content count and view count.
Algorithm 1 Training Algorithm
Input:
Structured dataset: {S1, S2,· · · , Sn}
Unstructured dataset: {T1, T2,· · · , Tn}
Length for used period of characteristic A:rA
Length for used period of characteristic B:rB
Output:
Ensemble model with learned parameters
1: Initialize the training instance set U0
2: if Program Type is characteristic Athen
3: for all available time interval t(1 trA)do
4: Put (St) into UA
5: end for
6: else
7: for all available time interval t(1 trB)do
8: Ht= Concat(St, Tt)
9: Put (Ht) into UB
10: end for
11: end if
12: Initialize all the learnable parameters
13: Repeat
14: Minimize the objective function within UAas Boosting
15: Minimize the objective function within UBas NN
16: until Convergence criterion met
17: Return learned Ensemble model
category. To address the problem, the recent neural net researches utilize the embedding techniques representing the
learning representation of a data by finding latent space. For example, a day of the week is an explanatory categorical
variable in the view-count dataset. However, if we see the proper embedding representation, the variable can be defined
as follows.
Sunday : [0.8,0.2,0.1,0.1],
Monday : [0.1,0.2,0.9,0.9],
T uesday : [0.2,0.1,0.9,0.8]
Monday and Tuesday are quite similar, but they are quite different from Sunday. The proposed neural network would
learn the best representations for each category while it is training, and each dimension or direction, which doesn’t
necessarily line up with ordinal dimensions could have multiple meanings. In this research, we use Continuous
Bag-of-Words (CBOW), which is introduced in the representative embedding technique called Word2Vec [
9
], to build
embedding space for categorical variables.
5
APREPRINT - FE BRUA RY 12, 2019
Figure 2: Overview of the deep neural net with categorical embedding.
To accomplish this, the following process is performed.
xi
is a one-hot encoding vector, the output of the layer of linear
neurons given the inputxiis defined as
Xi=X
α
ωαβ δxiβ(1)
.
Xi
is the value converted from discrete variable to vector. If
mi
is the number of values for the categorical variable
xi
,
δxiα
then is a vector of length
mi
, where the element is only non-zero when
α
=
xi
.
ωαβ
is the weight connecting the
one-hot encoding layer to the embedding layer and
β
is the index of the embedding layer. The possible values for
α
are
the same as xi.
We stack the deep neural net model using embedding vectors which is converted from categorical and continuous
variables. The proposed model uses text data to learn both structured and unstructured data from the same layers. We
learn all the categorical, numerical, and text data in one model. It is shown in Fig 2. First, each categorical variable is
converted into a one-hot encoding and then entered independently of the input of the neural net. As you can see in
Fig 2, the encoded input is entered independently of the the neural net and the values obtained after passing through the
embedding and dense layer are concatenate. Based on seven categorical variables, three numeric variables, and two text
data, we learn the model through the neural net considering all these features. We use a sigmoid function activation as
an activation function of the output layer. Because there is no proper method for determining the number of dimensions
of embedding when categorical variables are embedded, we have to find the suitable hyper parameter for them through
empirical method.
3.4 Optimizer
Because the view-count of online-streaming contents have high volatility and sparsity, there are some difficulties
to adopt gradient-based learning algorithm to massive scale dataset for online-streaming contents. In this research,
the FTRL-Proximal algorithm [
2
] was employed to induce sparsity and volatility and it shows better performance.
FTRL-Proximal algorithm is proposed by McMahan and H. Brendan. The algorithm is the analysis technique that
provides the fastest, most real time analytical results, rather than performance, such as search engine-based ad prediction
using the FTRL-Proximal algorithm published by Google. Like Google, search engine search history shows the
popularity of certain words, which can be used effectively in view count prediction. The popularity of certain words in
search engine also has high volatility.
6
APREPRINT - FE BRUA RY 12, 2019
Although the original algorithm is learned as online gradient descent, we change it to learned in batches for accuracy.
The data we used are not as vast as Google search history data, and volatilities of data are relatively smaller than
Google’s. Moreover, our study requires batch learning rather than online gradient descent because the purpose of our
research requires higher accuracy than quick learning. Therefore, our study approaches the optimization method of this
paper slightly different from the real-time learning method such as online gradient descent [
10
]. The modified FTRL
updates the weights in the following manner.
wt+1 = argmin
w
(g1:t·w+1
2
t
X
s=1
σs||wws||2
2+λ1||w||1),(2)
This algorithm defines
σs
in terms of the learning rate schedule such that
σ1:t=1
ηt
.
t
is the current training batch and
gt
is a vector of gradients of loss function. It is defined as a notation compressed with
g1:t=Pt
s=1 gs
. Recall that
gs,i
is the ith batch coordinate of the gradient gs=`s(ws). Then, Learning rate ηtequation proposed in [2] as
ηt,i =α
β+qPt
s=1g2
s,i
,(3)
where
α
is set as twice the maximum allowed magnitude for
wi
to give the best possible regret bound.
β
= 1 is usually
good enough. This simply ensures that early learning rates are not too high [2].
The first term
g1:t·w
expresses the FTL(Follow the leader) and takes a leader that has the smallest loss until
t
step to
obtain an approximation of the loss. The second term
1
2Pt
s=1 σs||wws||2
2
is a proximal part that limits the new
weight to no large fluctuations in the weights previously used. Increases the stability of the algorithm by adding strong
convexity. The third term λ1||w||1is the L1 regularization part.
If there is no distortion or bias, the more data there is in machine learning, the more accurate and robust the prediction
model is. However, the media market continues to feature new types of content, and new actors and new program
material are frequently registered. These factors continue to give us the volatility of the keyword data. It is difficult to
understand all of these volatility as the model learns. We effectively overcome this problem by using the FTRL-proximal
algorithm. The experimental results for this are shown in section 4.
3.5 Classification Definition of prereleased contents
Since we divide into two categories of dataset and then classify the contents as popular contents, we need to establish the
interface definitions between these two models so that we can present the final popular contents classification probability.
First, we predict popular probability
PcnA
as a Boosting model for the characteristic
A
dataset that accounts for more
than 70% of
cn(= cnew)
. Next, calculate the
PcnB
through the deep neural network in the remaining characteristic
B
dataset. Indeed, popular contents are concentrated in an characteristic
A
dataset, and the ratio of the data is much higher
than that of characteristic
B
dataset. In addition, the performance of our model, which will be described later, exceeds
90%, so even if we predict
PcnB
after predicting
PcnA
, we will almost converge to the popular contents boundary
that we already defined. Of course, we can sort the view count values of two datasets by regression at the same time
to perform storage allocation. However, predicting the unpublished contents through regression is very limiting in
terms of performance issues because the lack of historical data makes it impossible to grasp the time series. Finally,
we use Eq.1 to calculate the final predicted probability of content
cn,i
(
i
is current content index) at time
t
using the
information(contents
cold
already released) available until
tr
. As described above, since the whether or not to use of
text data depends on the characteristics of the dataset, we set the period
r
differently considering the volatility of the
text data.
b
Pcn,i (tr, t) = (b
Pcn,i (trA, t),if iAset
b
Pcn,i (trB, t),if iBset (4)
4 Experiments
4.1 Dataset
The dataset in the experiment is taken from one of the top Korean online streaming video service companies. This
company produces more than 50 million data logs every month. We collect the dataset including the metadata and
historical view-count of the contents from the service. Content that has existed before and whose rating can be tracked
7
APREPRINT - FE BRUA RY 12, 2019
is not subject to our analysis, and new incoming content is only our analysis target at T present time. The features
provided by the service company in advance are composed of 10 variables except sequel episode count and Related
view. The dataset does not contain a feature based on the view-count record as log data and is the contents metadata
before the contents is released. Although we use the dataset containing view-count log to classify whether the content is
hot or cold when we train model, we do not have any view-count log when predicting the new release content.
We will also add view-count for related episode content to variables. This variable is the only static data we have,
and later experiments have shown that it is the most important variable. This process of extracting as much of the
historical data as possible from the time series perspective of the data is critical for the view count prediction experiment.
However, content that is first published in platform or movie contents is limited to extracting variables for these related
works. First, we divide the data into two clusters based on the availability of the historical data. A cluster that has
access to historical data uses the XGBoost technique, and a cluster that does not use historical data uses the neural net
with embedding that also uses the unstructured data, such as text data to overcome it. Finally, models will categorize
hot and cold contents.
4.2 Experimental setting
4.2.1 Experiment environment
In this study, we use Python-based machine learning framework to achieve the results. we use Scikit-Learn
8
, fast.ai
9
,
and Keras [
11
] which open source neural network library. The machine we used has 64 Gigabyte memory and GTX
1080 with 8 Gigabyte GPU memory.
4.2.2 Parameters Setting
Keras is used to build our neural net model. The batch size is set to 64 and the epoch is set 50. We train the model
with the hard-sigmoid activation function. The vector size for categorical data embeddings is arbitrarily set to 2 to 30,
depending on the number of values of categories per variable. Additionally, The used our model parameters can be
found in Table 3.
Table 3: Parameters of the used Model
XGB Parameter Value NN Parameter Value
Learning rate 0.4 Learning rate 0.001
n estimators 1000 Dropout rate 0.05
Max depth 8 Batch Size 64
Subsample 0.8 Epochs 50
Colsample bytree 0.8 Text dimension 50
4.3 Prediction Results
Our experimental results are evaluated by comparing the predictions of our model and the actual classification of the
content. Our current data does not have data from previous years prior to January, so there are not enough contents to
qualify new contents in January and February. Therefore, data from March to December 2017 are used in the experiment.
We evaluate the performance with precision, recall, and f1-score and compare it with the techniques previously utilized
to predict the classification of contents popularity. The parameters of the model are empirically found by optimizing the
results of the validation set.
Table 4 shows the results of the comparison with the Random Forest, XGBoost, SVR [
12
], and neural net models used
in similar studies. The performance of our model is higher than that of other existing models to all performance metrics
in 15,376 cumulative new publish contents for nine months.
Table 5 shows the performance comparison between the categorical variables using embedding and one-hot encoding.
Precision, Recall and f1-score all have high performance in models with categorical data embedding technique. We can
see that the highest performance improvement achieved by categorical embedding is Neural Network.
8https://scikit-learn.org/
9https://docs.fast.ai/
8
APREPRINT - FE BRUA RY 12, 2019
Table 4: Comparison of Performance for BOW-SVR, RF, XGBOOST and Our Model
Numbers of Programs 15,376
Method Precision Recall F1-Score
RF 0.951 0.737 0.830
XGB 0.942 0.743 0.842
SVR 0.892 0.529 0.738
MLP 0.889 0.772 0.826
Our Model 0.952 0.852 0.895
Table 5: Performance comparison by model with or without Categorical Embedding.
Method F1-Score F1-Score(using categorical embedding)
RF 0.822 0.830
XGB 0.821 0.842
SVR 0.682 0.738
MLP 0.774 0.826
Our Model 0.851 0.895
Fig 3 shows the convergence results on media streaming dataset. The results show that Ftrl-proximal Optimizer
outperforms Other optimization method such as RMSprop, Adam [
13
], FOBOS [
14
] by around 5.6% on average.
These three algorithms are almost indistinguishable in the their trade off curves on the our dataset, but on the FTRL-
Proximal learned faster and showed better performance.
Fig 4 shows the f1-score values produced with different contents observation period r to be used as training data by all 4
models. As we have already mentioned in section 3 above, we designated different observation periods depending on
the characteristics of the two data. In the case of type A, performance will not rise as long as the observation period
exceeds 10 days. Likewise, type B no longer rises after 20 days. We have defined 10 and 20 for
ra
and
rb
, respectively,
as defined in Eq.4 above, and we were able to achieve the best performance and efficient computation time in the
experiment.
0 10 20 30 40 50
0.2
0.4
0.6
Epoch
Loss
Ftrl-Proximal
Adam
RMSprop
FOBOS
Figure 3: Comparison of Training Loss for various optimizer method.
9
APREPRINT - FE BRUA RY 12, 2019
0 10 20 30 40
0.4
0.5
0.6
0.7
0.8
0.9
1
Observation period
F1-Score
Type A
TypeB
Figure 4: Comparison of F1-Score as a function of Observation Period for various Model.
5 Conclusion & Future work
In this paper, we suggest a hybrid model to the popularity prediction of newly video contents. We divide contents into
two distinct groups, namely, series and standalone contents. To complement the shortage of features to standalone
contents, the unstructured data such as text are utilized with structured data, and embedding techniques are adopted to
train both data together. To consider characteristics of the view-count Volatility of streaming video contents, we apply
the FTRL-Proximal optimizer to the gradient-based learning process of the proposed models. Finally, we experiment
the hybrid model to the real dataset using one of the top streaming companies in Korea, and achieving overall better
performance compared to the existing standalone models.
However, there are some challenges that we have not yet tried to improve upon in the performance of our model. Before
the program was added to the online streaming service, there are important external data that are not considered in our
model, such as the number of viewers and audience for each program and movie, movie rating, reputation of director
and actor, and emotional analysis data of the public. A text-based analysis method based on the viewer’s reputation
for content can be used to improve the accuracy of the model in future work. Although this research is a significant
contribution to the early stages of research by only focusing on the popularity prediction of the new contents not
published, there are numerous challenges that must be addressed to improve the prediction performance. Before the
program was added to the online streaming service, there were important external data that were not considered in the
proposed model, such as the number of viewers and audience for movies, movie rating, the reputation of director and
actor, and emotional analysis data of the public. A text-based analysis method based on the viewer’s reputation for
content can be used to improve the accuracy of the model in future work. First of all, a well-organized benchmarking
dataset of streaming video contents is needed. Because of privacy issues, it is very hard to open the company’s dataset
to the public, however, discussion in the academic community with a commercial organization is needed.
References
[1]
L. Chen, Y. Zhou, and D. M. Chiu. Smart streaming for online video services. IEEE Transactions on Multimedia,
17(4):485–497, April 2015.
[2]
H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips,
Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom
Boulos, and Jeremy Kubica. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 1222–1230, New
York, NY, USA, 2013. ACM.
[3]
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm
sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
[4]
Jingsong Cui Scott Sereday. Using machine learning to predict future tv ratings. nielsen journal, 1(3), February
2017.
10
APREPRINT - FE BRUA RY 12, 2019
[5]
C. Zhu, G. Cheng, and K. Wang. Big data analytics for program popularity prediction in broadcast tv industries.
IEEE Access, 5:24593–24601, 2017.
[6]
B.RAVEENA D. ANAND, A.V.SATYAVANI and M.POOJITHA. Analysis and prediction of television show
popularity rating using incremental k-means algorithm. International Journal of Mechanical Engineering
Technology (IJMET), 9(1):482–489, January 2018.
[7]
Y. Fukushima, T. Yamasaki, and K. Aizawa. Audience ratings prediction of tv dramas based on the cast and their
popularity. In 2016 IEEE Second International Conference on Multimedia Big Data (BigMM), pages 279–286,
April 2016.
[8]
Márton Mestyán, Taha Yasseri, and János Kertész. Early prediction of movie box office success based on wikipedia
activity big data. PLOS ONE, 8(8):1–8, 08 2013.
[9]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector
space. CoRR, abs/1301.3781, 2013.
[10]
Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In Proceedings of the
20th International Conference on Neural Information Processing Systems, NIPS’07, pages 65–72, USA, 2007.
Curran Associates Inc.
[11] François Chollet et al. Keras: The python deep learning library. Astrophysics Source Code Library, 2018.
[12]
Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural processing
letters, 9(3):293–300, 1999.
[13]
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
2014.
[14]
John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of
Machine Learning Research, 10(Dec):2899–2934, 2009.
11
ResearchGate has not been able to resolve any citations for this publication.
  • Article
    The Television Reality shows are increasing day-by-day in the present generations. There are many different ways to find the Television Rating Point (TRP). Firstly the raw data is taken based upon the People’s Meter and the no of views will be counted from that. Then we need to divide the whole data set into clusters based on different channels. Here the data set consists of channels. Select the particular channel and take the view count. Depending upon the number of views, rate the channel or show accordingly, like if the view count is more than 10,000, then allot 10 rating to that particular show. If any new data is present then add it in the middle of the proces, then the whole process starts again. With the help of proposed algorithm we can update, add new entries in the middle of the process also. Based on the number of views we will rate that the particular Television shows accordingly with the highest Rating. The TRP can be compared among different shows and be viewed in bar graphs, pie charts, histograms. We have K-Means and Incremental K-Means algorithms to compare the TRP. The comparison between the two algorithms is very clear on histograms. It is the easiest way of predicting TV show analysis. If the data is inaccurate it May result to fault values.
  • Article
    The precise and timely prediction of program popularity is of great value for content providers, advertisers, and broadcast TV operators. This information can be beneficial for operators in TV program purchasing decisions and can help advertisers formulate reasonable advertisement investment plans. Moreover, in terms of technical matters, a precise program popularity prediction method can optimize the whole broadcasting system, such as the content delivery network (CDN) strategy and Cache strategy. Several prediction models have been proposed based on VOD data from YouKu, YouTube and Twitter. However, existing prediction methods usually require a large quantity of samples and long training time, and the prediction accuracy is poor for programs that experience a high peak or sharp decrease in popularity. This paper presents our improved prediction approach based on trend detection. First, a dynamic time warping (DTW)-distance-based K-medoids algorithm is applied to group programs’ popularity evolution into 4 trends. Then, 4 trend-specific prediction models are built separately using random forests (RF) regression. According to the features extracted from an electronic program guide (EPG) and early viewing records, newly published programs are classified into the 4 trends by a gradient boosting decision tree. Finally, by combining forecasting values from the trend-specific models and the classification probability, our proposed approach achieves better prediction results. The experimental results on a massive set of real VOD data from Jiangsu Broadcasting Corporation show that, compared to the existing prediction models, the prediction accuracy is increased by more than 20%, and the forecasting period is effectively shortened.
  • Conference Paper
    We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
  • Conference Paper
    Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
  • Article
    We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.
  • Article
    Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
  • Article
    We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
  • Conference Paper
    Full-text available
    Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.
  • Article
    Full-text available
    Use of socially generated "big data" to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between "real time monitoring" and "early predicting" remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.