Conference PaperPDF Available

Diving Deep into Clickbaits: Who Use Them to What Extents in Which Topics with What Effects?


Abstract and Figures

The use of alluring headlines (clickbait) to tempt the readers has become a growing practice nowadays. For the sake of existence in the highly competitive media industry, most of the on-line media including the mainstream ones, have started following this practice. Although the wide-spread practice of clickbait makes the reader's reliability on media vulnerable, a large scale analysis to reveal this fact is still absent. In this paper, we analyze 1.67 million Facebook posts created by 153 media organizations to understand the extent of clickbait practice, its impact and user engagement by using our own developed clickbait detection model. The model uses distributed sub-word embeddings learned from a large corpus. The accuracy of the model is 98.3%. Powered with this model, we further study the distribution of topics in clickbait and non-clickbait contents.
Content may be subject to copyright.
Diving Deep into Clickbaits: Who Use Them to
What Extents in Which Topics with What Effects?
Md Main Uddin Rony §, Naeemul Hassan §, Mohammad Yousuf \
§Department of Computer and Information Science, \Gaylord College of Journalism and Mass Communication
§The University of Mississippi, \The University of Oklahoma
Abstract—The use of alluring headlines (clickbait) to tempt
the readers has become a growing practice nowadays. For the
sake of existence in the highly competitive media industry, most
of the on-line media including the mainstream ones, have started
following this practice. Although the wide-spread practice of
clickbait makes the reader’s reliability on media vulnerable, a
large scale analysis to reveal this fact is still absent. In this paper,
we analyze 1.67 million Facebook posts created by 153 media
organizations to understand the extent of clickbait practice,
its impact and user engagement by using our own developed
clickbait detection model. The model uses distributed sub-word
embeddings learned from a large corpus. The accuracy of the
model is 98.3%. Powered with this model, we further study the
distribution of topics in clickbait and non-clickbait contents.
The term clickbait refers to a form of web content that
employs writing formulas and linguistic techniques in head-
lines to trick readers into clicking links [1], [2], but does not
deliver on promises 1. Media scholars and pundits consistently
show clickbait content in a bad light, but the industry based
on this type of content has been rapidly growing and reaching
more and more people across the world [3], [4]. Taboola,
one of the key providers of clickbait content, claims 2to
have doubled its monthly reach from 500 million unique
users to 1billion in a single year from March 2015. The
growth of clickbait industry appears to have clear impact on
the media ecosystem, as many traditional media organizations
have started to use clickbait techniques to attract readers and
generate revenue. However, media analysts suggest that news
media risk losing readers’ trust and depleting brand value by
using clickbait techniques that may boost advertising revenue
only temporarily. According to a study performed by Facebook
3,80% users “preferred headlines that helped them decide if
they wanted to read the full article before they had to click
through”. [5] shows that clickbait headlines lead to negative
reactions among media users.
Compared to the reach of clickbait content and its impact
on the online media ecosystem, the amount of research done
on this topic is very small. No large scale study has been
conducted to examine the extent to which different types of
media use clickbait techniques. Little is known about the
extent to which clickbait headlines contribute to user engage-
ment on social networking platforms – major distributors of
web content. This study seeks to fill this gap by examining
uses of clickbait techniques in headlines by mainstream and
unreliable media organizations on the social network. Some
of the questions we answer in this paper are– (i) to what
extent, mainstream and unreliable media organizations use
clickbait? (ii) does the topic distribution of the contents vary in
clickbaity contents? (iii) which type of headlines – clickbait or
non-clickbait – generates more user engagement (e.g., shares,
comments, reactions)?
We first create a set of supervised clickbait classification
models to identify clickbait headlines. Instead of following
the traditional bag-of-words and hand-crafted feature set ap-
proaches, we take a more recent deep learning path that
does not require feature engineering. Specifically, we use
distributed subword embedding technique [6], [7] to transform
the words in the corpus to 300 dimensional embeddings. These
embeddings are used to map sentences to a vector space
over which a softmax function is applied as a classifier. Our
best performing model achieves 98.3% accuracy on a labeled
dataset. We use this model to analyze a larger dataset which
is a collection of approximately 1.67 million Facebook posts
created during 2014–2016 by 68 mainstream media and 85
unreliable media organizations. In addition to identifying the
clickbait headlines in the corpus, we also use the embeddings
to measure the distance between the headline and the first
paragraph, known as intro, of a news article. We use a
word co-occurrence based topic model that learns topics by
modeling word-word co-occurrences patterns (e.g., bi-terms)
to understand the distribution of topics in the clickbait and
non-clickbait contents of each media. Finally, using the data
on Facebook reactions, comments, and shares, we analyzed
the role clickbaits play in user engagement and information
spread. The main contributions of this paper are–
We collect a large data corpus of 1.67 million Facebook
posts by over 150 U.S. based media organizations. Details of
the corpus is explained in Section II. We have made the corpus
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting
with credit is permitted. To copy otherwise, or republish, to post on servers
or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from
ASONAM ’17, July 31 - August 03, 2017, Sydney, Australia
2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-4993-2/17/07?/$15.00
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
available to use for research purpose 4.
We prepare distributed subword based embeddings for the
words present in the corpus. In Section III, we provide a
comparison between these word embeddings and the word2vec
[8], [9] embeddings created from Google News dataset with
respect to clickbait detection.
We perform detailed analysis of the clickbait practice
in the social network from multiple perspectives. Section
IV presents qualitative, quantitative and impact analysis of
clickbait and non-clickbait contents.
We use two datasets in this paper. Below, we provide
description of the datasets and explain the collection process.
Headlines: This dataset is curated by Chakraborty et al. [2].
It contains 32,000 headlines of news articles which appeared
in ‘WikiNews’, ‘New York Times’, ‘The Guardian’, ‘The
Hindu’, ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’,
‘Scoopwhoop’, and ‘ViralStories’. 5Each of these headlines
is manually labeled either as a clickbait or a non-clickbait by
at least three volunteers. There are 15,999 clickbait headlines
and 16,001 non-clickbait headlines in this dataset. We used
this labeled dataset to develop an automatic clickbait classi-
fication model (details in Section III). An earlier version of
this dataset was used in [2], [10]. It had 15,000 manually
labeled headlines with an even distribution of 7,500 clickbait
and 7,500 non-clickbait headlines.
Media Corpus: For large scale analysis, using Facebook
Graph API 6, we accumulated all the Facebook posts created
by a set of mainstream and unreliable media within January
1st, 2014 – December 31st , 2016. The mainstream set consists
of the 25 most circulated print media 7and the 43 most-
watched broadcast media 8(according to Nielson rating [11]).
The unreliable set is a collection of 85 conspiracy, clickbait,
satire and junk science based media organizations. The cate-
gory of each unreliable media is cross-checked by two sources
[12], [13]. Figure 1 shows the number of media organizations
in each category in the dataset along with the percentage.
Overall, we collected more than 2million Facebook posts.
A Facebook post may contain a photo or a video or a link
to an external source. In this paper, we limit ourselves to the
link and video type posts only. This reduces the corpus size
to 1.67 million. For each post, we collect the headline (title of
a video or headline of an article) and the status message. For
a collection of 191,540 link type posts, we also collected the
bodies of the corresponding news articles. All these contents
(headlines, messages, bodies) were used to train a domain
specific word embeddings (details in Section III). We also
gather the Facebook reaction (Like, Love, Haha, Wow, Sad,
Angry) statistics of each post. Table I shows distribution of
the corpus.
7 of newspapers in the United States
Media Category Link Video Total
Mainstream Broadcast 324028 32924 356952
Print 516713 14129 530842
Clickbait 371834 4099 375933
Conspiracy 309122 5841 314963
Junk Science 51923 649 52572
Satire 41046 151 41197
Total 1614666 57793 1672459
Fig. 1. Category distribution of the Media Corpus
The key purpose of this study is to systematically quantify
the extents to which traditional print and broadcast media as
well as “alternative” media – often portrayed as unreliable –
use clickbait properties in contents published on the web. The
first step towards that goal is to identify clickbait and non-
clickbait headlines.
A. Problem Definition
We define the clickbait identification task as a supervised
binary classification problem where the set of classes, C=
{clickbait, non clickbait}. Formally, given X, a set of all
sentences, and a training set Sof labeled sentences hs, ci,
where hs, ci X × C, we want to learn a function γ
such that γ:X → C, in other words, it maps sentences
to {clickbait, non clickbait}. In the following sections, we
describe modeling of the problem and compare performances
of multiple learning techniques.
B. Problem Modeling
In text classification, a traditional approach is to use bag-of-
words (BOW) model to transform text into feature vectors be-
fore applying learning algorithms. [2] followed this approach
and used BOW model along with a collection of hand-crafted
rules to prepare the feature set. However, inspired by the recent
success of deep learning methods in text classification, we
use distributed subword embeddings as features instead of
applying BOW model. Specifically, we use an extension of
the continuous skip-gram model [8], which takes into account
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
subword (substring of a word) information [6]. We call
this model as Skip-Gramsw. Below, we explain how Skip-
Gramswis used to generate word embeddings.
1) Skip-Gramsw:Given a large corpus W, represented as
a sequence of words, W=w1, . . . , wT, the objective of the
skip-gram model is to maximize the log-likelihood
log p(wc|wt)(1)
where the context Ctis the set of indices of words sur-
rounding wt. In other words, given a word wt, the model
wants to maximize the correct prediction of its context wc.
The probability of observing a context word wcgiven wt
is parametrized using the word vectors. The output of the
model is an embedding for each word which captures semantic
and contextual information of the word. Skip-Gramswworks
in a slightly different way. Rather than treating each word
as a unit, it breaks down words into subwords and wants to
correctly predict the context subwords of a given subword.
This extension allows sharing the representations across words,
thus allowing to learn reliable representations for rare words.
Consider the following example.
Fig. 2. The Skip-Gramsw model architecture. The training objective is to
learn subword vector representations that are good at predicting the nearby
Example 1. “the quick brown fox jumped over the lazy
dog”- take the word “quick” as an example. Assuming sub-
word length as three, the subwords are- {qui, uic, ick}. Skip-
Gramswmodel learns to predict qui,ick in the context given
uic as the input. Assuming “sick” a rare word (absent in the
vocabulary), the embeddings of ick and sic are used to learn
a reliable representation of it.
Figure 2 shows the architecture of the Skip-Gramsw model.
Using neural network, the model learns the mapping between
the output and the input. The weights to the hidden layer form
the vector representations of the subwords. The embedding of
a word is formed by the sum of the vector representations
of its subwords. Formally, given a word wand its set of
subwords SW w, we can calculate the embedding of wusing
the following equation-
sw∈SW w
vsw (2)
where uwis the embedding of wand vsw is the vector
representation of sw. Further details of the Skip-Gramsw model
can be found in [6].
2) Pre-trained Vectors: Note that Skip-Gramsw does not
require Cto learn the embeddings of words in corpus W. It
means that one can use the model on any large corpus of text to
learn the word embeddings irrespective of whether the corpus
is labeled or not. This technique of learning from large text
corpus helps having richer word embeddings which capture
a lot of semantic, conceptual and contextual information. We
use the texts (headlines, messages, bodies) from Media Corpus
to learn word embeddings using this model. In Section III-C,
we present comparison between our pre-trained vectors and
word vectors which were trained on about 100 billion words
[9] from the Google News dataset.
3) Classification: For a labeled sentence hs, ci, we average
the embeddings of words present in sto form the hidden
representation of s. These sentence representations are used
to train a linear classifier. Specifically, we use the softmax
function to compute the probability distribution over the
classes in C. [7] describes the classification process in detail.
C. Evaluation
We use the Headlines dataset to evaluate our classification
model. Section II provides the description of the dataset. We
perform 10-fold cross-validation to evaluate various methods
with respect to accuracy, precision, recall, f-measure, area
under the ROC curve (ROC-AUC) and Cohen’s κ. Table II
shows performances of the methods. To avoid randomness
effect, we perform each experiment 5times and present the
average. There are in total seven methods. We categorize them
based on the use of pre-trained vectors. Note that we report
performances of Chakroborty et al. [2] and Anand et al.
[10] in the table. We keep Anand et al. with the methods
which use pre-trained vectors. Because Anand et al. used
word embeddings trained on about 100 billion words from
the Google News dataset using the Continuous Bag of Words
architecture [9]. Each word embedding has 300 dimensions.
Both of these works [2], [10] used a smaller and earlier
version of the Headlines dataset. Moreover, the training and
test sets of the earlier dataset are not available. So, we could
not compare our methods with them using the same test bed.
The Skip-Gramswmodel, even without pre-trained vectors,
significantly outperforms the BOW based Chakroborty et al.
It achieves a f-measure score of 0.975 (2.5% higher than
Chakroborty et al.) and a κscore of 0.952. Powered with the
pre-trained vectors, Skip-Gramswperformed even better. We
used the same word embeddings provided by [9] as well as
our own Media Corpus. Regarding the later, we experimented
with three combinations- pre-trained vectors learned from the
content headlines only, from headlines and messages, and from
headline, bodies and messages. We set embedding size to 300
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
dimensions while learning from these combinations. For the
methods which were applied on the full Headlines dataset,
we highlight the top performance in each column. Skip-
Gramswalong with pre-trained vectors from headlines, bodies
and messages performed the best among all the variations.
We realize that the differences of the measure values among
the methods are small. However, we understand that making
a small improvement while working above the 0.95 range, is
Media Corpus has 477,236 unique embeddings where
Google News dataset provided 100 billion embeddings. One
interesting observation is, even though the size of our Media
Corpus is significantly smaller than the Google News dataset,
it contributes more to the clickbait classification task. It can be
rationalized as, the embeddings from Media Corpus have more
domain specific knowledge than the Google News dataset. We
release these pre-trained embeddings for research purpose 9.
With this powerful clickbait classification model [Skip-
Gramsw+(Headline+Body+Message)], we move forward and
perform large scale study on the clickbait practice by a range
of media on social network (Facebook). A small random
sample of shared headlines and their corresponding labels is
presented for the readers 10.
We analyze the clickbait practice in Facebook using the
Media Corpus from three perspectives.
A. Quantitative Analysis
To understand the extent of clickbait practice by different
media and their categories, we applied the clickbait detection
model on their contents; particularly on the headline/title of the
link/video type posts. From now onward, we will use the term
headline to denote both the headline of a link content (article)
and the title of a video content. Table III shows amounts of
clickbaits and non-clickbaits in the headlines of mainstream
and unreliable media. Out of 887,794 posts by mainstream
media, 297,774 (33.54%) have clickbait headlines. In unreli-
able media, the ratio is 39.26% (308,095 clickbait headlines
out of 784,665). Based on these statistics, the percentage
appears to be surprisingly high for the mainstream. We zoom
into the categories of these two media to analyze the primary
proponents of the clickbait practice. We find that between the
two categories of mainstream media, broadcast uses clickbait
47.56% of the times whereas print only uses 24.12%. We
further zoom in to understand the high percentage in the broad-
cast category. The Media Corpus has 43 broadcast media. We
manually categorize them into news oriented broadcast media
(e.g. CNN,NBC, etc.) and non-news (lifestyle, entertainment,
sports, etc.) broadcast media (e.g. HGTV,E!, etc.). There are
6news oriented broadcast media and 37 non-news broadcast
media. We find that the ratio of clickbait and non-clickbait
is 61.64% in non-news type broadcast media whereas it is
only 22.32% (close to print media) in news oriented media.
Figure 4 shows kernel density estimation of the clickbait
percentage both for news and non-news broadcast media. It
clearly shows the difference in clickbait practice in these
two sub-categories. Most of the news type broadcast media
has about 25% clickbait contents. On the other hand, the
percentage of clickbait for non-news type broadcast media has
a wider range with peak at about 60%. In case of unreliable
media, unsurprisingly all the categories have high percentage
of clickbaits in their headlines. In Figure 3, we show the
percentage of clickbait in video and link type posts for each
of the media categories. Satire is leading in both link and
video type posts. Print and conspiracy media have the lowest
clickbait practice among all the media categories in link and
video type posts, respectively. Table V shows the top-5
clickbait proponents in each media category.
Fig. 3. Percentage of clickbaits in
link and video headlines.
Fig. 4. Broadcast (News) vs.
Broadcast (Non-news).
Fig. 5. Frequency of link re-post by different media.
B. Qualitative
Topic distribution: To understand the topics in the clickbait
and non-clickbait contents, we applied topic modeling on all
the headlines of each category. One concern about applying
the traditional topic modeling algorithms (e.g. Latent Dirichlet
Allocation, Latent Semantic Analysis) on our corpus is, they
focus on document-level word co-occurrence patterns to dis-
cover the topics of a document. So, they may struggle with
the high word co-occurrence patterns sparsity which becomes
a dominant factor in case of shorter context. That is why we
use Biterm Topic Modeling (BTM) [14] which generates the
topics by directly modeling the aggregated word co-occurrence
patterns of a short document.
Table IV shows 5topics in clickbait and non-clickbait
contents for each media category. Each topic is represented
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Method Precision Recall F-measure Accuracy Cohen’s κROC-AUC
Without Pre-trained Vectors
*Chakroborty et al. [2] 0.95 0.90 0.93 0.93 NA 0.97
Skip-Gramsw 0.976 0.975 0.975 0.976 0.952 0.976
With Pre-trained Vectors
*Anand et al. [10] 0.984 0.978 0.982 0.982 NA 0.998
Skip-Gramsw+Google word2vec 0.977 0.977 0.977 0.976 0.951 0.976
Skip-Gramsw+ (Headline) 0.981 0.981 0.981 0.981 0.962 0.981
Skip-Gramsw+(Headline + Message) 0.982 0.982 0.982 0.982 0.964 0.982
Skip-Gramsw+(Headline + Body + Message) 0.983 0.983 0.983 0.983 0.965 0.983
* Their experiments were performed on a smaller and earlier version of the Headlines dataset.
Media Category Clickbait Non-clickbait Clickbait (%)
Broadcast 169752 187200 47.56
Print 128022 402820 24.12
Clickbait 172271 203662 45.82
Conspiracy90389 224574 28.7
Junk Science 23637 28935 44.96
Satire 21798 19399 52.91
by a set of 10 words. The words are ordered by their signifi-
cance in the corresponding topic. The modeling indicates that
clickbait headlines in print and broadcast media vary in tones
and subject matters from their non-clickbait headlines to a
great extent. Clickbait headlines in these media represent more
personalized, sensationalized and entertaining topics, while
non-clickbait headlines highlight topics of collective problems
such as public policies and civic affairs. But this variation
is not much evident in unreliable media that use clickbait
headlines indiscriminately across all topics.
The model highlights some differences in clickbait topics
between print and broadcast media. Most clickbait topics
in print media, three out of five, are about U.S. President
Donald Trumps views on women. Each of these three topics
include all of these four words: Trump, woman, make, new.
A manual search shows that print news media often used
clickbait techniques (e.g., question based headline) in stories
about Trump and women. For instance, “Did Donald Trump
really say those things?” was the headline of a Washington
Post article dated July 25, 2016. The headline of a New York
Times story from May 14, 2016, reads; “Crossing the Line:
How Donald Trump Behaved With Women in Private.”
Most clickbait topics in broadcast media are about enter-
tainment (e.g., Kylie Jenners new music video; Kardashians
new baby) and lifestyle (e.g., food and health). Two topics
appeared to touch Donald Trump and his opponent Hillary
Clinton. Clickbait topics in unreliable media, however, range
from politics to lifestyle. At least three topics appeared to be
about politics in which key words include, Trump, Hillary,
Obama, Muslim, Cop, and Woman. One topic is about food
and health while another is unclear.
Non-clickbait topics remain similar across all three media
types, which primarily focus on law and order, and U.S.
presidential election campaign. Twelve out of 15 topics – all
five in print, three in broadcast, and four in unreliable – are
about these two areas. One broadcast topic appears to be about
sports and one is unclear. One unreliable topic is about food
and health.
Clearly, the topic modeling is dominated by the 2016’s posts
as we see a lot of words related to the U.S. election. To get
the original picture of the topics in clickbait and non-clickbait
contents, we apply topic modeling on the headlines of each
year. The result can be observed here 11 . We see that topics
related to politics are not prevalent in 2014 and 2015 as they
are in 2016. Clickbait contents of broadcast media in 2014 are
mostly non-political. Their primary focus was entertainment
which later shifted towards politics afterwards.
Headline-Body similarity: One limitation of Skip-
Gramswis, it only considers the headline to determine whether
it is a clickbait or not. The body of the news, is not considered
as a factor in defining the headline. An attractive headline can
be highly relevant to the content/body of a news or it can be
very loosely related to the news. Our model is not capable
of making the distinction. A metric is required to measure the
similarity between the headline and the content to determine if
the headline fairly represents the content. In future, we want
to systematically incorporate the headline-body similarity in
defining the clickbaitiness. Nonetheless, here we measure how
similar the clickbait and non-clickbait headlines are to the
corresponding bodies using a simple approach. We assume
that the first para of an article represents the summary of the
whole news [15] and use cosine similarity to measure the
similarity between the headline and the sentences in the first
para. We use bag-of-words model to transform the sentences
into vectors before applying cosine similarity. In future, we
plan to use our word embeddings to create the vectors instead.
Figure 6 shows the kernel density estimation of the headline-
body similarity in clickbait and non-clickbait contents posted
by different media. One observation is, in print media non-
clickbait headlines are closer to their summary than clickbait
headlines. In broadcast media, the difference is less clear and
in unreliable media the difference is almost absent.
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Media Clickbait Non-Clickbait
T1:new, trump, woman, get, man, say, chicago, star, show, make T1:man, police, la, vega, killed, woman, found, new, dead, shooting
T2:best, thing, day, new, cleveland, get, know, la, 2015, 2016 T2:new, cleveland, police, chicago, state, say, get, win, chronicle, game
T3:new, best, thing, day, year, get, star, make, know, kid T3:new, trump, us, say, state, city, bay, get, million, could
T4:trump, new, donald, woman, get, make, 2016, say, time, thing T4:new, news, trump, say, york, us, hawaii, clinton, man, police
T5:boston, trump, donald, clinton, say, new, woman, get, make, thing T5:boston, trump, say, new, clinton, sox, us, win, state, red
T1:new, movie, make, video, get, star, time, jenner, music, like T1:police, new, say, shooting, man, us, killed, death, found, woman
T2:new, baby, get, star, first, kardashian, see, show, jenner, look T2:trump, clinton, say, obama, new, us, news, gop, police, campaign
T3:trump, photo, day, donald, get, best, dog, clinton, make, national T3:national, new, found, photo, us, world, geographic, may, year, day
T4:history, know, thing, woman, episode, new, full, man, may, trump T4:win, game, say, team, player, first, sport, get, new, nba
T5: day, way, food, like, recipe, best, thing, holiday, home, make T5: new, show, first, get, star, news, dead, baby, day, say
T1:trump, hillary, donald, clinton, chick, obama, say, get, make, one T1:trump, republican, say, new, obama, gop, clinton, get, sander, police
T2:woman, man, cop, black, get, people, muslim, white, girl, say T2:human, us, new, cancer, health, world, vaccine, found, food, study
T3:video, american, eagle, obama, us, people, america, hillary, trump, world T3:us, obama, video, police, war, new, military, isis, newsredflag, state
T4:thing, people, make, woman, like, get, way, new, man, know T4:trump, hillary, clinton, obama, new, email, say, news, us, make
T5:day, reunionfather, human, food, way, thing, health, life, reason, people T5:obama, eagle, muslim, say, gun, us, christian, american, isis, america
Media Name Clickbait Non-clickbait Clickbait (%)
VH1 13760 1339 91.13
AmplifyingGlass 692 71 90.69
MTV 42313 4492 90.4
ClickHole 8250 930 89.87
Reductress 3984 484 89.17
VH1 13760 1339 91.13
MTV 42313 4492 90.4
Bravo TV 8263 1242 86.93
Food Network 2990 492 85.87
OWN 474 118 80.07
Washington Post 13905 15158 47.84
New York Post 11977 13910 46.27
Dallas Morning News 3982 8232 32.6
USA Today 8538 20282 29.63
Houston Chronicle 8481 21618 28.18
AmplifyingGlass 692 71 90.69
ClickHole 8250 930 89.87
Reductress 3984 484 89.17
Food Babe 2387 638 78.91
Chicks on the Right 14185 4977 74.03
C. Impact
To measure the reachability and user engagement of click-
bait and non-clickbait contents, we use Facebook reactions,
comments and shares as metrices. Figure 7 shows number
of comments, shares and reactions (summation of like, haha,
wow, sad, angry, happy, love) of an average clickbait and non-
clickbait post in each media category. Blue areas indicate that
Media Category Clickbait Status Non-clickbait Link Clickbait Status (%)
Mainstream Broadcast 84192 176177 32.34
Print 164669 379504 30.26
Unreliable Clickbait 91747 157886 36.75
Conspiracy46851 190477 19.74
Junk Science 12764 28349 31.05
Satire 7425 14453 33.94
Fig. 6. Headline-Body similarity in clickbait and non-clickbait contents.
on average, a clickbait post (link or video) receives more at-
tention (reactions/shares/comments) than a non-clickbait post.
Green areas indicate the opposite. Clickbait contents receive
more attention and reach to more users in general. One
exception is the broadcast media.
We also analyze how often a news article is re-posted in
Facebook. Figure 5 shows number of times a link is re-
posted by a media. Each bar represents a news link. The height
indicates how many times this link was posted in Facebook
by the colored media category. We only consider the links
which were re-posted at least 20 times. Compare to others,
conspiracy media organizations repeat the same link more.
This is observed both for clickbait and non-clickbait. Clickbait
media seem to repeatedly posting same clickbait links more
than others.
Other than headlines, the media organizations also practice
using clickbait in the Facebook status message itself. Table
VI shows use of clickbait status for non-clickbait articles by
different media. A general observation is, the practice is there
to allure the readers by giving clickbaity message posts even
for non-clickbaity news contents. Unsurprisingly, the clickbait
media category is leading in this practice.
Even though clickbait is a relatively nascent term, its
traces can be found in several journalistic concepts such as
tabloidization and content trivialization. The linguistic tech-
niques and presentation styles, employed typically in clickbait
headlines and articles, derived from the tabloid press that baits
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Fig. 7. Top: Print media, Middle: Broadcast media, Bottom: Unreliable media. Blue areas indicate that on average, a clickbait post (link or video) receives
more attention (reactions/shares/comments) than a non-clickbait post. Green areas indicate the opposite.
readers with sensational language and appealing topics such
as celebrity gossip, humor, fear and sex [1]. Clickbait articles
are also similar to tabloid press articles in terms of story
focus, which puts emphasis on the entertaining elements of an
event rather than the informative elements. The Internet and
especially the social media have made it easier for the clickbait
practitioners to create, publish in a larger scale and reach to
a broader audience with a higher speed than before [16]. In
the last several years, academicians and media studied this
phenomenon from several perspectives.
Clickbait– Properties, Practice and Effects: There have
been a small number of studies–some conducted by aca-
demic researchers and others by media firms–which examined
correlations between headline attributes and degree of user
engagement with content. Some media market analysts and
commentators [17] discussed various aspects of this practice.
However, no research has been found, which gauges the
extents of clickbait practices by mainstream and alternative
media outlets on the web. Nor have we found any study that
examined if clickbait techniques help increase user engage-
ment on social media.
A journalism professor [1] manually examined content of
four online sections of the Spanish newspaper El Pais , which
apparently used clickbait features to capture attention. The
corpus included only 151 articles published in June, 2015.
The articles in the corpus appeared to emphasize anecdotal
aspects, or issues with little value, and curiosities. The study
identified various linguistic techniques used in headlines of
these articles such as orality markers and interaction (e.g.,
direct appeal to the reader), vocabulary and word games (e.g.,
informal language, generic or buzzwords), and morphosyntax
(e.g., simple structures). Researchers at the University of
Texas’s Engaging News Project [5] conducted an experiment
on 2,057 U.S. adults to examine their reactions to clickbait
(e.g., question-based headlines) and traditional news headlines
in political articles. They found that clickbait headlines led
to more negative reactions among users than non-clickbait
headlines. Interestingly, the same users were slightly more
engaged with non-traditional media that tend to use clickbait
techniques more often. This finding questions the conventional
belief that user reactions may predict user engagement, and
warrants large-scale investigations.
Chartbeat, an analytics firm that provides market intelli-
gence to media organizations, tested 10,000 headlines from
over 100 websites for their effectiveness in engaging users
with content [18]. The study examined 12 ‘common tropes’
in headlines– a majority of them are considered clickbait tech-
niques – and found that some of these tropes are more effective
than others. Some media pundits interpreted the findings of
this study as clickbaits being detrimental to traditional news
HubSpot and Outbrain, two content marketing platforms
that distribute clickbait contents across the web, examined
millions of headlines to identify attributes that contribute
to traffic growth, increased engagement, and conversion of
readers into subscribers. The study suggested that clickbait
techniques may increase temporary engagement [19], but an
article must deliver on its promises made in headline for users
to return and convert.
Automated Clickbait Detection: [2], [10], [20], [21] study
automated detection of clickbait headlines using natural lan-
guage processing and machine learning. [21] collects 10,000
headlines from Buzzfeed,Clickhole, and The New York Times
(NYT) and uses Logistic Regression to create a supervised
clickbait detection model. It assumes all Buzzfeed and Click-
hole headlines as clickbait and all NYT headlines as non-
clickbait. We would like to argue that it makes the model
susceptible to personal bias as it overlooks the fact that many
Buzzfeed contents are original, non-clickbaity and there are
clickbait practice in NYT [22]. Moreover, BuzzFeed, and NYT
usually write on very different topics. The model might have
been trained merely as a topic classifier. [20] attempts to
detect clickbaity Tweets in Twitter by using common words
occurring in clickbaits, and by extracting some tweet specific
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
features. [2] uses a dataset of 15,000 manually labeled
headlines to train several supervised models for clickbait
detection. These methods heavily depend on a rich set of hand-
crafted features which take good amount of time to engineer
and sometimes are specific to the domain (for example, tweet
related features are specific to Twitter data and inapplicable
to other domains). [10] presents clickbait detection model
which uses word embeddings and Recurrent Neural Network
(RNN). These works consider the structure and semantic of a
headline to determine whether it is a clickbait or not. However,
one important aspect, the body of the news, is not considered
as a factor in these works at all. We would like to argue that
only the headline itself does not fully represent whether an
article is a clickbait or not. If a headline represents the body
fairly, it should not be considered as a clickbait. Consider the
title as an example, ‘The Top 10 Mistakes Of Entrepreneurs’12.
It is as clickbait of a headline as it can be. However, the body
actually contains reasonably decent materials, which might be
interesting to many people.
Clickbait Generation [23]–[25] present automated click-
bait generation tools. [23] trains an RNN model using 2
million headlines collected from Buzzfeed,Gawker,Jezebel,
Huffington Post and Upworthy. The model is then used to
produce new clickbait headlines
In this paper, we introduce a word-embedding based click-
bait detection system which is built on our own collected
corpus of news headlines and contents. We showed that our
model performs better than the Google news dataset based
embeddings. Our analysis also reveals how mainstream media
are getting involved into clickbait practicing increasingly.
Close scrutiny of the social media posts also reveals that broad-
cast type media has higher percentage of usage of clickbait
practice than the print media and non-news type broadcast
media mostly contributes to it. Our study also brings forth
another fact of using higher percentage of clickbait practice
by unreliable media which is quite obvious. Moreover, results
from our topic modeling indicates that clickbait practice is
prevalent in personalized and entertaining areas. In future,
we want to incorporate the content of the news in defining
the clickbaitiness of a headline. We believe, such a system
would help social networking platforms to curb the problem
of clickbait and provide a better using experience.
[1] D. Palau-Sampio, “Reference press metamorphosis in the digital context:
clickbait and tabloid strategies in elpais. com.” Communication &
Society, vol. 29, no. 2, 2016.
[2] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly, “Stop
clickbait: Detecting and preventing clickbaits in online news media,
in Advances in Social Networks Analysis and Mining (ASONAM), 2016
IEEE/ACM International Conference on. IEEE, 2016, pp. 9–16.
[3] M. S. LUCKIE, “Adele and the death of clickbait,” http://, 2015.
[4] C. Sutcliffe, “Can publishers step away from the brink of peak content?” step-away-
from-the- brink-of- peak-content, 2016.
[5] J. M. Scacco and A. Muddiman, “Investigating the influence of
“clickbait” news headlines,”
content/uploads/2016/08/ENP-Investigating-the- Influence-of- Clickbait-
News-Headlines.pdf, 2016.
[6] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
vectors with subword information,arXiv preprint arXiv:1607.04606,
[7] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for
efficient text classification,arXiv preprint arXiv:1607.01759, 2016.
[8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their composi-
tionality,” in Advances in neural information processing systems, 2013,
pp. 3111–3119.
[9] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,arXiv preprint arXiv:1301.3781,
[10] A. Anand, T. Chakraborty, and N. Park, “We used neural networks to
detect clickbaits: You won’t believe what happened next!” arXiv preprint
arXiv:1612.01340, 2016.
[11] J. Eggerton, “Fcc: Nielsen dmas still best definition of tv market,”
Broadcasting & Cable, 2016.
[12], “Unreliable/fake news sites & sources,”
C1dixJSIYqQUaXVZBqsJs, 2016.
[13] M. Zimdars, “My ‘fake news list’ went viral. but made-up stories
are only part of the problem,”
up-stories- are-only- part-of- the-problem, 2016.
[14] X. Yan, J. Guo, Y. Lan, and X. Cheng, “A biterm topic model for short
texts,” in Proceedings of the 22nd international conference on World
Wide Web. ACM, 2013, pp. 1445–1456.
[15] R. S. Izard, H. M. Culbertson, and D. A. Lambert, Fundamentals of
news reporting. Kendall/Hunt Publishing Company, 1994.
[16] M. Ingram, “The internet didnt invent viral content or clickbait
journalism – theres just more of it now, and it happens faster,” https:
// didnt-invent-viral- content-or-
clickbait-journalism- theres-just- more-of- it-now-and-it-happens-faster,
[17] F. Filloux, “Clickbait is devouring journalism but there are
ways out,”
but-there-are-ways-out/, 2016.
[18] C. Breaux, “You’ll never guess how chartbeat’s data
scientists came up with the single greatest headline,” how-
chartbeats-data- scientists-came- up-with- the-single- greatest-headline,
[19] Hubspot and Outbrain, “Data driven strategies for writing effective
titles & headlines,”
pdf/Data Driven Strategies For Writing Effective Titles and
[20] M. Potthast, S. K ¨
opsel, B. Stein, and M. Hagen, “Clickbait detection,”
in European Conference on Information Retrieval. Springer, 2016, pp.
[21] A. Thakur, “Identifying clickbaits using machine learning,”
learning-abhishek- thakur, 2016.
[22] N. Hurst, “To clickbait or not to clickbait? an examination of clickbait
headline effects on source credibility,” Ph.D. dissertation, University of
Missouri–Columbia, 2016.
[23] L. Eidnes, “Auto-generating clickbait with recurrent neural networks, clickbait-with-
recurrent-neural- networks, 2015.
[24] C. Cha, “clickbait generator,”, 2016.
[25] “Linkbait title generator,”
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
... Currently, the most widespread deceptive practice may be clickbait (Zannettou et al., 2018): the act of convincing a user to click a link they otherwise would not so as to increase click-based advertising revenue (Biyani et al., 2016). Clickbait can frustrate users and crowd out higher quality content (Rony et al., 2017). The scale of the clickbait problem is evident in both the recent attention being paid to the phenomenon by the academic community (Tafesse, 2020) and in the efforts of online media outlets and social media platforms to detect and remove clickbait content. ...
... From a journalistic perspective, news headlines have been transformed by digital reporting into elements intended to attract reader attention rather than provide information (Jiang et al., 2019). The use of clickbait techniques seems to be spreading to include mainstream news media and entertainment content (Rony et al., 2017) and has been thrust into the spotlight by the recent proliferation of fake news (Zannettou et al., 2018). Yet, as noted by Potthast et al. (2018), the working mechanisms of clickbait are still not adequately understood. ...
... Few studies to date have focused on clickbait in the context of YouTube despite its far-reaching popularity and immense user base. Much of the existing literature has focused on the classification and detection of clickbait in online news (Blom and Hansen, 2014;Potthast et al., 2016;Orosa et al., 2017;Rony et al., 2017;Mormol, 2019;Genç and Surer, 2021;Scott, 2021). In the context of YouTube, the advertising of a video is achieved primarily through the video title and thumbnail image; the most prominent pieces of information which users are presented with (Lopezosa et al., 2020;Tafesse, 2020). ...
Full-text available
This exploratory study aims to identify which linguistic and typological features commonly associated with clickbait in online news headlines are indicative of clickbait in YouTube video titles. A comparative corpus analysis is conducted to compare YouTube video titles commonly associated with clickbait to titles not associated with clickbait. Results indicate that a majority of the typological and linguistic features associated with clickbait in online news headlines are found to be indicative of clickbait in YouTube video titles. However, the role which each of the features plays seems to differ to that of online news. The findings contribute to the understanding of clickbait in non-news contexts from a linguistics perspective, an area which has been relatively unexplored in the current literature.
... The strategy of using clickbait headlines has become common. Rony et al. (2017) report that 39.26% posts made by unreliable media have clickbait headlines, and even in mainstream media, the ratio is 33.54%. In general, clickbait headlines exploit a narrative strategy called "forward reference" (Yang, 2011) that provides just enough information to create curiosity and thus arouses users to click on the headline to read further to satisfy their curiosity. ...
... Anand et al., (2017) employ an approach of artificial neural networks, Recurrent Neural Network (RNN), to extract features from clickbait headlines and identify those that are ambiguous and misleading. Rony et al. (2017) use the distributed subword embedding technique to retrieve a large sample of 1.67 million Facebook posts made by 153 media organizations and find clickbait posts (links or videos) receives more attention (reactions/shares/comments) than non-clickbait posts. ...
... In this struggle, journalists are increasingly resorting to this controversial technique. If previously clickbait was associated only with low-quality, yellow press, advertisements, and unreliable sources [2], now it is quite common to take the "bait" even in well-known and influential publications [3]. In such a large information space to achieve the desired result, that is, to grab attention, almost all publications begin to use clickbaits, which already has consequences. ...
... Previously, clickbait used to be associated exclusively with low-quality, yellow press, advertisements, and unreliable sources [2]. Today, however, this type of manipulation is now found even in well-known and influential publications, which makes one wonder about its nature [3]. ...
Full-text available
This paper examines ways of teaching Earth science students techniques to recognize manipulative clickbait in the online versions of American newspapers. The study presents an analysis of headlines in the online versions of the newspaper aiming to determine the specificities of the use of clickbait and to further classify the techniques through which it is employed. The analyzed data was collected from the Sports and Business sections of the 2021 online issues of the New York Times and the Washington Post. The frequency of clickbait use was found in each of the newspapers in general and in each of the studies section in particular. The analysis resulted in the classification of the techniques of achieving the clickbait effect. The individual characteristics of the journalists’ styles of the newspapers under study was further investigated. The quantitative analysis revealed varying amounts of clickbait in the headlines of the online newspapers.
... sial [26,34,127,155]. However, research has suggested that opti-On the one hand, when receiving content, our participants' percepmizing glance and click-through metrics tends to decrease content tion of SMIs as being ordinary people just like them made them quality [29,63,135,158] and is not even always an efective means feel SMIs' content was more applicable to their own cases, and also of attracting attention [34]. ...
... The term clickbait refers to using alluring headlines that employ writing formulas and linguistic methods to "bait" readers into clicking items [34]. Even though media scholars continually portray clickbait material in a negative light [37], the business built on it has been quickly developing and reaching an increasing number of individuals across the world [38]. News organizations have shifted to a digital front, in part, to stay afloat. ...
Full-text available
Background: While there is high-quality online health information, a lot of recent work has unfortunately highlighted significant issues with the health content on social media platforms (eg, fake news and misinformation), the consequences of which are severe in health care. One solution is to investigate methods that encourage users to post high-quality content. Objective: Incentives have been shown to work in many domains, but until recently, there was no method to provide financial incentives easily on social media for users to generate high-quality content. This study investigates the following question: What effect does the provision of incentives have on the creation of social media health care content? Methods: We analyzed 8328 health-related posts from an incentive-based platform (Steemit) and 1682 health-related posts from a traditional platform (Reddit). Using topic modeling and sentiment analysis-based methods in machine learning, we analyzed these posts across the following 3 dimensions: (1) emotion and language style using the IBM Watson Tone Analyzer service, (2) topic similarity and difference from contrastive topic modeling, and (3) the extent to which posts resemble clickbait. We also conducted a survey using 276 Amazon Mechanical Turk (MTurk) users and asked them to score the quality of Steemit and Reddit posts. Results: Using the Watson Tone Analyzer in a sample of 2000 posts from Steemit and Reddit, we found that more than double the number of Steemit posts had a confident language style compared with Reddit posts (77 vs 30). Moreover, 50% more Steemit posts had analytical content and 33% less Steemit posts had a tentative language style compared with Reddit posts (619 vs 430 and 416 vs 627, respectively). Furthermore, more than double the number of Steemit posts were considered joyful compared with Reddit posts (435 vs 200), whereas negative posts (eg, sadness, fear, and anger) were 33% less on Steemit than on Reddit (384 vs 569). Contrastive topic discovery showed that only 20% (2/10) of topics were common, and Steemit had more unique topics than Reddit (5 vs 3). Qualitatively, Steemit topics were more informational, while Reddit topics involved discussions, which may explain some of the quantitative differences. Manual labeling marked more Steemit headlines as clickbait than Reddit headlines (66 vs 26), and machine learning model labeling consistently identified a higher percentage of Steemit headlines as clickbait than Reddit headlines. In the survey, MTurk users said that at least 57% of Steemit posts had better quality than Reddit posts, and they were at least 52% more likely to like and comment on Steemit posts than Reddit posts. Conclusions: It is becoming increasingly important to ensure high-quality health content on social media; therefore, incentive-based social media could be important in the design of next-generation social platforms for health information.
Full-text available
Gazeteciler haberlerini ilgi çekici kılmak ve geniş bir takipçi kitlesine ulaşmak için farklı iletişim yöntemlerine başvurmaktadır. Bunlardan biri olan tık odaklı habercilik okuyucuların ilgisini çekmek amaçlı olarak çeşitli tıklama tuzaklarının kullanıldığı ve buna ana akım medyanın dâhil olduğu bir çevrim içi habercilik yöntemidir. Bu araştırma, sitelerin popülerlik oranlarını değerlendiren Similarweb’e göre Türkiye’de en çok ziyaret edilen 100 internet sitesi içerisinde yer alan üç haber sitesindeki ekonomi haberlerinin tık odaklı habercilik çerçevesinde ve içerik analizi yöntemiyle incelenmesinden oluşmaktadır. İnceleme süresi sonunda toplam 485 ekonomi haberi tık tuzağı kategorileri çerçevesinde ele alınmıştır. Çalışmaya göre toplam ekonomi haberi içerisinde tık odaklı haberlerin yüzde 32,5’i tık odaklı olmayan haberlerin ise yüzde 67,4’ü oluşturduğu, bir başka ifadeyle yayınlanan her üç haberden birinin tık odaklı habercilik anlayışıyla hazırlandığı görülmüştür. Tık odaklı olan haber içeriklerine yapılan kullanıcı yorumları da tık odaklı habercilik özelliği taşımayan içeriklere yapılan yorumlara göre oldukça yüksektir. Haber sitelerinde en çok tercih edilen tıklama tuzağı ise “muğlak bırakma” (Ensonhaber yüzde 67, Haber7 yüzde 51, Sözcü yüzde 71) olarak görülmüştür. Sonuç olarak her üç sitede tıklama tuzaklı habercilik tekniklerinin uygulandığı göze çarpmaktadır. Bu durum haber sitelerinin daha fazla ziyaretçi kazanmak ve haberlere yapılan etkileşimleri artırmak gibi isteklerinden kaynaklanmaktadır.
Full-text available
El objetivo de esta investigación es analizar las estrategias de los principales diarios españoles en Facebook prestando atención a la temática, los valores noticiosos y el sensacionalismo utilizados. El estudio se basa en un análisis de contenido cuantitativo (n = 2.821) de los contenidos en Facebook publicados por tres diarios tradicionales (El País, El Mundo y La Vanguardia) y tres nativos digitales (El Confidencial, El Diario y El Español). Entre los resultados obtenidos, destaca que los contenidos sobre sucesos y tribunales son los más frecuentes y la proximidad, el valor noticioso más recurrente en ambos tipos de diarios. En cuanto a las prácticas sensacionalistas, los nativos digitales utilizan más el texto apelativo y el clickbait.
Chapter 4, “Worldwide Fungi,” examines how stupidity diverts. Networking is traced to an evolutionary lineage of assembling for problem-solving, stretching back to slime molds and fungal mycorrhiza. The chapter reconceptualizes the worldwide web into the worldwide fungi. With intelligence characterized by efficiency in problem-solving, fungi is argued here as opening evolution to the immense differentiation potential of stupidity. Empirical material is analyzed to uncover how stupidity proliferates online, as ways to solving the problem of gaining attention. Populations browsing websites are shown to regularly follow clickbait directed to base desires, drawn to the secrets of good health and easy wealth, beauty, fame, and sexual promiscuity. The chapter tests what might be learned from such stupidity, in particular how clickbait reveals the potency of foolishness for a range of human social fields. This includes the technological development of artificial stupidity (rather than artificial intelligence), economic reliance on greater fools, art as abbreviation for artificial stupidity, and philosophy as a form of morosophy (a love of stupidity rather than knowledge).
Full-text available
Online content publishers often use catchy headlines for their articles in order to attract users to their websites. These headlines, popularly known as clickbaits, exploit a user's curiosity gap and lure them to click on links that often disappoint them. Existing methods for automatically detecting clickbaits rely on heavy feature engineering and domain knowledge. Here, we introduce a neural network architecture based on Recurrent Neural Networks for detecting clickbaits. Our model relies on distributed word representations learned from a large unannotated corpora, and character embeddings learned via Convolutional Neural Networks. Experimental results on a dataset of news headlines show that our model outperforms existing techniques for clickbait detection with an accuracy of 0.98 with F1-score of 0.98 and ROC-AUC of 0.99.
Full-text available
Most of the online news media outlets rely heavily on the revenues generated from the clicks made by their readers, and due to the presence of numerous such outlets, they need to compete with each other for reader attention. To attract the readers to click on an article and subsequently visit the media site, the outlets often come up with catchy headlines accompanying the article links, which lure the readers to click on the link. Such headlines are known as Clickbaits. While these baits may trick the readers into clicking, in the long run, clickbaits usually don't live up to the expectation of the readers, and leave them disappointed. In this work, we attempt to automatically detect clickbaits and then build a browser extension which warns the readers of different media sites about the possibility of being baited by such headlines. The extension also offers each reader an option to block clickbaits she doesn't want to see. Then, using such reader choices, the extension automatically blocks similar clickbaits during her future visits. We run extensive offline and online experiments across multiple media sites and find that the proposed clickbait detection and the personalized blocking approaches perform very well achieving 93% accuracy in detecting and 89% accuracy in blocking clickbaits.
Full-text available
In a context marked by technological uncertainty and the search for a viable business model, the digital media have succumbed to various strategies to attract audience, reflected in the clicks generated by each reader. This article focuses attention on a reference newspaper as El País, to discuss what options were incorporated in order to attract visitors to the website. By using a quantitative and qualitative methodology, a study was carried out on four sections added by in the last two years: SModa, Icon, BuenaVida and Verne, which have generated significant traffic, particularly the latter. The conclusions of the analysis show a commitment to tabloidization, with anecdotal, lifestyle or curiosity content, while the so-called hard news is absent from these new sections. This circumstance coincides with headlines that have some characteristic features of the popular press, which act more as bait to capture the curiosity – and also generate income each time they are clicked on – rather than as informative elements. The fragmentation of the texts, and their list-style structure, mimic the success of some native digital media, while moving away from the standards of the quality press.
Conference Paper
Full-text available
Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Conference Paper
Online content publishers often use catchy headlines for their articles in order to attract users to their websites. These headlines, popularly known as clickbaits, exploit a user’s curiosity gap and lure them to click on links that often disappoint them. Existing methods for automatically detecting clickbaits rely on heavy feature engineering and domain knowledge. Here, we introduce a neural network architecture based on Recurrent Neural Networks for detecting clickbaits. Our model relies on distributed word representations learned from a large unannotated corpora, and character embeddings learned via Convolutional Neural Networks. Experimental results on a dataset of news headlines show that our model outperforms existing techniques for clickbait detection with an accuracy of 0.98 with F1-score of 0.98 and ROC-AUC of 0.99.
Conference Paper
This paper proposes a new model for the detection of clickbait, i.e., short messages that lure readers to click a link. Clickbait is primarily used by online content publishers to increase their readership, whereas its automatic detection will give readers a way of filtering their news stream. We contribute by compiling the first clickbait corpus of 2992 Twitter tweets, 767 of which are clickbait, and, by developing a clickbait model based on 215 features that enables a random forest classifier to achieve 0.79 ROC-AUC at 0.76 precision and 0.76 recall.
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.