Content uploaded by BK Sarthak Das
Author content
All content in this area was uploaded by BK Sarthak Das on Mar 20, 2015
Content may be subject to copyright.
1
Analyzing
Advertisements on
Twitter during
Valentine’s Month
Ashwin Satyanarayan
Bk Sarthak Das
Divya Krishnan
2
Contents
1. Background .............................................................................................................................. 3
2. Methods ................................................................................................................................... 4
a. Data Collection ..................................................................................................................... 4
b. Data Cleaning ....................................................................................................................... 4
c. Codebook ............................................................................................................................. 5
3. Ethical Analysis ........................................................................................................................ 7
4. Limitations ............................................................................................................................... 8
5. Results .................................................................................................................................... 10
6. Conclusion ............................................................................................................................. 17
7. Interactive Visualization ........................................................................................................ 18
References .................................................................................................................................... 19
3
1. Background
Twitter is one of the most influential social networking platforms that currently
flaunts over 284 million active users – as of December 2014 (Keach, 2014). Twitter
observes the follower/following model – unlike Facebook – where users can subscribe
to other “handles” (users, organizations, communities, etc.) to view their activity and
receive their tweets. One research conducted by the market research firm, Pear
Analytics, entailed the analysis of 2000 tweets over a period of 2 weeks (Kelly, 2009).
Through their findings, they separated these tweets into six categories – of which Self-
Promotion comprised 6% of the 2000 tweets. According to Valentine’s Day 2015:
Shopping Factoids Inspired by Cupid (2015), over 50% of the people in United States
would be celebrating Valentine's Day. Also, it is estimated that people will spent about
$20 billion for Valentine's Day. It is estimated that 25% of the people would do
online shopping for Valentine's Day. Valentine's Day is an event which is a great time
for businesses and individuals to sell their products. Extending on these ideas, we
believe that Twitter can provide us with some valuable insights for our research
question –
RQ: “What type of advertisements have greater propagation in Twitter during
Valentine’s Month?”
Our research question is trying to classify advertisements (ads) as direct or
indirect to understand what type of advertisements fare better in terms of propagation.
For propagation, we are measuring the retweet count of the tweets. Through our
analyses, following data collection, we feel that we will be in a better position to answer
how organizations, communities, groups, individuals, etc. are promoting their brand,
products and services during Valentine’s Month – 3rd February, 2015 through 28th
February, 2015 – through tweets. The following sections will offer elaborate
explanations about the tools used for data collection, the data cleaning process, the
reasoning behind our codebook – coding strategy, ethical analysis of the overall
research, limitations and anomalies encountered over the course of research. Finally,
4
we were able to draw inferences about the propagation of different types of
advertisements using quantitative and qualitative analysis.
2. Methods
a. Data Collection
We have used DiscoverText as our tool for collecting Twitter data. This was
achieved through DiscoverText’s in-built functionality of connecting to the Twitter
Search API. This feature required us to input Twitter login credentials in order to be
authorized to acquire live Twitter data feed. We have referred to the Twitter Search API
documentation to understand how to use it to collect data. In order to limit tweets
relevant to Valentine’s Day, we used the keyword “Valentine’s Day” in the Search API.
Based on Twitter’s Search API, DiscoverText collected data when the tweets matched
the keywords “Valentine’s” or “Day”. DiscoverText enabled us to schedule our feeds,
allowing us to set our date between 3rd February and 28th February, aligning with the
period we believed would give us rich data for our analysis. We chose to initiate data
collection, 10 days prior to Valentine’s Day i.e. on 3rd February. There is a residual
period after the event that we feel would still contain relevant advertisement. Also, as
we are looking at propagation, we thought it would be good idea to consider the entire
month of February for analyzing the advertisements. By the end of data collection, we
managed to gather 1.6 million tweets in our archive. We went with this choice based on
our assumption that organizations or groups would begin advertising on Twitter well
before the culmination of an event such as Valentine’s Day. This allowed us to observe
advertisements that had an extended life cycle in Twitter, and thereby estimate its
propagation through retweet count.
b. Data Cleaning
DiscoverText sports a dexterous range of functionalities and features that
facilitated our data cleaning process. The 1.6 million tweets that we collected in our
archive over the period of 26 days, contains each and every tweet that has a reference
5
of “Valentine’s” or “Day” in it. That is, the resulting archive contains tweets that can be
classified as advertisements – as per our codebook – among many other non-relevant
tweets. Additionally, a lot of tweets in our archive were duplicates due to identical text
content.
Our first step towards cleaning the archive entailed the deduplication of tweets.
Since we had to primarily analyze text data, deduplication was vital in order to produce
better results and effective data analysis (Iris, 2014). The deduplication process in
DiscoverText allows us to create clusters of identical tweets and segregate these
clusters from single items. Through deduplication, we gathered 62542 clusters and
633349 single items, where each cluster had identical tweets and hence qualified as a
single tweet – in terms of content. Since we are interested in determining the
propagation of advertisements in Twitter, we have eliminated the 633349 single items
after deduplication. The reason for this elimination is that the single items have not
been retweeted even once. Although we understand that some of these single item
tweets might be advertisements, we are not interested in those tweets since the
retweet count is zero. From the perspective of our research question, these
advertisements did not manage to propagate during the period of data collection.
We created a bucket to store the resulting data from the clustering done using
DiscoverText. Furthermore, we employed DiscoverText’s random sampling mechanism
to create a dataset. This dataset comprised of 10% of the 62542 tweets in the bucket.
We will describe our coding strategy in the following section and look at the steps and
rationale behind it.
c. Codebook
We created a codebook to classify the 6255 tweets in our dataset under one of
the three categories viz. direct advertisement, indirect advertisement, or not relevant.
We first observed advertisements on our Twitter feeds to devise the keywords for our
codebook. Through observation and brainstorming, we listed keywords under the
6
direct, indirect, or not relevant categories. This codebook guided us to classify the
tweets into either direct advertising, indirect advertising or non-relevant.
Direct advertisement, as per our codebook, can be defined as a tweet that
prompts or urges users to buy their products. The following are the keywords we listed
on our codebook for direct advertising: buy, order, preorder, orders, sale, store, etsy.
One such example for a tweet under direct advertising would be – “Shop local for sweet
Valentine's Day gifts http://bit.ly/1CCfFe9”.
Indirect advertisement, as defined by our codebook, can be recognized as tweets
that try to promote a brand without making a sales pitch. The listed keywords for
indirect advertising on our codebook are: discount, contest, contests, win, visit, free,
available, coupon, giveaway, giving away, enter, hamper, bestseller, draw, sweeps,
sweepstakes, $, [0-9][0-9]% off, $[0-9], RSVP, deal, shop, limited editions. A good
example for indirect advertising is – “Hallmark knows how 2 celebrate Valentine's Day
#Win a Doggone Sweet Prize Pk @hallmark_canada @pawsitiveliving
http://goo.gl/ezkJx0 02/14”
The non-relevant tweets still make up the major portion of tweets collected.
From the previous definitions, we can consider tweets as non-relevant if none of the
keywords recorded under direct or indirect advertisements appear on the tweet
content. For instance, through the patterns observed while coding, we concluded that a
tweet that contains “myself” cannot be a direct or indirect advertisement. Further we
categorized tweets containing keywords such as single, options, fuck, shit, don’t, 50
under non-relevant. We were able to make these conclusions by running a search on
these terms on DiscoverText. Searches for single usually suggested non-relevant tweets,
fuck is highly unlikely to appear on an advertisement, and 50 was highly used since
February 13, 2015 marked the release of “50 Shades of Grey”.
Our coding strategy began with manually coding the tweets on our dataset that
we collected through random sampling. We examined the tweets in our dataset to
contextualize and look for keywords appearing on the text of the tweets. Additionally,
7
we also visited the URLs, if any, embedded in the tweets to understand what kind of an
advertisement it can be classified as. After about 10% of manual coding, we classified
approximately 500 tweets to observe how the machine was learning to categorize. This
was done using Sifter which is the text classification functionality for DiscoverText. We
would then observe how the machine has classified the tweets and reconcile
inconsistencies if any. This iteratively improved the machines classification capability
that, eventually, allowed us to run machine classification on our entire dataset.
Figure 1: The pie-chart above illustrates the classification done on the training set. Number of units in dataset: 6255. Number of
units coded: 1997 (31.93%). Number of codes: 3. Number of coders: 2.
3. Ethical Analysis
Analyzing deleted data
While collecting the data and inserting into DiscoverText, we were gathering
data in real time. Later while performing the coding, we randomly sampled some tweets
to check the validity and noticed that the tweets have been deleted after some time.
This raises an issue on our side that do we take those tweets into consideration as the
user who had tweeted it has removed it later due to unknown reasons. However, we did
take those tweets into consideration for our research because in DiscoverText we were
not able to crawl through each tweet to determine its validity. Also, maintaining a
Utilitarian viewpoint on this, we won’t be releasing the data to public at any point of
8
time and the data will be used only within the purview of this class. As researchers, we
deemed it right to retain the deleted tweets in our final dataset.
Interactive data visualization
The interactive visualizations created for this project have been done using
Tableau Desktop Professional Edition. Tableau has a functionality of displaying the
underlying data, which raises another ethical bottleneck. In general we would like the
person studying the interactive visualization to drill down and view the data. However,
the data we are dealing with here includes personal tweets of people without any
explicit permission from them. Although by agreeing with Twitter’s terms of services,
the users agree that their “... display will be able to be viewed by other users of the
Services and through third party services and websites (go to the account settings
(http://twitter.com/settings/security) page to control who sees your Content). You
should only provide Content that you are comfortable sharing with others under these
Terms” (Twitter Terms of Service, 2014).
Both of these problems we are facing are propelled by the fact that users are
usually clueless of how their data is used. Although we are using this data for purely
scholarly purposes, there can be other cases where people want to use this data to gain
monetary advantage or personal insights. As ethical professionals, we would restrict the
access to the underlying information for solely academic and research purposes.
4. Limitations
Search API Collection
Our data collection was done on Twitter’s Search API against the keywords
“Valentine’s” and “day”. The limitation that we faced while running this search is that
we might not have collected some tweets that might have contained keywords such as
“#valentines”, “valentine”, “vday”, etc. Although the data that was collected for final
analysis contained tweets related to the above keywords and does not take any other
form of expression such as shorthand methods or ASCII characters.
9
Language Barrier
The tweets we have taken into consideration are the tweets which are written in
the English language. As our team maintains proficiency only in English, we have not
taken tweets into account that are written in languages other than English or have been
transliterated in languages such as Spanish, Mandarin or other foreign languages.
Exclusion of Single Tweets
Our data cleaning process was led by DiscoverText’s deduplication and clustering
functionality. These clusters – each of which contained identical tweets – were stored in
a bucket to create a dataset for coding. However, the single tweets were added to the
exclusion list on the assumption that these tweets do not have high propagation. We
understand, however, that we might have missed classifying some advertisements due
to this. Nonetheless, these advertisements are not of great interest to our research
questions since it is clear that their degree of propagation is too low.
Processing power for Tableau
We have used Tableau Student Edition to generate our interactive visualizations
– this limited us to not being able to use certain features available to enterprise users.
Additionally, the capability of Tableau is also contingent upon the processing capacity of
our machines such as RAM, Clock speed, etc. (Tableau, 2014).
Data collection using DiscoverText
Since we have used DiscoverText to collect our data, we gathered tweets as per
REST API. This does not give us real-time data collection, but rather a partially curated
form of the real-time data. Additionally, the metadata received through DiscoverText
does not contain all of the metadata available through Twitter’s REST API (Twitter REST
APIs, n.d.)
10
5. Results
The manual coding of 1997 tweets in the dataset showed that majority i.e. about
89% of the tweets were ‘Not relevant’. There were about 7% and 3% of indirect and
direct advertisements respectively. We observed a similar breakup when the machine
was used to classify the dataset of 6255 tweets – ‘Not relevant’ (85%), Indirect
Advertisement (9%) and Direct Advertisement (6%). Further, we individually
investigated the machine classified tweets to determine if the categorization was
accurate. After exporting the dataset from DiscoverText, we focused on Direct and
Indirect Advertisement.
Our observations after plotting the values acquired through our analysis suggest
that indirect advertisements are more likely to propagate through the Twitter space
than direct advertisements.
Figure 2: This graph is plotted with Number of Retweets on the Y-axis and Tweet Creation Date on the X-Axis.
11
Relationship between propagation and popularity of user account (Retweet count Vs
Follower count)
We wanted to examine if the propagation of the advertisements is related to the
popularity of a user who posts the advertisement. This was a measured by analyzing the
relationship between ‘retweet count’ and ‘follower count’.
Quantitative Method
Pearson’s r is a measure of linear relationship between two variables where -1
suggests a negative correlation and +1 suggests a positive correlation (Pearson's
Correlation, n.d.). If the value is close to zero then it indicates that there is no linear
relationship between the variables.
We calculated Pearson’s r to check for correlation between ‘retweet count’ and
‘follower count’. The Pearson’s r value showed that there is not much correlation
between retweet count and follower count (r = - 0.01966 ). We have assumed that the
follower count is a measure of popularity. Since there is no correlation between retweet
count and follower count of the user who has retweeted, we can infer that the
propagation of advertisements is not based on the popularity of the user account. We
also calculated the correlation between ‘retweet count’ and ‘follower count’ for direct
and indirect advertisements respectively. The Pearson’s r for direct advertising is -
0.03917 and for indirect advertisement -0.01882. For both the correlation value was
very close to zero and hence we can say it with more confidence that the popularity of a
user account who retweeted the advertisement is not related to the propagation of the
advertisement.
Qualitative Method
The qualitative methods are being used by us to identify any evidences which
might help us to answer the question at hand. Even though we do not have any focus
groups or user interviews performed to get behind the sentiment of the users, we will
make educated assumptions on some of the top outlying evidences which we have
12
collected during this study. It has a very open-ended nature of analyzing the data, we
have tried to establish relationships and individual experiences, which are the driving
factors of performing qualitative methods (Qualitative Research Methods: A Collector’s
Field Guide, n.d.).
These are some of the top retweeted advertisements –
RT @ChillDrake: @Ericdress_com "Valentine's Day Big Sale" 80% OFF & Extra 10% OFF
Over $119 >>http://t.co/f18RVTvy6b http://t.co/pGCCBH0wAq (RT: 114; Followers: 499;
Direct)
Observation: The original advertisement was done by a fan account of Ariana Grande,
who is Grammy award nominee for Best Pop Duo/ Group Performance (2014).
Interestingly this is only a user account of a user who is a fan of Ariana Grande but still
the user has 109K followers. Hence this tweet shows that though the retweeted user
account does not have much followers, the original user account is highly popular. The
user uses a highly popular celebrity (Ariana Grande in this case) to create a follower
base to propagate the tweets.
RT @womenhealthh: Gifts To Buy Your Boyfriend For Valentine's Day -
http://t.co/E92fJTNVYk http://t.co/df0Itbk5RN (RT: 175; Followers: 1281; Direct)
Observation: This tweet has been started by @womenhealth Twitter account, although
this Twitter account has similarity to the popular magazine publication Women’s
Health(@WomensHealthMag), this is not their official Twitter account. The tweet itself
leads to a webpage (https://luufy.com/site/post/441?m=1) which has gifting ideas as its
content. This tweet was classified as direct due to the presence of the word “buy”, but it
does not mention any specific product or brand to buy.
RT @Mr_Mike_Clarke: MODERN DAY VALENTINE'S DAY CARD: http://t.co/XGYffTDf4l
(RT: 188; Followers: 3039 ; Direct but deleted RT)
Observation: There are some deleted retweets which had a good amount of
propagation. Hence the propagation of advertisements is affected when the retweets
13
are deleted by users. There also might be the case if these tweets are being tracked
using 3rd party analytic programs like HootSuite and they have been removed
deliberately after reaching its propagation timeline or target.
RT @iansomerhalder: Not sure link posted!Signed Valentine's Day cards are available at
http://t.co/qtvSk9pk These things;) http://t.co/vAH3…(RT : 5543, Followers: 349, Indirect
)
Observation: The original tweet is by the user Ian Somerhalder, who is an actor on one
of the popular television series, Vampire Diaries. Here we observe that the original
advertisement was tweeted by a user who is highly popular (5.72 M followers).
RT @DisneylandToday: Celebrate Valentine's Day with a custom made pal from Build-A-
Bear @DisneylandDTD such as the Huggable Hearts Kitty! … (RT : 1644; Followers 400 ;
Indirect)
This tweet is one of the quintessential example of an indirect tweet, the user has
retweeted a very popular plush toy brand ‘Build-A-Bear’ from Disney and it is related to
Valentine’s Day in term of context as well.
14
Above is a scatter plot of Number of Retweets versus Number of Followers. As discussed
above through quantitative methods and qualitative methods, the graph illustrates that
there is no evident correlation between propagation of advertisements and the
popularity of a user, which is indicated by the follower count of the user.
Language Structuring of Advertisements
We have done an elementary analysis of the language used in direct and indirect
advertisements. Below are some of the case studies for direct and indirect
advertisements.
Direct Advertisements
Eric Fashion - This user account provides latest fashion garment for cheap prices. One of
the retweets containing advertisement about Eric Fashion received 114 retweets. The
text was as following-
“RT @ChillDrake: @Ericdress_com "Valentine's Day Big Sale" 80% OFF & Extra 10% OFF
Over $119 >>http://t.co/f18RVTvy6b http://t.co/pGCCBH0wAq”
The above case study shows that the advertisement provides the audience
information about a sale. One of the reasons for good propagation of this advertisement
could be the monetary benefit to its audience. Hence we see that the advertisement
which provides discount or sale information propagates well through social media sites
such as Twitter.
Unique Reflections - This is a user account for a small business called ‘Nina’s Unique
Reflections’, which sells liquor bottle candles (Welcome to Nina's Unique Reflections,
n.d.). One of the retweets of the advertisement posted by this user account advertises
for candles that can be bought from Etsy. Etsy is a marketplace where people around
the world connect, both online and offline, to make, sell and buy unique goods (Etsy,
n.d.). The retweet count of the advertisement was 114. The text was as following -
15
“RT @UReflections: Need a great #Valentine's Day gift idea? Our candles are perfect!
http://t.co/VgTuH0Snmd #etsymntt #giftideas http://t.co… “
From the above case study we observe that small businesses are able to use the
power of social media to advertise their products. The small businesses do not have
huge capital investment for branding or promoting their products, and hence social
media sites like Twitter provide these business a platform to advertise without any
capital.
Indirect Advertisements
Dave Lackie - He is the “editor-in-chief and founder of BEAUTY the guide, luxury
beauty’s premiere digital magazine” (Dave Lackie, 2014). One of the retweets of the
advertisement posted by him persuades the audience to propagate the advertisement
in order to win a perfume bottle of Lacoste. The text was as following -
“RT @davelackie: The perfect scent for Valentine's Day! I'm giving away this bottle on
twitter! To enter, follow @davelackie & RT http://t.c… “
The above case study shows how prominent individuals are using Twitter to
promote products by running contests. Dave Lackie has about 82.7K followers and is
able to use his popularity to market beauty products and run advertisement campaigns
to engage with audience.
MTV USA - MTV is one of the television channels owned by Viacom media networks,
which mainly targets the american youth (MTV, n.d.). Vivarelli (2014) states that MTV is
using the power of social media to engage with its audience in order to maximize the
advertisement opportunity. We observed that one of the retweets of the advertisement
posted by MTV urges the audience to retweet their advertisement to win covergirl
makeup bundle. The retweet count of the advertisement is 491. The text was as
following -
“RT @MTV: RT for a chance to win a Valentine's Day COVERGIRL makeup bundle for the
perfect #covermoment! Rules: http://t.co/odLHKcdzTa”
16
The above case study illustrates that popular brands like MTV is able to capitalize
on its brand to engage with its target audience, which is primarily the millennial
generation. Since most of its target audience are active members of social media sites, it
is able to utilize Twitter for advertisements.
Tag Cloud of Advertisements
The above case studies show that the direct advertisements use more forceful
language by incorporating words such as shop, order, sale. Indirect advertisements
make use of persuasive language by including words such as giveaway, enter, win. The
below tag cloud reinforces the textual analysis we did above.
Tag Cloud for Direct Advertisements:
Tag Cloud for Indirect Advertisements:
17
Incentivization in Indirect Advertisement
Most of the indirect advertisement ask their audience to retweet in return of
some form of incentives. Usually the incentives are in the form of followers or
giveaways by participating in contests or lottery. To win the contest or lottery, the users
have to retweet the advertisement, thereby propagating the advertisement. The
incentive method might be a big contributing factor behind the user’s motivation to
engage (retweet) with the original tweet. The content in direct ads is more related to
discounts or savings through the advertisement, whereas the advertisements in indirect
category are more appealing to the users because they give assurance of getting
something for free through giveaways or getting more number of followers without any
extra effort by the user.
6. Conclusion
Quantitative measure of retweet count indicates that indirect advertisements
have greater propagation than direct advertisements. We also understand that the
retweet count alone will not provide a complete understanding of the intricacies behind
the research question.
By conducting both quantitative and qualitative analysis on our dataset, we
discovered a few discernible cases which stood out from the major chunk of data. We
also attempted to justify these cases through the result section above by conducting an
elementary analysis on the linguistics involved in the advertisements. This, however,
might not be the absolute answer.
Although it definitely provides some insights on how businesses sell and
promote their products through Twitter. Direct advertisement case studies show how
small business are trying to persuade users to buy their specific products. Indirect
advertisement case studies show how even big brands are utilizing the power of social
media to engage with their consumers by running advertisement campaigns on Twitter.
18
These advertisement campaigns use incentives such as free followers and giveaways to
encourage propagation of tweets. We did not find any statistical correlation between
retweets and the follower count of the people who retweeted it. A qualitative analysis
of the top direct and indirect advertisement showed that some of the user accounts
utilized existing brands or celebrity names to gain follower count. There are unseen
factors like contextual relevance, brand power and user interest which play a very
important role in the propagation of advertisements. Our analysis provides cursory
insights into the research question but an extensive analysis on the above said
contributing factors would yield a complete picture.
7. Interactive Visualization
Interactive Data Visualization URL = https://public.tableau.com/profile/bksarthak/
(go to the URL and browse through the tabs at the top to cycle through 3 interactive
data visualizations)
19
References
Keach, S. (2014, December 12). Instagram now has more users than Twitter. Retrieved March
17, 2015, from http://www.trustedreviews.com/news/instagram-now-has-more-users-than-
twitter
Kelly, R. ed. (August 12, 2009). “Twitter Study – August 2009”. San Antonio, Texas: Pear
Analytics. Retrieved March 17, 2015, from Internet Archive.
Iris, R. (2014, July 25). The importance of deduplicating. Retrieved March 18, 2015, from
https://texifter.zendesk.com/hc/en-us/articles/202520774-The-importance-of-deduplicating
Valentine’s Day 2015: Shopping Factoids Inspired by Cupid. (2015, February 5). Retrieved March
18, 2015, from http://www.trueship.com/blog/2015/02/05/valentines-day-2015-shopping-
factoids-inspired-by-cupid/#.VQoZII7F9bU.
The Search API. (n.d.). Retrieved March 19, 2015, from
https://dev.twitter.com/rest/public/search
Maximum Limit for Rows or Columns of Data. (2014, April 4). Retrieved March 19, 2015, from
http://kb.tableau.com/articles/howto/maximum-limit-for-rows-or-columns-of-data
Twitter Terms of Service. (2014, September 8). Retrieved March 19, 2015, from
https://twitter.com/tos?lang=en
REST APIs. (n.d.). Retrieved March 19, 2015, from https://dev.twitter.com/rest/public
Qualitative Research Methods Overview. (n.d.). In Qualitative Research Methods: A Data
Collector's Field Guide. Family Health International.
MTV (n.d.). Retrieved March 19th 2015, from
http://www.mtv.com/sitewide/legal/frequently_asked_questions.jhtml
Pearson's Correlation. (n.d.). Retrieved March 19, 2015, from
http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf
20
Vivarelli, N. (2014, August 29). MTV Uses Social Media for Global Audience Involvement.
Retrieved March 19, 2015, from http://variety.com/2014/tv/global/mtv-international-uses-
social-media-for-total-event-audience-involvement-1201292480/
Bio. (n.d.). Retrieved March 19, 2015, from http://davelackie.com/about/bio/
Welcome to Nina's Unique Reflections! (n.d.). Retrieved March 19, 2015, from
http://www.ninasuniquereflections.com/
About Etsy. (n.d.). Retrieved March 19, 2015, from https://www.etsy.com/about/?ref=ftr