“We (don’t) know how you feel” - a comparative study of automated
vs. manual analysis of social media conversations
Yuvraj Padmanabhan and Ana Isabel Canhoto
This paper presents the result of a study comparing the output of sentiment analysis of
social media data, done manually vs. using automated analysis tools. The results show
low levels of agreement between the manual analysis and the automated one, as well
as between the automated tools. This is a concerning result given the popularity of
automated sentiment analysis as a way of extracting insight from social media data.
Sentiment analysis is valuable in understanding consumers’ emotions and these, in
turn, help marketers explain and anticipate consumer behaviour. Social media can
provide real time insight into these emotions, in a non-obtrusive manner. However,
emotions may be expressed in many different ways, sometimes only implicitly. This
means that the analysis of sentiment is a complex and time consuming task, and it
may be very difficult to identify the underlying emotion. Automated tools are
effective at processing large volumes of data but lack the sophistication and
contextual awareness of an (human) analyst. Researchers need to assess the accuracy
of automated analysis tools, before deciding to utilise them in the analysis of social
Keywords: Consumer Behaviour, Emotions, Sentiment Analysis, Social media, Data
“We (don’t) know how you feel” - a comparative study of automated
vs. manual analysis of social media conversations
Social media (SM) are deemed to offer new opportunities to obtain qualitative insight
on consumer behaviour for practitioners (Casteleyn, Mottart, & Rutten, 2009) and
social scientists (Baker, 2009). The advantages of using SM include the ability to
collect data in real time and in a non-intrusive manner and getting access to
customers’ unprompted views, without having to rely on their ability to recall past
behaviour (Zaltman, 2003). SM may provide highly relevant customer insight in a
more cost-effective way than traditional approaches (Christiansen, 2011), which is
very appealing to firms conscious of their investments and keen to reduce costs in the
continuing economic recession (Simkin & Dibb, 2012). SM, it seems, is an essential
mechanism for marketers studying the behaviour of people, across places and
The volume of data available and the pressure to analyse it quickly means that,
increasingly, data collection and analysis is done without human intervention (Nunan
& Domenico, 2013). Accordingly, there is now an abundance of commercial software
tools that mine textual data and produce reports of expressed opinions and sentiment
(Sterne, 2010). Despite their popularity, these tools have limited ability to analyse and
interpret the emotions expressed in segments of text (Cambria & Hussain, 2012),
though it is difficult to assess how limited the tools are, or what aspects of sentiment
analysis are particularly vulnerable, given that the companies behind such commercial
applications do not reveal their algorithms (Beer & Burrows, 2013). That is,
practitioners and academics alike are increasingly acting on the output of analytical
processes that they do not control and do not fully understand. Given that SM is
increasingly used as a source of data in marketing, it is imperative to investigate the
quality of popular tools employed in the analysis of that data (Davis & O’Flaherty,
2012). This paper presents a critical assessment of the quality of automated sentiment
analysis of SM data.
The next section, briefly considers the role of software in analysing qualitative data,
before focusing on the specific case of sentiment analysis. To investigate the accuracy
of alternative approaches to classifying SM data, 200 Twitter posts were analysed
manually and using two popular sentiment analysis tools. The results revealed low
levels of inter-coder agreement, not just in terms of manual vs. automated approaches,
but also between the two automated tools considered. The implications of these
findings for research on consumer behaviour are considered, and directions for further
2. Analysing social media data
Qualitative research is a valuable research methodology in marketing (Shabbir, Reast,
& Palihawadana, 2009), which allows the researcher to move away from a description
of what is, to an explanation of why that is the case (Durgee, 1986). Cohen (1999) has
likened the role of the qualitative researcher to that of a detective who goes beyond
‘just the facts’ (p. 287), generating new insights into consumer behaviour. The
analysis of qualitative data is a time consuming task, often progressing alongside data
collection in a dynamic and interactive manner (Basit, 2003). The researcher is likely
to have an extensive and poorly ordered data set to work with, with rich and variety
inputs (Saunders, Lewis, & Thornhill, 2007). The issue of variety and volume is
particularly relevant when collecting data on SM – e.g., Twitter users post an average
of 58 million messages per day (StatisticBrain, 2014). Moreover, SM users not only
use text, but they also share and comment on links, pictures, videos and audio files
(Kietzmann, Hermkens, McCarthy, & Silvestre, 2011).
The researcher engages in a process of data reduction (Miles & Huberman, 1994)
selecting relevant segments of data and identifying pertinent themes. This may be
done manually or using software, including specialist programs which are commonly
described as computer aided qualitative data analysis software (CAQDAS). CAQDAS
help with the generation of codes and the manipulation of big datasets, which may
encourage the researcher to probe the data in depth (Lage & Godoy, 2008). CAQDAS
may improve the overall credibility of a study, even if it does not change the rigour of
the work or the outcome of the analysis (Ryan, 2009). While CAQDAS assists with
data manipulation, ultimately it is the researcher who has to ‘create the categories
(…) and decide what to retrieve and collate’ (Basit, 2003p. 145). The researcher also
needs to assess the suitability of the automated tool for the study. Automated tools
may not discern nuances in meaning, leading to the partial retrieval of information
(Brown, Taylor, Baldy, Edwards, & Oppenheimer, 1990). This latter weakness is
particularly worrying when processing SM data as users tend to abbreviate words, and
employ colloquialisms and emoticons (Canhoto & Clark, 2013).
Sentiment analysis is used to study the attitude that someone may have towards an
entity, and issue or a brand. It is a variant of qualitative content analysis, where the
codes relate to sentiments. There are a number of specialist software products
available to mine documents and identify words or phrases that denote a sentiment
(Yi & Niblack, 2005). Using techniques such as latent semantic analysis, the
algorithm scans textual data, using keywords deemed to reflect sentiment (Thet, Na,
& Khoo, 2010) – e.g., ‘great’ for a positive emotion, or ‘revolting’ for a negative one.
These segments are classified by sentiment polarity – positive or negative - reflecting
the attitude of the person that expressed that opinion (Thelwall, Buckley, & Paltoglu,
Sentiment analysis is extremely topical in the marketing literature – particularly,
consumer behaviour – given the abundance of SM content that is directly or indirectly
related to preferences, behaviour and motivations. For instance, an analysis of Twitter
posts by Jansen and colleagues (2009) found that one in every five tweets mentioned
a specific brand, with one-fifth of those expressing sentiment. Unable to cope with the
volume of SM data, and the technical complexity of monitoring diverse SM
platforms, many firms have turned to third parties for automated tracking and analysis
of SM data relevant to their products or brands (Davis & O’Flaherty, 2012). Despite
its popularity, the automated analysis of sentiment is not without challenges. For
instance, the expression of sentiment varies with cultures (Nasukawa & Yi, 2003),
and over time. Moreover, different sentiments may be expressed in one segment
(Nasukawa & Yi, 2003). Also, sentiment may be expressed through subtle elements
such as the use of clauses (Kim & Hovy, 2006) or the choice of words and their
placement (Davis & O’Flaherty, 2012). The analysis of the meaning of SM data is a
highly complex task that needs to take into account contextual information (Kozinets,
2002), but current automated tools have limited capacity to analyse data in context
(Cambria & Hussain, 2012 p. 139), particularly when handling short text segments.
Given the increasing popularity of sentiment analysis of SM data, in both academic
and practitioner circles, and in particular the increasing reliance on automated tools, it
is critical to investigate how accurate these tools are in studying consumer behaviour.
The promotional literature of the providers of automated sentiment analysis tools
typically report an accuracy rate between 70% and 80% (Davis & O’Flaherty, 2012).
While these percentages suggest a high level of accuracy by some benchmarks (see
Gwet, 2012), it is not clear how the coefficients were calculated. Moreover, it is not
possible to independently verify the companies’ claims, as there is no information on
the input for those calculations, e.g., the sample used. Also, it is not clear what the
reported accuracy is judged against. In academic research, accuracy is assessed by the
extent to which different researchers agree on the classification of a particular data
object (Gwet, 2012); or, in the case of an automated tool, the extent to which the
classification produced by the system matches that of human coders. Yet, some
providers may be reporting accuracy against other automated tools. These weaknesses
and uncertainties mean that an increasing number of academics and practitioners are
acting on the findings of automated analysis whose mechanics they do not fully
understand, and which may be flawed. The present paper addresses this risk by
investigating the research question: To what extent do automated and manual analysis
of sentiment in social media data match; and why?
3. The empirical study
The study adopted a netnographic approach. Netnography is a naturalistic,
unobtrusive observation technique (Kozinets, 2002), ideal to study naturally occurring
online behaviour (Hine, 2000). Moreover, it is the approach traditionally followed by
the commercial providers of sentiment analysis, which increases the relevance of this
paper’s findings for marketing practice. As the topic of food and beverages is the one
most widely discussed on social media (Forsyth, 2011), it was chosen as the focus for
data collection. This was narrowed down to conversations about coffee as this broadly
popular beverage is ‘charged with a wide range of cultural meanings’ (Grinshpun,
forthcoming), and has been the subject of other netnographic studies (e.g., Kozinets,
The community selected for observation was Twitter, as this platform is often used as
a source of qualitative data by both practitioners and academics (Williams, Terras, &
Warwick, 2013). 200 tweets were collected over a period of one month, using the
search term ‘coffee’ and its variants ‘latte’, ‘mocha’, ‘cappuccino’, ‘espresso’ and
‘Americano’, as well as the terms ‘flavour’, ‘aroma’ and ‘caffeine’. Care was taken to
include multiple users and to exclude tweets by manufacturers and retailers.
Disagreement between coders may occur because they followed different procedures
or interpreted the input differently (Gwet, 2012). Hence, data analysis proceeded
deductively. First, data were analysed manually and with the popular automated
sentiment analysis tools, Theysay and Lymbix, using a coding scheme that reflected
polarity of emotion (positive vs. negative). As advised by Koppel and Schler (2006),
comments that related to the product but did not express an emotion were not
excluded from the sample; instead, they were given the code ‘neutral’. Then, data
were analysed by type of emotion, because different emotions produce different
behavioural consequences (Laros & Steenkamp, 2005). This was done manually and
with the tools Lymbix, rule based sentic computing and M-C based sentic computing,
applying Plutchik (2001)’s coding scheme.
There were significant differences between the outcomes of each approach to
sentiment analysis. Overall, manual analysis of tweets was most likely to result in a
‘positive’ score. One of the systems was most likely to produce negative scores, while
the other tended to result in neutral ones. As per
Figure 1, the three approaches only delivered the
same score in 32% of the messages, and those
messages were most likely to reflect positive
emotions. In 11% of the cases, each approach
delivered a different score. In 7% of the cases
both systems agreed on the outcome, but differed
from the manual analysis; these tended to be
neutral scores (systems) vs. positive one (manual)
and occurred mostly where the segments for analysis were very short, such as this
one: “In uni. I think without this cup of coffee I would hulk out” (entry 69). In
addition, there were significant differences in terms of the specific emotion identified,
particularly around the label ‘surprise’ which often got confused with other emotions
such as ‘joy’ or ‘sadness’ (Figure 2).
Figure 2. Overview of basic emotions
Further analysis focused on the 68% of the
comments where the outcomes differed, in
order to investigate the reasons for the
differences. One reason identified was the
expression of both positive and negative
sentiments in one segment of text, such as
‘The early shift sucks. Oh well at least my
latte is yummy :)’ (entry 19). Here, the
overall sentiment is negative, but the
consumer expresses positive sentiment
towards the drink. The systems struggled to detect this nuance. A related factor
concerns the target of the emotion. For instance, the segment ‘This coffee shop needs
to change there music up every once and a while. Or maybe I should go home’ (Entry
12) is generally negative, yet the consumer is not expressing negative emotion
towards coffee itself.
Differences also arose when the sentiment was implied rather than explicit. That is,
when the text did not contain any of the keywords associated with emotion but was
nonetheless rich in meaning. For instance, this segment depicts coffee positively, as a
reward: ‘Wyndes - 100 copies of Ghosts sold overnight means a definite Starbucks run
this morning. Possibly coffee out twice this week! Maybe even sushi!!’ (entry 46). One
of the systems attributed a negative score to this segment, and the other a neutral one.
A related issue is where the negativity emerges from the absence of coffee. In such
cases, the polarity of the text segment is negative, but it actually expresses a positive
attitude towards coffee, as in this case: “how the heck am I supposed to be able to
sleep well without coffee in my system? fucking snow” (entry 31)
Another cause of divergent classifications was the use of abbreviations and slang –
e.g., ‘QT’ instead of the phrase ‘Quality Time’. This abbreviation expresses a positive
emotion, as in the example “Having coffee with my grandma before work right now.
QT” (entry 25), though one system gave it a neutral score.
5. Discussion and Conclusions
Emotions are key to both explain and anticipate consumer behaviour. Sentiment
analysis offers marketers a way to measure and summarise these emotions. It is not a
simple or straightforward process, though, unlike with the quantitative analysis of
data where there are well-established standards for both processing and analysing data
– e.g., correlation between 2 variables. There are many words that can be associated
with different emotions, some that can be associated with both and, as our study
showed, it is also possible to communicate emotion without using emotionally
charged words. This means that the analysis of sentiment – both its measurement and
its classification into different types of emotion – is a complex task, and that it may be
very difficult for analysts to identify the underlying emotion, or agree on a code. To
improve accuracy, sentiment analysis needs to take into consideration the social
context within which the conversation takes place (Cooke & Buckley, 2008), for
instance looking at tweets before or after the one being coded, or referring to wider
patterns (e.g., more negative tweets on Mondays). Social media offers a window into
consumers’ minds, and there is a degree of contextual information that helps
researchers map sentiment. However, the task is made difficult by the use of short
sentences, colloquialisms and Internet slang. SM should not be seen as a shortcut for
collecting consumer data. Rather, researchers need to invest considerable time
familiarising themselves with the technical and pragmatic aspects of communication
in the SM environment.
In this study, there were low levels of inter-rater agreement, not only between manual
analysis and automated systems, but also between the two systems used. While it had
been noted that automated tools are blunt instruments to study sentiment (e.g.,
Cambria & Hussain, 2012), it was still surprising to see that the two systems only
agreed with each other 39% of the time. This result is very concerning given the
popularity of automated sentiment analysis. It is concerning for academics,
particularly the novice user, who may be too reliant on these tools to analyse large
volumes of consumer data. One of the reasons why using CAQDAS may improve the
credibility of a qualitative study is that the software enables researchers to make
visible their data coding and data analysis processes (Rademaker, Grace, & Curda,
2012). This is not the case with most automated sentiment analysis tools, given that
the coding and analysis process is performed by algorithms strongly guarded by the
commercial organisations that sell these applications (Beer & Burrows, 2013). It is
even more concerning for practitioners in search of speedy and inexpensive customer
insight and who are unlikely to use two systems plus manual analysis, as we did, to
assess the robustness of the tools. Automated tools are effective at processing large
volumes of data but lack the sophistication and contextual awareness of human
analysts. Researchers need to carefully assess their accuracy before using them.
Baker, S. (2009, 21 May 2009). Learning, and Profiting, from Online Friendships.
Bloomberg Businessweek Magazine.
Basit, T. N. (2003). Manual or electronic? The role of coding in qualitative data
analysis. Educational Research, 45(2), 143-154.
Beer, D., & Burrows, R. (2013). Popular Culture, Digital Archives and the New
Social Life of Data. Theory, Culture & Society, 30(4), 47-71.
Brown, D., Taylor, C., Baldy, R., Edwards, G., & Oppenheimer, E. (1990).
Computers and QDA - can they help it? A report on a qualitative data analysis
programme. The Sociological Review, 38(1), 134-150.
Cambria, E., & Hussain, A. (2012). Sentic Computing: Techniques, Tools, and
Applications. Dordrecht: Springer.
Canhoto, A. I., & Clark, M. (2013). Customer service 140 characters at a time – the
users’ perspective Journal of marketing Management, 29(5-6), 522-544. doi:
Casteleyn, J., Mottart, A., & Rutten, K. (2009). How to use Facebook in your market
research. International Journal of Market Research, 51(4), 439-447.
Christiansen, L. (2011). Personal privacy and Internet marketing: An impossible
conflict or a marriage made in heaven? Business Horizons, 54(6), 509-514.
Cohen, R. J. (1999). Qualitative Research and Marketing Mysteries: An Introduction
to the Special Issue. Psychology & Marketing, 16(4), 287-289.
Cooke, M., & Buckley, N. (2008). Web 2.0, Social Networks and the Future of
Market Research. International Journal of Market Research, 50(2), 267-292.
Davis, J. J., & O’Flaherty, S. (2012). Assessing the Accuracy of Automated Twitter
Sentiment Coding. Academy of Marketing Studies Journal, 16(Supplement),
Durgee, J. F. (1986). Richer Findings from Qualitative Research. Journal of
Advertising Research, 26(4), 36-44.
Forsyth, J. (2011). Coffee - UK (pp. 169): Mintel.
Grinshpun, H. (forthcoming). Deconstructing a global commodity: Coffee, culture,
and consumption in Japan. Journal of Consumer Culture, forthcoming.
Gwet, K. L. (2012). Handbook of Inter-Rater Reliability. Gaithersburg, MD: StatAxis
Hine, C. M. (2000). Virtual Ethnography. London: Sage.
Jansen, B. J., Zhang, M., Sobel, K., & Chowdury, A. (2009). Twitter Power: Tweets
as Electronic Word of Mouth. Journal of the American Society for Information
Science and Technology, 60(11), 2169-2188.
Kietzmann, J. H., Hermkens, K., McCarthy, I. P., & Silvestre, B. S. (2011). Social
media? Get serious! Understanding the functional building blocks of social
media. Business Horizons, 54(3), 241-251.
Kim, S.-M., & Hovy, E. (2006, July). Automatic Identification of Pro and Con
Reasons in Online Reviews. Paper presented at the COLING/ACL, Sydney.
Kozinets, R. V. (2002). The Field Behind the Screen: Using Netnography for
Marketing Research in Online Communities. Journal of Marketing Research,
Lage, M. C., & Godoy, A. S. (2008). Computer-aided qualitative data analysis:
emerging questions. RAM. Revista de Administração Mackenzie, 9(4), 75-98.
Laros, F. J. M., & Steenkamp, J.-B. E. M. (2005). Emotions in consumer behavior: a
hierarchical approach. Journal of Business Research, 58(10), 1437-1445.
Miles, M. B., & Huberman, A. M. (1994). Qualitative Data Analysis. Thousand Oaks,
Moshe Koppel, & Schler, J. (2006). The Importance of Neutral Examples for
Learning Sentiment. Computational Intelligence, 22(2), 100-109.
Nasukawa, T., & Yi, J. (2003). Sentiment analysis: capturing favorability using
natural language processing. Paper presented at the 2nd international
conference on Knowledge capture.
Nunan, D., & Domenico, M. D. (2013). Market research and the ethics of big data.
International Journal of Market Research, 55(4), 505-520.
Plutchik, R. (2001). The nature of emotions. American Scientist, 89(4), 344-350.
Rademaker, L., Grace, E., & Curda, S. (2012). Using computer-assisted qualitative
data analysis software (CAQDAS) to re-examine traditionally analyzed data:
Expanding our understanding of the data and of ourselves as scholars.
Qualitative Report, 17(43), 1-11.
Ryan, M. E. (2009). Making visible the coding process: Using qualitative data
software in a post-structural study. Issues in Educational Research, 19(2),
Saunders, M., Lewis, P., & Thornhill, A. (2007). Research Methods for Business
Students (4th ed.). London: Prentice Hall.
Shabbir, H. A., Reast, J., & Palihawadana, D. (2009). 25 Years of Psychology &
Marketing: A Multi-dimensional Review. Psychology & Marketing, 26(12),
Simkin, L., & Dibb, S. (2012). Leadership teams rediscover market analysis in
seeking competitive advantage and growth during economic uncertainty.
Journal of Strategic Marketing, 20(1), 45-54.
StatisticBrain. (2014). Twitter Statistics. Retrieved 20 January 2014, from
Sterne, J. (2010). Social Media Analytics: Effective Tools for Building, Intrepreting,
and Using Metrics London: Wiley.
Thelwall, M., Buckley, K., & Paltoglu, G. (2011). Sentiment in Twitter events.
Journal of the American Society for Information Science and Technology,
Thet, T. T., Na, J.-C., & Khoo, C. S. G. (2010). Aspect-based sentiment analysis of
movie reviews on discussion boards. Journal of Information Science, 36(6),
Williams, S. A., Terras, M., & Warwick, C. (2013). What do people study when they
study Twitter? Classifying Twitter related academic papers. Journal of
Documentation, 69(3), 384-410.
Yi, J., & Niblack, W. (2005, April). Sentiment Mining in WebFountain. Paper
presented at the 21st International Conference on Data Engineering.
Zaltman, G. (2003). How Customers Think: Essential Insights into the Mind of the
Market. Cambridge (MA): Harvard Business School Press.