ArticlePDF Available

Abstract and Figures

Online social media drive the growth of unstructured text data. Many marketing applications require structuring this data at scales non-accessible to human coding, e.g., to detect communication shifts in sentiment or other researcher-defined content categories. Several methods have been proposed to automatically classify unstructured text. This paper compares the performance of ten such approaches (five lexicon-based, five machine learning algorithms) across 41 social media datasets covering major social media platforms, various sample sizes, and languages. So far, marketing research relies predominantly on support vector machines (SVM) and Linguistic Inquiry and Word Count (LIWC). Across all tasks we study, either random forest (RF) or naive Bayes (NB) performs best in terms of correctly uncovering human intuition. In particular, RF exhibits consistently high performance for three-class sentiment, NB for small samples sizes. SVM never outperform the remaining methods. All lexicon-based approaches, LIWC in particular, perform poorly compared with machine learning. In some applications, accuracies only slightly exceed chance. Since additional considerations of text classification choice are also in favor of NB and RF, our results suggest that marketing research can benefit from considering these alternatives.
Content may be subject to copyright.
Full Length Article
Comparing automated text classication methods
Jochen Hartmann , Juliana Huppertz, Christina Schamp, Mark Heitmann
Marketing & Customer Insight, University of Hamburg, Moorweidenstraße 18, 20148 Hamburg, Germany
article info abstract
Article history:
First received on August 16, 2017 and
was under review for 5 months
Available online 24 October 2018
Senior Editor: Michael Haenlein
Online social media drive the growth of unstructured text data. Many marketing applications require
structuring this data at scales non-accessible to human coding, e.g., to detect communication shifts in
sentiment or other researcher-dened content categories. Several methods have been proposed to
automatically classify unstructured text. This paper compares the performance of ten such
approaches (ve lexicon-based, ve machine learning algorithms) across 41 social media datasets
covering major social media platforms, various sample sizes, and languages. So far, marketing
research relies predominantly on support vector machines (SVM) and Linguistic Inquiry and Word
Count (LIWC). Across all tasks we study, either random forest (RF) or naive Bayes (NB) performs
best in terms of correctly uncovering human intuition. In particular, RF exhibits consistently high
performance for three-class sentiment, NB for small samples sizes. SVM never outperform the
remaining methods. All lexicon-based approaches, LIWC in particular, perform poorly compared
with machine learning. In some applications, accuracies only slightly exce ed chance. Since additional
considerations of text classication choice are also in favor of NB and RF, our results suggest that
marketing research can benet from considering these alternatives.
© 2018 Elsevier B.V. All rights reserved.
Keywords:
Text classication
Social media
Machine learning
User-generated content
Sentiment analysis
Natural l anguage pro cessing
1. Introduction
Online social networks, consumer reviews, and user-generated blog content facilitate personal communication between
consumers and consumers as well as rms and consumers (Hewett, Rand, Rust, & van Heerde, 2016). This provides marketing
research and practice with additional consumer information that can complement traditional market research (Netzer, Feldman,
Goldenberg, & Fresko, 2012). Among other things, social media accelerate public opinion building processes. Accordingly, contin-
uously tracking potential communication shifts in terms of sentiment or other predened categories becomes increasingly impor-
tant to enable timely responses. Similarly, marketing researchers are increasingly interested in classifying large volumes of
unstructured text data to study how sentiment and theoretically meaningful content classes co-evolve with marketing-relevant
outcomes (e.g., Berger & Milkman, 2012;Ghose, Ipeirotis, & Li, 2012).
Whenever dictionaries exist, researchers can apply lexicon-based methods such as Linguistic Inquiry and Word Count (LIWC)
to relate word choice to content categories of interest (e.g., Cavanaugh, Bettman, & Luce, 2015;Ordenes, Ludwig, Grewal, &
Wetzels, 2017). Alternatively, human coding (e.g., in terms of positive vs. negative sentiment) can be used to train supervised
machine learning algorithms to automatically classify any additional data (e.g., Hennig-Thurau, Wiertz, & Feldhaus, 2015;Lee,
Hosanagar, & Nair, 2018). All of these approaches attempt to structure unstructured text data by assigning categories to individual
text documents. This is particularly relevant whenever comprehensive human coding is not feasible due to the amount of data or
because immediate classication information is required.
International Journal of Research in Marketing 36 (2019) 2038
Corresponding author.
E-mail address: jochen.hartmann@uni-hamburg.de. (J. Hartmann).
https://doi.org/10.1016/j.ijresmar.2018.09.009
0167-8116/© 2018 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
IJRM
International Journal of Research in Marketing
journal homepage: www.elsevier.com/locate/ijresmar
Industry reports estimate the market volume of such automated text analysis to reach 8.8 billion USD by 2022, with annual
growth up to 17.2% (Markets & Markets, 2018). However, according to a survey among 3300 C-level executives, the selection
of adequate methods for specic application contexts is regarded as one of the main challenges that currently prohibits further
machine learning proliferation (McKinsey Global Institute, 2017).
Balancing different objectives when choosing text classication approaches can be complex. Objectives exclusively related to
maximizing classication accuracy suggest comprehensively testing all available approaches to identify the best solution for
each individual task. On the other hand, taking comparability and scarce resources into account employing the same approach
across applications can be reasonable. Choosing between both extremes requires knowledge about the size of the potential
accuracy trade-offs and their monetary consequences.
Prior research provides little guidance on these issues. While in marketing, method comparisons of text classication are scarce,
several such comparisons exist in computer science. However, these publications have different objectives. Even small improve-
ment increments are of interest, e.g., to understand promising avenues for further developments. For these reasons, method
comparisons are often limited to a few new method candidates and reference methods. Accordingly, the empirical evidence on
predictive accuracy is scattered across publications with different types of data and implementations. These comparisons also
often lack statistical tests of signicance and economic relevance or investigations on the practical relevance of the observed
performance differences (e.g., Bermingham & Smeaton, 2010;Pang, Lee, & Vaithyanathan, 2002). In terms of data, this literature
studies diverse datasets, many of which are only of peripheral interest to marketing, such as classifying political blogs
(e.g., Melville, Gryc, & Lawrence, 2009) or medical abstracts from bibliographic databases (e.g., Joachims, 1998). A particularly
often mentioned conclusion is the no free lunch theorem, i.e., no single method works best across all applications and each appli-
cation requires an exhaustive method comparison to nd the optimal approach for the task at hand (e.g., Fernández-Delgado,
Cernadas, Barro, & Amorim, 2014;Wolpert, 1996).
Social media marketing covers a smaller and likely more homogenous set of text classication problems. In addition, marketing
is particularly interested in economic relevance, interpretability of results, and implementational costs. Among other things,
empirical observations are relevant for theory building, which necessitates comparable results across studies in terms of similar
methodological approaches, parameters, and their interpretations. Moreover, text classication often provides only a few variables
as part of more comprehensive econometric models (e.g., Hewett et al., 2016). This makes implementational costs relevant and
favors repeated application of well-established approaches. Against this background, it is not clear that the no free lunch theorem
advocated in computer science is similarly reasonable advice for marketing.
The particular concern of marketing research with efciency in application and comparability in terms of results is evidenced
by the fact that marketing research has gravitated towards two main classication approaches used across applications: support
vector machines (SVM) and LIWC. With very few exceptions (e.g., Netzer, Lemaire, & Herzenstein, 2016), marketing research
does not conduct the types of exhaustive method comparisons computer science would suggest. Repeated applications of easily
implementable and interpretable methods such as LIWC can appear reasonable considering the aforementioned text classication
objectives. However, the trade-offs in terms of accuracy require further research. In particular, it is not clear whether an individual
approach exists that performs consistently well and within a reasonable boundary compared with the top performing approaches.
It is also not clear under which circumstances which methods are most likely suitable.
This research attempts to ll this gap. We compare the performance of SVM and LIWC to other approaches tested particularly
often for text classication outside of marketing. This includes articial neural networks (ANN), k-nearest neighbors (kNN), naive
Bayes (NB), and random forest (RF) as well as four additional lexicon-based methods. We study how well these automated text
classication methods represent human intuition across a collection of 41 social media datasets, covering different sample sizes,
languages, major social media and ecommerce platforms as well as corporate blogs (e.g., Facebook, Twitter, IMDb, and YouTube).
To the best of our knowledge, only one study is similar in spirit to our investigation. Kübler, Colicev, and Pauwels (2017) also
investigate method performance for applied marketing problems. They focus on SVM and LIWC as the prevailing methods in
marketing. Their investigation is based on a single social network and studies whether an individual rm should rely on these
specicclassication methods for sentiment extraction and consumer mindset prediction. Our research complements their work
by including additional methods (i.e., ANN, kNN, NB, and RF as well as four additional lexicon-based methods) and tests how
SVM and LIWC perform relative to these and which classication method is best suited given the specictasks.
We have found no other comparative study investigating a similar scope of social media datasets and methods. This allows us
to explore a potential middle ground between the no free lunch theorem and treating all classication problems with the same
(simple) approach. In particular, we can study the variance of method performance across methods and datasets to understand
what drives accuracy and under which conditions which methods perform best. This allows making more informed method
choices without requiring full method comparisons for each application.
2. Related research
2.1. The use of automated text classication in marketing research
We identied marketing publications applying automated text classication by searching relevant marketing journals (i.e., JM,
JMR, Mrkt. Sci., JCR, IJRM, Mgnt. Sci., JAMS), for papers that mention at least one of the methods we study in their titles, abstracts,
or keywords or explicitly state the application of automated text classication. We also conducted a keyword search regarding
these methods as well as text classication and screened the websites of the authors we identied. Note, we may still have missed
21J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
individual publications since text classication sometimes provides only a single variable of a more comprehensive empirical anal-
ysis and is consequently only briey mentioned in some articles. However, to the best of our knowledge we cover a representative
majority of relevant publications (see Web Appendix A for a detailed list).
These studies are typically geared towards substantive contributions with comprehensive method comparisons beyond their
scope. Automated text classication is used in marketing research across a wide range of very different research objectives,
e.g., to predict defaults based on online loan requests (Netzer et al., 2016), to elicit customer preferences (Huang & Luo, 2016),
to optimize search personalization (Yoganarasimhan, 2018), to forecast movie demand (Hennig-Thurau et al., 2015) and customer
engagement (Lee et al., 2018), to understand the link between consumer sentiment and business outcomes (Hewett et al., 2016),
or to model stock market returns based on review sentiment (Tirunillai & Tellis, 2012). Interestingly and despite these diverse
research objectives, N70% of the publications rely on either SVM or LIWC, with LIWC being twice as popular as SVM. While
such implicit conventions might benet comparability of results, more than a third of the studies mention no rationale for their
method choice. Many of the remaining ones refer to successful applications in previous publications as a rationale, in particular
when applying SVM or LIWC.
Notably, many important factors such as the classication task, data source or average text length vary fundamentally across
studies. This strong reliance on such a limited set of methods suggests that computer science research on text classication
performance has not been guiding text classication choices. Marketing rather appears to follow the implicit assumption that
an individual method is similarly effective across applications. Regarding the dictionary LIWC in particular, it is conceptually not
clear whether simple word counts can deal with complex gures of speech, e.g., litotes such as “… is really not bad, or differences
in meaning across domains (e.g., high product quality being positive, high blood pressure negative). Consequently, classication
accuracy must not coincide with popularity of LIWC in marketing research.
Computer science has taken a different direction. Here, lexicon-based approaches are less popular while additional machine
learning approaches play a stronger role. We will review this research next.
2.2. Evidence from comparative studies in computer science
Computer science publications collectively cover a more diverse set of classication approaches. Due to their focus on methodological
advancements such as algorithm improvements (e.g., Melville et al., 2009;Ye, Chow, Chen, & Zheng, 2009) or feature engineering
(e.g., Bermingham & Smeaton, 2010;Neethu & Rajasree, 2013), individual publications compare only subsets of the relevant methodol-
ogy and focus on a limited set of reference methods. For similar reasons, these publications typically demonstrate the effectiveness of
novel approaches for a single or only a few datasets (e.g., Fang & Zhan, 2015).
Despite these limitations, several conclusions emerge from this work (see Web Appendix B for a detailed list): First, recent computer
science studies also apply SVM, but in addition rather suggest ANN and NB than LIWC as the best possible option for text classication.
Approaches such as RF, which are suggested in this literature, have not been applied in the marketing publications we have been able to
identify. Few compare these classiers to lexicon-based methods, which are frequently used in marketing. In addition, some methods
such as ANN, RF or kNN are less often compared with other text classiers, perhaps because ANN and RF have more recently been intro-
duced for text classication. The empirical evidence based on certain conditions does suggest that ANN and RF can be particularly effec-
tive. We include ANN and RF in our analysis also because these methods often achieve superior performances outside text classication.
Second, the results across different studies vary in terms of the top performing method, suggesting that a single method such as SVM
is unlikely to work best across application contexts. For example, Annett and Kondrak (2008) nd that NB performs better than SVM,
whereas Pang et al. (2002) arrive at the opposite conclusion. Although several authors propose that method performance might be de-
pendent on specic properties of the dataset like its length (e.g., Bermingham & Smeaton, 2010) or the sample size (e.g., Dumais, Platt,
Heckerman, & Sahami, 1998;Ye et al., 2009), empirical tests of these conjectures are scarce and multivariate estimates across diverse
application contexts do not exist.
Third, prior research has focused on either sentiment or content classication tasks. In applied marketing settings, content
classes are often identiable by strong single signal words, while sentiment is most frequently expressed in more subtle and com-
plex ways (e.g., involving irony or sarcasm), which can require a deeper understanding of the social media text. In their paper,
Pang et al. (2002) argue that content might be simpler than sentiment classication, although they do not test any content
classication tasks themselves. To reveal the potential consequences given these differences, we include both sentiment and content
classication tasks in our analysis.
3. Automated text classication methods
3.1. Conceptual overview of text classication
We distinguish between sentiment and content classication, which are both of particular interest for marketing applications.
The former involves predicting the emotion of an unlabeled text document such as the following exemplary movie review, drawn
from one of our datasets: All in all,a great disappointment. In sentiment classication, the goal of the text classication methods
would be to detect the emotion conveyed through this text and to correctly classify it as negative. Lexicon-based or supervised
machine learning methods are the two major approaches to accomplish this task (see Appendix A for an overview of text classi-
cation problems and approaches).
22 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
Supervised machine learning methods learn either sentiment or custom content categories based on manually labeled text data
and inductively construct classiers based on observed patterns without requiring manual coding of classication rules (Dumais
et al., 1998). This makes them exible in understanding grammatical construction specic to certain domains. In comparison,
lexicon-based methods (e.g., NRC or LIWC) require expert-crafted dictionaries, consisting of elaborate word lists and associated
labels to classify text documents (Mohammad & Turney, 2010;Pennebaker, Boyd, Jordan, & Blackburn, 2015). These are often
generic across domains but can be extended by custom word lists. If no suitable dictionary is available, researchers must create
their own (e.g., Hansen, Kupfer, & Hennig-Thurau, 2018). As the creation of such dictionaries is cumbersome, lexical methods
such as LIWC are most commonly used for two-class sentiment classication because several off-the-shelf dictionaries exist
(e.g., Hennig-Thurau et al., 2015;Ordenes et al., 2017). While these methods are quick and easy to employ, they also come
with drawbacks. For example, LIWC may struggle to correctly predict the negative sentiment of a post like the previous example,
as the individual words greatand disappointmentpoint in two opposite emotional directions unless such phrases are included
in the dictionary. In contrast, machine learning methods can learn that the word pair great disappointmentindicates negative
sentiment without the need to curate a dictionary a priori.
In comparison, content classication refers to the task of assigning custom category labels to new text documents, e.g., to automati-
cally detect that YouTube comments such as Subscribe to my channelhave commercial rather than user interest background. Such tasks
are potentially easier than the extraction of emotion, which often requires higher context understanding, e.g., irony, playing a larger role
(Das & Chen, 2007).
In contrast to all other methods we study, latent Dirichlet allocation (LDA, Blei, Andrew, & Jordan, 2003) is an unsupervised
machine learning method originally developed and applied for knowledge discovery purposes (Humphreys & Wang, 2017). In
marketing research, LDA is most commonly used for explorative topic modeling or latent topic identication (e.g., Puranam,
Narayan, & Kadiyali, 2017;Zhang, Moe, & Schweidel, 2017). Such types of analyses have different objectives and are conceptually
and empirically not comparable to the remaining methods in terms of accurately recovering researcher-dened class labels and are
therefore beyond the scope of this investigation.
3.2. Algorithmic approaches and characteristics of text classication methods
In total, we test a set of ten text classication methods due to their conceptually different algorithmic approaches, their use and
relevance for marketing research, and their proven performance in other disciplines. This includes ve machine learning methods,
i.e., ANN, kNN, NB, RF, and SVM, as well as ve lexicon-based methods, i.e., AFINN (Nielsen, 2011), BING (Hu & Liu, 2004), LIWC
(Pennebaker et al., 2015), NRC Emotion Lexicon (Mohammad & Turney, 2010), and Valence Aware Dictionary for Sentiment
Reasoning (VADER, Hutto & Gilbert, 2014). While ANN, RF, and SVM are discriminative classiers, NB is a generative, probabilistic
classier. In contrast, kNN is a non-parametric classier, belonging to the family of proximity-based algorithms. We explain each of
these approaches in more detail next.
ANN are the most highly parametrized method in our comparison. Neurons, which are connected to the input layer, inductively
learn patterns from training data to allow predictions on test data (Efron & Hastie, 2016). The simplest form of ANN consists of
only one input and output layer (perceptrons). The number of units in the output layer corresponds to each of the possible classes.
Current computational capabilities enable the inclusion of multiple hidden layers in between (e.g., LeCun, Bengio, & Hinton, 2015;
Sebastiani, 2002). The number of nodes in the hidden layer is linked to the complexity of the classication task (Detienne,
Detienne, & Joshi, 2003). As common text classication problems represent linearly separable class structures in high-
dimensional space (Aggarwal & Zhai, 2012), single-layer ANN with a non-linear activation function are most frequently applied
for text classication (e.g., Moraes, Valiati, & Neto, 2013).
Due to their exible structure, ANN can be considered particularly versatile, performing well across different classica-
tion tasks, which is likely relevant when handling noisy social media data. Moreover, ANN can learn subtle text patterns.
This can be important for sentiment problems, where the link between individual word features and the class may be
more complex compared with content classication tasks. However, this ability to adapt to even contradictory data and po-
tentially better recognition of higher context tends to negatively affect the computational costs of ANN (Sebastiani, 2002).
The more complex the network typology, the higher the computational time both in the training and prediction phase.
While RF can be easily parallelized, ANN are more difcult to multi-thread, posing a larger optimization problem. Moreover,
given their large number of parameters and complex structure, ANN are difcult to interpret intuitively and require expert
knowledge for parameter tuning.
kNN is a lazy learning algorithm with no ofine training phase (Yang, 1999). All training documents are stored and computa-
tion is deferred to the prediction phase (Sebastiani, 2002). For each test document, kNN ranks the nearest neighbors of the labeled
examples from the training set and uses the categories of the highest-ranked neighbors to derive a class assignment. The more
near neighbors with the same category, the higher the condence in that prediction (Yang & Liu, 1999).
Computing the respective distances between all test and training documents makes kNN computationally costly when applied
to high-dimensional, sparse text data (Aggarwal & Zhai, 2012), especially if the training set is large (Sebastiani, 2002). Moreover,
as a non-parametric method, kNN suffers from the curse of dimensionality (Bellmann, 1961), requiring an exponentially larger
number of training examples to generalize well for many features. This makes kNN prone to overt in-sample and predict poorly
out-of-sample. Thus, relative performance of kNN is likely lower for longer texts with many features and, in turn, more favorable
relative to all other methods for shorter texts.
23J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
NB is one of the simplest probabilistic classier models (Yang, 1999). The classier estimates a class-conditional document dis-
tribution P(d|c) from the training documents and applies Bayes' rule to estimate P(c|d) for test documents, where the documents
are modeled using their terms. To efciently compute the conditional probabilities, NB assumes all features to be independent.
This naïve assumption can provide a reasonable trade-off between performance and computational costs. Domingos and Pazzani
(1997) nd that NB can also perform well when features are interdependent. In addition, Netzer et al. (2016) argue that the
resulting generative model is easy to interpret and explain. Moreover, NB as a generative classier may be recommended for
smaller sample sizes due to its inherent regularization, making it less likely to overt compared with discriminative classiers
(e.g., Domingos, 2012;Ng & Jordan, 2002). However, NB is not capable of modeling interaction effects among features. Thus,
we expect it to perform relatively well for problems with strong individual signal words and straightforward relationships be-
tween the text features and the respective classes, e.g., for simple forms of promotion content detection (Yang, Nie, Xu, & Guo,
2006) and two-class sentiment classication exhibiting strong polarity.
RF is an ensemble learning method that grows a multitude of randomized, uncorrelated decision trees (Breiman, 2001). Each
decision tree casts a vote for the class of the test example. The most popular class determines the nal prediction of the RF clas-
sier. This procedure is called bagging (Breiman, 1996). The larger the number of predictors, the more trees need to be grown for
good performance. There are different ways to introduce randomness and decorrelate the individual decision trees, e.g., through
random feature selection and randomly chosen data subsets (Breiman, 2001). While individual decision trees are prone to
overtting due to their high exibility (Domingos, 2012;Sebastiani, 2002), RF overcomes this issue by combining a multitude
of decision trees on a heterogeneous randomly drawn subset of variables.
As RF is more robust to noise and outliers (Breiman, 2001), we expect consistently high performance across all social media
datasets. Moreover, given their hierarchical structure, RF can learn complex interactions between features, perform automatic fea-
ture selection, and model highly non-linear data. This leads us to believe that RF can deal well with both content and more com-
plex sentiment classication, where higher context understanding is required, as signals are subtly embedded in the text and
spread across features. Lastly, the training time of RF increases linearly with the number of decision trees in the ensemble. As
each tree is grown individually, processing can be easily parallelized. This makes RF scalable and computationally efcient, en-
abling quick training of classiers.
SVM are discriminative classiers, tting a margin-maximizing hyperplane between classes. They were initially developed as
binary linear classiers (Cortes & Vapnik, 1995), but can be extended to non-linear problems of higher dimensionality through
the use of kernels that can accommodate any functional form (Scholkopf & Smola, 2001). Unlike other classiers with higher ca-
pacity to t the training data, SVM are less likely to overt and generalize better (Bennett & Campbell, 2000). Following research
convention, we study linear classier kernels since they represent the most common application in text mining (e.g., Boiy, Hens,
Deschacht, & Moens, 2007;Pang et al., 2002;Xia, Zong, & Li, 2011). The margin-maximizing hyperplane is determined solely by
the support vectors (Sebastiani, 2002). Beyond determining the position of the discriminant plane, the support vectors carry
only limited information (Bennett & Campbell, 2000). Computing the parameters of the margin-maximizing hyperplane poses a
convex optimization problem (Moraes et al., 2013), a task that can be computationally costly depending on the sample size and
number of features.
SVM have been shown to be effective for certain text problems such as news article categorization and sentiment predic-
tion (e.g., Joachims, 1998;Pang et al., 2002), as they can deal well with high dimensionality (Bermingham & Smeaton, 2010;
Wu et al., 2008). However, their limited representation may result in a lack of ability to model nuanced patterns in the
training data (Domingos, 2012).Atthesametime,SVMhavebeenarguedtobelesspronetoovertting (Joachims,
1998). Therefore, we expect SVM to perform similarly to a simple method like NB, but worse than more exible methods
like ANN and RF.
In addition to the supervised machine learning methods, we investigate the performance of ve lexicon-based methods for
sentiment classication. First, LIWC, counts words belonging to a linguistic category (Pennebaker et al., 2015). For this task,
LIWC uses manually created dictionaries that identify words in texts and assigns labels based on word frequencies per document.
Typically, simple ratios (e.g., share of words with positive or negative emotion) or count scores (e.g., number of words) are
computed based on this. The exemplary user expression, drawn from a movie review dataset I loved it,it was really scaryhas
a total word count of seven words. Thereof, one is counted as positive, i.e., loved, and one is counted as negative, i.e., scary.
In two separate columns LIWC would report 1/7 = 14.3% for both the positive and negative word ratio.
Second, we analyze the performance of NRC, which includes both unigrams and bigrams with a few recent applications in
marketing research (e.g., Felbermayr & Nanopoulos, 2016). Third, we test VADER, a dictionary specialized on microblog con-
tent, incorporating special characters such as emoticons, emojis, and informal language (Hutto & Gilbert, 2014). Forth, we in-
clude AFINN, a dictionary developed by Nielsen (2011), dedicated to analyzing microblog texts and emphasis on acronyms
such as LOLand WTF. Lastly, we analyze BING, a labeled list of opinion adjectives constructed to detect the emotional ori-
entation of customer reviews (Hu & Liu, 2004). The dictionary can cope with misspellings and morphological variations
through fuzzy matching.
All dictionaries are simple to employ and will likely provide best results for texts following a stringent train of thought with
strong emotion-laden signal words. However, for noisy social media texts with a high degree of informality and netspeak, we
expect LIWC to perform relatively poorly compared with dictionaries such as VADER and BING, which are specialized on informal
texts and include larger lexica. Moreover, all lexicon-based methods' classication accuracies may suffer from shorter texts, as this
reduces the probability of matching words from texts to off-the-shelf dictionaries. Additionally, reviews that point in different
emotional directions as the example above pose a challenge for all dictionary methods.
24 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
Table 1
Dataset descriptions.
Classication
task
Social media type Source (publicly available at) ID Authors Language Avg.
words/document
Max. sample
size
1
# features Majority class
share
# classes (DV)
Sentiment Product review titles Amazon (McAuley et al., 2015 for EN) AMT UGC DE/EN 3/5 3000 161/239 0.50 2 (pos, neg)
Product reviews Amazon (McAuley et al., 2015 for EN) AMR UGC DE/EN 92/82 3000 3117/3374 0.50 2 (pos, neg)
Movie reviews IMDb (Kotziats, Denil, De Freitas, &
Smyth, 2015)
IMD UGC EN 15 1000 557 0.50 2 (pos, neg)
Restaurant reviews Yelp (Kotziats et al., 2015) YEL UGC EN 11 1000 480 0.50 2 (pos, neg)
Social network
comments
Facebook FBK UGC DE 13 3000 549 0.33 3 (pos, neg, neu)
Corporate blog
comments
Fortune 500 blogs CBC UGC &
rm
DE 36 2942 1274 0.58 3 (pos, neg, neu)
Microblog posts Twitter TWS UGC &
rm
EN/ES/DE 10/11/9 3000 349/339/330 0.58 3 (pos, neg, neu)
Content Social network
comments
YouTube (Alberto, Lochter, & Almeida,
2015)
YTU UGC EN 17 1000 624 0.50 2 (promotion, user
communication)
Text messages Telecom provider (Almeida, Gómez
Hidalgo, & Yamaki, 2011)
SMS UGC EN 19 1000 861 0.50 2 (promotion, user
communication)
Movie reviews Rotton Tom. & IMDb (Pang & Lee, 2004) ROT UGC EN 22 3000 855 0.50 2 (subjective, objective)
Corporate blog posts Corporate blogs CBP Firm EN 344 1000 10,170 0.54 3 (high, med, low storytelling score)
Microblog posts Twitter TWC UGC &
rm
EN 10 3000 358 0.55 3 (emotion, information,
combination)
Note: # features for N = 1000. 1: The maximum TWS sample sizes for ES and DE are 1000.
25J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
Given their methodological diversity, we expect relatively low inter-method correlations in terms of accuracy. This would
be in line with the no free lunch theorem (Wolpert, 1996). Having said this, research on other applications beyond text
classication suggests ANN and RF to be among the top performing methods given their versatile structures. NB is expected
to perform well for smaller sample sizes. For larger samples sizes, we expect better performances of machine learning
methods, as more data has substantial impact on the ability to identify the different types of expressions contained in a
particular dataset more comprehensively. In contrast, lexicon-based methods by denition do not benetfromadditional
data.
For content classication, accuracies are likely to be higher than for sentiment classication, as there is a clearer link between
the word features and the respective class. Overall, we expect the highest performance for two-class content classication, as this
poses the conceptually simplest task. Obviously, as the number of classes increases, classication becomes more challenging. In
addition, all methods are likely to suffer from noisy data, e.g., in terms of netspeak. In contrast, data carrying strong signals,
e.g., adjectives, are likely to produce better results.
4. Research design and methodology
4.1. Data collection
To understand whether these conceptual differences materialize in applied social media settings, we compare the text classication
methods on 41 different social media datasets covering different sample sizes, languages, and platforms. Specically, we have obtained
three sample sizes for nine social media types (500; 1000; 3000) and analyze seven social media types in two sample sizes (see Table 1).
A relevant driver of predictive accuracy is the amount of available training data, which requires human coding. We chose these sample
sizes for two reasons. First, they pose a reasonable effort in terms of manual annotation. Second, similar ranges have been used in previous
comparative studies (for example, Pang et al., 2002 with a dataset of 700 positive and negative reviews). Alternatively, some researchers
suggest transfer learning, i.e., using labeled data from other domains (e.g., Kübler et al., 2017). However, this did not produce better
results on our applications (see Web Appendix C).
The text classication problems we study represent a large variety of real-world marketing tasks in social media. Specically,
we work with short posts from microblogs (Twitter) and social networks (Facebook), short text messages, extended discussion
posts from 14 different Fortune 500 blogs, product reviews and their titles from an online shop (Amazon) as well as restaurant
(Yelp) and movie reviews (IMDb, Rotten Tomatoes). Our data contains both rm- and user-generated communication in three
different languages, representing both colloquial and formal language.
Review titles (AMT) with an average of three to ve words per document contribute the shortest texts, whereas corporate blog posts
(CBP) contain the longest texts with an average of 344 words. The number of features per dataset, which is proportional to the number of
unique words, varies substantially (i.e., 161 compared with N10,000 for AWT and CBP, respectively). For the sake of consistency and
better comparability, we report the number of features for N = 1000 for all social media types although maximum sample sizes differ
depending on the data source. All datasets but CBP contain user-generated content, representing typical social media application settings
(e.g., understanding public sentiment).
The English microblog dataset is used for two classication tasks. Specically, we manually code sentiment (positive, negative,
neutral) and content (emotion, information, combination of both) following previous research in marketing (e.g., Akpinar &
Berger, 2017). In addition, we analyze comments from Facebook, which exhibit similar text characteristics. For both, we include
the neutral class, which in most real-world problems cannot be neglected (Go, Bhayani, & Huang, 2009). While one might expect
higher accuracies for texts from corporate blogs compared with microblog posts, such long posts can be challenging due to high
levels of information and low noise density (Bermingham & Smeaton, 2010) and therefore can be difcult to annotate even for
experienced human coders (Melville et al., 2009).
From our 12 social media types, seven have been used in prior publications and are made publicly available, including four
from the UCI repository and one from Pang and Lee (2004). The English Amazon reviews, i.e. AMT and AMR, have been sampled
out of N142 million reviews from 1996 to 2014 and originate from McAuley, Targett, Shi, and Van Den Hengel (2015).Tounder-
stand whether methods are capable of inferring quality assessments from unstructured texts, we transform star ratings to two
classes, combining reviews with less than three stars to a negative class and more than three stars to a positive class and excluding
reviews with three stars following prior research (e.g., Moraes et al., 2013). Web Appendix D lists representative text examples for
all datasets.
4.2. Preprocessing, document representation, and method specication
Text classication methods are typically applied to preprocessed datasets, since raw text data can contain high levels of noise
such as typographical errors as frequently observed on social media (e.g., Aggarwal & Zhai, 2012). The most frequently applied
steps include tokenization, case transformation, stop-word removal, term weighting, stemming, and building n-grams (Yang,
1999). The goal is to eliminate all non-informative features, which do not contribute to the underlying text classication task. This not
only yields better generalization accuracy but also reduces issues of overtting (Joachims, 1998). All detailed preprocessing steps are
summarized in Web Appendix E.
26 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
4.3. Performance assessment
We evaluate the performance of each text classication method based on its ability to develop a similar understanding as a
human reader, since many marketing applications intend to capture the information communicated to other professional or
non-professional users. As our primary performance measure, we compare prediction accuracies to understand how well each
method represents human intuition. The accuracy on dataset iis dened as the sum of all correct predictions on the hold-out
test set divided by the sum of all predictions.
To obtain an unbiased estimate of out-of-sample accuracy, we split each dataset into a training set (80% of the data) and a
hold-out test set (20% of the data). All accuracies reported in this paper are based on predictions on the hold-out test set (see
Ordenes et al., 2018 for a similar approach). Importantly, none of the methods could learn from this data during the training
phase, producing unbiased performance estimates for classier effectiveness (Domingos, 2012). If a method overtted on the
training data, it is expected to generalize poorly on the hold-out test data, producing worse accuracies than other methods with
better regularization.
Using ve-fold cross-validation, we tune the most important parameters for each method on the training set (see Web Appen-
dix E for further details). This means that each dataset is partitioned into ve equal training and validation subsets. The goal of this
grid search procedure is to test a reasonable set of parameter values to identify the best model for a given task based on validation
accuracy. This approach is computationally more complex than simply using the default parameter values of each method across
all classication scenarios. However, parameter tuning is necessary becausedefaultvaluesmustnotbeappropriateforallindivid-
ual applications and this can vary across methods. The parameter values producing the best average accuracy across all ve folds
are used to t each model on the entire training set. Lastly, we use the tuned models to make predictions on the hold-out test set
and report those accuracies for each dataset.
To obtain standard errors, we run ve times repeated ten-fold cross-validation for each tuned model on the entire data,
producing 50 accuracies for each method. This allows us to conduct two-tailed t-tests on the mean paired accuracy differences
(see Moraes et al., 2013 for a similar approach). Web Appendix E presents the R packages we have used. Web Appendix F
describes an exemplary R script containing all steps for running the machine learning methods.
5. Results
5.1. Comparison of method performances across all datasets and classication tasks
To facilitate interpretation and in the interest of parsimony, we group all similar datasets and compare similar sample size
resulting in 12 distinct types of social media text data. Specically, we aggregate across languages since we do not detect a signif-
icant impact of language on classication performance for any method. Fig. 1 summarizes the resulting accuracies for all methods.
The performance of the lexicon-based methods is reported for all two-class sentiment problems. Only two out of the ve
dictionaries perform slightly better than the weakest machine learning algorithm (kNN) and that occurs only in three instances.
However, these few instances where one of the dictionary approaches exceeds a machine learning method is due to the poor per-
formance of kNN for these datasets having N15% lower accuracy than the best performing approach. Since none of the dictionaries
achieve a performance close to the winning approaches, we summarize these as the average performance in Fig. 1 and provide all
details in Appendix B.
Fig. 1 suggests several conclusions. First, there are large differences in maximum accuracies between the easiest task,
i.e., promotion detection of short text messages (SMS) at 94.5%, and the most difcult task, i.e., sentiment prediction of user-
generated comments to corporate blog posts (CBC) at 63.5%. In line with conventional wisdom, this implies that some dependent
variables are easier to predict than others. Second, the performance spread across the ve machine learning methods varies across
the different data sources. While the difference between the best and worst method for sentiment classication of Amazon review
titles (AMT) is only 4 percentage points between ANN and kNN, the difference increases to N21 percentage points between the
two methods for Amazon review texts (AMR). Nevertheless, across all different contexts, ANN, NB, and RF consistently achieve
the highest performances.
Comparing the absolute average performance across all datasets, the results reveal that the winning methods produce the
highest accuracies for two-class content classication. The top three social media types, i.e., SMS, YTU, and ROT, all belong to
this kind of classication task. Evidently and due to chance alone, two-class classication has a higher likelihood of correct
classications than three-class classication. In addition, for content classication, individual words tend to be more predic-
tive of the correct class compared with sentiment classication, where the signals are often more subtly embedded in the
text. The ve lowest accuracies are produced for three-class problems, including both sentiment and content classication,
i.e., FBK, CBP, TWS, TWC, and CBC. All but CBP represent user-generated content; among those are two from Twitter and
one from Facebook, exhibiting the highest degree of noise. This likely exerts a deteriorating effect on all methods'
performances.
Comparing the relative performance across classiers for signicant differences (pb.05), RF and NB are among the best
performing methods for 11 out 12 social media types (see Fig. 1). This is consistent with conceptual conjectures. Breiman
(2001), for example, argues that RF is particularly robust to noise and outliers. This may explain its relatively good performance
even on the noisiest text corpora from Twitter, i.e., TWS and TWC. Interestingly, the relatively simple approach of NB is equally
27J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
often among the top performing methods as RF (7 out of 12 cases). However, its variance in performance is larger and it can perform
poorly relative to other methods in particular when datasets are larger (see also Section 5.2).
SVM are considered less prone to overtting (Joachims, 1998). However, they may miss important features by reducing all
available training data down to the support vectors. Due to their limited representation (Domingos, 2012), they are not as exible
to model feature combinations, which RF can represent through hierarchical trees. Overall, SVM exhibit a signicantly lower per-
formance across all social media types compared with the winning methods, except for Amazon reviews (AMR). Here, they rank
rst although not signicantly different from ANN (pN.05), which is similarly versatile as RF and also has little variance in terms
of relative performance. However, in the case of ANN, this has higher trade-offs in terms of implementational costs and interpret-
ability, which are in turn similar to SVM.
kNN do not produce competitive accuracies across all tasks except for Amazon review titles (AMT) and comments from
corporate blogs (CBC). As a non-parametric method, kNN suffer from the curse of dimensionality (Bellmann, 1961). Specically,
it typically overts in-sample and, in turn, generalizes and predicts poorly out-of-sample. The number of training examples needed
to maintain good generalizability for kNN grow exponentially with the number of features (Domingos, 2012), as it cannot extract a
concise summary from the data like the other machine learning methods. In contrast, when there are only few features, kNN may
perform reasonably well. AMT are by far the shortest text documents across all datasets with a mean of only four words,
Fig. 1. Accuracies of automated text classication in reecting human intuition across 12 social media types. Note: ° indicate insignicant differences between the
best methods (pN.05). DICT is the average of ve lexicon-based methods, i.e., LIWC, NRC, AFINN, BING, and VADER (see Appendix B for details).
28 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
potentially explaining why kNN compare more favorably with the other methods' performances for review titles. CBC is among the
most challenging classication tasks, resulting in all methods to perform poorly and only slightly better than random chance. This
also results in small performance differences across methods.
The lexicon-based classiers such as LIWC cannot learn custom classes automatically from training data. Instead, they require
expert-crafted dictionaries for such purposes (e.g., Kuhnen & Niessen, 2012). Hence, LIWC, for example, is most often used for sentiment
classication in marketing research (e.g., Berger & Milkman, 2012;Hennig-Thurau et al., 2015;Hewett et al., 2016). Following this re-
search, we apply all ve dictionaries, i.e., LIWC, NRC, VADER, BING, and AFINN, to all two-class sentiment classication tasks.
On average, the lexical methods lack behind all supervised machine learning methods. For example, LIWC exhibits the highest
average accuracy on IMD (61.5%), still worse than the weakest performing kNN. As Appendix B reveals, VADER and BING with the
highest number of words perform best. As expected, dictionary size and specialization appear to indeed allow for higher levels of
accuracy. Within the dictionary group, LIWC performance is average and never exceeds the best performing dictionary but also
never falls below the weakest one.
For short review titles (AMT), the performance of the majority of dictionaries (LIWC, NRC, AFINN) does not exceed random
chance. Similarly, VADER and BING also achieve accuracies of only 54.0% and 52.1%, barely above chance alone. Here, the overall
average accuracy of all ve dictionaries is 43.0%, compared with 77.6% of ANN, the best performing machine learning approach. For
the entire review text (AMR), the dictionaries perform almost 13.3 percentage points better compared with AMT but still signif-
icantly worse than all other methods. This is intuitive as the probability of nding a positive or negative word from the dictionary
within a short title of only three to ve words on average is much lower compared with review texts that are on average 82 to 92
words long (see Table 1). LIWC, for example, contains 620 positive and 744 negative words (Pennebaker et al., 2015). For reviews
that follow a stringent logic, contain little noise, and carry emotion-laden words, dictionaries may perform well.
However, in many real-world application scenarios the sentiment of a text tends to be conveyed in less obvious, subtler ways,
especially when dealing with reviews. For example, the following review clearly conveys a negative emotion and is properly clas-
sied by all ve machine learning methods. Due to the lack of negative signal words, LIWC, in contrast, cannot infer the correct
sentiment: I received a completely different product with packaging that say made in Hong Kong! And the banana was only yellow.
Not white and yellow like the photo.Additionally, dictionaries may struggle with phrases such as in Damn good steak. Again,
all machine learning methods predict the correct class. LIWC, however, requires manual coding to correctly classify the bigram
Damn goodand consequently assigns a score of 33.3% to both the positive (good) and negative (Damn)wordcount,resulting
in an ambiguous classication.
Web Appendix G reports the inter-method correlations in terms of classication accuracy. The values range from 0.35 between
ANN and kNN to 0.64 between ANN and RF. These values again suggest that individual methods arrive at different accuracies de-
pending on the task at hand. Given their consistently high levels of exibility and high overall performance, it is not surprising that
ANN and RF exhibit the highest correlation of 0.64. kNN reveal the lowest correlation with all other methods (between 0.35 with
ANN to 0.47 with RF). This nding is in line with the fundamentally different learning approach of kNN, i.e., not building a model
on the training data like the other methods, but instead comparing all features of each classication document with all features of
the training instances.
5.2. Multivariate analysis of the drivers of text classication performance
The previous descriptive ndings and statistical tests suggest certain plausible explanations for the differences in accuracies.
However, the data we investigate are taken from diverse social media settings, and the different potential explanations for the
observed accuracy differences cannot be disentangled (e.g., amount of available training data, languages, text length, number of
classes or type of classication task). To investigate whether the conceptually plausible drivers of performance have an impact
over and above the remaining factors, we run logistic regression models across all datasets with accuracy of predicting human
coding (correct vs. incorrect) as the dependent variable, text and data characteristics as independent variables, and social media
types as random intercepts to control for unobserved heterogeneity. To ensure each social media type is represented equally,
we randomly sample 300 observations from all hold-out test sets. We include the interaction between the number of classes
and the type of classication task since an additional sentiment class (neural sentiment) is conceptually different from an addi-
tional content class. We also test interactions between the number of classes and the remaining variables but nd no signicant
effects (pN.05). Table 2 reports the odds ratios of this analysis.
These results reveal that the number of classes (three vs. two) and the type of classication task (sentiment vs. content) as well
as their interaction exert strong effects on accuracy for all methods. Specically, in line with conventional wisdom, content
classication tasks with three compared with two classes yield lower accuracies (OR = 0.1740.433, pb.001pb.05). Across all
methods except kNN, for two classes, sentiment classication is more challenging than content classication (e.g., OR = 0.3720.481,
pb.001 - pb.05). This is, conceptually plausible since sentiment classication often necessitates higher context understanding (Das &
Chen, 2007). For example, reviews can reect thwarted expectations(Pang et al., 2002), containing more negative than positive
words but overall conveying a positive sentiment.
However, a strong positive interaction between number of classes and classication task (OR = 2.3593.618, pb.05pb.001) for all
methods except kNN suggests that these differences are attenuated for more than two classes. Across all methods, the interaction is
highest for RF. This is due to a relatively similar performance for three class-content and three-class sentiment classication but a
much better performance for two-class content than two-class sentiment classication. Put differently, the number of classes inuences
the accuracy of the respective methods asymmetrically: Relative to content classication, sentiment classication suffers less from
29J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
increasing the number of classes. In other words, the marginal difculty of adding an additional class is higher for content compared with
sentiment classication. This nding may be due to the fact that sentiment in social media is often neutral. In such cases, assignment of
binary values, i.e., positive vs. negative, can result in arbitrary conclusions. While an additional class by denition increases classication
difculty, better assignability of three-class sentiment appears to partially compensate this. For content classication, similar effects are
less likely and additional classes make classication more complex.
Several additional effects and differences between methods deserve attention. Specically, ANN and kNN appear to suffer from
longer texts (OR
ANN
= 0.904, pb.05; OR
kNN
= 0.859, pb.001). The average number of words per document is proportional to the
total number of features underlying the document-term-matrix. This increased complexity seems to particularly undermine the
performance of these two methods. Suffering from too little training data relative to the number of features, kNN are likely to
overt, resulting in poor generalization on the out-of-sample test data. Recall kNN perform relatively poorly for longer Amazon
review texts (AMR) compared with short Amazon review titles (AMT), as can be seen from Fig. 1. In terms of average word length,
we nd no signicant effects, except for NB, where longer words seem to slightly hurt performance (OR
NB
=0.924,pb.05).
We control for specic text characteristics that may send particularly strong signals or induce noise. Regarding the former, all
methods but RF benet from the presence of adjectives, which are likely to contain information facilitating classication tasks.
Given the strong performance of RF across all classication tasks, the absence of this effect may also indicate that RF is less depen-
dent on such strong signal words, but instead can detect and interpret also more subtle features or feature combinations
(Humphreys & Wang, 2017). Clout, a variable signaling expertise and condence of the sender, exhibits a small positive effect
for ANN and NB (OR
ANN
= 1.004, pb.05; OR
NB
= 1.005, pb.05). This is intuitive as a low score represents a more tentative
and insecure writing style (Pennebaker et al., 2015), likely to be correlated with less precise and structured communication,
which in turn may be more difcult to classify correctly. In contrast, netspeak (e.g., 4ever,lol, and b4) and nonuencies
(e.g., hm,umm,anduh) do not show signicant effects. For netspeak, however, all coefcients are negative, which directionally
shows that noisier social media data may lead to worse performance.
We also control for language and sample size when estimating these models. According to these results, none of the assessed methods
are sensitive to the language of the text corpora (all pN.05), suggesting that text classication performance is comparable across
Germanic (e.g., German) and Romanic languages (e.g., Spanish) as well as English. Regarding sample size, three methods exhibit a
signicant learning curve. Among those, ANN and RF benetthemost(OR
ANN
= 1.259, pb.05; OR
RF
= 1.262, pb.05) and SVM the
least from adding additional training data (OR
SVM
= 1.225, pb.05). In contrast, NB and kNN appear not to signicantly improve
predictive performance for larger sample sizes or conversely will perform relatively better for smaller datasets (pN.05).
5.3. Post-estimation analysis
Based on the logistic regression results, we perform a more detailed post-estimation analysis based on the two most critical
drivers of method performance: the number of classes and the type of classication task. Fig. 2 summarizes average accuracies
across all 41 datasets per method.
Overall, the depicted patterns illustrate how the number of classes moderates the effect of classication task on classication
accuracy. As indicated earlier, for the two-class scenarios, classifying sentiment tends to be more challenging than classifying con-
tent for all methods. For classication problems with three classes, the differences in performance between content and sentiment
classication are less pronounced for all methods but kNN. Regarding ANN, RF, and NB, the differences between three-class sen-
timent and content are not statistically signicant (pN.05). Conversely and from the perspective of the number of classes, means
Table 2
Random-intercept logistic regression on method performance as a function of task and text characteristics.
Dependent variable: hold-out accuracy
ANN kNN NB RF SVM
Task characteristics
# classes (1 = 3 classes, 0 = 2 classes) 0.204*** 0.433* 0.244*** 0.179*** 0.174***
Task (1 = sentiment, 0 = content) 0.481* 0.643 0.469*** 0.372*** 0.441***
Interaction # classes Task 2.861* 2.056 2.359** 3.618*** 3.010***
Text characteristics
# words (in 100) 0.904* 0.859*** 0.942 0.948 0.994
# characters per word 0.974 1.012 0.924* 0.969 1.011
Language (1 = non-English, 0 = English) 1.090 0.782 0.994 0.967 1.127
Text signals
Adjectives (e.g., happy, free) 1.013* 1.015** 1.021*** 1.010 1.011*
Clout (signaling expertise) 1.004* 1.002 1.005* 1.002 1.003
Text noise
Netspeak (e.g., lol, 4ever) 0.992 0.991 0.992 0.994 0.994
Nonuencies (e.g., hmm, uh) 0.989 1.003 0.983 1.011 1.048
Sample size (1 = N N1000, 0 = N 1000) 1.259* 1.058 1.033 1.262* 1.225*
Note: N = 3600; *pb.05, **pb.01, ***pb.001 (two-tailed). All effects reported as odds ratios. Text signals and noise variables are operationalized with LIWC.
30 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
drop by N20 percentage points across all methods when increasing the number of classes for content classication (from 85.8% to
64.7%). However, they drop by only slightly N10 percentage points for sentiment classication (from 74.2% to 63.1%). Since addi-
tional classes contain additional information and social media communication is not exclusively negative or positive, three-class
sentiment classication appears a more reasonable choice than two-class sentiment. This provides a more nuanced sentiment
understanding while still allowing for high levels of classication accuracy. For example, ANN results in 77.5% classication accuracy
for a 3000 documents dataset for three sentiment classes.
Most important for method selection, there are noteworthy performance differences of the methods across the four sce-
narios that inherently reect the functioning of the respective learning methods. Overall, RF shows robust high performance
across al l four contexts. This is consistent with ndings from other classication domains and suggests that RF is a particularly versatile
approach also for social media text classication (e.g., Caruana & Niculescu-Mizil, 2006;Fernández-Delgado et al., 2014).
NB, on the other hand, lacking the ability to interact features and making the naïve assumption that all features are indepen-
dent, performs unexpectedly well for a large variety of classication tasks. For the two-class condition of content classication,
its performance is not signicantly different from that of RF at pb.05. Also for two-class sentiment and three-class content classi-
cation it compares favorably to more exible methods like ANN and RF. For all three tasks, texts tend to contain strong signals
indicating the correct class, especially for content classication. However, also for the binary sentiment classication problems,
NB performs surprisingly well, perhaps because binary sentiment classication requires less complex feature combinations than
higher-order classication tasks. In contrast, when adding the neutral class, more nuanced context understanding may be required
to distinguish neutral comments from positive and negative comments. For this task, RF performs signicantly better than all re-
maining methods on average. ANN exhibit a similar performance to NB, except for two-class content where they perform slightly
worse.
Recall SVM is the most frequently used supervised machine learning approach in marketing and text classication. However,
SVM ranks second to last across all four classication scenarios and performs particularly poorly when the number of classes
increases. While SVM appear to work reasonably well for some specic problems such as classication of journalistic and medical
articles, which typically contain stringent communication and low degrees of informality (e.g., Joachims, 1998), they may miss
important text signals relative to other methods by reducing all training data down to the support vectors. This can reduce
their ability to capture more subtle information, e.g., when multiple features need to be interacted to produce the correct predic-
tion. In line with this, SVM lacks behind by the largest difference for the three-class sentiment condition (60.5% for SVM vs. 66.6%
for RF), where slight text nuances may determine whether a microblog post is positive, negative, or neutral.
6. Predictions based on actual consumer behavior and economic consequences of suboptimal method choices
So far, we have compared the predictive performance of ten text classication methods against the benchmark of human
judgment. This approach mirrors the current focus of marketing research, which mainly uses text classication techniques
Fig. 2. Accuracies for different number of classes (three vs. two) and classication tasks (sentiment vs. content). Note: Superscripts indicate statistically insigni-
cant differences (pN.05).
31J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
to explore the sentiment of different social media activities (e.g., Hennig-Thurau et al., 2015;Homburg, Ehm, & Artz, 2015;
Tirunillai & Tellis, 2012)orclassies communication content based on theoretically meaningful categories (e.g., Ghose
et al., 2012;Ordenes et al., 2018). In marketing practice, machine learning techniques are employed for cost reduction pur-
poses as well as revenue growth (e.g., reducing the costs of market research or by providing more tailored user experi-
ences). We investigate whether our previous results generalize to such settings. This also allows us to understand the
economic consequences of suboptimal method choices. Specically, we study three exemplary application scenarios:
(1) cost reductions by automating customer service classication tasks for an online travel agency, (2) impact of social
media communication on online demand and (3) website visits. These three content classication tasks vary in their num-
ber of classes, i.e., binary classes for application scenarios 2 and 3 vs. ve classes for application scenario 1. As we work with
custom classes, lexical classiers are not applicable to these settings. Fig. 3 presents the methods' classication accuracies
for all three scenarios. Overall, the relative performs mimics our previous results, i.e., across all scenarios either RF or NB
perform best.
6.1. Application scenario 1: cost reduction for an online travel agency
The rst application evaluates how textual classication can enable faster and more efcient processing of customer re-
quests. Our data covers a sample of 3000 incoming customer emails to an online travel agency that offers leisure activities,
tours, as well as tickets for local sights and attractions to tourists. Their customer service department of 150 service repre-
sentatives is structured in ve teams mirroring typical customer queries (i.e., questions about activities and the booking pro-
cess, questions about an existing booking, cancellations, booking amendments, and complaints). Currently, the platform
receives an average of 17,000 emails per week, and due to their fast organic growth over the last two years, emails are
still opened unguided by any available service representative who skims through the query, before forwarding the customer
inquiry to the responsible service team. We analyze how well the machine learning methods perform on this content clas-
sication task and observe substantial performance differences between the methods (see Fig. 3). Consistent with our pre-
vious ndings on multi-class content classication, RF outperforms all other methods with an accuracy of 70.7%,
surpassing kNN as the worst method by about 20 percentage points. Note, all methods surpass random chance of 20% to a
considerable degree.
Considering the 17,000 customer emails the platform receives each week and assuming an average 30 s for the right allo-
cation of the query, a work week of 40 h, and 46 work weeks per year, the yearly number of hours for manual inquiry han-
dling equals 6517 (or an equivalent of N3.5 full-time service representatives). For a gross salary of 43,000, this amounts to a
theoretical maximum of 152,292 in cost savings for a classier with perfect predictions. Applied to the accuracies we ob-
serve, the best performing method (i.e., RF with 70.7%) can save N3300 h of yearly classication work compared with the
random-chance baseline of 20%, equaling an overall annual cost reduction of 77,273. Compared with kNN, the higher preci-
sion of RF materializes in over 30,000 savings per year, i.e., about 1500 per percentage point increase in accuracy. Hence,
suboptimal method choice can result in relevant cost consequences even for small- to medium-sized companies.
Fig. 3. Accuracies across three eld application scenarios. Note: RF exhibits signicantly better accuracies than all other methods at pb.05, except for application
scenario 2 where difference between NB and RF is not signicant.
32 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
6.2. Application scenario 2: demand prediction in an online music network
Many rms are interested in predicting the demand for their products. Social media data have been frequently used in market-
ing research as a basis to make such predictions through social listening(e.g., Barasch & Berger, 2014;Schweidel & Moe, 2014).
The underlying assumption is that people say what they think and that this is closely related to their future behavior (e.g., Hewett
et al., 2016;Kannan & Li, 2017).
To investigate the consequences of suboptimal method choice, we study a social network geared towards newcomer musicians
and music fans. Music fans who seek entertainment and information visit the proles of artists, can connect with them, and down-
load their songs. The artists, in turn, use the platform to establish a fan base and audience for their songs, to promote upcoming
concerts and new song releases as well as to create electronic word-of-mouth. Moreover, they can engage with their fans and
increase visibility by writing public comments on their fans' prole pages.
The number of song downloads of an artist in the subsequent period serves as the primary marketing outcome variable of
interest. Individual song demand follows a highly skewed distribution with a signicant long-tail of weakly demanded
songs. We median split the number of monthly song downloads to differentiate between successful and less successful
songs and study whether the monthly communication in the artist social network is predictive of success in the subsequent
month. Our data contains 691 communication texts for 441 music artists. Note that the communication text is Swiss-
German, a colloquial German dialect, which sometimes lacks orthographic conventions, suggesting increased classication
difculties.
Despite this, all methods clearly perform better than random chance in predicting song success of an artist based on social
media communication (see Fig. 3). NB and RF are the winning methods with around 75% accuracy and kNN with the worst
performance of 66.7%. This mirrors our ndings on two-class content classication (Fig. 2).
Based on the mean downloads for the high and low class as well as the respective method accuracies, we can quantify the
forecast error for each method (further details in Web Appendix H). For this two-class example, two types of errors are pos-
sible. Specically, the classiers can assign low-class labels to the above median class (false positive, FP) and vice versa (false
negative, FN), using the above median class as the reference category. We assume equally high costs of over- and
underestimating future demand and therefore study the absolute deviation from the true means (false positive and false neg-
atives contribute equally).
Recall, the distribution of song downloads is highly skewed, resulting in a ratio of slightly below eleven between the means of
above- and below-median class, i.e., 3094 and 33,770.
1
This corresponds to a potential forecast error (difference between both
classes) of 30,676 downloads or a percentage point difference in predictive accuracies in a forecast error of 307. Assuming text
classication accuracy is comparable across music networks and further assuming an average price of 0.99 for a song on typical
platforms such as iTunes, the differences in predictive accuracy between the best method (i.e., NB with 75.4%) and the worst
method (i.e., kNN with 66.7%) can result in a signicant forecast error of economic demand of 2669 song downloads per
month, or 31,705 in annual revenues. Note, this is likely a conservative estimate based on the aforementioned download
amounts. Higher average downloads would result in higher consequences.
6.3. Application scenario 3: forecast of corporate blog reach
Many companies attempt to operate proprietary social media such as corporate blogs with the objective of reaching as
large an audience as possible. We analyze the communication data of 14 Fortune 500 blogs to evaluate the impact of cor-
porate social media activity on future reach. Specically, we analyze how the content of rm-generated blog posts drives
the number of future visitors, returning to subsequent posts. As in the previous case, we median split the number of
unique visitors of a given post, resulting in a balanced dataset with two classes. The high- and low-reach posts attract
mean visits of 672 and 32, respectively. Again, we assume both types of forecast errors are equally costly (see
Application scenario 2: demand prediction in an online music network), i.e., total potential forecast error of erroneous
classication is 640 visitors.
According to Fig. 3, the average absolute accuracy is slightly lower compared with the previous scenario, presumably because of
the weaker relationship between the blog content and the dependent variable. Nevertheless, RF and ANN achieve accuracies above
65% (i.e., 66.7% and 65.1%, respectively), whereas again kNN performs worst and only slightly above random chance (53.1%). This
suggests that rms can obtain meaningful assessments of likely post impact using automated text analysis and thereby optimize
their activities across employees.
Following the same computation as in application scenario two, we quantify the forecast error to translate the perfor-
mance differences into economic consequences. Assuming 150 blog posts annually per company, choosing RF instead of
kNN for predicting future reach reduces the forecast error by N13,000 visitors. Also, assuming a conservative value of 2
per visit due the benets of earned media and organic trafc, i.e., reduced marketing costs (e.g., search engine advertis-
ing), suboptimal classier choice translates to an economic impact of more than 26,000 comparing the best to the
worst method.
1
Note, the means are scaled by a constant factor, as we are not permitted to publish absolute demand information.
33J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
7. Discussion
Given the constantly growing stream of social media data, automatic classication of unstructured text data into sentiment
classes or other theoretically or practically relevant researcher-dened content categories is likely to continue attracting attention.
Research in marketing has gravitated towards SVM and lexicon-based methods such as LIWC for text classication. Marketing re-
search is typically interested in how communication content appears to and affects human readers. Moreover, when choosing a
text classication method, trade-offs between interpretability of results, economic relevance of differences, and implementational
costs are of particular concern.
The results of our analysis both conrm the conjecture of computer science that no single method performs equally well across
all application settings and also suggests simple heuristics for making practically meaningful method choices without requiring ex-
tensive method comparisons for each application. In particular, RF which is underrepresented in text classication research both
within and outside of marketing so far is versatile and performs well across most application contexts, especially for three-class
sentiment classication, which is a relevant application for marketing research (Fig. 2). Despite its conceptual simplicity, NB has
also provided high accuracies in recovering human intuition. In our regression, NB and kNN are the only methods where smaller
sample sizes do not result in reductions of performance (Table 2). To illustrate the implications of this, consider application sce-
nario 2, which contained the least amount of training data and resulted in the best performance of NB out of all methods, also
slightly better than RF (Fig. 3). Similarly, NB signicantly outperforms all other methods in classifying the two-class sentiment
of movie reviews (IMD), where the maximum sample size available to us is limited to 1000 observations. Focusing on NB and
RF for text classication would have identied the best approach for all three practical applications as well as for 11 out 12 social
media types.
AMR is the only exception to this. It is the only example where SVM is among the best performing methods, together with
ANN. However, ANN is clearly more versatile and among the best performing methods for 7 out of 12 social media types.
When implementational costs and interpretability are of lower consideration, ANN is therefore an additional promising candidate
marketing research may wish to consider. Recall, according to the multivariate regression ANN and RF benet most from larger
sample sizes. Therefore, ANN appears particularly relevant for large datasets.
Overall, our ndings are in contrast to emphasis of lexical classiers in marketing research. These perform consistently and con-
siderably worse compared with the best algorithms in recovering human intuition. All ve dictionaries show also strongly inferior
accuracies compared with similarly simple and intuitive approaches such as NB.
Moreover, our results conrm social media marketing applications also require context specicmethodchoicestond optimal
solutions. However, focusing on NB and RF appears as a reasonable trade-off between the objectives of interpretability,
implementational costs, and economic relevance. If the former objectives are a smaller concern relative to accuracy, more exhaus-
tive method comparison can be of interest. A few recent marketing publications such as Netzer et al. (2016) have followed this
approach. For their datasets, they arrive at similar conclusions as Fig. 2 suggests, i.e., they nd that NB performs best for two-
class content classication.
Another intuitively appealing solution also followed by Netzer et al. (2016) is applying method ensembles, which entails train-
ing multiple classiers. Majority votes, i.e., choosing the class the majority of methods predicts, are a simple way of accomplishing
this (e.g., Xia et al., 2011). Despite their conceptual appeal, such approaches are both particularly complex and time-consuming to
estimate as well as challenging to interpret and implement. In particular, they require parameter tuning for each individual
method, selecting appropriate methods to include, and additionally choosing an appropriate form of aggregation. This is a combi-
national search problem in discrete space that is many times more complex than optimizing parameters for a single method. Even
Table 3
Objectives and characteristics of individualmethods of automated text classication.
ANN kNN NB RF SVM DICT
Performance
Accuracy
Versatility
High, esp. for
large sample sizes
Low, esp. for
growing number of
features (curse of
dimensions)
High, esp. for small
sample sizes and content
classication
High, esp. for all
three-class
sentiment problems
Medium, relatively
low for three-class
problems
Low, esp. for high
degree of noise,
e.g., colloquial
language
Implementational
costs
Expert
knowledge
Computational
requirements
High,esp.for
complex network
typology
(e.g., number of
neurons, hidden
layers)
Medium, costs are
deferred to slow
prediction phase
(lazy learner)
Low due to naïve
assumption of
independent features
Low as conceptually
accessible, and easy
parallelization of
individual decision
trees
Medium to high for
set-up, but training
difcult to scale and
parallelize
Low costs for
initialization, no
training time
when using
off-the-shelf
dictionaries
Interpretability
Comprehensi-
bility
Comparability
Low due to being
highly
parameterized,
difcult to tune
and interpret
parameters
High due to low
number of
parameters, but
difcult to interpret
in high-dimensional
space
High due to low number
of parameters
(i.e., Laplace factor) and
accessible interpretation
of conditional probabilities
High due to few core
parameters,
intuitive for
individual trees,
intuitive feature
importance
Medium due to few
core parameters to
tune but support
vectors conceptually
difcult to interpret
High due to
intuitive word
counts, no
parameters to
tune
34 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
if parallel processing is used for each method, computation time is at least as slow as the slowest method in the ensemble and also
takes more training time to identify appropriate ensemble level choices. Similarly, interpretation is many times more challenging
than the most complex method of the ensemble. We have experimented with majority vote ensembles of the approaches we have
covered but did not achieve better results in the applications we study. This is in line with prior research on text classication both
within and outside of marketing, which also relies on a single approach and applies ensembles only in few exceptions (e.g., Lee
et al., 2018;Neethu & Rajasree, 2013). Considering all marketing objectives, reliance on RF, NB and potentially ANN is likely to
result in an acceptable trade-off between efciency and predictive performance.
In terms of computational costs of individual methods, researchers and practitioners may wish to consider how efciently the
individual methods can be implemented and parallelized. Whereas ANN and SVM involve more complex optimizations, higher
sample size and a larger amount word features drives computation time. In contrast, RF can be easily parallelized (Breiman,
2001). Although a few parallel implementations of SVM exist, SVM typically face scalability issues both in terms of memory and
processing time required (e.g., Chang et al., 2008). In our applications, RF trains on average about four times faster than SVM
across all our datasets. For our real-world classication problem from the online travel agency, SVM trains N30 times slower
than RF and even slower than ANN. These differences can quickly amount to hours if not days in training time in actual application
settings. Given the overall poor performance of SVM, longer computational times appear to not necessarily generate better results.
Consequently, approaches such as NB or RF do not require many trade-offs and are appealing both in terms training time and
predictive performance.
In addition to this, costs associated with interpretation and communication of results as well as the number of parameters
associated with tuning drive application costs. In that respect, the conditional probabilities of NB further favor its use and allow
for an intuitive interpretation and explanation. In contrast, SVM and ANN can be considered black boxmethods due to their
complex structures, making interpretations costlier (see Table 3 for a summary of performance, implementational costs, and inter-
pretability consequences).
There are of course limitations to this research. First, while we have analyzed an exhaustive set of the major social media plat-
forms and real-world problems, results may differ for other types of text sources, languages, and classication objectives. Second,
given the large number of analyses we run, we focus on the most important parameters for tuning. This follows the few prior
method comparisons who have considered tuning (e.g., Joachims, 1998). Still, an even more extensive optimization may produce
different results. Third, we also follow prior comparisons by applying standard procedures in terms of preprocessing and document
representation. There are many ways of extending this to the specic task at hand, which can improve performance. Overall, our
results can be viewed as a lower performance boundary which further emphasizes the potential of automated text classication. In
addition, there are algorithms and dictionaries beyond the ones we study. However, we believe we cover a representative set of
both machine learning and lexical methods.
Moreover, commercial alternatives have recently appeared that marketing researchers and practitioners can use to generate in-
sights through automated natural language processing of unstructured texts, e.g., from Microsoft, Amazon or Google. This research
has been limited to the types of methods applied in prior publications. Conceptually, these commercial solutions require fewer or
no training examples and less technical expertise (e.g., to implement and calibrate the machine learning methods). However, they
are also less specialized for the task and text domain at hand. Consequently, they may not be optimally suited for the specic
questions marketing researchers may pose, e.g., when dealing with special types of texts or custom classes. For example, the
Google Cloud Natural Language API can classify a generic set of several hundred content classes, which will often not match
more specic interests of marketing research. However, the eld is developing rapidly and we strongly encourage readers to
monitor the development of such commercial services.
In many ways the results of this research represent a middle ground between marketing research and computer science. While
the latter emphasizes exhaustive method comparisons for each application, the former applied marketing perspective must also
consider application efciency, standardization, and comparability. In extension to the computer science literature, our results
suggest that inferior method choices can indeed result in important economic consequences. At the same time and with very
few exceptions, performance differences between RF, ANN, and NB are small and likely subordinate to most research applications.
According to our results choosing between RF and NB based on sentiment vs. content classication as well as the number of clas-
ses appears a reasonable trade-off between efciency,comparability,andaccuracy.Wehopethesendings make sound automated
text classication more approachable to marketing researchers and encourage future research to integrate social media communi-
cation as a standard component in econometric marketing models.
Funding
This work was funded by the German Research Foundation (DFG) research unit 1452, How Social Media is Changing Market-
ing, HE 6703/1-2.
Acknowledgements
The authors thank Chris Biemann, Brett Lantz, Julius Nagel, Robin Katzenstein, Christian Siebert, and Ann-Kristin Kupfer for
their valuable feedback and suggestions.
35J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
Appendix A. Approaches for automated text analysis
Appendix B. Dictionary characteristics of ve lexicon-based methods and classication accuracies
Dictionary characteristics Classication accuracies (in %)
Method Author(s) Type Positive words Negative words AMR YEL AMT IMD
AFINN Nielsen (2011) Strength 878 1,598 56.8 51.8 46.1 62.0
BING Hu & Liu (2004) Polarity 2,006 4,783 58.2 62.3 52.1 68.2
LIWC Pennebaker et al. (2015) Polarity 620 744 54.0 53.0 39.2 61.5
NRC Mohammad & Turney (2010) Polarity 1,070 814 48.1 45.0 23.8 53.0
VADER Hutto & Gilbert (2014) Strength 3,344 4,173 64.4 63.5 54.0 67.8
Note: We test VADER in Python and AFINN, BING, and NRC as implemented in the syuzhet package in R. For NRC and LIWC we evaluate all Amazon datasets in
both English and German. To make the results of all dictionaries comparable, we convert the sentiment strength scales of AFINN and VADER to binary polarities.
Appendix C. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.ijresmar.2018.09.009.
References
Aggarwal, C. C., & Zhai, C. (2012). A survey of text classification algorithms. Mining text data (pp. 163222). Boston, MA: Springer.
Akpinar, E., & Berger, J. (2017). Valuable virality. Journal of Marketing Research,54(2), 318330.
Alberto, T. C., Lochter, J. V., & Almeida, T. A. (2015). Comment spam filtering on YouTube, Proceedings of the 14th IEEE international conference on machine learning
and applications.
Almeida, T. A., Gómez Hidalgo, J. M., & Yamaki, A. (2011). Contributions to the study of SMS spam filtering: New collection and results. Proceedings of the 2011 ACM
symposium on doc ument engineering.
Annett, M., & Kondrak, G. (2008). A comparison of sentiment analysistechniques: Polarizing movie blogs. Conference of theCanadian Society for Computational Stud-
ies of IntelligenceBerlin, Heidelberg: Springer.
Barasch, A., & Berger, J. (2014). Broadcasting and narrowcasting: How audience size affects what people share. Journal of Marketing Research,51(3), 286299.
Bellmann, R. E. (1961). Adaptive control processes: A guided tour. Princeton University Press. Princeton: NJ.
Bennett, K. P., & Campbell, C. (2000). Support vector machines: Hype or hallelujah? ACM SIGKDD explorations newsletter. 2(2).(pp.113).
Berger, J., & Milkman, K. L. (2012). What makes online content viral? Journal of Marketing Research,49(2), 192205.
Bermingham, A., & Smeaton, A. F. (2010). Classifying sentiment in microblogs: Is brevity an advantage? Proceedings of the 19th ACM international conference on
information and knowledge management.
Blei, D. M., Andrew, N. G., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research,3(1), 9931002.
Boiy, E., Hens, P., Deschacht, K., & Moens, M. F. (2007). Automatic sentiment analysis in online text. ELPUB (pp. 349360).
Breiman, L. (1996). Bagging predictors. Machine Learning,24(2), 123140.
Breiman, L. (2001). Random forests. Machine Learning,45(1), 532.
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on machine
learning.
Cavanaugh,L. A., Bettman, J. R., & Luce, M. F. (2015). Feeling love and doing more for distant others: Specific positive emotionsdifferentially affect prosocial consump-
tion. Journal of Marketing Research,52(5), 657673.
Chang, E. Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z., & Cui, H. (2008). Parallelizing support vector machines on distributed computers. Advances in Neural Information
Processing Systems,257264.
36 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning,20(3), 273297.
Das, S. R., & Chen, M. Y. (2007). Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science,53(9), 13751388.
Detienne, K. B., Detienne, D. H., & Joshi, S. A. (2003). Neural networks as statistical tools for business researchers. Organizational Research Methods,6(2), 236265.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM,55(10), 7887.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning,29(2), 103130.
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Proceedings of the seventh inter-
national conference on information and knowledge management.
Efron, B., & Hastie, T. (2016). Computer age statistical inference. Cambridge, MA: Cambridge University Press.
Fang, X., & Zhan, J. (2015). Sentiment analysis using product review data. Journal of Big Data,2(5), 114.
Felbermayr, A., & Nanopoulos, A. (2016). The role of emotions for the perceived usefulness in online customer reviews. Journal of Interactive Marketing,36,6076.
Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Dowe need hundreds of classifiers to solve real world classification problems? Journal of Machine
Learning Research,15(1), 31333181.
Ghose, A.,Ipeirotis, P. G., & Li,B. (2012). Designingranking systems forhotels on travel search engines by mining user-generatedand crowdsourced content. Marketing
Science,31(3), 493520.
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N project report. 1(12).(pp.16). Stanford.
Hansen, N., Kupfer, A. K., & Hennig-Thurau, T. (2018). Brand crisis in the digital age: The short- and long-term effects of social media firestorms on consumers and
brands. International Journal of Research in Marketing,151 forthcoming https://www.sciencedirect.com/science/article/abs/pii/S0167811618300351.
Hennig-Thurau, T., Wiertz, C., & Feldhaus, F.(2015). Does twitter matter? The impact of microblogging word of mouth on consumers'adoption of new movies. Journal
of the Academy of Marketing Science,43(3), 375394.
Hewett, K., Rand, W., Rust, R. T., & van Heerde, H. J. (2016). Brand buzz in the echoverse. Journal of Marketing,80(3), 124.
Homburg, C., Ehm, L., & Artz, M. (2015). Measuring and managing consumer sentiment in an online community environment. Journal of Marketing Research,52(5),
629641.
Hu, M., & Liu, B. (2004).Mining and summarizing customer reviews.Proceedings of the tenth ACM SIGKDD international conference onknowledge discoveryand data
mining.
Huang, D., & Luo, L. (2016). Consumer preference elicitation of complex products using fuzzy support vector machine active learning. Marketing Science,35(3),
445464.
Humphreys, A., & Wang, R. J. -H. (2017). Automated text analysis for consumer research. Journal of Consumer Research,44(6), 12741306.
Hutto, E., & Gilbert, C. J. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media textEighth international conference on weblogs and
social media (ICWSM-14).
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning,ECML-98,137142.
Kannan, P. K., & Li, H. A. (2017). Digital marketing: A framework, review and research agenda. International Journal of Research in Marketing,34(1), 2245.
Kotziats, D., Denil, M., De Freitas, N., & Smyth, P. (2015). From group to individual labels using deep features. KDD (pp. 110).
Kübler, R. V., Colicev, A., & Pauwels, K. (2017). Social media's mindset: When to use which sentiment extraction tool? Marketing Science Institute working paper series,
17(122).(pp.199).
Kuhnen, C. M., & Niessen, A. (2012). Public opinion and executive compensation. Management Science,58(7), 12491272.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,521(7553), 436444.
Lee, D., Hosanagar, K., & Nair, H. (2018). Advertising content and consumer engagement on social media: Evidence from Facebook. Management Science,127
(forthcoming).
Markets and Markets (2018). Text analytics market by component. September 21, 2018, accessed from https://www.marketsandmarkets.com/PressReleases/text-ana
lytics.asp.
McAuley,J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015).Image-based recommendationson styles and substitutes. Proceedings of the 38th International ACM SIGIR
Conference on Research and Development in Information Retrieval.
McKinsey Global Institute (2017). Artificial intelligence. The next digital frontier? ,180.
Melville, P., Gryc, W., & Lawrence, R. D. (2009). Sentiment analysis of blogs by combining lexical knowledge with text classification. Proceedings of the 15th ACM
SIGKDD international conference on knowledge discovery and data mining.
Mohammad, S. M., & Turney, P. D. (2010). Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. Proceedings of the
NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text.
Moraes, R., Valiati, J. F., & Neto, W. P. G. (2013). Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Systems with
Applications,40(2), 621633.
Neethu, M.S., & Rajasree, R. (2013). Sentiment analysisin twitter using machine learning techniques. Computing, communicationsand networking technologies (ICCCNT)
(pp. 15).
Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own business: Market-structure surveillance through text mining. Marketing Science,31(3), 521543.
Netzer, O., Lemaire, A., & Herzenstein, M. (2016). When words sweat: Identifying signals for loan default in the text of loan applications. (Working Paper).
Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information
Processing Systems,841848.
Nielsen, F. (2011). A new anew: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 wor kshop on making sense of
microposts: Big things come in small packages.
Ordenes, F. V., Grewal, D., Ludwig, S., Ruyter, K. D., Mahr, D., Wetzels, M., & Kopalle, P. (2018). Cutting through content clutter: How speech and image acts drive con-
sumer sharing of social media brand messages. Journal of Consumer Research,165 (forthcoming).
Ordenes, F. V., Ludwig, S., Grewal, D., & Wetzels, M. (2017). Unveiling what is written in the stars: Analyzing explicit, implicit, and discourse patterns of sentiment in
social media. Journal of Consumer Research,43(6), 875894.
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Proceedings of the 42nd annual
meeting on association for computational linguistics.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: Sentiment classification using machine learning techniques. Proceedings of the ACL-02 conference on em-
pirical methods in natural language processing10.(pp.7986).
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. Austin, TX: University of Texas at Austin.
Puranam, D., Narayan, V., & Kadiyali, V. (2017). The effect of calorie posting regulation on consumer opinion: A flexible latent Dirichlet allocation model with
informative priors. Marketing Science,36(5), 726746.
Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press.
Schweidel, D. A., & Moe, W. W. (2014). Listening in on social media: A joint model of sentiment and venue format choice. Journal of Marketing Research,51(4),
387402.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys,34(1), 147.
Tirunillai, S., & Tellis, G. J. (2012). Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Science,31(2), 198215.
Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation,8(7), 13411390.
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... Zhou, Z. H. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems,14
(1), 137.
Xia, R., Zong, C., & Li, S. (2011). Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences,181(6), 11381152.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval,1(12), 6990.
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and
development in information retrieval.
37J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
Yang, Z., Nie, X., Xu, W., & Guo, J. (2006). An approach to spam detection by naive Bayes ensemble based on decision induction. Intelligent systems design and
applicationsISDA'06. Sixth international conference2.(pp.861866).
Ye, J., Chow, J. H., Chen, J., & Zheng, Z. (2009). Stochastic gradient boosted distributed decision trees. Proceedings of the 18th ACM conference on information and
knowledge management.
Yoganaras imhan, H. ( 2018). Search personalization using machine learning. Management Science,152 (forthcoming).
Zhang, Y.,Moe, W. W., & Schweidel,D. A. (2017). Modelingthe role of message content and influencers in social media rebroadcasting.International Journal of Research
in Marketing,34(1), 100119.
38 J. Hartmann et al. / International Journal of Research in Marketing 36 (2019) 2038
... This also requires an accurate initial content analysis by humans. Whereas a large number of studies compare the performance of different kinds of AI (e.g., Lorena et al., 2011;Hartmann et al., 2019), different configurations of parameters have rarely been investigated (e.g., Probst et al., 2019). These hyperparameters have to be chosen before the learning process of AI begins; they are normally not optimized during the learning process (Probst et al., 2019). ...
... Furthermore, most performance studies do not analyze how accurately AI interprets the texts of students for learning purposes. Previous studies analyze textual data such as product reviews on Amazon, social media comments on Facebook or user generated content on Twitter (Hartmann et al., 2019;Saura et al., 2022). As a consequence, there is a clear research gap as there is no empirical evidence how well AI can be used for the analysis of textual data generated in educational settings. ...
Article
Full-text available
Learning analytics represent a promising approach for fostering personalized learning processes. Most applications of this technology currently do not use textual data for providing information on learning, or for deriving recommendations for further development. This paper presents the results of three studies aiming to make textual information usable. In the first study, the iota concept is introduced as a new content analysis measure to evaluate inter-coder reliability. The main advantage of this new concept is that it provides a reliability estimation for every single category, allowing deeper insight into the quality of textual analysis. The second study simulates the process of content analysis, comparing the new iota concept with well-established measures (e.g., Krippendorff's Alpha, percentage agreement). The results show that the new concept covers the true reliability of a coding scheme, and is not affected by the number of coders or categories, the sample size, or the distribution of data. Furthermore, cutoff values are derived for judging the quality of the analysis. The third study employs the new concept, as it analyzes the performance of different artificial intelligence (AI) approaches for interpreting textual data based on 90 different constructs. The texts used here were either created by apprentices, students, and pupils, or were taken from vocational textbooks. The paper shows that AI can reliably interpret textual information for learning purposes, and also provides recommendations for optimal AI configuration.
... Studies [33][34] presented a feature model for languages workbenches and classified them using the model in the Language Workbench Challenge 2013. The authors supplied 10 workbenches with a consistent challenge (i.e., a DSL for surveys). ...
... LIWC also allows researchers to create custom dictionaries to measure other constructs (Humphreys & Wang, 2018). And although scholars have found more precise ways to measure some constructs like sentiment (Hartmann et al., 2019), LIWC remains a good place to start (www. liwc. ...
Article
Full-text available
Language is an integral part of marketing. Consumers share word of mouth, salespeople pitch services, and advertisements try to persuade. Further, small differences in wording can have a big impact. But while it is clear that language is both frequent and important, how can we extract insight from this new form of data? This paper provides an introduction to the main approaches to automated textual analysis and how researchers can use them to extract marketing insight. We provide a brief summary of dictionaries, topic modeling, and embeddings, some examples of how each approach can be used, and some advantages and limitations inherent to each method. Further, we outline how these approaches can be used both in empirical analysis of field data as well as experiments. Finally, an appendix provides links to relevant tools and readings to help interested readers learn more. By introducing more researchers to these valuable and accessible tools, we hope to encourage their adoption in a wide variety of areas of research.
... Concerning suspicious includes any content that offends religious sensibilities; stokes antigovernment sentiments; incites terrorism; encourages illegal activities such as phishing, SMS, and pharming; or instigates a community without a legitimate reason [8][9][10][11]. As examples, social media was employed as a means of communication in the Boston Marathon bombing and during the Egyptian revolution [12]. The questionable content can be delivered in a variety of formats, including video, audio, pictures, graphics, and plain text. ...
Article
Full-text available
The concentration of this paper is on detecting trolls among reviewers and users in online discussions and link distribution on social news aggregators such as Reddit. Trolls, a subset of suspicious reviewers, have been the focus of our attention. A troll reviewer is distinguished from an ordinary reviewer by the use of sentiment analysis and deep learning techniques to identify the sentiment of their troll posts. Machine learning and lexicon-based approaches can also be used for sentiment analysis. The novelty of the proposed system is that it applies a convolutional neural network integrated with a bidirectional long short-term memory (CNN–BiLSTM) model to detect troll reviewers in online discussions using a standard troll online reviewer dataset collected from the Reddit social media platform. Two experiments were carried out in our work: the first one was based on text data (sentiment analysis), and the second one was based on numerical data (10 attributes) extracted from the dataset. The CNN-BiLSTM model achieved 97% accuracy using text data and 100% accuracy using numerical data. While analyzing the results of our model, we observed that it provided better results than the compared methods.
... Thus, marketing research and application have great opportunities for deploying innovative and advanced state-of-the-art methods that may generate superior insights that are applied effectively in many other subject domains (Hartmann et al. 2019;Salminen et al. 2019). From an academic perspective, marketing as a research domain has been largely dependent on legacy methods that are simply incapable of dealing with the volume and complexity of data toward generating meaningful insights (Liu, Burns, and Hou 2017;Mustak et al. 2021). ...
Article
Full-text available
Artificial intelligence, particularly machine learning, carries high potential to automatically detect customers’ pain points, which is a particular concern the customer expresses that the company can address. However, unstructured data scattered across social media make detection a nontrivial task. Thus, to help firms gain deeper insights into customers’ pain points, the authors experiment with and evaluate the performance of various machine learning models to automatically detect pain points and pain point types for enhanced customer insights. The data consist of 4.2 million user-generated tweets targeting 20 global brands from five separate industries. Among the models they train, neural networks show the best performance at overall pain point detection, with an accuracy of 85% (F1 score = .80). The best model for detecting five specific pain points was RoBERTa 100 samples using SYNONYM augmentation. This study adds another foundational building block of machine learning research in marketing academia through the application and comparative evaluation of machine learning models for natural language–based content identification and classification. In addition, the authors suggest that firms use pain point profiling, a technique for applying subclasses to the identified pain point messages to gain a deeper understanding of their customers’ concerns.
... Random Forest builds multiple uncorrelated decision trees, where each tree represents a vote toward the final class predictions (Breiman 2001). It can deal with complex interactions between attributes, perform feature selection automatically, and non-linear model data (Hartmann et al. 2019). ...
Article
Understanding consumer emotions arising from robot-customers encounters and shared through online reviews is critical for forecasting consumers’ intention to adopt service robots. Qualitative analysis has the advantage of generating rich insights from data, but it requires intensive manual work. Scholars have emphasized the benefits of using algorithms for recognizing and differentiating among emotions. This study critically addresses the advantages and disadvantages of qualitative analysis and machine learning methods by adopting a hybrid machine-human intelligence approach. We extracted a sample of 9707 customers reviews from two major social media platforms (Ctrip and TripAdvisor), encompassing 412 hotels in 8 countries. The results show that the customer experience with service robots is overwhelmingly positive, revealing that interacting with robots triggers emotions of joy, love, surprise, interest, and excitement. Discontent is mainly expressed when customers cannot use service robots due to malfunctioning. Service robots trigger more emotions when they move. The findings further reveal the potential moderation effect of culture on customer emotional reactions to service robots. The study highlights that the hybrid approach can take advantage of the scalability and efficiency of machine learning algorithms while overcoming its shortcomings, such as poor interpretative capacity and limited emotion categories.
... Works in Tuv et al. (2009) address the problem of defining the appropriate number of features to be selected. The choice of the best set of features is a key factor for successful and effective text classification (Hartmann et al., 2019). In general, redundant and irrelevant features cannot improve the performance of the learning model rather they lead to additional mistakes in the learning process of the model. ...
Article
Full-text available
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
Chapter
Smartphones have become indispensable part of day-to-day human life. These devices provide rapid access to digital calendars enabling users to schedule their personal and professional activities with short titles referred as event titles. Event titles provide valuable information for personalization of various services. However, very nature of the event titles to be short with only few words, pose a challenge to identify language and exact event the user is scheduling. Deployment of robust machine learning pipelines that can continuously learn from data on the server side is not feasible as the event titles represent private user data and raise significant concerns. To tackle this challenge, we propose a privacy preserving on-device solution namely Calendar Event Classifier (CEC) to classify calendar titles into a set of 22 event types grouped into 3 categories using the fastText library. Our language detection models with accuracies of 96%, outperform existing language detection tools by 20% and our event classifiers achieved 92%, 94%, 87% and 90% accuracies across, English, Korean and German, French respectively. Currently tested CEC module architecture delivers the fastest (4 ms/event) predictions with <8 MB memory footprint and cater multiple personalization services. Taken together, we present the need for customization of machine learning models for language detection and information extraction from extremely short text documents such as calendar titles.
Article
Full-text available
The amount of digital text available for analysis by consumer researchers has risen dramatically. Consumer discussions on the internet, product reviews, and digital archives of news articles and press releases are just a few potential sources for insights about consumer attitudes, interaction, and culture. Drawing from linguistic theory and methods, this article presents an overview of automated text analysis, providing integration of linguistic theory with constructs commonly used in consumer research, guidance for choosing amongst methods, and advice for resolving sampling and statistical issues unique to text analysis. We argue that although automated text analysis cannot be used to study all phenomena, it is a useful tool for examining patterns in text that neither researchers nor consumers can detect unaided. Text analysis can be used to examine psychological and sociological constructs in consumerproduced digital text by enabling discovery or by providing ecological validity. © The Author 2017. Published by Oxford University Press on behalf of Journal of Consumer Research, Inc. All rights reserved.
Article
Full-text available
We describe the e ect of social media advertising content on customer engagement using data from Facebook. We content-code 106,316 Facebook messages across 782 companies, using a combination of Amazon Mechanical Turk and natural language processing algorithms. We use this data set to study the association of various kinds of social media marketing content with user engagement—defined as Likes, comments, shares, and click-throughs—with the messages. We find that inclusion of widely used content related to brand personality—like humor and emotion—is associated with higher levels of consumer engagement (Likes, comments, shares) with a message. We find that directly informative content—like mentions of price and deals—is associated with lower levels of engagement when included in messages in isolation, but higher engagement levels when provided in combination with brand personality–related attributes. Also, certain directly informative content, such as deals and promotions, drive consumers’ path to conversion (click-throughs). These results persist after incorporating corrections for the nonrandom targeting of Facebook’s EdgeRank (News Feed) algorithm and so reflect more closely user reaction to content than Facebook’s behavioral targeting. Our results suggest that there are benefits to content engineering that combines informative characteristics that help in obtaining immediate leads (via improved click-throughs) with brand personality–related content that helps in maintaining future reach and branding on the social media site (via improved engagement). These results inform content design strategies. Separately, the methodology we apply to content-code text is useful for future studies utilizing unstructured data such as advertising content or product reviews.
Article
Full-text available
Sentiment analysis or opinion mining is one of the major tasks of NLP (Natural Language Processing). Sentiment analysis has gain much attention in recent years. In this paper, we aim to tackle the problem of sentiment polarity categorization, which is one of the fundamental problems of sentiment analysis. A general process for sentiment polarity categorization is proposed with detailed process descriptions. Data used in this study are online product reviews collected from Amazon.com. Experiments for both sentence-level categorization and review-level categorization are performed with promising outcomes. At last, we also give insight into our future work on sentiment analysis.
Article
Social media firestorms imply the sudden occurrence of many, predominantly negative social media expressions against a brand. Do such firestorms leave a mark on consumers and their brand judgments—in the short term but also over time—to a degree that deserves managerial attention? What kind of firestorms have the strongest destructive potential? This manuscript treats firestorms as a digital form of brand crisis and proposes a conceptual framework to identify which firestorms harm short- and long-term brand perceptions and become part of consumers' long-term memory. A unique data set combines secondary data about 78 real-life firestorms with daily brand perceptions obtained from the YouGov panel and survey data from 997 consumers. The results indicate that of all affected brands, 58% suffer from a decrease in short-term brand perceptions, and 40% suffer long-term negative effects, suggesting that social media firestorms can indeed harm businesses but also show that strong variations exist. Contingency analyses of the conceptual framework with regressions and generalized estimating equations indicate that social media firestorms are most impactful in terms of negative brand association changes and/or memory effects when they are initiated by a vivid trigger (e.g., video in the first firestorm tweet), linked to a product/service or social failure, characterized by a large volume of social media messages, and when they last longer.
Article
We describe the effect of social media advertising content on customer engagement using data from Facebook. We content-code 106,316 Facebook messages across 782 companies, using a combination of Amazon Mechanical Turk and natural language processing algorithms. We use this data set to study the association of various kinds of social media marketing content with user engagement—defined as Likes, comments, shares, and click-throughs—with the messages. We find that inclusion of widely used content related to brand personality—like humor and emotion—is associated with higher levels of consumer engagement (Likes, comments, shares) with a message. We find that directly informative content—like mentions of price and deals—is associated with lower levels of engagement when included in messages in isolation, but higher engagement levels when provided in combination with brand personality–related attributes. Also, certain directly informative content, such as deals and promotions, drive consumers’ path to conversion (click-throughs). These results persist after incorporating corrections for the nonrandom targeting of Facebook’s EdgeRank (News Feed) algorithm and so reflect more closely user reaction to content than Facebook’s behavioral targeting. Our results suggest that there are benefits to content engineering that combines informative characteristics that help in obtaining immediate leads (via improved click-throughs) with brand personality–related content that helps in maintaining future reach and branding on the social media site (via improved engagement). These results inform content design strategies. Separately, the methodology we apply to content-code text is useful for future studies utilizing unstructured data such as advertising content or product reviews.
Article
Consumer-to-consumer brand message sharing is pivotal for effective social media marketing. Even as companies join social media conversations and generate millions of brand messages, it remains unclear what, how, and when brand messages stand out and prompt sharing by consumers. With a conceptual extension of speech act theory, this study offers a granular assessment of brands’ message intentions (i.e., assertive, expressive, or directive) and the effects on consumer sharing. A text mining study of more than two years of Facebook posts and Twitter tweets by well-known consumer brands empirically demonstrates the impacts of distinct message intentions on consumers’ message sharing. Specifically, the use of rhetorical styles (alliteration and repetitions) and cross-message compositions enhance consumer message sharing. As a further extension, an image-based study demonstrates that the presence of visuals, or so-called image acts, increases the ability to account for message sharing. The findings explicate brand message sharing by consumers and thus offer guidance to content managers for developing more effective conversational strategies in social media marketing.
Article
In 2008, New York City mandated that all chain restaurants post calorie information on their menus. For managers of chain and standalone restaurants, as well as for policy makers, a pertinent goal might be to monitor the impact of this regulation on consumer conversations. We propose a scalable Bayesian topic model to measure and understand changes in consumer opinion about health (and other topics).We calibrate the model on 761,962 online reviews of restaurants posted over eight years. Our model allows managers to specify prior topics of interest such as “health” for a calorie posting regulation. It also allows the distribution of topic proportions within a review to be affected by its length, valence, and the experience level of its author. Using a difference-in-differences estimation approach, we isolate the potentially causal effect of the regulation on consumer opinion. Following the regulation, there was a statistically small but significant increase in the proportion of discussion of the health topic. This increase can be attributed largely to authors who did not post reviews before the regulation, suggesting that the regulation prompted several consumers to discuss health in online restaurant reviews.
Article
Query-based search is commonly used by many businesses to help consumers find information/products on their websites. Examples include search engines (Google, Bing), online retailers (Amazon, Macy's), and entertainment sites (Hulu, YouTube). Nevertheless, a significant portion of search sessions are unsuccessful, i.e., do not provide information that the user was looking for. We present a machine learning framework that improves the quality of search results through automated personalization based on a user's search history. Our framework consists of three modules -- (a) Feature generation, (b) NDCG-based LambdaMART algorithm, and (c) Feature selection wrapper. We estimate our framework on large-scale data from a leading search engine using Amazon EC2 servers. We show that our framework offers a significant improvement in search quality compared to non-personalized results. We also show that the returns to personalization are monotonically, but concavely increasing with the length of user history. Next, we find that personalization based on short-term history or "within-session" behavior is less valuable than long-term or "across-session" personalization. We also derive the value of different feature sets -- user-specific features contribute over 50% of the improvement and click-specific over 28%. Finally, we demonstrate scalability to big data and derive the set of optimal features that maximize accuracy while minimizing computing speed.