PreprintPDF Available

Measuring Emotions in the COVID-19 Real World Worry Dataset

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The COVID-19 pandemic is having a dramatic impact on societies and economies around the world. With various measures of lockdowns and social distancing in place, it becomes important to understand emotional responses on a large scale. In this paper, we present the first ground truth dataset of emotional responses to COVID-19. We asked participants to indicate their emotions and express these in text and created the Real World Worry Dataset of 5,000 texts (2,500 short + 2,500 long texts). Our analyses suggest that emotional responses correlated with linguistic measures. Topic modeling further revealed that people in the UK worry about their family and the economic situation. Tweet-sized texts functioned as a call for solidarity, while longer texts shed light on worries and concerns. Using predictive modeling approaches, we were able to approximate the emotional responses of participants from text within 14\% of their actual value. We encourage others to use the dataset and improve how we can use automated methods to learn about emotional responses and worries about an urgent problem.
Content may be subject to copyright.
Measuring Emotions in the COVID-19 Real World Worry Dataset
Bennett Kleinberg1,2Isabelle van der Vegt1Maximilian Mozes1,2,3
1Department of Security and Crime Science
2Dawes Centre for Future Crime
3Department of Computer Science
University College London
{bennett.kleinberg, isabelle.vandervegt, maximilian.mozes}
The COVID-19 pandemic is having a dramatic
impact on societies and economies around the
world. With various measures of lockdowns
and social distancing in place, it becomes im-
portant to understand emotional responses on
a large scale. In this paper, we present the first
ground truth dataset of emotional responses to
COVID-19. We asked participants to indicate
their emotions and express these in text and
created the Real World Worry Dataset of 5,000
texts (2,500 short + 2,500 long texts). Our
analyses suggest that emotional responses cor-
related with linguistic measures. Topic mod-
eling further revealed that people in the UK
worry about their family and the economic sit-
uation. Tweet-sized texts functioned as a call
for solidarity, while longer texts shed light on
worries and concerns. Using predictive model-
ing approaches, we were able to approximate
the emotional responses of participants from
text within 14% of their actual value. We en-
courage others to use the dataset and improve
how we can use automated methods to learn
about emotional responses and worries about
an urgent problem.
1 Introduction
The outbreak of the SARS-CoV-2 virus in late 2019
and subsequent evolution of the COVID-19 disease
has affected the world on an enormous scale. While
hospitals are at the forefront of trying to mitigate
the life-threatening consequences of the disease,
practically all societal levels are dealing directly or
indirectly with an unprecedented situation. Most
countries — are at the time of writing this paper
— in various stages of a lockdown. Schools and
universities are closed or operate online-only, and
merely essential shops are kept open.
At the same time, lockdown measures such as
social distancing (e.g., keeping a distance of at least
1.5 meters from one another and only socializing
with two people at most) might have a direct im-
pact on people’s mental health. With an uncertain
outlook on the development of the COVID-19 situ-
ation and its preventative measures, it is of vital im-
portance to understand how governments, NGOs,
and social organizations can help those who are
most affected by the situation. That implies, at
the first stage, understanding the emotions, wor-
ries, and concerns that people have and possible
coping strategies they use. Since a majority of on-
line communication is recorded in the form of text
data, measuring the emotions and worries around
COVID-19 will be a central part of understand-
ing and addressing the impacts of the COVID-19
situation on people. This is where computational
linguistics can play a crucial role.
In this paper, we present and make publicly avail-
able a high quality, ground truth text dataset of emo-
tions and worries around COVID-19. We report
initial findings on linguistic correlates of emotions,
topic models, and prediction experiments.
1.1 Ground truth emotions datasets
Tasks like emotion detection (Seyeditabari et al.,
2018) and sentiment analysis (Liu,2015) typically
rely on labeled data in one of two forms. Either
a corpus is annotated on a document-level (i.e.,
judging a document as positive, neutral, or neg-
ative) or individual
-grams are judged on their
polarity (i.e., assigning a score to each
-gram on
a word list). These annotations are done (semi)
automatically (e.g., exploiting hashtags such as
) (Mohammad and Kiritchenko,2015) or
manually through third persons (Mohammad and
Turney,2010). While these approaches are com-
mon practice and have accelerated the progress that
was made in the field, they are limited in that they
propagate a pseudo ground truth. This is problem-
atic because, as we argue, the core aim of emotion
detection is to make an inference about the authors
arXiv:2004.04225v1 [cs.CL] 8 Apr 2020
emotional or cognitive state. The text as the prod-
uct of an emotional state functions as a proxy for
the latter. For example, rather than wanting to know
whether a Tweet is written in a pessimistic tone, we
are interested in learning whether the author of the
text actually felt pessimistic.
The limitation inherent to third-person annota-
tion, then, is that they might not be adequate mea-
surements of the emotional state of interest. The
solution, albeit a costly one, lies in ground truth
datasets. Whereas real ground truth would require -
in its strictest sense - a random assignment of peo-
ple to experimental conditions (e.g., a group that is
given a positive product experience vs. the oppo-
site), variations that rely on self-reported emotions
can also mitigate the problem. Datasets that rely on
self-reports are the International Survey on Emo-
tion Antecedents and Reactions (ISEAR)
, which
asked participants to recall from memory situations
that evoked a set of emotions. The COVID-19 sit-
uation is unique and calls for datasets that capture
peoples affective responses to it while it is happen-
1.2 Current COVID-19 datasets
Several datasets mapping how the public responds
to the pandemic have been made available. For ex-
ample, tweets relating to the Coronavirus have been
collected since March 11, 2020, yielding about 4.4
million tweets a day (Banda et al.,2020). Tweets
were collected through the Twitter stream API, us-
ing keywords such as ’coronavirus’ and ’COVID-
19’. Another Twitter dataset of Coronavirus tweets
has been collected since January 22, 2020, in sev-
eral languages, including English, Spanish, and
Indonesian (Chen et al.,2020). Further efforts in-
clude the ongoing Pandemic Project
which has
people write about the effect of the coronavirus
outbreak on their everyday lives.
1.3 The COVID-19 Real World Worry
This paper reports initial findings for the Real
World Worry Dataset (RWWD) that captured the
emotions and worries of UK residents at a point
in time where the impact of the COVID-19 sit-
uation affected the lives of all individuals in the
UK. The data were collected on the 6th and 7th
materials-and- online-research/
of April 2020, a time at which the UK was un-
der lockdown (news,2020), and death tolls were
increasing. On April 6, 5,373 people in the UK
had died of the virus, and 51,608 tested positive
(Walker ,now). On the day before data collection,
the Queen addressed the nation via a television
broadcast (Guardian,2020). Furthermore, it was
also announced that Prime Minister Boris John-
son was admitted to intensive care in a hospital for
COVID-19 symptoms (Lyons,2020).
The RWWD is a ground truth dataset that used
a direct survey method and obtained written ac-
counts of people alongside data of their emotions
and worries. As such, the dataset does not rely
on third-person annotation but can resort to direct
self-reported emotions. We present two versions of
RWWD, each consisting of 2,500 English texts rep-
resenting the participants’ genuine worries about
the Corona situation in the UK: the Long RWWD
consists of texts that were open-ended in length
and asked the participants to express their feelings
as they wish. The Short RWWD asked the same
people also to express their feelings in Tweet-sized
texts. The latter was chosen to facilitate the use of
this dataset for Twitter data research.
The dataset is publicly available.3.
2 Data
We collected the data of
2500 participants
(94.46% native English speakers) via the crowd-
sourcing platform Prolific
. Every participant pro-
vided consent in line with the local IRB. The sam-
ple requirements were that the participants were
resident in the UK and a Twitter user. In the task,
all participants were asked to indicate how they
felt about the current COVID-19 situation using
9-point scales (1
not at all, 5
moderately, 9
very much). Specifically, each participant rated
how worried they were about the Corona/COVID-
19 situation and how much anger, anxiety, desire,
disgust, fear, happiness, relaxation, and sadness
(Harmon-Jones et al.,2016) they felt about their
situation at this moment. They also had to choose
which of the eight emotions (except worry) best
represented their feeling at this moment.
All participants were then asked to write two
texts. First, we instructed them to “write in a few
sentences how you feel about the Corona situation
covid19worry and
at this very moment. This text should express your
feelings at this moment” (min. 500 characters).
The second part asked them to express their feel-
ings in Tweet form (max. 240 characters). Finally,
the participants indicated how well they felt they
could express their feelings (in general/in the long
text/in the Tweet-length text) and how often they
used Twitter and whether English was their native
language. The overall corpus size of the dataset
was 2500 long texts (320,372 tokens) and 2500
short texts (69,171 tokens).
2.1 Excerpts
Below are two excerpts from the dataset:
Long text:
I am 6 months pregnant, so I
feel worried about the impact that getting the virus
would have on me and the baby. My husband
also has asthma so that is a concern too. I am
worried about the impact that the lockdown will
have on my ability to access the healthcare I will
need when having the baby, and also about the
exposure to the virus [...] There is just so much
uncertainty about the future and what the coming
weeks and months will hold for me and the people
I care about.
Tweet-sized text:
Proud of our NHS and
keyworkers who are working on the frontline at
the moment. I’m optimistic about the future, IF
unite as a country, by social distancing and stay in.
2.2 Descriptive statistics
We excluded nine participants who padded the
long text with punctuation or excessive letter rep-
etitions. The dominant feelings of participants
were anxiety/worry, sadness, and fear (see Table
1). For all emotions, the participants’ self-rating
ranged across the whole spectrum (from “not at all”
to “very much”). The participants’ self-reported
ability to express their feelings in the long text
M= 7.12
SD = 1.78
) was larger than that for
short texts (
M= 5.91
SD = 2.12
), Bayes factor
>1e+ 96.
3 Findings and experiments
3.1 Correlations of emotions with LIWC
We correlated the self-reported emotions to match-
ing categories of the LIWC2015 lexicon (Pen-
Variable Mean SD
Corpus descriptives
Tokens (long text) 127.75 39.67
Tokens (short text) 27.70 15.98
Chars. (long text) 632.54 197.75
Chars. (short text) 137.21 78.40
Worry 6.55a1.76
Anger1(4.33%) 3.91b2.24
Anxiety (55.36%) 6.49a2.28
Desire (1.09%) 2.97b2.04
Disgust (0.69%) 3.23b2.13
Fear (9.22%) 5.67a2.27
Happiness (1.58%) 3.62b1.89
Relaxation (13.38%) 3.95b2.13
Sadness (14.36%) 5.59a2.31
Table 1: Descriptive statistics of text data and emo-
tion ratings. 1brackets indicate how often the emotion
was chosen as the best fit for the current feeling about
COVID-19. athe value is larger than the neutral mid-
point with Bayes factors >1e+ 32.bthe value is
smaller than the neutral midpoint with BF >1e+ 115.
nebaker et al.,2015). The overall matching rate
was high (92.36% and 90.11% for short and long
texts, respectively). Across all correlations, we
see that the extent to which the linguistic variables
explain variance in the emotion ratings (indicated
by the
) is larger in long texts than in Tweet-
sized short texts (see Table 2). There are significant
positive correlations for all affective LIWC vari-
ables with their corresponding self-reported emo-
tions (i.e., higher LIWC scores accompanied higher
emotion scores, and vice versa). These correlations
imply that the linguistic variables explain up to
10% and 3% of the variance in the emotion ratings
for long and short texts, respectively.
The LIWC also contains categories intended
to capture people’s concerns, which we corre-
lated to the self-reported worry. Positive (nega-
tive) correlations would suggest that the higher
(lower) the worry score of the participants, the
larger their score on the respective LIWC category.
We found no correlation between the categories
“work”, “money” and “death” suggesting that worry
was not associated with these categories. Signif-
icant positive correlations emerged for long texts
for “family” and “friend”: the more people were
worried, the more they spoke about family and —
to a lesser degree — friends.
3.2 Topic models of peoples worries
We constructed topic models for both the long and
short texts separately using the stm package in R
(Roberts et al.,2014a). The text data were lower-
cased, punctuation, stopwords and numbers were
removed, and all words were stemmed. For the
long texts, we chose a topic model with 20 topics
as determined by semantic coherence and exclu-
sivity values for the model (Mimno et al.,2011;
Roberts et al.,2014b,a). Table 3shows the five
most prevalent topics with ten associated frequent
terms for each topic (see online supplement for all
20 topics). The most prevalent topic seems to re-
late to following the rules related to the lockdown.
In contrast, the second most prevalent topic ap-
pears to relate to worries about employment and
the economy. For the Tweet-sized texts, we se-
lected a model with 15 topics. The most common
topic bears a resemblance to the government slogan
“Stay at home, protect the NHS, save lives.” The
second most prevalent topic seems to relate to calls
for others to adhere to social distancing rules.
3.3 Predicting emotions about COVID-19
We used linear regression models to predict the re-
ported emotional values (i.e., anxiety, fear, sadness,
worry) based on text properties. Specifically, we
applied regularised ridge regression models using
TFIDF and part-of-speech (POS) features extracted
from long and short texts separately. TFIDF fea-
tures were computed based on the 1000 most fre-
quent words in the vocabularies of each corpus;
POS features were extracted using a predefined
scheme of 53 POS tags in spaCy5.
We process the resulting feature representations
using principal component analysis and assess
the performances using the mean absolute error
(MAE) and the coefficient of determination
Each experiment is conducted using five-fold cross-
validation, and the arithmetic means of all five folds
are reported as the final performance results.
Table 4shows the performance results in both
long and short texts. We observe MAEs ranging
between 1.26 (worry with TFIDF) and 1.88 (sad-
ness with POS) for the long texts, and between 1.37
(worry with POS) and 1.91 (sadness with POS) for
the short texts. We furthermore observe that the
models perform best in predicting the worry scores
for both long and short texts. The models explain
up to 16% of the variance for the emotional re-
sponse variables on the long texts, but only up to
1% on Tweet-sized texts.
4 Discussion
This paper introduced the RWWD as a ground
truth dataset as a resource to measure emotional
responses to the Corona pandemic. We reported
initial findings on the linguistic correlates of emo-
tional states, used topic modeling to understand
what people in the UK are concerned about, and
ran prediction experiments to infer emotional states
from text using machine learning. These analyses
provided several core findings. (1) Some emotional
states correlated with word lists made to measure
these constructs, (2) longer texts were more useful
to identify patterns in language that relate to emo-
tions than shorter texts, (3) Tweet-sized texts served
as a means to call for solidarity during lockdown
measures while longer texts revealed peoples wor-
ries, and (4) preliminary regression experiments
indicate that we can infer from the texts the emo-
tional responses with an absolute error of 1.26 on a
9-point scale (14%).
4.1 Linguistic correlates of emotions and
Affective reactions to the Coronavirus were ob-
tained through self-reported scores. When we used
psycholinguistic word lists that measure these emo-
tions, we found weak, positive correlations. The
lexicon-approach was best at measuring anger, anx-
iety, and worry and did so better for longer texts
than for Tweet-sized texts. In behavioral and cog-
nitive research, small effects (here: a maximum of
10.63% of explained variance) are the rule rather
than the exception (Gelman,2017;Yarkoni and
Westfall,2017). It is essential, however, to inter-
pret them as such. If 10% of the variance in the
anxiety score is explained through a linguistic mea-
surement, 90% are not. An explanation for the
imperfect correlations - aside from random mea-
surement error - might lie in the inadequate ex-
pression of someone’s felt emotion in the form of
written text. The latter is partly corroborated by
even smaller effects for shorter texts, which may
have been too short to allow for the expression of
one’s emotion.
Correlates Long texts Short texts
Affective processes
Anger - LIWC anger 0.28 [0.23; 0.32] (7.56%) 0.09 [0.04; 0.15] (0.88%)
Sadness - LIWC sad 0.21 [0.16; 0.26] (4.35%) 0.13 [0.07; 0.18] (1.58%)
Anxiety - LIWC anx 0.33 [0.28; 0.37] (10.63%) 0.18 [0.13; 0.23] (3.38%)
Worry - LIWC anx 0.30 [0.26; 0.35] (9.27%) 0.18 [0.13; 0.23] (3.30%)
Happiness - LIWC posemo 0.22 [0.17; 0.26] (4.64%) 0.13 [0.07; 0.18] (1.56%)
Worry - LIWC work -0.03 [-0.08; 0.02] (0.01%) -0.03 [-0.08; 0.02] (0.10%)
Worry - LIWC money 0.00 [-0.05; 0.05] (0.00%) -0.01 [-0.06; 0.04] (0.00%)
Worry - LIWC death 0.05 [-0.01; 0.10] (0.26%) 0.05 [0.00; 0.10] (0.29%)
Worry - LIWC family 0.18 [0.13; 0.23] (3.12%) 0.06 [0.01; 0.11] (0.40%)
Worry - LIWC friend 0.07 [0.01; 0.12] (0.42%) -0.01 [-0.06; 0.05] (0.00%)
Table 2: Correlations (Pearsons r, 99% CI, R-squared in %) between LIWC variables and emotions.
Docs Terms
Long texts
9.52 people, take, think, rule, stay, serious, follow, virus, mani, will
8.35 will, worri, job, long, also, economy, concern, impact, famili, situat
7.59 feel, time, situat, relax, quit, moment, sad, thing, like, also
6.87 feel, will, anxious, know, also, famili, worri, friend, like, sad
5.69 work, home, worri, famili, friend, abl, time, miss, school, children
Short texts
10.70 stay, home, safe, live, pleas, insid, save, protect, nhs, everyone
8.27 people, need, rule, dont, stop, selfish, social, die, distance, spread
7.96 get, can, just, back, wish, normal, listen, lockdown, follow, sooner
7.34 famili, anxious, worri, scare, friend, see, want, miss, concern, covid
6.81 feel, situat, current, anxious, frustrat, help, also, away, may, extrem
Table 3: The five most prevalent topics for long and short texts.
Model Long Short
Anxiety - TFIDF 1.65 0.16 1.82 -0.01
Anxiety - POS 1.79 0.04 1.84 0.00
Fear - TFIDF 1.71 0.15 1.85 0.00
Fear - POS 1.83 0.05 1.87 0.01
Sadness - TFIDF 1.75 0.12 1.90 -0.02
Sadness - POS 1.88 0.02 1.91 -0.01
Worry - TFIDF 1.26 0.16 1.38 -0.03
Worry - POS 1.35 0.03 1.37 0.01
Table 4: Results for regression modeling for long and
short texts.
4.2 Topics of peoples worries
Prevalent topics in our corpus showed that people
worry about their jobs and the economy, as well
as their friends and family - the latter of which
is also corroborated by the LIWC analysis. For
example, people discussed the potential impact of
the situation on their family, as well as their chil-
dren missing school. Participants also discussed
the lockdown and social distancing measures. In
the Tweet-sized texts, in particular, people encour-
aged others to stay at home and adhere to lockdown
rules in order to slow the spread, save lives and/or
protect the NHS. Thus, people used the shorter
texts as a means to call for solidarity, while longer
texts offered insights into their actual worries.
While there are various ways to select the ideal
number of topics, we have relied on assessing se-
mantic coherence of topics and exclusivity of topic
words. Since there does not seem to be a consensus
as to the best practice for selecting topic numbers,
we encourage others to examine other approaches
or models with varying numbers of topics.
4.3 Predicting emotional responses
Prediction experiments reveal that ridge regression
models can be used to approximate emotional re-
sponses to COVID-19 based on encodings of the
textual features extracted from the participants’
statements. Similar to the correlational and topic
modeling findings, there is a stark difference be-
tween the long and short texts: the regression mod-
els are more accurate and explain more variance
for longer than for shorter texts. Additional ex-
periments are required to investigate further the
expressiveness of the collected textual statements
for the prediction of emotional values.
4.4 Suggestions for future research
The current analysis leaves several research ques-
tions untouched. First, to mitigate the limitations
of lexicon-approaches, future work on inferring
emotions and worries around COVID-19 could ex-
pand on the prediction approach (e.g., using binary
classification and different feature sets and models).
Carefully validated models could help to provide
the basis for large scale, real-time measurements of
emotional responses. Of particular importance is a
solution to the problem hinted at in the current pa-
per: the shorter, Tweet-sized texts contained much
less information, had a different function, and were
less suitable for predictive modeling. With much
of today’s stream of text data coming in the form of
(very) short messages, it is important to understand
the limitations of using that data and worthwhile
examining how we can better use that information.
Second, with a lot of research attention paid to
readily available Twitter data, we hope that future
studies also focus on non-Twitter data to capture
emotional responses of those who are underrepre-
sented (or non-represented) on social media but are
at heightened risk.
Third, future research may focus on manually
annotating topics to precisely map out what people
worry about with regards to COVID-19. Several
raters could assess frequent terms for each topic,
then assign a label. Then through discussion or
majority votes, final topic labels can be assigned to
obtain a model of COVID-19 real-world worries.
5 Conclusions
This paper introduced the first ground truth dataset
of textual emotional responses to COVID-19. Our
findings highlight the potential of inferring con-
cerns and worries from text data but also show
some of the pitfalls, in particular, when using con-
cise texts as data. We encourage the research com-
munity to use the dataset so we can better under-
stand the impact of the pandemic on people’s lives.
This research was supported by the Dawes Centre
for Future Crime at UCL.
Juan M. Banda, Ramya Tekumalla, Guanyu Wang,
Jingyuan Yu, Tuo Liu, Yuning Ding, and Gerardo
Chowell. 2020. A Twitter Dataset of 150+ mil-
lion tweets related to COVID-19 for open research.
Type: dataset.
Emily Chen, Kristina Lerman, and Emilio Ferrara.
2020. #COVID-19: The First Public Coron-
avirus Twitter Dataset. Original-date: 2020-03-
Andrew Gelman. 2017. The piranha problem in social
psychology / behavioral economics: The ”take a pill”
model of science eats itself - Statistical Modeling,
Causal Inference, and Social Science.
The Guardian. 2020. Coronavirus latest: 5 April at a
glance.The Guardian.
Cindy Harmon-Jones, Brock Bastian, and Eddie
Harmon-Jones. 2016. The Discrete Emotions Ques-
tionnaire: A New Tool for Measuring State Self-
Reported Emotions.PLOS ONE, 11(8):e0159915.
Bing Liu. 2015. Sentiment analysis: mining opinions,
sentiments, and emotions. Cambridge University
Press, New York, NY.
Kate Lyons. 2020. Coronavirus latest: at a glance.The
David Mimno, Hanna Wallach, Edmund Talley,
Miriam Leenders, and Andrew McCallum. 2011.
Optimizing Semantic Coherence in Topic Models.
page 11.
Saif Mohammad and Peter Turney. 2010. Emotions
Evoked by Common Words and Phrases: Using Me-
chanical Turk to Create an Emotion Lexicon. In
Proceedings of the NAACL HLT 2010 Workshop on
Computational Approaches to Analysis and Genera-
tion of Emotion in Text, pages 26–34, Los Angeles,
CA. Association for Computational Linguistics.
Saif M. Mohammad and Svetlana Kiritchenko. 2015.
Using Hashtags to Capture Fine Emotion Cate-
gories from Tweets.Computational Intelligence,
ITV news. 2020. Police can issue ’unlimited
fines’ to those flouting coronavirus social distanc-
ing rules, says Health Secretary. Library Catalog:
James W. Pennebaker, Ryan L. Boyd, Kayla Jordan,
and Kate Blackburn. 2015. The development and
psychometric properties of LIWC2015. Technical
Margaret E Roberts, Brandon M Stewart, and Dustin
Tingley. 2014a. stm: R Package for Structural Topic
Models. Journal of Statistical Software, page 41.
Margaret E. Roberts, Brandon M. Stewart, Dustin
Tingley, Christopher Lucas, Jetson LederLuis,
Shana Kushner Gadarian, Bethany Albertson, and
David G. Rand. 2014b. Structural Topic Models
for Open-Ended Survey Responses.American Jour-
nal of Political Science, 58(4):1064–1082. eprint:
Armin Seyeditabari, Narges Tabari, and Wlodek
Zadrozny. 2018. Emotion Detection in Text: a Re-
view.arXiv:1806.00674 [cs]. ArXiv: 1806.00674.
Amy Walker (now), Matthew Weaver (earlier), Steven
Morris, Jamie Grierson, Mark Brown, Jamie Grier-
son, and Pete Pattisson. 2020. UK coronavirus live:
Boris Johnson remains in hospital ’for observation’
after ’comfortable night’.The Guardian.
Tal Yarkoni and Jacob Westfall. 2017. Choosing Pre-
diction Over Explanation in Psychology: Lessons
From Machine Learning.Perspectives on Psycho-
logical Science, 12(6):1100–1122.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
This paper demonstrates how to use the R package stm for structural topic modeling. The structural topic model allows researchers to flexibly estimate a topic model that includes document-level metadata. Estimation is accomplished through a fast variational approximation. The stm package provides many useful features, including rich ways to explore topics, estimate uncertainty, and visualize quantities of interest.
Full-text available
Several discrete emotions have broad theoretical and empirical importance, as shown by converging evidence from diverse areas of psychology, including facial displays, developmental behaviors, and neuroscience. However, the measurement of these states has not progressed along with theory, such that when researchers measure subjectively experienced emotions, they commonly rely on scales assessing broad dimensions of affect (positivity and negativity), rather than discrete emotions. The current manuscript presents four studies that validate a new instrument, the Discrete Emotions Questionnaire (DEQ), that is sensitive to eight distinct state emotions: anger, disgust, fear, anxiety, sadness, happiness, relaxation, and desire. Emotion theory supporting the importance of distinguishing these specific emotions is reviewed.
Full-text available
Detecting emotions in microblogs and social media posts has applications for industry, health, and security. Statistical, supervised automatic methods for emotion detection rely on text that is labeled for emotions, but such data are rare and available for only a handful of basic emotions. In this article, we show that emotion-word hashtags are good manual labels of emotions in tweets. We also propose a method to generate a large lexicon of word–emotion associations from this emotion-labeled tweet corpus. This is the first lexicon with real-valued word–emotion association scores. We begin with experiments for six basic emotions and show that the hashtag annotations are consistent and match with the annotations of trained judges. We also show how the extracted tweet corpus and word–emotion associations can be used to improve emotion classification accuracy in a different nontweet domain.Eminent psychologist Robert Plutchik had proposed that emotions have a relationship with personality traits. However, empirical experiments to establish this relationship have been stymied by the lack of comprehensive emotion resources. Because personality may be associated with any of the hundreds of emotions and because our hashtag approach scales easily to a large number of emotions, we extend our corpus by collecting tweets with hashtags pertaining to 585 fine emotions. Then, for the first time, we present experiments to show that fine emotion categories such as those of excitement, guilt, yearning, and admiration are useful in automatically detecting personality from text. Stream-of-consciousness essays and collections of Facebook posts marked with personality traits of the author are used as test sets.
Conference Paper
Full-text available
Even though considerable attention has been given to semantic orientation of words and the creation of large polarity lexicons, research in emotion analysis has had to rely on limited and small emotion lexicons. In this paper, we show how we create a high-quality, moderate-sized emotion lexicon using Mechanical Turk. In addition to questions about emotions evoked by terms, we show how the inclusion of a word choice question can discourage malicious data entry, help identify instances where the annotator may not be familiar with the target term (allowing us to reject such annotations), and help obtain annotations at sense level (rather than at word level). We perform an extensive analysis of the annotations to better understand the distribution of emotions evoked by terms of different parts of speech. We identify which emotions tend to be evoked simultaneously by the same term and show that certain emotions indeed go hand in hand.
Conference Paper
Full-text available
Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
Psychology has historically been concerned, first and foremost, with explaining the causal mechanisms that give rise to behavior. Randomized, tightly controlled experiments are enshrined as the gold standard of psychological research, and there are endless investigations of the various mediating and moderating variables that govern various behaviors. We argue that psychology’s near-total focus on explaining the causes of behavior has led much of the field to be populated by research programs that provide intricate theories of psychological mechanism but that have little (or unknown) ability to predict future behaviors with any appreciable accuracy. We propose that principles and techniques from the field of machine learning can help psychology become a more predictive science. We review some of the fundamental concepts and tools of machine learning and point out examples where these concepts have been used to conduct interesting and important psychological research that focuses on predictive research questions. We suggest that an increased focus on prediction, rather than explanation, can ultimately lead us to greater understanding of behavior.
Sentiment analysis is the computational study of people's opinions, sentiments, emotions, and attitudes. This fascinating problem is increasingly important in business and society. It offers numerous research challenges but promises insight useful to anyone interested in opinion analysis and social media analysis. This book gives a comprehensive introduction to the topic from a primarily natural-language-processing point of view to help readers understand the underlying structure of the problem and the language constructs that are commonly used to express opinions and sentiments. It covers all core areas of sentiment analysis, includes many emerging themes, such as debate analysis, intention mining, and fake-opinion detection, and presents computational methods to analyze and summarize opinions. It will be a valuable resource for researchers and practitioners in natural language processing, computer science, management sciences, and the social sciences.
Collection and especially analysis of open-ended survey responses are relatively rare in the discipline and when conducted are almost exclusively done through human coding. We present an alternative, semiautomated approach, the structural topic model (STM) (Roberts, Stewart, and Airoldi 2013; Roberts et al. 2013), that draws on recent developments in machine learning based analysis of textual data. A crucial contribution of the method is that it incorporates information about the document, such as the author's gender, political affiliation, and treatment assignment (if an experimental study). This article focuses on how the STM is helpful for survey researchers and experimentalists. The STM makes analyzing open-ended responses easier, more revealing, and capable of being used to estimate treatment effects. We illustrate these innovations with analysis of text from surveys and experiments.
A Twitter Dataset of 150+ million tweets related to COVID-19 for open research
  • Juan M Banda
  • Ramya Tekumalla
  • Guanyu Wang
  • Jingyuan Yu
  • Tuo Liu
  • Yuning Ding
  • Gerardo Chowell
Juan M. Banda, Ramya Tekumalla, Guanyu Wang, Jingyuan Yu, Tuo Liu, Yuning Ding, and Gerardo Chowell. 2020. A Twitter Dataset of 150+ million tweets related to COVID-19 for open research. Type: dataset.