PreprintPDF Available

Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moral sentiment

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Research has shown that accounting for moral sentiment in natural language can yield insight into a variety of on- and off-line phenomena, such as message diffusion, protest dynamics, and social distancing. However, measuring moral sentiment in natural language is challenging and the difficulty of this task is exacerbated by the limited availability of annotated data. To address this issue, we introduce the Moral Foundations Twitter Corpus, a collection of 35,108 tweets that have been curated from seven distinct domains of discourse --- including natural disasters, politics, and contemporary social issues --- and hand-annotated by at least three trained annotators for 10 categories of moral sentiment. We discuss the structure of the corpus and our annotation procedures, as well as present baseline moral sentiment classification results. By making the Moral Foundations Twitter Corpus publicly available, our goal is to facilitate advances in applied and methodological research at the intersection of psychology and Natural Language Processing.
Content may be subject to copyright.
Running head: MORAL FOUNDATIONS TWITTER CORPUS 1
Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moral
sentiment
Joe Hoover1, Gwenyth Portillo-Wightman1?, Leigh Yeh1?, Shreya Havaldar1, Aida
Mostafazadeh, Ying Lin2, Davani1, Brendan Kennedy1, Mohammad Atari1, Zahra Kamel1,
Madelyn Mendlen1, Gabriela Moreno1, Christina Park1, Tingyee E. Chang1, Jenna
Chin1, Christian Leong1, Jun Yen Leung1, Arineh Mirinjian1, Morteza Dehghani1
1University of Southern California
2Rensselaer Polytechnic Institute
Author Note
?Contributed equally. Contributed equally. Contributed equally.
This work has been funded in part by NSF IBSS #1520031, NSF CAREER
BCS-1846531, and the Army Research Lab. Correspondence regarding this article should
be addressed to Morteza Dehghani, mdehghan@usc.edu, 3620 S. McClintock Ave, Los
Angeles, CA 90089-1061.
MORAL FOUNDATIONS TWITTER CORPUS 2
Abstract
Research has shown that accounting for moral sentiment in natural language can yield
insight into a variety of on- and off-line phenomena, such as message diffusion, protest
dynamics, and social distancing. However, measuring moral sentiment in natural language
is challenging and the difficulty of this task is exacerbated by the limited availability of
annotated data. To address this issue, we introduce the Moral Foundations Twitter
Corpus, a collection of 35,108 tweets that have been curated from seven distinct domains of
discourse — including natural disasters, politics, and contemporary social issues — and
hand-annotated by at least three trained annotators for 10 categories of moral sentiment.
We discuss the structure of the corpus and our annotation procedures, as well as present
baseline moral sentiment classification results. By making the Moral Foundations Twitter
Corpus publicly available, our goal is to facilitate advances in applied and methodological
research at the intersection of psychology and Natural Language Processing.
Keywords: sentiment; NLP; text analysis; morality; moral Foundations Theory
MORAL FOUNDATIONS TWITTER CORPUS 3
Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moral
sentiment
In this work, we introduce the Moral Foundations Twitter Corpus, a collection of
35,108 tweets that have been hand annotated for 10 categories of moral sentiment. The
motivation behind this corpus is facilitating research at the intersection of psychology and
natural language processing, an area that has received increasingly widespread attention in
recent years. However, while a large portion of such research has focused on the task of
inferring latent person-level traits and states (Iliev, Dehghani, & Sagi, 2014; Kern et al.,
2016), such as personality (Azucar, Marengo, & Settanni, 2018; Garcia & Sikström, 2014;
Park, Schwartz, & Eichstaedt, 2014), values (Boyd et al., 2015), and depression (Eichstaedt
et al., 2018; Resnik, Garron, & Resnik, 2013; Zhou et al., 2015); this corpus addresses a
different task: measuring psychologically relevant constructs at the document-level.
This task shares many similarities with standard sentiment classification tasks, such
as valence detection. However, it also introduces notable challenges, such as the fact that
moral sentiment categories co-occur, moral sentiment is often only implicitly signaled, and
ground-truth is, by definition, subjective. Despite these difficulties, research suggests that
accounting for expressions of moral sentiment can afford insight into important downstream
phenomena (Hoover, Dehghani, Johnson, Iliev, & Graham, 2017; Sagi & Dehghani, 2014),
such as violent protest (Mooijman, Hoover, Lin, Ji, & Dehghani, 2018), charitable donation
(Hoover, Johnson, Boghrati, Graham, & Dehghani, 2018), social avoidance (Dehghani et
al., 2016), diffusion (Brady, Wills, Jost, Tucker, & Van Bavel, 2017), and political discourse
(Dehghani, Sagae, Sachdeva, & Gratch, 2014; Johnson & Goldwasser, 2018).
However, aside from the computational challenges of measuring moral sentiment in
natural language, a major obstacle for both theoretical and methodological research has
been the difficulty of obtaining sufficient data. In our experience, all categories of moral
sentiment have low base rates, which complicates assembling a suitable corpus for
MORAL FOUNDATIONS TWITTER CORPUS 4
annotation. Further, compared to sentiment domains like positive and negative valence or
the basic emotions, annotating expressions of moral sentiment requires considerable
domain expertise and training. Accordingly, conducting either theoretical or
methodological research in this area has required substantial initial costs.
To address this issue, we have assembled a collection 35,108 tweets drawn from
corpora focused around seven distinct, socially relevant discourse topics: All Lives Matter,
Black Lives Matter, the Baltimore protests, the 2016 Presidential election, hate speech &
offensive language (Davidson, Warmsley, Macy, & Weber, 2017), Hurricane Sandy,
#MeToo. Already, portions of this corpus have facilitated advances in both theoretical and
methodological research. For example, Hoover et al. (2018) relies on the Hurricane Sandy
annotations to investigate the relationship between charitable donation and moral framing
and Mooijman et al. (2018) uses the Baltimore Protest annotations to predict violent
protest from online moral rhetoric. These annotation sets have also been used for recent
work advancing dictionary-based approaches to sentiment analysis (Garten, Boghrati,
Hoover, Johnson, & Dehghani, 2016; Garten et al., 2018) and feature enrichment via
background knowledge for novel neural network architecture (Lin et al., 2018).
Our hope is that making these resources available for the research community will
facilitate both theoretical and methodological advances by lowering the cost of conducting
research in this area. Researchers can use these annotated tweets to evaluate new methods
and train models for downstream application, as well as work on current problems in
natural language processing (NLP), such as domain transfer and multitask learning. To
this end, we next provide a detailed description of the corpus, our annotation procedures,
and a set of baseline classification results from Word Count, Distributed Dictionary
Representation, and Long Short-Term Memory neural network models.
MORAL FOUNDATIONS TWITTER CORPUS 5
Corpus Overview
As noted above, the Moral Foundations Twitter Corpus (MFTC) consists of 35,108
tweets drawn from seven different discourse domains. These domains were chosen for
several reasons. Because moral sentiment generally only occurs in morally relevant
discourse contexts, it was necessary to focus on domains relevant to moral values. Further,
while many domains may seem to satisfy this constraint, it was also necessary to select
domains with sufficient popularity among Twitter users.
Given these constraints, we strove to select a set of domains (1) that were relevant to
current problems in the social sciences (e.g., prejudice, political polarization, natural
disaster dynamics) and (2) that we expected a priori to contain a wide variety of moral
concerns. Regarding the latter aim, we sought to accomplish this by selecting domains that
were a priori associated with the political Left (e.g. BLM) or Right (ALM), both
ideological poles (e.g., the Presidential election), or not aligned with either ideological
group (e.g., Hurricane Sandy). Through these considerations, our goal was to maximize the
variance in expressions of moral sentiment in the annotation corpus. This is particularly
important, as the content of moral sentiment expressions can vary substantially with
discourse context. For example, the moral sentiment contained in the Black Lives Matter
corpus is substantively distinct from the moral sentiment expressed in the Hurricane Sandy
corpus, as these corpora focus on largely distinct issues. This heterogeneity makes
out-of-domain prediction particularly difficult, because expressions of moral sentiment in
one domain will not necessarily generalize well to data drawn from a different domain.
Accordingly, to help address this issue, we provide moral sentiment annotations for Tweets
drawn from multiple, heterogeneous contexts.
Annotation
Each tweet in the MFTC was labeled by at least three trained annotators for 10
categories of moral sentiment as outlined in the Moral Foundations Coding Guide (Hoover,
MORAL FOUNDATIONS TWITTER CORPUS 6
Johnson-Grey, Dehghani, & Graham, 2017). These categories are drawn from Moral
Foundations Theory (MFT; Graham et al., 2013; Graham, Haidt, & Nosek, 2009), which
proposes a five factor taxonomy of human morality. In this model, each factor is bipolar,
with each pole representing a virtue, or a prescriptive moral concern, and a vice, a
prohibitive moral concern. The proposed factors (Virtues/Vices) are:
Care/Harm. Prescriptive concerns related to caring for others and prohibitive
concerns related to not harming others
Fairness/Cheating. Prescriptive concerns related to fairness and equality and
prohibitive concerns related to not cheating or exploiting others.
Loyalty/Betrayal. Prescriptive concerns related to prioritizing one’s ingroup and
prohibitive concerns related to not betraying or abandoning one’s ingroup.
Authority/Subversion. Prescriptive concerns related to submitting to authority
and tradition and prohibitive concerns related to not subverting authority or
tradition.
Purity/Degradation. Prescriptive concerns related to maintaining the purity of
sacred entities, such as the body or a relic, and prohibitive concerns focused on the
contamination of such entities.
While researchers often do not discriminate between the virtues and vices of a given
foundation, their expressions in natural language are typically distinct and often
independent. For example, an utterance focused on a Harm violation (e.g., hurting
someone emotionally or physically) is not necessarily also going to express Care concerns.
Accordingly, to account for the semantic independence between virtues and vices, each
tweet in the corpus has been annotated for both.
Annotators, who were all undergraduate Research Assistants (authors 8-16, and
others), participated in repeated training sessions during which they developed expert-level
MORAL FOUNDATIONS TWITTER CORPUS 7
familiarity with the Moral Foundations Taxonomy. In early annotation stages, annotator
disagreement was also addressed through discussion and, if necessary, subsequent label
modification. However, moral sentiment is, in our view, qualitatively different from some
other, more conventional, sentiment domains. In many cases, it is difficult to make a final
determination of whether or not a document expresses moral sentiment, or, for that
matter, which moral sentiment it expresses, as such judgments are, ultimately, subjective
(Hoover, Johnson-Grey, et al., 2017).
Accordingly, while uniform annotator training is important, we believe that excessive
focus on maximizing annotator agreement risks artificially inflating agreement at the cost
of suppressing the natural variability of moral sentiment. Thus, while annotators were
instructed to strive for consistency, they were also encouraged to avoid heuristics that
might increase agreement with other annotators, but would also lead them to neglect their
own judgments.
Relying on this training, annotators were independently assigned to label each tweet
from a subset of tweets sampled from a corpus associated with one of our 7 discourse
domains. The annotators used an annotation tool developed for Mooijman et al. (2018)’s
project1. Specifically, each tweet was assigned a label indicating the absence or presence of
each Virtue and Vice or a label indicating that the Tweet was non-moral. This yielded a
set of 11 labels for each tweet.
Sampling Procedure
General Sampling Procedure. To assemble the MFTC, we sampled tweets from
larger corpora associated with each of the seven discourse domains. While, as noted above,
these domains were selected to maximize the base rates of moral sentiment, the proportion
of tweets containing moral sentiment within each domain was still too low to use fully
randomized sampling. Accordingly, our general sampling procedure relied on a combination
1This tool is available at https://github.com/limteng-rpi/moral_annotation_tool
MORAL FOUNDATIONS TWITTER CORPUS 8
of random sampling and semi-supervised selection as in Garten et al. (2018); Hoover et al.
(2018).
Specifically, for each discourse domain, we used Distributed Dictionary
Representation (DDR; Garten et al., 2018) to calculate moral loadings for each tweet for
each of the 10 virtues and vices. Then, for each virtue and vice the ntweets with the
highest loadings were selected for annotation. Finally, an additional ntweets were sampled
from the subset of tweets with loadings that were ±1 SD from 0.
This procedure yielded approximately n×11 tweets per discourse domain. However,
because virtues and vices regularly co-occur, some duplication is expected under this
sampling procedure. Accordingly, as duplicates are removed, the final sampled Nis less
than the upper bound of 11n.
All Lives Matter. Include #BlueLivesMatter & #AllLivesMatter hashtags and
were posted between 2015-2016. These tweets were purchased from a third-party vendor.
Baltimore Protests. Posted during the 2015 Baltimore protests (4 December 2015
to 5 August 2015) from cities where protests related to the death of Freddie Gray occurred
(Mooijman et al., 2018). They were purchased from Gnip.com.
Black Lives Matter. Posted between 2015-2016 about the Black Lives Matter
Movement. Hashtags used to compile the corpus: #BLM, #BlackLivesMatter. The tweets
were purchased from a third-party vendor.
2016 Presidential Election. Scraped during the 2016 Presidential election season
from the followers of @HillaryClinton, @realDonaldTrump, @NYTimes, @washingtonpost,
& @WSJ.
MORAL FOUNDATIONS TWITTER CORPUS 9
Davidson. Taken from Davidson et al.’s corpus of hate speech and offensive
language (Davidson et al., 2017)23.
Hurricane Sandy. The tweets in this corpus were posted before, during, and
immediately after Hurricane Sandy (10/16/2012-11/05/2012). They were selected based on
the inclusion of Hurricane Sandy related #-tags and purchased from Gnip.com.
#MeToo. The tweets in this corpus were purchased from a third-party vendor, and
contain data from 200 individuals involved in the #MeToo movement.
Annotation Results
Overall, this annotation and sampling procedure yielded 4000-6000 annotated tweets
for each discourse domain (See Table 1). Notably, the base rates of each of the virtues and
vices varies substantially across domain. For example, the ALM data (Total = 4,424)
contain only 443 tweets labeled as Degradation, which is approximately one-third of the
1,515 Degradation tweets in the Hurricane Sandy Corpus (Total = 4,591).
To evaluate inter-annotator agreement, we calculated both Fleiss’ Kappa for multiple
annotators (Fleiss, 1971) as well as prevalence and bias adjusted Fleiss’ Kappa (Sim &
Wright, 2005, PABAK;) to account for the low base-rates of moral values expressions. As
expected due to the sparsity of moral content across all corpora, all Kappas were relatively
low. However, adjusting for prevalence and bias suggests that inter-annotator agreement
for each virtue and vice is reasonably high across discourse domains.
Moral Sentiment Prediction
In addition to developing the MFTC, we have also trained classifiers to predict moral
sentiment using a range of model architectures and linguistic features. Our goal for these
2The original corpus is available at:
https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data
3Please note that the ID’s in this corpus are not tweet IDs, but they can be used to find the tweets texts
in the original Davidson et al. corpus.
MORAL FOUNDATIONS TWITTER CORPUS 10
Table 1
Foundation ALM Baltimore BLM Election Davidson Sandy #MeToo
Subversion 392 1,700 701 484 142 76 2,285
Authority 620 666 606 527 1,563 1,196 1,454
Cheating 1,220 1,423 1,558 1,053 401 1,072 1,466
Fairness 1,235 700 1,349 1,037 194 1,044 1,022
Harm 1,777 1,040 2,094 1,161 477 562 1,074
Care 1,294 610 1,142 1,048 103 1,585 675
Betrayal 409 1,612 569 481 301 1,572 1,344
Loyalty 788 1,142 918 791 310 647 894
Purity 322 228 509 1,171 56 396 521
Degradation 443 267 630 490 574 1,515 2,208
Non-moral 3,037 4,079 3,753 4,372 4,117 1,421 2,391
Total 4,424 5,593 5,257 5,358 4,994 4,591 4,891
predictive models is to provide an initial performance baselines for this corpus as well as to
document the performance variance across multiple approaches to sentiment analysis.
Specifically, we report classification results from five different approaches to moral
sentiment prediction. For each, we attempt to predict the document-level presence of moral
sentiment for each of the five Moral Foundations. We focus on prediction at the foundation
level, rather than at the level of the virtues and vices, because of the heterogeneous
sparsity of the virtues and vices across discourse domains (See Table 1).
MORAL FOUNDATIONS TWITTER CORPUS 11
Table 2
PABAK and KAPPA scores for all datasets and foundations
All ALM Baltimore BLM Election Davidson #MeToo Sandy
All Foundations KAPPA 0.3147 0.1556 0.3708 0.376 0.2907 0.159 0.2062 0.3156
PABAK 0.3656 0.2057 0.4693 0.4128 0.3983 0.4565 0.2259 0.3436
Subversion KAPPA 0.2732 0.1921 0.0546 0.5311 0.2258 0.071 0.1683 -0.1611
PABAK 0.8034 0.8829 0.6084 0.8849 0.9002 0.9502 0.4607 0.9745
Authority KAPPA 0.1177 0.3087 0.0066 0.542 0.1855 -0.2505 0.1882 0.2458
PABAK 0.7501 0.8218 0.8494 0.8969 0.8902 0.3984 0.6589 0.6693
Cheating KAPPA 0.3748 0.2555 0.2816 0.4947 0.4117 0.1626 0.3627 0.2836
PABAK 0.7317 0.6486 0.6894 0.7341 0.7951 0.868 0.6794 0.7153
Fairness KAPPA 0.414 0.313 0.1841 0.5284 0.4391 0.0074 0.3332 0.3769
PABAK 0.798 0.6623 0.8434 0.8009 0.8048 0.9332 0.7711 0.7684
Harm KAPPA 0.3596 0.1993 0.1872 0.3912 0.2952 0.3744 0.3538 0.285
PABAK 0.7319 0.4885 0.7625 0.6155 0.7597 0.8699 0.7683 0.8856
Care KAPPA 0.3853 0.2503 0.3446 0.4257 0.3234 0.1819 0.2948 0.4745
PABAK 0.8036 0.6357 0.8762 0.8204 0.7969 0.967 0.8501 0.7079
Betrayal KAPPA 0.2911 0.0667 0.2446 0.3639 0.1741 0.1872 0.1682 0.4445
PABAK 0.7912 0.8761 0.6417 0.9037 0.901 0.9035 0.6868 0.7207
Loyalty KAPPA 0.3687 0.2335 0.3216 0.6423 0.223 0.1427 0.3298 0.0909
PABAK 0.8232 0.7724 0.7602 0.8703 0.8455 0.8971 0.7996 0.9044
Purity KAPPA 0.2818 0.214 0.2427 0.2784 0.1926 0.124 0.2981 0.07
PABAK 0.895 0.9056 0.9518 0.9079 0.7593 0.9816 0.88 0.9548
Degradation KAPPA 0.2969 0.198 0.122 0.2713 0.2192 0.0596 0.3 0.3294
PABAK 0.7939 0.8668 0.9406 0.8783 0.9014 0.7974 0.5172 0.6193
Non-Moral KAPPA 0.3879 -0.0027 0.5841 0.3177 0.2881 0.2281 0.3904 0.0345
PABAK 0.4204 0.1575 0.5912 0.4206 0.2883 0.3102 0.5304 0.6929
MORAL FOUNDATIONS TWITTER CORPUS 12
Methodology
In order to provide a full-spectrum performance baseline for this corpus, we selected
methodologies from across a range of performance expectations. Specifically, we report
results from four approaches:
Model Set 1. In the first of these approaches, we use the Moral Foundations
Dictionary4(Graham et al., 2009) to obtain message-level frequencies for words associated
with each virtue and vice. These word-counts were then used to train separate linear
Support Vector Machine (SVM) models with ridge regularization to predict the binary
presence of each Moral Foundation, collapsing across virtues and vices. Each SVM was
trained with C=1.
Model Set 2. For the second model set, we replaced the Moral Foundations
Dictionary with the Moral Foundations Dictionary 2 (MFD2)5(Frimer, Boghrati, Haidt,
Graham, & Dehghani, 2015), an updated lexicon of words associated with each Virtue and
Vice. Using word frequencies based on the MFD2 we generated predictions of moral
sentiment using linear SVMs with the same implementation as for Model Set 1.
Model Set 3. For the third model set, we again trained linear SVMs to predict
moral sentiment; however, rather than relying on word-counts, we used DDR to calculate
moral loadings for each message (Garten et al., 2018). We used the same seed-words for
DDR as the ones used in the second study of Garten et al. (2018). These loadings
represent the estimated similarity between a given message and latent semantic
representations of each foundation. These loadings were then used as features to train a
third set of linear SVMs.
Model Set 4. For the fourth model, we implemented and trained a multi-task
Long Short-Term Memory (LSTM) neural network (Collobert & Weston, 2008; Luong, Le,
Sutskever, Vinyals, & Kaiser, 2015) to predict moral sentiment. LSTMs are particularly
4Available at https://www.moralfoundations.org/othermaterials
5Available at http://www.jeremyfrimer.com/uploads/2/1/2/7/21278832/mfd2.0.dic
MORAL FOUNDATIONS TWITTER CORPUS 13
effective for document-level classification tasks, as they rely on a recurrent structure that
yields latent representations of documents that encode long-term dependencies among
words. Here, we use a multitask architecture, which involves training a model to predict
labels for multiple outcomes. Specifically, for each discourse domain, we trained a
multi-task model to predict the document-level presence of each Moral Foundation.
To establish performance baselines, we first collapsed Tweet’s annotations by taking
the majority vote for each Foundation. We then trained each model type separately on
each discourse domain to predict each Moral Foundation. Then, using the entire corpus, we
trained each model type to predict each moral foundation (i.e. ‘All’ corpus). Finally, we
also collapsed across Moral Foundations and trained each model type — on each discourse
domain and the entire corpus — to predict whether documents were moral or not moral.
Results
As expected, performance varied substantially across methodology, discourse domain,
and prediction task. Further, our results suggest that in the context of different domains
and prediction tasks, each methodology showed different strengths and weaknesses. For
example, while predictions derived from the LSTM models almost always outperformed
predictions derived from the other models in terms of F1 and Precision, DDR generally
yielded higher recall compared to both the LSTM and dictionary-based approaches (See
Tables 3, 4, 5, 6, 7, 8). Notably, the results from DDR and LSTM models trained to
predict only the presence of general moral sentiment, as opposed to a specific foundation,
also suggest that poor performance may be a function of sparsity. That is, when all moral
sentiment labels are collapsed into a single class, and there are thus more positive training
observations, performance improves and stabilizes across discourse domains.
Finally, in some cases, the dictionary-based approaches also largely outperformed
DDR in terms of precision. Finally, our results suggest while, on average, the MFD and
MFD2 dictionaries yield comparable performance in terms of F1, performance differences,
MORAL FOUNDATIONS TWITTER CORPUS 14
again, depend on discourse domain and Foundation. Further, across discourse domains and
Foundations, the MFD2 appears to offer higher precision, compared to the original MFD.
In contrast, the original MFD appears to offer generally better recall, compared to the
MFD2.
Together, our classification results demonstrate the viability of measuring moral
sentiment in natural language using a range of methodologies; however, they also highlight
the difficulty of this task. Regardless of methodology, considerable performance variation
was observed across both discourse domain and Foundation. In our view, this raises
multiple important goals for future research, such as working toward a better understanding
of the causes of this variation and developing methodological approaches that minimize it.
Table 3
Model F1, Precision, and Recall scores for Moral
Model Metric All ALM Baltimore BLM Election Davidson #MeToo Sandy
SVM-MFD F1 0.54 (0.02) 0.53 (0.03) 0.33 (0.03) 0.66 (0.03) 0.56 (0.03) 0.02 (0.02) 0.62 (0.04) 0.22 (0.03)
Precision 0.68 (0.02) 0.81 (0.03) 0.70 (0.06) 0.91 (0.02) 0.88 (0.03) 0.01 (0.01) 0.66 (0.05) 0.13 (0.02)
Recall 0.45 (0.02) 0.40 (0.03) 0.22 (0.02) 0.52 (0.03) 0.41 (0.03) 0.08 (0.11) 0.59 (0.03) 0.90 (0.04)
SVM-MFD2 F1 0.58 (0.01) 0.66 (0.02) 0.59 (0.02) 0.59 (0.02) 0.62 (0.02) 0.14 (0.02) 0.63 (0.03) 0.64 (0.02)
Precision 0.57 (0.02) 0.57 (0.03) 0.56 (0.03) 0.56 (0.03) 0.52 (0.02) 0.75 (0.08) 0.67 (0.04) 0.50 (0.02)
Recall 0.60 (0.01) 0.79 (0.02) 0.63 (0.03) 0.63 (0.03) 0.75 (0.03) 0.08 (0.01) 0.59 (0.03) 0.88 (0.03)
SVM-DDR F1 0.66 (0.01) 0.67 (0.03) 0.64 (0.02) 0.81 (0.02) 0.71 (0.02) 0.13 (0.03) 0.65 (0.04) 0.69 (0.02)
Precision 0.60 (0.02) 0.76 (0.04) 0.58 (0.03) 0.88 (0.02) 0.71 (0.03) 0.46 (0.07) 0.55 (0.04) 0.87 (0.03)
Recall 0.74 (0.01) 0.60 (0.03) 0.72 (0.04) 0.74 (0.02) 0.71 (0.02) 0.08 (0.02) 0.78 (0.05) 0.57 (0.03)
LSTM F1 0.74 (0.01) 0.76 (0.03) 0.74 (0.02) 0.90 (0.01) 0.77 (0.02) 0.18 (0.04) 0.65 (0.03) 0.90 (0.01)
Precision 0.73 (0.03) 0.72 (0.03) 0.86 (0.03) 0.96 (0.01) 0.78 (0.02) 0.49 (0.14) 0.77 (0.05) 1.00 (0.00)
Recall 0.77 (0.01) 0.80 (0.03) 0.65 (0.03) 0.85 (0.02) 0.76 (0.03) 0.11 (0.03) 0.56 (0.05) 0.82 (0.02)
MORAL FOUNDATIONS TWITTER CORPUS 15
Table 4
Model F1, Precision, and Recall scores for Care/Harm
Model Metric All ALM Baltimore BLM Election Davidson #MeToo Sandy
SVM-MFD F1 0.40 (0.02) 0.55 (0.04) 0.16 (0.04) 0.53 (0.03) 0.47 (0.05) 0.06 (0.02) 0.43 (0.05) 0.22 (0.03)
Precision 0.46 (0.03) 0.40 (0.04) 0.15 (0.05) 0.38 (0.03) 0.60 (0.07) 1.00 (0.00) 0.33 (0.04) 0.13 (0.02)
Recall 0.35 (0.03) 0.89 (0.02) 0.16 (0.04) 0.86 (0.02) 0.40 (0.04) 0.03 (0.01) 0.61 (0.08) 0.90 (0.04)
SVM-MFD2 F1 0.45 (0.02) 0.62 (0.04) 0.28 (0.04) 0.28 (0.04) 0.53 (0.04) 0.05 (0.02) 0.40 (0.07) 0.53 (0.04)
Precision 0.65 (0.02) 0.68 (0.05) 0.56 (0.07) 0.56 (0.07) 0.64 (0.04) 0.78 (0.31) 0.74 (0.13) 0.52 (0.04)
Recall 0.35 (0.02) 0.58 (0.05) 0.18 (0.03) 0.18 (0.03) 0.46 (0.05) 0.03 (0.01) 0.27 (0.05) 0.54 (0.05)
SVM-DDR F1 0.42 (0.02) 0.57 (0.46) 0.25 (0.04) 0.64 (0.04) 0.51 (0.03) 0.05 (0.03) 0.29 (0.04) 0.55 (0.03)
Precision 0.30 (0.01) 0.48 (0.05) 0.15 (0.03) 0.55 (0.05) 0.38 (0.03) 0.39 (0.14) 0.18 (0.03) 0.44 (0.03)
Recall 0.75 (0.02) 0.69 (0.06) 0.68 (0.07) 0.76 (0.04) 0.74 (0.04) 0.03 (0.01) 0.80 (0.08) 0.73 (0.06)
LSTM F1 0.62 (0.02) 0.67 (0.04) 0.33 (0.06) 0.74 (0.03) 0.64 (0.03) 0.07 (0.03) 0.32 (0.04) 0.49 (0.04)
Precision 0.65 (0.04) 0.78 (0.04) 0.73 (0.07) 0.66 (0.02) 0.71 (0.04) 0.45 (0.21) 0.81 (0.08) 0.34 (0.04)
Recall 0.59 (0.04) 0.59 (0.06) 0.21 (0.04) 0.85 (0.05) 0.59 (0.04) 0.04 (0.02) 0.20 (0.04) 0.88 (0.07)
Discussion
By understanding and measuring the expression of moral sentiment in natural
language, researchers can gain insight into a variety of important digital- and real-world
phenomena (Hoover, Johnson-Grey, et al., 2017; Sagi & Dehghani, 2014). However, in
practice, it can be quite costly to take advantage of these opportunities. In our view, a
major driver of this cost has been the difficulty of obtaining annotated data, which is
necessary for evaluating method performance and training supervised language models. To
address this issue, we have developed the Moral Foundations Twitter Corpus, a collection
of 35,108 Tweets drawn from seven different domains and annotated for 10 types of moral
sentiment. This corpus can be used, for example, to help verify new collections of
annotated data, train models for predicting moral sentiment in new data, and evaluate new
methods for measuring moral sentiment in text.
MORAL FOUNDATIONS TWITTER CORPUS 16
Table 5
Model F1, Precision, and Recall scores for Fairness
Model Metric All ALM Baltimore BLM Election Davidson #MeToo Sandy
SVM-MFD F1 0.64 (0.02) 0.68 (0.04) 0.25 (0.05) 0.72 (0.05) 0.72 (0.03) 0.03 (0.06) 0.57 (0.07) 0.22 (0.03)
Precision 0.68 (0.02) 0.81 (0.04) 0.31 (0.06) 0.82 (0.05) 0.89 (0.03) 0.02 (0.05) 0.61 (0.07) 0.13 (0.02)
Recall 0.60 (0.02) 0.59 (0.05) 0.21 (0.04) 0.64 (0.06) 0.61 (0.04) 0.04 (0.09) 0.53 (0.08) 0.90 (0.04)
SVM-MFD2 F1 0.61 (0.01) 0.68 (0.03) 0.40 (0.05) 0.40 (0.05) 0.72 (0.03) 0.03 (0.04) 0.52 (0.07) 0.53 (0.06)
Precision 0.70 (0.03) 0.71 (0.03) 0.60 (0.03) 0.60 (0.03) 0.74 (0.05) 0.21 (0.27) 0.59 (0.07) 0.58 (0.06)
Recall 0.54 (0.02) 0.65 (0.04) 0.30 (0.05) 0.30 (0.05) 0.71 (0.04) 0.02 (0.03) 0.47 (0.08) 0.49 (0.07)
SVM-DDR F1 0.59 (0.01) 0.72 (0.04) 0.41 (0.05) 0.81 (0.03) 0.71 (0.02) 0.03 (0.02) 0.44 (0.04) 0.50 (0.04)
Precision 0.46 (0.02) 0.65 (0.05) 0.28 (0.04) 0.77 (0.04) 0.61 (0.03) 0.48 (0.19) 0.32 (0.04) 0.38 (0.03)
Recall 0.82 (0.02) 0.79 (0.04) 0.75 (0.04) 0.86 (0.04) 0.85 (0.02) 0.02 (0.01) 0.70 (0.06) 0.75 (0.06)
LSTM F1 0.73 (0.02) 0.76 (0.03) 0.47 (0.04) 0.87 (0.03) 0.79 (0.03) 0.05 (0.03) 0.46 (0.06) 0.58 (0.03)
Precision 0.80 (0.04) 0.82 (0.04) 0.83 (0.05) 0.89 (0.36) 0.84 (0.04) 0.47 (0.17) 0.73 (0.07) 0.72 (0.04)
Recall 0.68 (0.03) 0.70 (0.04) 0.33 (0.04) 0.86 (0.04) 0.74 (0.04) 0.03 (0.02) 0.33 (0.06) 0.49 (0.04)
In our view, open data standards regarding annotated text corpora are a key element
in the emerging field of computational social science. They afford greater research
transparency and can help facilitate scientific progress via the free dissemination of
materials that are costly to assemble. Through the Moral Foundations Twitter Corpus, we
hope to contribute to this culture of openness and thereby help facilitate both applied and
methodological advances in the computational social sciences.
MORAL FOUNDATIONS TWITTER CORPUS 17
Table 6
Model F1, Precision, and Recall scores for Loyalty
Model Metric All ALM Baltimore BLM Election Davidson #MeToo Sandy
SVM-MFD F1 0.46 (0.04) 0.63 (0.09) 0.10 (0.06) 0.64 (0.06) 0.41 (0.05) 0.01 (0.01) 0.34 (0.05) 0.22 (0.03)
Precision 0.42 (0.04) 0.55 (0.10) 0.10 (0.06) 0.63 (0.07) 0.32 (0.05) 0.90 (0.32) 0.29 (0.05) 0.13 (0.02)
Recall 0.50 (0.04) 0.75 (0.11) 0.11 (0.06) 0.66 (0.07) 0.60 (0.09) 0.01 (0.00) 0.41 (0.07) 0.90 (0.04)
SVM-MFD2 F1 0.38 (0.04) 0.48 (0.05) 0.20 (0.04) 0.20 (0.04) 0.39 (0.05) 0.02 (0.01) 0.32 (0.02) 0.23 (0.05)
Precision 0.69 (0.03) 0.81 (0.06) 0.51 (0.07) 0.51 (0.07) 0.71 (0.08) 0.75 (0.35) 0.60 (0.05) 0.67 (0.07)
Recall 0.26 (0.03) 0.35 (0.05) 0.13 (0.03) 0.13 (0.03) 0.27 (0.04) 0.01 (0.00) 0.22 (0.02) 0.14 (0.04)
SVM-DDR F1 0.36 (0.03) 0.57 (0.06) 0.21 (0.05) 0.75 (0.05) 0.32 (0.03) 0.01 (0.01) 0.33 (0.03) 0.33 (0.05)
Precision 0.24 (0.03) 0.43 (0.06) 0.13 (0.03) 0.63 (0.06) 0.20 (0.02) 0.01 (0.01) 0.21 (0.02) 0.21 (0.04)
Recall 0.79 (0.04) 0.88 (0.04) 0.66 (0.11) 0.93 (0.03) 0.78 (0.06) 0.32 (0.30) 0.76 (0.08) 0.78 (0.08)
LSTM F1 0.41 (0.04) 0.62 (0.04) 0.25 (0.03) 0.83 (0.03) 0.41 (0.07) 0.01 (0.01) 0.35 (0.04) 0.44 (0.06)
Precision 0.84 (0.03) 0.85 (0.09) 0.81 (0.11) 0.93 (0.04) 0.79 (0.09) 0.31 (0.33) 0.68 (0.08) 0.77 (0.06)
Recall 0.27 (0.03) 0.50 (0.05) 0.15 (0.02) 0.74 (0.05) 0.28 (0.06) 0.01 (0.01) 0.24 (0.03) 0.31 (0.05)
MORAL FOUNDATIONS TWITTER CORPUS 18
Table 7
Model F1, Precision, and Recall scores for Authority
Model Metric All ALM Baltimore BLM Election Davidson #MeToo Sandy
SVM-MFD F1 0.46 (0.04) 0.63 (0.09) 0.10 (0.06) 0.64 (0.06) 0.41 (0.05) 0.01 (0.01) 0.34 (0.05) 0.22 (0.03)
Precision 0.42 (0.04) 0.55 (0.10) 0.10 (0.06) 0.63 (0.07) 0.32 (0.05) 0.90 (0.32) 0.29 (0.05) 0.13 (0.02)
Recall 0.50 (0.04) 0.75 (0.11) 0.11 (0.06) 0.66 (0.07) 0.60 (0.09) 0.01 (0.00) 0.41 (0.07) 0.90 (0.04)
SVM-MFD2 F1 0.38 (0.04) 0.48 (0.05) 0.20 (0.04) 0.20 (0.04) 0.39 (0.05) 0.02 (0.01) 0.32 (0.02) 0.23 (0.05)
Precision 0.69 (0.03) 0.81 (0.06) 0.51 (0.07) 0.51 (0.07) 0.71 (0.08) 0.75 (0.35) 0.60 (0.05) 0.67 (0.07)
Recall 0.26 (0.03) 0.35 (0.05) 0.13 (0.03) 0.13 (0.03) 0.27 (0.04) 0.01 (0.00) 0.22 (0.02) 0.14 (0.04)
SVM-DDR F1 0.36 (0.03) 0.57 (0.06) 0.21 (0.05) 0.75 (0.05) 0.32 (0.03) 0.01 (0.01) 0.33 (0.03) 0.33 (0.05)
Precision 0.24 (0.03) 0.43 (0.06) 0.13 (0.03) 0.63 (0.06) 0.20 (0.02) 0.01 (0.01) 0.21 (0.02) 0.21 (0.04)
Recall 0.79 (0.04) 0.88 (0.04) 0.66 (0.11) 0.93 (0.03) 0.78 (0.06) 0.32 (0.30) 0.76 (0.08) 0.78 (0.08)
LSTM F1 0.41 (0.04) 0.62 (0.04) 0.25 (0.03) 0.83 (0.03) 0.41 (0.07) 0.01 (0.01) 0.35 (0.04) 0.44 (0.06)
Precision 0.84 (0.03) 0.85 (0.09) 0.81 (0.11) 0.93 (0.04) 0.79 (0.09) 0.31 (0.33) 0.68 (0.08) 0.77 (0.06)
Recall 0.27 (0.03) 0.50 (0.05) 0.15 (0.02) 0.74 (0.05) 0.28 (0.06) 0.01 (0.01) 0.24 (0.03) 0.31 (0.05)
MORAL FOUNDATIONS TWITTER CORPUS 19
Table 8
Model F1, Precision, and Recall scores for Purity
Model Metric All ALM Baltimore BLM Election Davidson #MeToo Sandy
SVM-MFD F1 0.13 (0.05) 0.28 (0.13) 0.04 (0.03) 0.21 (0.03) 0.23 (0.03) 0.00 (0.01) 0.26 (0.12) 0.22 (0.03)
Precision 0.11 (0.16) 0.41 (0.30) 0.02 (0.02) 0.12 (0.02) 0.14 (0.02) 0.09 (0.28) 0.18 (0.10) 0.13 (0.02)
Recall 0.79 (0.22) 0.52 (0.35) 0.11 (0.12) 0.95 (0.03) 0.89 (0.06) 0.00 (0.00) 0.50 (0.18) 0.90 (0.04)
SVM-MFD2 F1 0.29 (0.02) 0.34 (0.08) 0.15 (0.05) 0.15 (0.05) 0.46 (0.05) 0.03 (0.02) 0.34 (0.11) 0.45 (0.03)
Precision 0.71 (0.04) 0.75 (0.15) 0.55 (0.14) 0.55 (0.14) 0.69 (0.06) 0.29 (0.22) 0.75 (0.08) 0.83 (0.05)
Recall 0.19 (0.01) 0.22 (0.06) 0.08 (0.03) 0.08 (0.03) 0.35 (0.05) 0.02 (0.01) 0.23 (0.09) 0.31 (0.03)
SVM-DDR F1 0.24 (0.02) 0.26 (0.06) 0.13 (0.04) 0.40 (0.06) 0.38 (0.05) 0.03 (0.02) 0.32 (0.13) 0.46 (0.04)
Precision 0.14 (0.01) 0.16 (0.04) 0.07 (0.02) 0.27 (0.05) 0.25 (0.04) 0.49 (0.25) 0.20 (0.09) 0.36 (0.04)
Recall 0.77 (0.03) 0.77 (0.13) 0.79 (0.14) 0.86 (0.08) 0.79 (0.08) 0.02 (0.01) 0.85 (0.13) 0.66 (0.04)
LSTM F1 0.41 (0.29) 0.39 (0.06) 0.11 (0.04) 0.57 (0.07) 0.49 (0.05) 0.05 (0.02) 0.35 (0.05) 0.56 (0.04)
Precision 0.74 (0.04) 0.75 (0.07) 0.73 (0.14) 0.83 (0.07) 0.80 (0.05) 0.45 (0.22) 0.81 (0.09) 0.75 (0.03)
Recall 0.29 (0.03) 0.26 (0.05) 0.06 (0.02) 0.43 (0.06) 0.36 (0.05) 0.02 (0.01) 0.22 (0.04) 0.44 (0.05)
MORAL FOUNDATIONS TWITTER CORPUS 20
References
Azucar, D., Marengo, D., & Settanni, M. (2018, April). Predicting the big 5 personality
traits from digital footprints on social media: A meta-analysis. Personality and
individual differences,124 , 150–159.
Boyd, R. L., Wilson, S. R., Pennebaker, J. W., Kosinski, M., Stillwell, D. J., & Mihalcea,
R. (2015). Values in words: Using language to evaluate and understand personal
values. Ninth International AAAI.
Brady, W. J., Wills, J. A., Jost, J. T., Tucker, J. A., & Van Bavel, J. J. (2017, July).
Emotion shapes the diffusion of moralized content in social networks. Proceedings of
the National Academy of Sciences of the United States of America,114 (28),
7313–7318.
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of the 25th
international conference on machine learning (pp. 160–167).
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech
detection and the problem of offensive language. In Eleventh international aaai
conference on web and social media.
Dehghani, M., Johnson, K., Hoover, J., Sagi, E., Garten, J., Parmar, N. J., .. . Graham, J.
(2016, January). Purity homophily in social networks. Journal of experimental
psychology. General.
Dehghani, M., Sagae, K., Sachdeva, S., & Gratch, J. (2014). Analyzing political rhetoric in
conservative and liberal weblogs related to the construction of the “ground zero
mosque”. Journal of Information Technology & Politics,11 (1), 1–14.
Eichstaedt, J. C., Smith, R. J., Merchant, R. M., Ungar, L. H., Crutchley, P.,
Preoţiuc-Pietro, D., . . . Schwartz, H. A. (2018, October). Facebook language predicts
depression in medical records. Proceedings of the National Academy of Sciences of the
United States of America,115 (44), 11203–11208.
MORAL FOUNDATIONS TWITTER CORPUS 21
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological
bulletin,76 (5), 378.
Frimer, J., Boghrati, R., Haidt, J., Graham, J., & Dehghani, M. (2015). Moral foundations
dictionary 2.0.
Garcia, D., & Sikström, S. (2014, September). The dark side of facebook: Semantic
representations of status updates predict the dark triad of personality. Personality
and individual differences,67 , 92–96.
Garten, J., Boghrati, R., Hoover, J., Johnson, K. M., & Dehghani, M. (2016). Morality
between the lines: Detecting moral sentiment in text. In Proc. IJCAI 2016 workshop
on computational modeling of attitudes. pdfs.semanticscholar.org.
Garten, J., Hoover, J., Johnson, K. M., Boghrati, R., Iskiwitch, C., & Dehghani, M. (2018,
February). Dictionaries and distributions: Combining expert knowledge and large
scale textual data content analysis : Distributed dictionary representation. Behavior
research methods,50 (1), 344–361.
Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik, S. P., & Ditto, P. H. (2013).
Moral foundations theory: The pragmatic validity of moral pluralism. In Advances in
experimental social psychology (Vol. 47, pp. 55–130). Elsevier.
Graham, J., Haidt, J., & Nosek, B. a. (2009). Liberals and conservatives rely on different
sets of moral foundations. Journal of personality and social psychology,96 (5),
1029–1046.
Hoover, J., Dehghani, M., Johnson, K., Iliev, R., & Graham, J. (2017). Into the wild: Big
data analytics in moral psychology. In J. Graham & K. Gray (Eds.), The atlas of
moral psychology. Guilford Press.
Hoover, J., Johnson, K., Boghrati, R., Graham, J., & Dehghani, M. (2018, April). Moral
framing and charitable donation: Integrating exploratory social media analyses and
confirmatory experimentation. Collabra: Psychology,4(1), 9.
Hoover, J., Johnson-Grey, K., Dehghani, M., & Graham, J. (2017). Moral values coding
MORAL FOUNDATIONS TWITTER CORPUS 22
guide.
Iliev, R., Dehghani, M., & Sagi, E. (2014, July). Automated text analysis in psychology:
methods, applications, and future developments. Language and cognition, 1–26.
Johnson, K., & Goldwasser, D. (2018). Classification of moral foundations in microblog
political discourse. In Proceedings of the 56th annual meeting of the association for
computational linguistics (volume 1: Long papers) (Vol. 1, pp. 720–730).
Kern, M. L., Park, G., Eichstaedt, J. C., Schwartz, H. A., Sap, M., Smith, L. K., & Ungar,
L. H. (2016, December). Gaining insights from social media language: Methodologies
and challenges. Psychological methods,21 (4), 507–525.
Lin, Y., Hoover, J., Portillo-Wightman, G., Park, C., Dehghani, M., & Ji, H. (2018).
Acquiring background knowledge to improve moral value prediction. In The 2018
IEEE/ACM international conference on advances in social networks analysis and
mining (ASONAM2018).
Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., & Kaiser, L. (2015). Multi-task
sequence to sequence learning. arXiv preprint arXiv:1511.06114.
Mooijman, M., Hoover, J., Lin, Y., Ji, H., & Dehghani, M. (2018, June). Moralization in
social networks and the emergence of violence during protests. Nature Human
Behaviour,2(6), 389–396.
Park, G., Schwartz, H., & Eichstaedt, J. (2014). Automatic personality assessment through
social media language. Journal of personality and social psychology.
Resnik, P., Garron, A., & Resnik, R. (2013). Using topic modeling to improve prediction of
neuroticism and depression in college students. In Proceedings of the 2013 conference
on empirical methods in natural language processing (pp. 1348–1353). aclweb.org.
Sagi, E., & Dehghani, M. (2014, April). Measuring moral rhetoric in text. Social science
computer review,32 (2), 132–144.
Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use,
interpretation, and sample size requirements. Physical Therapy.
MORAL FOUNDATIONS TWITTER CORPUS 23
Zhou, L., Baughman, A. W., Lei, V. J., Lai, K. H., Navathe, A. S., Chang, F., . . . Rocha,
R. A. (2015). Identifying patients with depression using free-text clinical documents.
Studies in health technology and informatics,216 , 629–633.
... The goal of this work is to construct a scheme to generate humorous jokes while keeping their contents morally good. Because the Moral Foundations Twitter Corpus (Hoover et al., 2019) has 5 different category pairs, grouping positive and negative elements from the pairs is considered to be beneficial to generate morally good jokes. Moreover, as shown in the introduction and related work, a cognition of humor is considered to be interconnected to the processing mechanism of ToM; thus, analyzing humor from a categorical view is expected to provide novel knowledge to the community. ...
... To select jokes based on moral judgement, we use an LSTM trained on the moral corpus (Hoover et al., 2019). The moral foundations theory (Graham et al., 2012) aims to explain the variability of morality as a function of five core moral factors or foundations that appear across cultures, as shown in Table 1. ...
... Our objective is to select joke candidates based on more diverse criteria including polarity. Because the dataset (Hoover et al., 2019) contains the pair annotations, by following the foundation, we define classes as: ...
Article
Full-text available
Although humor enriches human lives, some jokes fail to amuse people because of a lack of morality. In this paper, we propose a mechanism capable of selecting humor based on moral criteria. To this end, we first construct a model based on an N-gram corpus and generate joke candidates using various template patterns. We then employ a moral judgement classifier based on a recurrent neural network and utilize the trained model for humor selection. The experimental results obtained from best–worst scaling demonstrate that this scheme is able to generate jokes with moral category labels. We confirmed that jokes about the classifier categorized as Loyalty and Authority, which are regarded as good in our study, are funnier than jokes about Fairness, Purity, Harm, Cheating, and Degradation. Although we did not confirm that there was a difference in the funny level between good and bad moral jokes, the results demonstrate that moral categories of humor can affect the funny level.
Chapter
Full-text available
News outlets are a primary source for many people to learn what is going on in the world. However, outlets with different political slants, when talking about the same news story, usually emphasize various aspects and choose their language framing differently. This framing implicitly shows their biases and also affects the reader’s opinion and understanding. Therefore, understanding the framing in the news stories is fundamental for realizing what kind of view the writer is conveying with each news story. In this paper, we describe methods for characterizing moral frames in the news. We capture the frames based on the Moral Foundation Theory. This theory is a psychological concept which explains how every kind of morality and opinion can be summarized and presented with five main dimensions. We propose an unsupervised method that extracts the framing Bias and the framing Intensity without any external framing annotations provided. We validate the performance on an annotated twitter dataset and then use it to quantify the framing bias and partisanship of news.
Article
Full-text available
In recent years, protesters in the United States have clashed violently with police and counter-protesters on numerous occasions1–3. Despite widespread media attention, little scientific research has been devoted to understanding this rise in the number of violent protests. We propose that this phenomenon can be understood as a function of an individual’s moralization of a cause and the degree to which they believe others in their social network moralize that cause. Using data from the 2015 Baltimore protests, we show that not only did the degree of moral rhetoric used on social media increase on days with violent protests but also that the hourly frequency of morally relevant tweets predicted the future counts of arrest during protests, suggesting an association between moralization and protest violence. To better understand the structure of this association, we ran a series of controlled behavioural experiments demonstrating that people are more likely to endorse a violent protest for a given issue when they moralize the issue; however, this effect is moderated by the degree to which people believe others share their values. We discuss how online social networks may contribute to inflations of protest violence.
Article
Full-text available
Do appeals to moral values promote charitable donation during natural disasters? Using Distributed Dictionary Representation, we analyze tweets posted during Hurricane Sandy to explore associations between moral values and charitable donation sentiment. We then derive hypotheses from the observed associations and test these hypotheses across a series of preregistered experiments that investigate the effects of moral framing on perceived donation motivation (Studies 2 & 3), hypothetical donation (Study 4), and real donation behavior (Study 5). Overall, we find consistent positive associations between moral care and loyalty framing with donation sentiment and donation motivation. However, in contrast with people’s perceptions, we also find that moral frames may not actually have reliable effects on charitable donation, as measured by hypothetical indications of donation and real donation behavior. Overall, this work demonstrates that theoretically constrained, exploratory social media analyses can be used to generate viable hypotheses, but also that such approaches should be paired with rigorous controlled experiments.
Article
Full-text available
Language data available through social media provide opportunities to study people at an unprecedented scale. However, little guidance is available to psychologists who want to enter this area of research. Drawing on tools and techniques developed in natural language processing, we first introduce psychologists to social media language research, identifying descriptive and predictive analyses that language data allow. Second, we describe how raw language data can be accessed and quantified for inclusion in subsequent analyses, exploring personality as expressed on Facebook to illustrate. Third, we highlight challenges and issues to be considered, including accessing and processing the data, interpreting effects, and ethical issues. Social media has become a valuable part of social life, and there is much we can learn by bringing together the tools of computer science with the theories and insights of psychology. (PsycINFO Database Record
Article
Full-text available
Does sharing moral values encourage people to connect and form communities? The importance of moral homophily (love of same) has been recognized by social scientists, but the types of moral similarities that drive this phenomenon are still unknown. Using both large-scale, observational social-media analyses and behavioral lab experiments, the authors investigated which types of moral similarities influence tie formations. Analysis of a corpus of over 700,000 tweets revealed that the distance between 2 people in a social-network can be predicted based on differences in the moral purity content-but not other moral content-of their messages. The authors replicated this finding by experimentally manipulating perceived moral difference (Study 2) and similarity (Study 3) in the lab and demonstrating that purity differences play a significant role in social distancing. These results indicate that social network processes reflect moral selection, and both online and offline differences in moral purity concerns are particularly predictive of social distance. This research is an attempt to study morality indirectly using an observational big-data study complemented with 2 confirmatory behavioral experiments carried out using traditional social-psychology methodology. (PsycINFO Database Record
Article
Full-text available
About 1 in 10 adults are reported to exhibit clinical depression and the associated personal, societal, and economic costs are significant. In this study, we applied the MTERMS NLP system and machine learning classification algorithms to identify patients with depression using discharge summaries. Domain experts reviewed both the training and test cases, and classified these cases as depression with a high, intermediate, and low confidence. For depression cases with high confidence, all of the algorithms we tested performed similarly, with MTERMS' knowledge-based decision tree slightly better than the machine learning classifiers, achieving an F-measure of 89.6%. MTERMS also achieved the highest F-measure (70.6%) on intermediate confidence cases. The RIPPER rule learner was the best performing machine learning method, with an F-measure of 70.0%, and a higher precision but lower recall than MTERMS. The proposed NLP-based approach was able to identify a significant portion of the depression cases (about 20%) that were not on the coded diagnosis list.
Article
Full-text available
People's values provide a decision-making framework that helps guide their everyday actions. Most popular methods of assessing values show tenuous relationships with everyday behaviors. Using a new Amazon Mechanical Turk dataset (N = 767) consisting of people's language, values, and behaviors , we explore the degree to which attaining " ground truth " is possible with regards to such complicated mental phenomena. We then apply our findings to a corpus of Face-book user (N = 130, 828) status updates in order to understand how core values influence the personal thoughts and behaviors that users share through social media. Our findings suggest that self-report questionnaires for abstract and complex phenomena, such as values, are inadequate for painting an accurate picture of individual mental life. Free response language data and language modeling show greater promise for understanding both the structure and content of concepts such as values and, additionally, exhibit a predictive edge over self-report questionnaires.
Article
Full-text available
Language use is a psychologically rich, stable individual difference with well-established correlations to personality. We describe a method for assessing personality using an open-vocabulary analysis of language from social media. We compiled the written language from 66,732 Facebook users and their questionnaire-based self-reported Big Five personality traits, and then we built a predictive model of personality based on their language. We used this model to predict the 5 personality factors in a separate sample of 4,824 Facebook users, examining (a) convergence with self-reports of personality at the domain- and facet-level; (b) discriminant validity between predictions of distinct traits; (c) agreement with informant reports of personality; (d) patterns of correlations with external criteria (e.g., number of friends, political attitudes, impulsiveness); and (e) test-retest reliability over 6-month intervals. Results indicated that language-based assessments can constitute valid personality measures: they agreed with self-reports and informant reports of personality, added incremental validity over informant reports, adequately discriminated between traits, exhibited patterns of correlations with external criteria similar to those found with self-reported personality, and were stable over 6-month intervals. Analysis of predictive language can provide rich portraits of the mental life associated with traits. This approach can complement and extend traditional methods, providing researchers with an additional measure that can quickly and cheaply assess large groups of participants with minimal burden. (PsycINFO Database Record (c) 2014 APA, all rights reserved).
Article
The growing use of social media among Internet users produces a vast and new source of user generated ecological data, such as textual posts and images, which can be collected for research purposes. The increasing convergence between social and computer sciences has led researchers to develop automated methods to extract and analyze these digital footprints to predict personality traits. These social media-based predictions can then be used for a variety of purposes, including tailoring online services to improve user experience, enhance recommender systems, and as a possible screening and implementation tool for public health. In this paper, we conduct a series of meta-analyses to determine the predictive power of digital footprints collected from social media over Big 5 personality traits. Further, we investigate the impact of different types of digital footprints on prediction accuracy. Results of analyses show that the predictive power of digital footprints over personality traits is in line with the standard “correlational upper-limit” for behavior to predict personality, with correlations ranging from 0.29 (Agreeableness) to 0.40 (Extraversion). Overall, our findings indicate that accuracy of predictions is consistent across Big 5 traits, and that accuracy improves when analyses include demographics and multiple types of digital footprints.
Article
Theory-driven text analysis has made extensive use of psychological concept dictionaries, leading to a wide range of important results. These dictionaries have generally been applied through word count methods which have proven to be both simple and effective. In this paper, we introduce Distributed Dictionary Representations (DDR), a method that applies psychological dictionaries using semantic similarity rather than word counts. This allows for the measurement of the similarity between dictionaries and spans of text ranging from complete documents to individual words. We show how DDR enables dictionary authors to place greater emphasis on construct validity without sacrificing linguistic coverage. We further demonstrate the benefits of DDR on two real-world tasks and finally conduct an extensive study of the interaction between dictionary size and task performance. These studies allow us to examine how DDR and word count methods complement one another as tools for applying concept dictionaries and where each is best applied. Finally, we provide references to tools and resources to make this method both available and accessible to a broad psychological audience.