Conference PaperPDF Available

Distant Reading of Religious Online Communities: A Case Study for Three Religious Forums on Reddit


Abstract and Figures

We present results of a project examining the application of computational text analysis and distant reading in the context of comparative religious studies, sociology, and online communication. As a source for our corpus, we use the popular platform Reddit and three of the largest religious subreddits: the subreddit Christianity, Islam and Occult. We have acquired all posts along with metadata for an entire year resulting in over 700,000 comments and around 50 million tokens. We explore the corpus and compare the different online communities via measures like word frequencies, bigrams, collocations and sentiment and emotion analysis to analyze if there are differences in the language used, the topics that are talked about and the sentiments and emotions expressed. Furthermore, we explore approaches to diachronic analysis and visualization. We conclude with a discussion about the limitations but also the benefits of distant reading methods in religious studies.
Content may be subject to copyright.
Distant Reading of Religious Online Communities:
A Case Study for Three Religious Forums on Reddit
Thomas Schmidt, Florian Kaindl and Christian Wolff
Media Informatics Group, University of Regensburg, Germany
Abstract. We present results of a project examining the application of compu-
tational text analysis and distant reading in the context of comparative religious
studies, sociology, and online communication. As a source for our corpus, we
use the popular platform Reddit and three of the largest religious subreddits: the
subreddit Christianity, Islam and Occult. We have acquired all posts along with
metadata for an entire year resulting in over 700,000 comments and around 50
million tokens. We explore the corpus and compare the different online com-
munities via measures like word frequencies, bigrams, collocations and senti-
ment and emotion analysis to analyze if there are differences in the language
used, the topics that are talked about and the sentiments and emotions ex-
pressed. Furthermore, we explore approaches to diachronic analysis and visual-
ization. We conclude with a discussion about the limitations but also the bene-
fits of distant reading methods in religious studies.
Keywords: Religious Studies, Distant Reading, Reddit, Sentiment Analysis,
Computational Social Science, Collocation
1 Introduction
With the concept of distant reading, Moretti [17] has argued for the application of
statistical and computational methods, primarily in literary studies and linguistics.
The general idea of distant reading is to explore large quantities of text via methods of
computational text analysis and text visualization, thus enabling findings that would
not be possible by qualitative or hermeneutical work alone. Some of the most popular
methods in this field are stylometry, topic modeling and sentiment and emotion analy-
sis [10]. The application of distant reading is also explored outside of literary studies.
Similar concepts for other media types can be found in film studies ([1]: distant view-
ing) or digital musicology [5]. In addition, in the context of textual analysis, distant
reading is also explored outside of literary studies in other text-oriented domains (e.g.
[21]). In a similar way, we want to explore the application of distant reading in the
Please cite as:
Schmidt, T., Kaindl, F. & Wolff, C. (2020). Distant Reading of Religious Online
Communities: A Case Study for Three Religious Forums on Reddit. In Proceed-
ings of the Digital Humanities in the Nordic Countries 5th Conference (DHN
2020) (pp. 157-172). Riga, Latvia.
context of religious studies and sociology by analyzing the communication of differ-
ent religious groups on the online platform Reddit
With the rise of social media the research area of computational social science is
gaining popularity [12]. However, most text analysis and visualization approaches are
focused on areas like politics (e.g. [11]). In the context of religious content and reli-
gious studies one can find research about extremist groups like ISIS [2] or the appli-
cation of distant reading techniques for famous religious texts (e.g. the Bible) [14, 25,
However, the exploration of the online communication of “ordinary” or “moder-
atereligious and spiritual groups on social media channels is rather rare although one
can assume that, considering the importance of social media for young adults, a lot of
religious discussions find their place on those platforms. Pfahler et al. [21] show the
benefit of applying distant reading on a Muslim forum by exploring the topics dis-
cussed via topic modeling.
To gather more insights about the subjects and language of religious discussions on
social media across diverse religious creeds, we present the results of a project exam-
ining and comparing three subforums on Reddit: a Christian, a Muslim and an occult
forum. We explore different techniques of distant reading and computational text
analysis. As methods, we employ the analysis of most frequent words and bigrams,
collocation analysis and visualization as well as sentiment and emotion analysis. Our
research goals are (1) to identify differences and specific features concerning lan-
guage usage as well as content discussed among those groups and (2) reflect upon the
benefits and limitations of the different computational techniques used.
2 Corpus
In the following, we describe how we gathered and constructed the corpus. If not
mentioned otherwise, we made use of Python and the popular library NLTK for all
2.1 Corpus Acquisition
As a source for our corpus, we have chosen the platform Reddit
. Reddit is a news
aggregation website founded in 2005 and is ranked among the top 20 most visited
websites in the world
. In recent years, the platform has evolved from its primary use,
which is to share links and images. Nowadays, it is a collection of subforums for vari-
ous topics. Users can subscribe to a subforum and via a voting system more popular
posts are placed more prominently on the platform. A subforum, also called subreddit
on Reddit consists of submissions (which are equivalent to a thread for general fo-
rums) and corresponding comments. Usually, the majority of entries consists of com-
ments. Due to the huge popularity, Reddit has been used for various research using
text mining methods [6, 7, 8]. Furthermore, Reddit’s open source roots and various
open source libraries for gathering data adds to its popularity in research.
As subreddits we have chosen three of the most popular religious subreddits. The
first subreddit, /r/Christianity
is focused on discussions about Christian belief and
practice. It is the biggest of the subreddits with 202,242 subscribers. The Muslim
subreddit /r/Islam
is smaller (82,404 subscribers), possibly due to the fact that Reddit
is more popular in English speaking countries. Next to those monotheistic religions,
we also look at an esoteric forum: /r/Occult
. The subreddit describes itself as “cen-
tered around discussion of the occult, mysticism, esoterica, metaphysics, and other
related topics” (149,379 subscribers; all subscription counts are of September 2,
2019). Spiritual directions as discussed in this forum have become more popular,
especially in the Western world and among the youth. Therefore, it is not surprising
that this subreddit is larger than those of many world religions (e.g. Islam, Judaism,
Buddhism). Please note that we do not want to explore the religious convictions in
these online communities (which might be an interesting topic for religious studies)
but rather want to explore the possibilities of corpus analysis and distant reading. For
this purpose, the chosen subreddits are (1) large enough and (2) varied enough to
investigate the subreddits on their own but also to compare them with each other.
However, one limitation to keep in mind is the difference in size with /r/Christianity
being much greater. We will focus on normalized results to avoid problems because
of these size differences.
To gather submissions, comments and metadata for a specific subreddit we use the
Python Reddit API Wrapper (PRAW)
library and save the data in the JSON format in
a MongoDB database. All submissions and comments have been collected for a
timeframe of one year (from the 1st of July 2018 to the 1st of July 2019). It is im-
portant to regard at least one year since religious communities and their communica-
tion behavior might be influenced due to specific holidays in the circle of the year.
2.2 Corpus Description
We have collected 115,556 submissions for all three subreddits. Nevertheless, we
filtered out 74,162 submissions consisting of links only or lacking author information.
Posts with no author information (e.g. deleted authors) are not visible anymore on the
platform. Thus, 41,394 submissions remain. After extracting the comments of these
submissions, 759,992 comments remain comprising more than 50 million tokens and
over 3.5 million sentences. Table 1 summarizes some of the general metrics of the
overall corpus and the specific subreddits after filtering noise.
Table 1. Corpus metrics
Table 2 illustrates some statistics about the lengths of submissions and comments.
Table 2. Comparison of post lengths
Sentences per submission
Tokens per sentence in submission
Sentences per comment
Tokens per sentence in comments
Comments per submission
Considering the post lengths, there are no specific differences. The only striking dif-
ference can be found concerning the Christian Forum since it is much larger than the
other ones. Submission also have a higher number of comments. However, the overall
sentence length does not differ since sentences are only slightly longer in the Chris-
tian forum.
3 Analysis
In the following, we present results for various statistical parameters, starting with
word frequencies, followed by bigram frequencies, results for significant collocations
and sentiment and emotion analysis.
3.1 Word Frequencies
To gain insights about the subjects discussed and the overall language we analyze the
most frequent words used in the subreddits. For the preprocessing we have eliminated
stop words and lemmatized the tokens using the WordNet-Lemmatizer which is a
general purpose solution for lemmatization often used for social media content [19,
20]. The following figures illustrate the top 10 most frequent words (MFWs) per sub-
reddit (Figure 1 to 3).
Fig. 1. MFWs in /r/Christianity
Fig. 2. MFWs in /r/Islam
Fig. 3. MFWs in /r/Occult
050000 100000 150000 200000 250000 300000 350000 400000 450000
05000 10000 15000 20000 25000
02000 4000 6000 8000 10000 12000 14000 16000
The results show that the word “god” is an important term in all three sub-reddits. It is
also notable that the word is used with a higher relative frequency in /r/Christianity
compared to the other forums. In /r/Islam it is notable that the words muslim and
islamare much more common than their equivalents “christian” and “christianity”
in /r/Christianity, suggesting these words are used differently in their respective do-
main, or that more meta discussion takes place in /r/Islam. The lack of the word “Mo-
hammed” as one of the most frequent words in the Muslim forum is due to the nu-
merous different spellings of this name (which have not been unified in this study).
As a last observation on the top words, /r/Christianity is the only subreddit with a
word for an emotion, love, in the top ten words, while /r/Occult’s top ten words
uniquely feature two words relating to the senses, namely feeland experience”.
3.2 Bigram Frequencies
A bigram is defined as two tokens appearing next to each other. More than unigrams,
bigrams can give insights in the usage of more complex concepts. Figure 4 to 6 illus-
trate the 10 most frequent bigrams for each subreddit.
Fig. 4. Most Frequent Bigrams in /r/Christianity
jesus christ
holy spirit
god bless
catholic church
eternal life
king james
lord jesus
god created
christ jesus
gay people
Fig. 5. Most Frequent Bigrams in /r/Islam
Fig. 6. Most Frequent Bigrams in /r/Occult
Both in the Cristian as well as in the Muslim forum specific named entities are the
most common bigrams e.g. “Jesus Christ”, “holy spirit”, “lord Jesus” and prophet
Muhammad. One of the most frequent bigrams in the Christian subreddit is gay
people. In comparison, bigrams consisting of the word gay are rather rare in the oth-
er forums. Gay peopleis ranked 44 for /r/Islam and there are no similar bigrams
found for /r/Occult showing that this topic is not of interest for this specific communi-
ty. In the Christian subreddit, a specific edition of the bible is often referred to (the
King James Bible, the most important bible edition in the English speaking world).
For the Muslim forum, geographical and political concepts are dominant e.g. middle
east, muslim countries, “saudi arabiaas well as spiritual authorities (Yasir
prophet muhammad
allah bless
middle east
abu bakr
prophet peace
muslim countries
yasir qadhi
saudi arabia
muslim community
allah guide
golden dawn
chaos magick
chaos magic
astral projection
black magic
sleep paralysis
left hand
hand path
lucid dreaming
tarot cards
Quadhi, Abu Bakr, Ibn Taymiyyah). Those findings are indeed in line with re-
sults of topic modeling on a similar corpus [21]. /r/Occult’s top bigrams refer mostly
to esoteric concepts and practices which is interesting since religious practices are
rarely discussed in the other forums
3.3 Collocations
To gain a better understanding about some of the religious key concepts we look at
the collocations for words representing those concepts. As a text window for colloca-
tion analysis we choose five, meaning words can be a maximum of five positions
away to be regarded as collocations. The collocation strength was measured as
Pointwise Mutual Information (PMI) which scores the collocations based on their
actual co-occurrence in the corpus in proportion to their expected co-occurrence if
they were independent [4]. Because this can lead to high values for very low-
frequency collocations, a minimum threshold was set for each measurement. We vis-
ualize the collocations similar to [3]. The key word is centered in the middle while the
surrounding words are those that are frequent enough in the surroundings of the word
according to the threshold. The lengths of the edges decrease with higher PMI-values,
thus words that occur more frequent are closer to the centered word. We focus our
analysis on various important religious words like god, death, life, love, experience or
religion. In the following, we show the collocation usage for the words god and
Fig. 7. Collocation visualizations for the word “god” in r/Christianity/
Fig. 8. Collocation visualizations for the word “god” in r/Islam/
Fig. 9. Collocation visualizations for the word “god” in r/Occult/
The collocations for the word god in Christianity (see figure 7–9) show some outdated
verb forms pointing to bible quotes (giveth, commendeth). In line with the up-
coming results about sentiment analysis, positive characterizations are more frequent
(“forgives, loves) than negative ones (hates, punishing). Similar holds true for
the Muslim forum with words like forgive. Those positive collocations become
even more apparent when analyzing the word Allah instead of god (which is not
shown here). It is striking that the existence of god seems to be discussed much more
in the Muslim forum (“existence, exists). Furthermore, the word godis probably
(also) used in the Muslim forum to refer to a specific Christian or Jewish god (son,
Abraham). For the occult forum, the multiple perspectives on God become very
clear. The word god is mostly surrounded by other words clarifying which god is
being discussed (horned, Abrahamic, Egyptian, Christian, sun). It is also
the only forum showing some rather negative perspective on god via the collocation
with damn. This might point to atheist or agnostic views.
The collocations for the concept death highlight the differences between the groups
even more clearly (see figure 1012).
Fig. 10. Collocation visualizations for the word “death” in r/Christianity/
Fig. 11. Collocation visualizations for the word “death” in r/Islam/
Fig. 12. Collocation visualizations for the word “death” in r/Islam/
In /r/Islam as well as /r/Christianity strong correlations with the term penaltyare
found. Death is much more frequently discussed in the Christian forum, thus more
collocations are identified. However, the collocations also point to the fact that death
plays a much more important role in the life and narration of Jesus since we find a lot
of collocations in this context (ascension, resurrection, jesus). The collocations
with angelandtastein the Muslim subreddit refer to specific Quran passages. For
the occult forum, the esoteric and spiritual content becomes clear since death is
strongly connected to words like rebirthand egopointing also to spiritual con-
cepts well-known in Buddhism.
3.4 Sentiment Analysis
Sentiment analysis means using computational methods for the analysis and predic-
tion of sentiments, mostly in written text [13]. Most of the times, the prediction goal
is whether the overall connotation of a text is negative, positive or neutral. This con-
cept is also often referred to as polarity. Typical areas for sentiment analysis are prod-
uct reviews but also social media [27]. In recent years, sentiment analysis has also
gained a lot of interest in Digital Humanities [15, 18, 22, 23, 24].
To explore sentiment analysis in our specific corpus, we use Vader, an open source
sentiment analysis library for Python
. Vader outputs a polarity score for each sen-
tence, which allows for the classification of each sentence as positive, neutral or nega-
tive. Although Vader employs lexicon-based methods for sentiment analysis, it has
been specifically developed for social media and shows very good evaluation results
on this type of content [9]. Table 3 shows the percentage of sentences classified with
a specific polarity class per subreddit.
Table 3. Ratio of Sentences Classified with a Polarity Class
While the sentiments expressed are rather similar, it is noticeable that /r/Christianity
has the lowest ratio of neutrality and is thus more polarized than the other forums.
/r/Occult has the lowest ratio of negative sentences, which might be because there are
fewer negative topics like sin and hell discussed in this subreddit. Overall, it is
rather striking that positivity dominates all subreddits. Please note however, that our
findings are purely descriptive at the moment and we apply no significance tests for
3.5 Emotion Analysis
The computational method of emotion analysis is closely related to sentiment analy-
sis. The goal, however, is to analyze and predict more complex emotions instead of
the simple polarity of a text. For our analysis, we use the NRC Emotion Lexicon [16],
a general purpose sentiment and emotion lexicon. It consists of around 14,000 words
and their associations with a set of emotions (anger, anticipation, disgust, fear, joy,
sadness, surprise, trust) but also with a polarity category (positive, negative). Words
can be associated with one or more of those emotions and polarity categories. By
counting the number of words associated with emotions one can investigate the emo-
tionalization of the language used. However, please note that this lexicon, unlike
Vader, is not optimized for social media language and is also not as sophisticated, as
Vader also accounts for negations and valence shifters. The following graph illus-
trates the percentages of every category for each subreddit (see figure 13):
Fig. 13. Percentage of words associated with emotions per subreddit
Like the results of the sentiment analysis, most emotions are much more frequent in
the Christian forum than in the others. This especially accounts for the categories
anticipation, fear, joy and trust. Similar to the results concerning Vader, we could
also identify this effect for the two polarity categories positive and negative. This
suggests that the discussions in /r/Christian are more emotionally charged. We also
investigated what specific words of the NRC emotion lexicon lead to these results:
Top words in the /r/Christianity subreddit with a trust connotation include god,
church, faith and pray”, words that were found to be especially frequent in this
sub-corpus. All of these, as well aslove, are furthermore associated with joy. How-
ever, vocabulary from Abrahamic religions is also often associated with negative
emotions. God, for example, is also associated with fear (its polarity is positive,
however), as are sin, prayand worship, words most frequently found in /r/Islam
and /r/Christianity. Concerning negativity and negative emotions /r/Occult is much
closer to the other forums. Reason for this are the negatively connoted and frequent
words occult, demon, blackand chaos” (as commonly appearing in black
magicand chaos magic).
4 Discussion
Via various methods of computational text analysis, we were able to gather some
interesting insights concerning the topics that are talked about and the sentiments and
emotions expressed. However, in the following we want to reflect upon the benefits
and the limitations of the methods chosen:
Ngram-frequencies give a compact and easily to understand overview of the key
concepts and topics that are discussed in the forums. The bigrams were more insight-
ful than the unigrams showing some more general differences like the focus on poli-
tics and authorities in the Muslim forum and the focus on practices for the occult fo-
rum. The analysis of word frequencies also proved to be very helpful for the interpre-
tation of more advanced methods like the collocation and sentiment/emotion analysis.
NRC Emotion Lexicon word frequencies
/r/Christianity /r/Islam /r/Occult
However, comparisons are limited with this method, since similar concepts are often
referred to differently (e.g. “Godvs Allah) and dependent of the specific vocabu-
lary of a group. We also want to pursue methods to identify keywords that are specific
for a sub-corpus e.g. using tf-idf weighting or comparative ranked lists.
The collocation analysis and visualizations did prove to be of the most interest for
us. By focusing on specific words that represent important concepts, we were able to
find interesting differences about the contexts of those words. Furthermore, to correct-
ly interpret the data, in-depth knowledge about the religions is necessary e.g. to iden-
tify quotes of the scriptures. Comparisons are easier, since different words for the
same concepts can be easily identified in the surroundings of a centered word. We
recommend investigating collocation analysis for similar future work. We also plan to
explore the possibility to construct a word embeddings model using our corpus to
analyze word associations.
The sentiment and emotion analysis illustrates some interesting results concerning
higher levels of emotional language for the Christian and Muslim forum. Although
these findings are of interest, they should be validated by more in-depth analysis since
now we can only speculate about the reasons for this result. We plan to analyze the
most extreme manifestations of comments concerning the emotional values to gain
more insights. Furthermore, we also want to precisely evaluate the performance of the
sentiment analysis approaches since they have been proven rather problematic in oth-
er areas of Digital Humanities [22]. One obvious problem is the lack of an emotion
lexicon which is specifically designed for the language used on Reddit or other social
media platforms.
Finally, there are several limitations of our study one should keep in mind when in-
terpreting the data. As already mentioned, the size of the subreddits was not equally
distributed. We focused on the analysis of normalized data to avoid skewness because
of the length. The reason for this disproportion might very well be the English lan-
guage. /r/Islam is very likely primarily used by Muslims living in Europe and Ameri-
ca which are of course a minority compared to Christians in those countries. Further-
more, research has shown that Reddit is predominantly used by American male young
. Therefore, we want to point out that we cannot make any statements about the
religious communities in general but only about this limited user group of Reddit and
also just for the specific year we regarded. Nevertheless, we plan to explore distant
reading methods to analyze religious groups on social media and improve our re-
search by increasing the corpora and investigating other social media channels. We
also want to examine other methods like stylometry, topic modeling, and named enti-
ty recognition to evaluate how religious studies and sociology can benefit of those
1. Arnold, T., Tilton, L.: Distant viewing: analyzing large visual corpora. Digital Scholarship
in the Humanities 34(Supplement 1), i3i16 (2019)
2. Badawy, A., Ferrara, E.: The rise of jihadist propaganda on social networks. Journal of
Computational Social Science 1(2), 453470 (2018)
3. Brezina, V., McEnery, T., Wattam, S.: Collocations in context: A new perspective on col-
location networks. International Journal of Corpus Linguistics 20(2), 139173 (2015)
4. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography.
Computational linguistics 16(1), 2229 (1990)
5. Cook, N.: Beyond the score: Music as performance. Oxford University Press (2013)
6. De Choudhury, M., De, S.: Mental health discourse on reddit: Self-disclosure, social sup-
port, and anonymity. In: Eighth international AAAI conference on weblogs and social me-
dia (2014)
7. Grover, T., Mark, G.: Detecting potential warning behaviors of ideological radicalization
in an alt-right subreddit. In: Proceedings of the International AAAI Conference on Web
and Social Media. vol. 13, pp. 193204 (2019)
8. Guimaraes, A., Balalau, O., Terolli, E., Weikum, G.: Analyzing the traits and anomalies of
political discussions on reddit. In: Proceedings of the International AAAI Conference on
Web and Social Media. vol. 13, pp. 205213 (2019)
9. Hutto, C.J., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of
social media text. In: Eighth international AAAI conference on weblogs and social media
10. Jänicke, S., Franzini, G., Cheema, M.F., Scheuermann, G.: On close and distant reading in
digital humanities: A survey and future challenges. In: EuroVis (STARs). pp. 83103
11. Karami, A., Bennett, L.S., He, X.: Mining public opinion about economic issues: Twitter
and the us presidential election. International Journal of Strategic Decision Sciences
(IJSDS) 9(1), 1828 (2018)
12. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.L., Brewer, D., Christakis, N.,
Contractor, N., Fowler, J., Gutmann, M., et al.: Computational social science. Science
323(5915), 721723 (2009)
13. Liu, B.: Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge Uni-
versity Press (2016)
14. McDonald, D.: A text mining analysis of religious texts. The Journal of Business Inquiry
13(1), 2747 (2014)
15. Mohammad, S.: From once upon a time to happily ever after: Tracking emotions in novels
and fairy tales. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology
for Cultural Heritage, Social Sciences, and Humanities. pp. 105114. Association for
Computational Linguistics (2011)
16. Mohammad, S.M., Turney, P.D.: Crowdsourcing a wordemotion association lexicon.
Computational Intelligence 29(3), 436465 (2013)
17. Moretti, F.: Conjectures on world literature. New left review pp. 5468 (2000)
18. Nalisnick, E.T., Baird, H.S.: Character-to-character sentiment analysis in shakespeare's
plays. In: Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers). vol. 2, pp. 479483 (2013)
19. Oyebode, O., Orji, R.: Social media and sentiment analysis: The nigeria presidential elec-
tion 2019. In: 2019 IEEE 10th Annual Information Technology, Electronics and Mobile
Communication Conference (IEMCON). pp. 01400146. IEEE (2019)
20. Pandarachalil, R., Sendhilkumar, S., Mahalakshmi, G.: Twitter sentiment analysis for
large-scale data: an unsupervised approach. Cognitive computation 7(2), 254262 (2015)
21. Pfahler, L., Elwert, F., Tabti, S., Morik, K., Krech, V.: What do you do with 5 million
posts? versuche zum distant reading religioser online-foren. In: Vogeler, G. (ed.) Book of
Abstracts, DHd 2018. pp. 335338. Cologne, Germany (2018)
22. Schmidt, T., Burghardt, M.: An evaluation of lexicon-based sentiment analysis techniques
for the plays of Gotthold Ephraim Lessing. In: Proceedings of the Second Joint
SIGHUMWorkshop on Computational Linguistics for Cultural Heritage, Social Sciences,
Humanities and Literature. pp. 139149. Association for Computational Linguistics
23. Schmidt, T., Burghardt, M., Dennerlein, K., Wolff, C.: Sentiment annotation in Lessing's
plays: Towards a language resource for sentiment analysis on german literary texts. In: 2nd
Conference on Language, Data and Knowledge (LDK 2019) (2019)
24. Schmidt, T., Burghardt, M., Wolff, C.: Toward multimodal sentiment analysis of historic
plays: A case study with text and audio for Lessing's Emilia Galotti. In: Proceedings of the
Digital Humanities in the Nordic Countries Conference 2019 (DHN 2019). pp. 405414
25. Slingerland, E., Nichols, R., Neilbo, K., Logan, C.: The distant reading of religious texts:
A ”big data" approach to mind-body concepts in early china. Journal of the american acad-
emy of religion 85(4), 9851016 (2017)
26. Verma, M.: Lexical analysis of religious texts using text mining and machine learning
tools. International Journal of Computer Applications 168
27. Vinodhini, G., Chandrasekaran, R.: Sentiment analysis and opinion mining: a survey. In-
ternational Journal 2(6), 282292 (2012)
... 2 Related Work First, we shortly describe the main text mining methods we apply in this study: Sentiment analysis is the computational method to analyze the sentiment expressed towards entities, mostly in written text (Liu, 2016). The main application area of sentiment analysis is user generated content on the web like social media (Hutto and Gilbert, 2014;Schmidt et al., 2020b) or movie reviews (Kennedy and Inkpen, 2006). Next to sophisticated machine learning approaches, there are also rule-based approaches working with lexical resources and simple rules (Taboada et al., 2011;Schmidt and Burghardt, 2018). ...
... Sentiment analysis was conducted by using VADER (Hutto and Gilbert, 2014). VADER is a lexiconbased sentiment analysis tool that is specifically attuned to sentiments expressed in social media and is therefore also used for Reddit (Schmidt et al., 2020b). VADER shows good evaluation results for thie context of social media (Hutto and Gilbert, 2014). ...
Conference Paper
Full-text available
We present a study employing various techniques of text mining to explore and compare two different online forums focusing on depression: (1) the subreddit r/depression (over 60 million tokens), a large, open social media platform and (2) Beyond Blue (almost 5 million tokens), a professionally curated and moderated depression forum from Australia. We are interested in how the language and the content on these platforms differ from each other. We scrape both forums for a specific period. Next to general methods of computational text analysis, we focus on sentiment analysis, topic modeling and the distribution of word categories to analyze these forums. Our results indicate that Beyond Blue is generally more positive and that the users are more supportive to each other. Topic modeling shows that Beyond Blue's users talk more about adult topics like finance and work while topics shaped by school or college terms are more prevalent on r/depression. Based on our findings we hypothesize that the professional curation and moderation of a depression forum is beneficial for the discussion in it.
... Its neighbouring field sentiment analysis primarily focuses on the prediction of the overall polarity (or valence) of a text, meaning if it is rather positive or negative (Mäntylä et al., 2018). Both methods have been explored in DH and CLS to analyze emotion/sentiment distributions and progressions in social media (Schmidt et al., 2020b) or literary texts like plays (Nalisnick and Baird, 2013;Schmidt et al., 2019b;Schmidt, 2019), novels (Zehe et al., 2016;Reagan et al., 2016) and fairy tales (Alm and Sproat, 2005;Mo-hammad, 2011) (see Kim and Klinger (2019) for an in-depth review of this research area). However, as the review of Kim and Klinger (2019) and recent tool developments in DH show (Schmidt et al., 2021a), the application of rather basic lexiconbased methods is frequent although these methods are usually outperformed by more modern approaches in sentiment and emotion classification (Cao et al., 2020;Dang et al., 2020;Cortiz, 2021;González-Carvajal and Garrido-Merchán, 2021) and are especially problematic for literary texts (Fehle et al., 2021). ...
Conference Paper
Full-text available
We present results of a project on emotion classification on historical German plays of Enlightenment , Storm and Stress, and German Classicism. We have developed a hierarchical annotation scheme consisting of 13 sub-emotions like suffering, love and joy that sum up to 6 main and 2 polarity classes (positive/negative). We have conducted textual annotations on 11 German plays and have acquired over 13,000 emotion annotations by two annotators per play. We have evaluated multiple traditional machine learning approaches as well as transformer-based models pretrained on historical and contemporary language for a single-label text sequence emotion classification for the different emotion categories. The evaluation is carried out on three different instances of the corpus: (1) taking all annotations, (2) filtering overlapping annotations by annotators, (3) applying a heuristic for speech-based analysis. Best results are achieved on the filtered corpus with the best models being large transformer-based models pretrained on contemporary German language. For the polarity classification accuracies of up to 90% are achieved. The accuracies become lower for settings with a higher number of classes, achieving 66% for 13 sub-emotions. Further pretraining of a historical model with a corpus of dramatic texts led to no improvements .
... Emotion recognition is a sub-field of affective computing (cf. Halbhuber et al. 2019;Hartl et al. 2019;Ortloff et al. 2019;Schmidt et al. 2020c) and is often applied in DH to predict sentiment and emotions from written text (Moßburger et al. 2020;Schmidt / Burghardt 2018;Schmidt, 2019;Schmidt et al. 2019a;Schmidt et al. 2020b). We focus on the image channel of movies and for the emotion prediction we use the Python module FER 2 (Goodfellow et al. 2013). ...
Conference Paper
Full-text available
We present an exploratory study in the context of digital film analysis inspecting and comparing five canonical movies by applying methods of computer vision. We extract one frame per second of each movie which we regard as our sample. As computer vision methods we explore image-based object detection, emotion recognition, gender and age detection with state-of-the-art models. We were able to identify significant differences between the movies for all methods. We present our results and discuss the limitations and benefits of each method. We close by formulating future research questions we plan to answer by applying and optimizing the methods.
... Online media and content have gained a lot of interest in Digital Humanities (DH) in recent years (e.g. Moßburger et al. 2020;Schmidt et al. 2020a;Schmidt et al. 2020c). In the context of literary studies, the analysis of online creative writing platforms has gained more and more Cite as: Schmidt, T., Grünler, J., Schönwerth, N. & Wolff, C. (2021). ...
Conference Paper
Full-text available
We report upon a digital humanities project on the acquisition and analysis of a corpus of German online writings. We have implemented a scraper to gather the German language material as well as corresponding metadata of the popular online writing platform Archive of Our Own (AO3), which is a platform primarily focused on the text sort of fan fictions. The corpus consists of 9,640 writings resulting in over 39 million tokens and 3.6 million sentences. The texts have varying lengths with a median of around 2,500 tokens per story. We present results on the analysis of metadata and general text statistics like the most frequent words. While we can support previous findings of literary and media studies like the dominance of male-male romantic and erotic narratives, we can also identify attributes that are very specific and unique to German culture as well as differences to results of research for English online writings. We will outline in our future work how we plan to further increase and analyze the corpus to support research in digital humanities as well as German literary and fan studies.
... Sentiment and emotion analysis have been explored in the context of the DH. Researchers explore sentiment analysis in various literary genres like plays [27,30,39,38,40,52], novels [20,23,36], fairy tales [1,2] and fan fictions [21,35] but also in the social media context [44,46]. While the focus of research is predominantly on text, esp. ...
Full-text available
Movies in Digital Humanities are often enriched with information by annotating the text e.g. via subtitles. However, we hypothesize that the missing presentation of the multimedia content is disadvantageous for certain annotation types like sentiment annotation. We claim that performing the annotation live during the viewing of the movie is beneficial for the annotation process. We present and evaluate the first version of a novel approach and prototype to perform live sentiment annotation of movies while watching them. The prototype consists of an Arduino microcontroller and a potentiometer which is paired with a slider. We perform an annotation study for five movies receiving sentiment annotations from three annotators each, once via live annotation and once via traditional subtitle annotation to compare the approaches. While the agreement among annotators increases slightly by using live sentiment annotation, the overall experience and subjective effort measured by quantitative post questionnaires improves significantly. The qualitative analysis of post annotation interviews validates these findings.
... Sentiment analysis (or opinion mining) is a term used to describe computational methods for predicting and analyzing sentiment, predominantly in written text (Liu, 2016, p. 1). Sentiment analysis is especially popular for social media content (Moßburger et al., 2020;Schmidt, Hartl, Ramsauer, Fischer, Hilzenthaler, & Wolff, 2020;Schmidt, Kaindl, & Wolff, 2020) and any other form of user generated content (cf. Mäntylä et al., 2018). ...
Conference Paper
Full-text available
We present SentText, a web-based tool to perform and explore lexicon-based sentiment analysis on texts, specifically developed for the Digital Humanities (DH) community. The tool was developed integrating ideas of the user-entered design process and we gathered requirements via semi-structured interviews. The tool offers the functionality to perform sentiment analysis with predefined sentiment lexicons or self-adjusted lexicons. Users can explore results of sentiment analysis via various visualizations like bar or pie charts and word clouds. It is also possible to analyze and compare collections of documents. Furthermore, we have added a close reading function enabling researchers to examine the applicability of sentiment lexicons for specific text sorts. We report upon the first usability tests with positive results. We argue that the tool is beneficial to explore lexicon-based sentiment analysis in the DH but can also be integrated in DH-teaching.
... Both methods have gained a lot of interest in Digital Humanities (DH) and Computational Literary Studies (CLS) (cf. [9]) and are applied to analyze emotions and sentiment in historical plays [12,17,23,25,26,27,29,40], novels [6,12,21], fairy tales [1,12], political texts [38], or online forums [14,35]. DH projects also explore more modern literary genres like fan fictions [8,7], original creative works on the web [19], subtitles of movies [5,42] or song lyrics [24]. ...
Conference Paper
Full-text available
In this paper, we present first work-in-progress annotation results of a project investigating computational methods of emotion analysis for historical German plays around 1800. We report on the development of an annotation scheme focussing on the annotation of emotions that are important from a literary studies perspective for this time span as well as on the annotation process we have developed. We annotate emotions expressed or attributed by characters of the plays in the written texts. The scheme consists of 13 hierarchically structured emotion concepts as well as the source (who experiences or attributes the emotion) and target (who or what is the emotion directed towards). We have conducted the annotation of five example plays of our corpus with two annotators per play and report on annotation distributions and agreement statistics. We were able to collect over 6,500 emotion annotations and identified a fair agreement for most concepts around a κ-value of 0.4. We discuss how we plan to improve annotator consistency and continue our work. The results also have implications for similar projects in the context of Digital Humanities.
Conference Paper
Full-text available
We present first results of an ongoing research project on sentiment annotation of historical plays by German playwright G. E. Lessing (1729-1781). For a subset of speeches from six of his most famous plays, we gathered sentiment annotations by two independent annotators for each play. The annotators were nine students from a Master's program of German Literature. Overall, we gathered annotations for 1,183 speeches. We report sentiment distributions and agreement metrics and put the results in the context of current research. A preliminary version of the annotated corpus of speeches is publicly available online and can be used for further investigations, evaluations and computational sentiment analysis approaches.
Conference Paper
Full-text available
Social media has become an inevitable tool in many sectors including politics. On February 23, Africa's largest economy and most populous country, Nigeria, conducts its presidential elections. Many Nigerians used the social media to express their opinion in favour or against the various presidential candidates. Research has shown that their shared sentiments can influence the opinions of others and hence who eventually wins the presidential election. This paper therefore aims to identify and analyze public sentiments towards two popular candidates with the aim of determining their chances of being elected into the highest position of authority in Nigeria based on social media comments. First, we perform sentiment analysis on election-related posts from Nairaland (a social network targeted at Nigerians) using lexicon-based and supervised machine learning (ML) techniques with the aim of detecting their sentiment polarity (i.e. negative or positive). We collected 118,421 posts between January 1 and February 22, 2019. Second, we implemented and compared the performance of three lexicon-based classifiers and five ML-based classifiers. The best performing classifier is then used in determining the sentiment polarity of posts. Third, we conducted thematic analysis on both positive and negative posts to further understand and reveal public opinions about each candidate. Finally, we discuss our analytical findings and the possibility of a candidate receiving more votes than the other. Our findings relate considerably to the actual election results released by the Independent National Electoral Commission (INEC).
Conference Paper
Full-text available
We present results from a project on sentiment analysis of drama texts, more concretely the plays of Gotthold Ephraim Lessing. We conducted an annotation study to create a gold standard for a systematic evaluation. The gold standard consists of 200 speeches of Lessing's plays and was manually annotated with sentiment information by five annotators. We use the gold standard data to evaluate the performance of different German sentiment lexicons and processing configurations like lemmatization, the extension of lexicons with historical linguistic variants, and stop words elimination, to explore the influence of these parameters and to find best practices for our domain of application. The best performing configuration accomplishes an accuracy of 70%. We discuss the problems and challenges for sentiment analysis in this area and describe our next steps toward further research.
Conference Paper
Full-text available
We present a case study as part of a work-in-progress project about multimodal sentiment analysis on historic German plays, taking Emilia Galotti by G. E. Lessing as our initial use case. We analyze the textual version and an audio version (audiobook). We focus on ready-to-use sentiment analysis methods: For the textual component, we implement a naive lexicon-based approach and another approach that enhances the lexicon by means of several NLP methods. For the audio analysis, we use the free version of the Vokaturi tool. We compare the results of all approaches and evaluate them against the annotations of a human expert, which serves as a gold standard. For our use case, we can show that audio and text sentiment analysis behave very differently: textual sentiment analysis tends to predict sentiment as rather negative and audio sentiment as rather positive. Compared to the gold standard, the textual sentiment analysis achieves accuracies of 56% while the accuracy for audio sentiment analysis is only 32%. We discuss possible reasons for these mediocre results and give an outlook on further steps we want to pursue in the context of multimodal sentiment analysis on historic plays.
Full-text available
Opinion polls have been the bridge between public opinion and politicians in elections. However, developing surveys to disclose people's feedback with respect to economic issues is limited, expensive, and time-consuming. In recent years, social media such as Twitter has enabled people to share their opinions regarding elections. Social media has provided a platform for collecting a large amount of social media data. This paper proposes a computational public opinion mining approach to explore the discussion of economic issues in social media during an election. Current related studies use text mining methods independently for election analysis and election prediction; this research combines two text mining methods: sentiment analysis and topic modeling. The proposed approach has effectively been deployed on millions of tweets to analyze economic concerns of people during the 2012 US presidential election.
Full-text available
This paper presents a text mining approach to compare and to explore the similarities and the differences between various religious texts using POS Tagging and Term Document Matrix. Automated text mining and machine learning tools have been used for lexical analysis of the ten world famous religious texts: the Holy Bible, the Dhammapada, the Tao Te Ching, the Bhagwad Gita, the Guru Granth Sahib, the Agama, the Quran, the Rig Veda, the Sarbachan and the Torah. The extracted nouns categories were used as features to explore some interesting relationships between these religions and ideas that have emerged in different religions from different geographic regions.
In this article we establish a methodological and theoretical framework for the study of large collections of visual materials. Our framework, distant viewing, is distinguished from other approaches by making explicit the interpretive nature of extracting semantic metadata from images. In other words, one must ‘view’ visual materials before studying them. We illustrate the need for the interpretive process of viewing by simultaneously drawing on theories of visual semiotics, photography, and computer vision. Two illustrative applications of the distant viewing framework to our own research are draw upon to explicate the potential and breadth of the approach. A study of television series shows how facial detection is used to compare the role of actors within the narrative arcs across two competing series. An analysis of the Farm Security Administration–Office of War Information corpus of documentary photography is used to establish how photographic style compared and differed amongst those photographers involved with the collection. We then aim to show how our framework engages with current methodological and theoretical conversations occurring within the digital humanities.
This article focuses on the debate about mind-body concepts in early China to demonstrate the usefulness of large-scale, automated textual analysis techniques for scholars of religion. As previous scholarship has argued, traditional, "close" textual reading, as well as more recent, human coder-based analyses, of early Chinese texts have called into question the "strong" holist position, or the claim that the early Chinese made no qualitative distinction between mind and body. In a series of follow-up studies, we show how three different machine-based techniques - word collocation, hierarchical clustering, and topic modeling analysis - provide convergent evidence that the authors of early Chinese texts viewed the mind-body relationship as unique or problematic. We conclude with reflections on the advantages of adding "distant reading" techniques to the methodological arsenal of scholars of religion, as a supplement and aid to traditional, close reading.
Using a dataset of over 1.9 million messages posted on Twitter by about 25,000 ISIS members, we explore how ISIS makes use of social media to spread its propaganda and to recruit militants from the Arab world and across the globe. By distinguishing between violence-driven, theological, and sectarian content, we trace the connection between online rhetoric and key events on the ground. To the best of our knowledge, ours is one of the first studies to focus on Arabic content, while most literature focuses on English content. Our findings yield new important insights about how social media is used by radical militant groups to target the Arab-speaking world, and reveal important patterns in their propaganda efforts.