ChapterPDF Available

Psychological Text Analysis in the Digital Humanities


Abstract and Figures

In the digital humanities, it has been particularly difficult to establish the psychological properties of a person or group of people in an objective, reliable manner. Traditionally, the attempt to understand an author’s psychological makeup has been primarily (if not exclusively) accomplished through subjective interpretation, qualitative analysis, and speculation. In the world of empirical psychological research, however, the past 2 decades have witnessed an explosion of computerized language analysis techniques that objectively measure psychological features of the individual. Indeed, by using modern text analysis methods, it is now possible to quickly and accurately extract information about people – personalities, individual differences, social processes, and even their mental health – all through the words that people write and speak. This chapter serves as a primer for researchers interested in learning about how language can provide powerful insights into the minds of others via well-established and easy-to-use psychometric methods. First, this chapter provides a general background on language analysis in the field of psychology, followed by an introduction to modern methods and developments within the field of psychological text analysis. Finally, a solid foundation to psychological text analysis is provided in the form of an overview of research spanning hundreds of studies from labs all over the world.
Content may be subject to copyright.
Psychological Text Analysis in the Digital
Ryan L. Boyd
Abstract In the digital humanities, it has been particularly difficult to establish
the psychological properties of a person or group of people in an objective,
reliable manner. Traditionally, the attempt to understand an author’s psychological
makeup has been primarily (if not exclusively) accomplished through subjective
interpretation, qualitative analysis, and speculation. In the world of empirical psy-
chological research, however, the past two decades have witnessed an explosion of
computerized language analysis techniques that objectively measure psychological
features of the individual. Indeed, by using modern text analysis methods, it is now
possible to quickly and accurately extract information about people—personalities,
individual differences, social processes, and even their mental health—all through
the words that people write and speak. This chapter serves as a primer for
researchers interested in learning about how language can provide powerful insights
into the minds of others via well-established and easy-to-use psychometric methods.
First, this chapter provides a general background on language analysis in the field
of psychology, followed by an introduction to modern methods and developments
within the field of psychological text analysis. Finally, a solid foundation to
psychological text analysis is provided in the form of an overview of research
spanning hundreds of studies from labs all over the world.
Psychological Text Analysis in the Digital Humanities
The digital revolution has transformed the way that we interact with virtually
everything. Televisions have built-in WIFI cards to stream media from sources all
over the world. New refrigerator models can take pictures of their own contents with
a digital camera and send them to your mobile phone. You can learn about world
events, connect with your friends, and buy groceries with a few lazy swipes of the
finger before you even get out of bed. For many of us, the technological shift has
happened so quickly that we are still diligently attempting to understand just how
R.L. Boyd ()
Department of Psychology, The University of Texas at Austin, 108 E. Dean Keeton St., Stop
A8000, Austin, TX 78712-1043, USA
© Springer International Publishing AG 2017
S. Hai-Jew (ed.), Data Analytics in Digital Humanities, Multimedia Systems
and Applications, DOI 10.1007/978-3-319-54499-1_7
162 R.L. Boyd
drastically and diversely our world has changed in such a short amount of time. For
many of the science fiction fans out there, these changes have not happened nearly
fast enough (“Where are those hoverboards, anyway?”).
One of the most obvious ways that advances in technology have impacted our
daily lives is that we interact with information in completely new, unprecedented
ways. This is particularly clear when considering how we both create and consume
information. We are awash in human-created data 24/7, which can be seen as a great
liberation of information or a scourge to quiet minds, depending on your point of
view. People on both sides of the fence, however, generally agree that maximizing
the benefit of our new abundance of information requires new ways of optimizing,
managing, and processing data.
A New World of Research
Rather than create a great divide between fields of inquiry, the modern technological
boom has instead helped to bring disparate areas of study closer together. The
introduction of computational techniques has been particularly useful in the social
sciences during this era of information. In the field of Psychology, we are primarily
interested in the fundamentals of observable human thought, behaviors, and experi-
ence. Traditionally, psychology researchers have had to invite people into laboratory
settings and prod them with stimuli and experimental conditions, hoping to coax
real-world psychological processes out of them in a highly tedious, controlled
manner. Alternatively, we could engage in painstaking field research by grabbing
our pencils and paper, venturing outside, and carefully documenting what we could
observe and survey from the population. These techniques are still incredibly useful
for (and, in fact, central to) the careful, empirical study of psychological phenomena.
However, with the proliferation of the internet and other digital technologies, we are
finding not only new opportunities to study ourselves, but new methods to do so.
One of the most fascinating psychological methods to arise from the merging
of disciplines has been the automated analysis of language. To many, this phrase
may conjure thoughts of intricate, computer-generated linguistic diagrams or highly
complex speech recognition programs. These are certainly triumphs of technology
and deserve praise in their own right. Potentially even more fascinating to readers
of this volume, however, are the scientific studies that have found that the common
words a person writes, speaks, and types can be deeply revealing of their underlying
psychology. Imagine a world where a few clicks on a computer can provide
profound insights into whether a person is being authentic, how someone orients
themselves socially, or even if a person is suicidal. You might be surprised to
find that you are already living in this world. Decades of rigorous, empirical
psychological research has found that there are profound links between the both
content and style of a person’s language and how they think, behave, and feel.
The merging of psychology and computer science has been a driving force of such
Psychological Text Analysis in the Digital Humanities 163
The goal of this chapter is to introduce to you three broad ideas from psy-
chological research: (1) language analysis in the field of psychology, (2) two
well-established techniques of automated psychological language analysis, and (3)
an overview of psychological research that has successfully used automated text
analysis. Some of the research that is introduced in this chapter has already been
explored in the context of the Digital Humanities, however, the opportunities for
future work are essentially limitless.
Importantly, the psychological analysis of text is conceptually quite different
from those provided by techniques often referred to as “distant reading” (e.g.,
Moretti 2013;Ross2014) and other natural language processing methodologies.
The empirical methods and research discussed in this chapter are not necessarily
about understanding texts themselves, modeling language structure, or capturing
relationships between written works. Instead, these methods are emphatically
psychological in their nature and applications. The analysis of language from
this perspective allows us to understand the individual behind a given text—their
motivations, preoccupations, emotional states, and other facets of their mental
Language Analysis in Psychology
Language has been of long-standing intrigue in the field of Psychology. Dewey
(1910) broadly philosophized about language as a vehicle for thought, and Freud
(e.g., Freud 1891) theorized that people’s words could provide insights into their
emotions and hidden, deep-seated motives. While many of the specific theories
from the psychodynamic movement have been modified or altogether abandoned,
the overarching ideas about links between language and psychology took hold in
empirical psychological research. In the middle of the twentieth century, researchers
began codifying word clusters into “dictionaries” that could be used to assess
people’s needs for things like affiliation, achievement, and power (e.g., Lasswell
et al. 1952). For example, people who used relatively high numbers of words like
“win,” “success,” and “goal” were thought to be generally motivated by a striving
for accomplishment. In fact, several lines of research often found such predictions
to be true (e.g., McClelland et al. 1953).
The dictionary-based approach to measuring meaningful psychological pro-
cesses continued to develop as mainframe computers were introduced in the 1960s.
The focus of psychological text analysis swayed between studying culture and mass
media versus individual motivations (e.g., Stone et al. 1966), but eventually returned
to a more idiographic approach that focused on clinically relevant topics such as
anxiety, schizophrenia, and coping in therapy (e.g., Martindale 1975; Mergenthaler
1996; Weintraub 1989).
A considerable drawback to most of the automated text analysis systems
designed in the latter half of the twentieth century is that they were often wholly
opaque. Many systems involved complicated rulesets that were difficult to interpret,
164 R.L. Boyd
idiosyncratic systems that prioritized certain words over others, or dictionaries that
were rooted in highly specific theories of psychology that did not extend well outside
of specific use-cases. A system of analyzing texts for a specific neo-Freudian process
had little use when researching extraversion, for example, and an obtuse system
that involved weighting various words made it difficult to evaluate the accuracy
of a language-based measure for something like self-esteem from a psychological
perspective. In effect, the ambitiousness of such systems proved to be a major
hindrance and point of fault rather than providing accessible techniques that could
be used to study human psychology broadly.
Linguistic Inquiry and Word Count
The impact of realizing that language can be quantified to reveal clues about a
person’s underlying psychology is difficult to overstate. The more that a person
used words from certain, verified categories of language (such as words about anger
or family), the more these concepts appeared to be a central dimension of that
person’s psychology.1However, in the late 1990s, psychologists were starting to
discover that there was more to language than its content. Up to that point, virtually
every text analysis technique in psychology focused on what we call content words
(also referred to as “open class” words because they can take many different forms,
such as “run,” “running,” and “runs”). Content words convey some kind of meaning
(e.g., who, what, where, etc.) and constitute the majority of the words in a person’s
vocabulary. Consider the sentence:
Natalie drove to the store and bought some coffee.
The content words in this sentence give us the information of who (Natalie), what
(drove, bought, coffee), and where (store). By looking only at the content words in
the above sentence, we can get a fairly good sense of what is being said.
But what about the other words, like “to,” “the,” “and,” and “some”? These have
traditionally been treated as garbage words that convey no useful information—such
words were thought to exist merely as an artifact of language that joins together the
“important” words. If we look just at the grammatical words in the above sentence
(“to the and some”), we get no clues as to what the sentence is about, what is being
conveyed, etc.—virtually no useful information at all.
However, researchers such as George Miller (1995) were beginning to find that
these small, meaningless words tend to be processed in different areas of the brain
than content words and appear to serve other psychological purposes. This type
1Note that while it may seem trivial to create new categories of words to measure a psychological
process, it can be an unspeakably difficult task in practice. Determining how specific words are
often used out in the “real world,” establishing whether certain words are related to psychologically
meaningful processes, and creating dictionaries that possess adequate statistical properties is
deceptively tricky. Many researchers have spent years working on word dictionaries that have
ultimately proven to be meaningless.
Psychological Text Analysis in the Digital Humanities 165
of work hinted that there may be more to these function words (also called “closed
class” words) than what we had initially thought. Indeed, researchers began noticing
that the way that people used function words often signaled a person’s psychological
state as much as, and in some cases more than, their content words.
In essence, while we knew that what a person says can be important for inferring
psychological patterns, we were beginning to learn that how a person says something
can be just as telling. Additionally, because function words are generated in an
essentially automatic manner by the brain in order to link meaningful words
together, the way that function words are used is often unconscious. In other words,
we seldom stop to think about which function words to use when speaking or
writing—they are quietly generated from a very deep level of the mind.
At the same time as these realizations were being made, early versions of desktop
applications were being developed to automatically analyze text in a more user-
friendly manner. One of the first mainstream examples of these applications was
named Linguistic Inquiry and Word Count (LIWC, pronounced the same as “Luke”;
Pennebaker and Francis 1999). LIWC has gone through several iterations since its
creation, with the latest version being LIWC2015 (Pennebaker et al. 2015). At its
core, LIWC consists of two parts. The heart of the application is its dictionary that,
in many ways, is similar to the dictionaries of older text analysis paradigms.
The LIWC dictionary contains word-to-category mappings for around 80 cate-
gories of words, including both common content words (e.g., words about family,
emotions, biological processes) and function words (e.g., pronouns, conjunctions,
articles, etc.). For example, the “cognitive processes” category contains words
like “think,” “understand,” and “analyze,” and the “articles” category contains the
words “a,” “an,” and “the.” Additionally, similar techniques have been used to
translate various versions of the LIWC dictionary to multiple languages over the
years, allowing researchers to conduct parallel work across different languages and
The principal strength of the LIWC dictionary is bolstered by the fact that it was
carefully developed using established, standard psychometric approaches such as
validation on external psychological data, as well as techniques that ensure high
internal reliability from a statistical perspective (see Pennebaker et al. 2015).
If the dictionary is the heart of LIWC, the program itself is the brain of the
whole operation. The LIWC2015 software is, in a way, deceptively simple in how
it works. The program operates by reading in batches of text files (e.g., plain text
“.txt” files, MS Office “.docx” files, etc.), then counts the percentage of words that
belong to each of the LIWC dictionary categories. For example, if 2 out of every
10 words in a text is a “social” word, and 1 out of every 10 words is a “cognitive
process” word, the text will be scored as 20% social and 10% cognitive in nature.
2At the time of this writing, various translations of the LIWC dictionary exist in Spanish,
Dutch, German, Italian, French, Korean, Chinese, Portuguese, and Turkish, among others. These
translations are typically available from their respective translators rather than the LIWC creators,
and most have accompanying peer-reviewed publications that evaluate the psychometric properties
of the translated dictionaries.
166 R.L. Boyd
Fig. 1 Example of output provided by the LIWC2015 application. Output is provided in a table
that shows filenames, summary measures (e.g., word count, average words per sentence), and
percentage of words that belong to each category (e.g., percent of words reflecting anxiety,
cognitive processing, tactile perception, etc.). LIWC output can be exported as one of many several
standard file formats (CSV, Excel spreadsheets, etc.) for use in statistical packages such as R or
This task is done for each of the 80 categories in the LIWC dictionary (some
other summary measures are calculated as well, such as word count, punctuation
use, and so on), and output is produced in the form of a spreadsheet (see Fig. 1).
Additionally, the LIWC2015 application is fairly open-ended—users can create and
use their own custom dictionaries with the software to code for categories of words
that are relevant to their own research interests.
The very basic “word counting” approach has been found to work extremely
well across time and contexts for two primary reasons. First, as discussed earlier,
we repeatedly find that when a person uses higher rates of certain topics/categories
of language, they tend to show psychological patterns that correspond to their
language. Second, the “word counting” approach of LIWC2015 takes advantage
of the probabilistic properties of how people naturally use language—that is, people
tend to use language in meaningful clusters. An angry person typically does not
just use one or two anger words, they use constellations of words that express
(or hint at) their anger. Someone who is actively working through a problem tends
to use multiple cognitive words together as they talk or write relative to someone
who is not working through a problem, and so on.
The word counting approach does occasionally miss or misclassify a word here
and there, however, a well-made dictionary will correctly categorize words most
of the time. If a person uses the word “sad” once, it is hard to determine with a
computer program the precise meaning of the word in a single instance. Was the
Psychological Text Analysis in the Digital Humanities 167
person being sarcastic? Were they relaying the fact that someone else looks sad?
Were they talking about being not sad?3In extremely short texts (e.g., one or two
sentences), these are very real concerns that should not be dismissed when using any
form of automated text analysis. However, when aggregating across large amounts
of text by the same person (hundreds, thousands, or tens of thousands of words),
we almost always find that these issues are rendered moot, despite the occasionally
misclassified word (see the section in Appendix “Preparing Texts for Analysis” for
additional considerations on text preparation/word detection). Generally speaking,
then, a word counting approach that ignores things like context and intended
audience can still provide surprisingly clear insights into a person’s psychology.
Meaning Extraction: A Complementary Approach
Over the years, an incredible amount of research from labs all over the world has
demonstrated that each of the LIWC word categories exhibits distinct and often
surprising relationships with specific social and psychological processes (for a fairly
contemporary review, see Tausczik and Pennebaker 2010). Indeed, most of the
research that is introduced later in this chapter relies directly on the LIWC dictionary
for understanding how people think, feel, and behave. The LIWC dictionary does
have one commonly cited drawback, however: it requires pre-defined sets of words
to make psychological inferences. In the past decade, additional language analysis
techniques have been created that are useful for inferring psychological processes in
a more inductive, or “bottom-up,” manner. In psychology, one of the most accessible
techniques that have been created is called the Meaning Extraction Method (MEM;
Chung and Pennebaker 2008).
The MEM is a way of automatically inferring what words are being used
together, essentially resulting in a dictionary of word-to-category mappings from a
collection of texts. In simple terms, this is achieved by finding words that naturally
clump into “themes” using some basic statistical techniques. In practice, the MEM
can be considered a series of steps that a researcher takes—this procedure begins
with a collection of texts and results in psychologically meaningful word clusters:
1. Identify words in a corpus of texts that are relatively common.
2. Create a table that shows which texts use which common words.
3. Statistically identify words that tend to occur together throughout the corpus.
Conceptually, the MEM is fairly simple and straight-forward. However, in
practice, there are several sub-steps that have to be completed in order to get from
3Interestingly, the negation of an emotion (e.g., “not sad”) appears to be psychologically different
from expressing the opposite of an emotion (“happy”). Research has found that people who think
along a “sadness” dimension, even if they are repeatedly saying that they are “not sad” or talking
about someone else’s sadness, are psychologically different from those who are thinking along the
lines of a different emotion altogether (Pennebaker et al. 1997).
168 R.L. Boyd
the beginning of the process to the end. Recently, our lab has released free software
called the Meaning Extraction Helper (MEH; Boyd 2016) that can automate the
majority of the MEM process, making this approach readily accessible and easy to
perform with minimal effort.
1. Identify words in a corpus of texts that are relatively common. Unlike a
dictionary-based method that considers each document in isolation, the MEM
considers a whole collection of texts at once. Like other topic modeling methods,
the MEM works this way in order to determine how words are typically used within
a given sample. As a first step, we need to identify those words that appear in
enough different texts in order to understand not just idiosyncratic word use by
one individual, but broader patterns across different people and texts. Moreover,
when performing the MEM, we tend to care less about specific variations of words
than the concepts reflected by each of the words—to account for this, we can do
something called “lemmatization,” which converts words to their most basic form
(e.g., “drive,” “driving,” and “drove” are all converted to “drive”). Lemmatization is
something that the MEH software is able to do this automatically for most English
words (in addition to roughly a dozen other languages).
Once words are lemmatized, we create a list of all words and determine what
percentage of texts use each word. Once all of this information has been collected,
we decide which words we would like to omit based on what percentage of texts
include each word, typically using a minimum of somewhere around 5%. This
prevents any single text from unduly influencing our results: a single text may
contain 500 instances of the word “enantiodromia,”4but this is hardly useful in
understanding its broad psychological properties across different people. Lastly, we
typically omit function words from our master list—because function words are
so common, they tend to co-occur with all other words, making them difficult to
statistically separate into meaningful themes (see Table 1for an example of the
frequency list generated by MEH).
2. Create a table that shows which texts use which common words. Once the
commonly used words have been selected, we need to create a dataset with this
information—this will allow us to extract common themes using some basic statis-
tical procedures. The simplest way of doing this is to score each text using a “binary”
method—that is, each text either does or does not use each word (see Table 2).
Before discussing the statistical particulars, take a close look at Table 2.By
simply looking at the 1’s in each row, one can get a decent sense of what each story is
about. Story 1 clearly contains elements about school and social concepts, possibly
a story about a school-related memory involving close relationships. Stories 3
and 5 appear to both have strong elements of time—an instinctive guess would
suggest that the authors are focused on (or attentive to) time, order of events, or
temporal sequences. The next step, listed below, is comprised of running a statistical
4To spare the curious reader from having to seek out a dictionary, “enantiodromia” refers to a
tendency for something to convert into its opposite form.
Psychological Text Analysis in the Digital Humanities 169
Table 1 Example of a frequency list generated by MEH
Word Raw Word Frequency
Across All Texts
Percentage of Texts
that Contain Word
the 600 100.00
a 542 98.12
life 320 67.89
school 187 53.45
time 133 35.33
people 98 21.66
year 91 8.99
day 82 6.72
family 72 5.51
friend 63 5.30
book 31 1.36
plant 15 1.21
enantiodromia 500 0.10
Note that uncommon words (even if used in high amounts by a single person) are excluded from
subsequent steps. In this table words highlighted in black will be removed from subsequent steps
because they are function words and, as the right-most column shows, will co-occur with virtually
all other words in the sample (e.g., every single text includes the word “the,” so it co-occurs
with literally all other words). Words highlighted in gray will be excluded from subsequent steps
because they appear in a small percentage of texts, even if they are highly common in terms of
their raw frequency.
Table 2 An example of Step 2 in the Meaning Extraction Method: a binary table that reflects
which texts used common words in a corpus
Tex t Life School Time People Year Day Family Friend
Story 1 1 1 1 1 1
Story 2 1 1
Story 3 1 1 1
Story 4 1 1 1
Story 5 1 1 1 1
The left-most column shows which text is being considered, and each row highlights which words
were present in each text. Note that for particularly long texts, other scoring methods may be
preferable (e.g., percentage of text comprised of each word, known as a “verbose” score)
procedure that does a rather good job of extracting themes in a way that coincides
with our intuitive judgments, albeit in a fairly broad manner.
3. Statistically identify words that tend to occur together. Conceptually, one can
think of a Principal Components Analysis (PCA) as a method for finding groups
of correlations—in essence, finding groups of words that tend to be used together.
Fortunately, one does not have to be statistically savvy to run this analysis, as most
statistical packages (such as SPSS and R) have ways to run a PCA rather easily.
For our example above, a PCA would likely find that the words “time,” “year,” and
“day” form a broader, meaningful theme about “time,” and the concepts of “school,”
“people,” and “friend” appear to be part of a broader social theme (see Table 3).
170 R.L. Boyd
Table 3 Example results
from a Principal Components
Analysis performed in
Tab l e 2
Factor 1 (“time”) Factor 2 (“social”)
Day 0.67 0.09
Year 0.58 0.02
Time 0.42 0.11
People 0.03 0.88
School 0.04 0.75
Friend 0.12 0.70
Life 0.16 0.00
Family 0.09 0.01
Technically, all words will have a ‘loading’ onto
each of the themes extracted from a corpus of texts.
Usually, a manually chosen cutoff is used (typically
around 0.10 for very large samples and 0.25 for
relatively small samples) to determine which words
belong to which themes. In this table, shaded values
indicate the theme onto which each word best loads
using a conservative cutoff value. Words may also
commonly load onto multiple themes.
What the PCA has done for us, then, is to identify 2 clusters of words that can be
thought of as new categories of language to measure for psychological purposes. If
we look back to Step 2, we find that these word clusters converge with our previous
What Do We Do with these New Themes?
Ultimately, the steps that comprise the MEM are designed to help identify themes
across our texts that can then be measured using the LIWC-style word counting
approach. In some cases, identifying which common themes exist is telling in its
own right—perhaps we are looking to discover common themes across international
treaties to better understand the topics that they deal with, or what themes are most
prevalent in famous jazz lyrics. However, as we have learned, these themes are often
psychologically meaningful as well. By using a LIWC-style approach to quantifying
each theme across texts, we can search for psychological patterns within or between
individuals (e.g., did an author objectively fixate more on nature in earlier versus
later works? How do time and culture influence the narrative of war?; see the section
in Appendix “Working with the MEM and MEM Results” for additional notes on
quantifying MEM themes in texts). In the coming sections, studies that have used
the MEM to better understand psychological profiles of individuals across several
types of research questions will be described.
Psychological Text Analysis in the Digital Humanities 171
The Statistical Analysis of LIWC and MEM Output
Once you have performed the above methods, you are left staring at spreadsheets full
of rows of text names and hundreds of columns containing numeric representations
of psychological dimensions. What does one do with all of those numbers? All of the
methods described above are inherently quantitative in some form or another. While
the quantified output of these procedures can still be used to facilitate qualitative
psychological insights, they are particularly well-suited to statistical modeling and
analyses. Given that each column of the LIWC and MEM datasets is numeric
vectors, language-based psychological measures can fit into virtually any statistical
algorithm that someone would use for any other quantified metric. How language-
based measures of psychology are put to best use will vary widely depending on
your research, goals, and interests. Like other types of measures and analyses, it is
often important to understand the properties of your variables so that they may be
optimally beneficial to your own work.
In terms of ensuring that your numerical data is of adequate quality, the primary
statistical considerations in psychological text analyses typically revolve around the
size of each text that you have processed—beyond this, most word counting methods
are rather forgiving. Because most word counting methods deal with percentages,
a minimum word count of 25 or 50 words per text is typically recommended.5
For example, the sentence “That was a good doughnut” contains positive emotion
words (good) and ingestion words (doughnut) both at the same rate: 1 out of 5,
or 20%. This is an extremely high number for both categories: positive emotion
words are typically in the 2–8% range, and ingestion words typically occur far
less frequently (less than 1% of words in most cases). In larger bodies of text,
the behaviors of language patterns tend to smooth out, resulting in more accurate
and useful distributions of psychological categories. Omitting texts with too few
words is equally important when running MEM analyses: if texts contain only 5
or 10 words on average, and half of these are function words, it becomes difficult
to discover meaningful co-occurrence patterns unless you have an extremely large
number of observations (e.g., a few hundred thousand texts).
Once language has been quantified and texts containing too few words have
been dropped, an appropriate statistical model must be selected. The selection and
execution of statistical models is usually informed by the nature of one’s data and
research design, and a full consideration of model selection is far beyond the scope
of this chapter. However, it is important to note that the most actionable insights
from psychological text analysis usually require transparent statistical models. In
other words, if a statistical model is relatively easy to interpret (e.g., a simple
Pearson correlation or a between-groups ANOVA), understanding the meaningful
5The general rule of thumb is that more data is almost always better. Data quantity is the foundation
of both reliability and accuracy when quantifying psychological processes, which can be incredibly
difficult to assess using any research method. The same holds true when using language to extract
psychological information.
172 R.L. Boyd
psychological patterns is far easier to come by than when using powerhouse
algorithms that are largely impenetrable (e.g., a support vector machine with radial
basis function kernels, deep-layer convolutional neural networks).
The use of powerful but opaque statistical techniques has rapidly become a seri-
ous problem for a number of studies that exist in the computational social sciences,
particularly when using language to model human psychology. Many published
studies are technically impressive from a Computer Science or Information Sciences
perspective, but are ultimately useless to social scientists who seek to understand
the nature of findings (e.g., “Did X cause Y?”; “How does X vary as a function of
Y?”). In other words, being able to predict a phenomenon with a statistical model
is conceptually quite different from being able to understand it and its implications.
For example, virtually all of the research described in the next section has been made
possible by the use of simple statistical models that are easily interpreted, resulting
in valuable theoretical insights into the human mind.
The predict-versus-understand tradeoff is a major one that delineates most
psychological text analysis techniques from those more frequently used in
computational linguistics, the information sciences, and related fields. Ultimately,
statistical decisions most often come down to properly understanding the intended
goals of a research project and field-specific standards/traditions. A mismatch
between the stated goals of a study and the domain expertise required to understand
what types of decisions should be made during statistical analyses can too often
result in dazzling statistical models that offer no insights beyond “X plus 50 other
variables can predict Y”.6
Psychological Research Using LIWC and MEM
In the past two decades, an incredible amount of research using both LIWC and
MEM has been done that finds robust links between language use and psychological
patterns—far too much to cover in a single chapter. What is presented below should
not be taken as a comprehensive overview of this body of work. Instead, the goal
of this section is to introduce you to some of the broader areas of psychological
research that have been studied using automated language analysis, providing
specific studies and experiments as examples of this work.
6These comments are in no way intended to disparage or discourage the use of heavy-hitting
statistical algorithms and machine learning procedures. In fact, the author of this chapter uses
these analytic techniques with absolute regularity in his own work, and he enjoys few things in life
more than the intricacy of well-crafted, complex models (why yes, he is a huge hit at parties—how
did you know?). However, the importance of considering tradeoffs between prediction power and
being able to describe/understand one’s model in practical terms cannot be overstated.
Psychological Text Analysis in the Digital Humanities 173
Language and Individual Differences
When we read a text, we are often trying to get a sense of who the author is as a
person. How do they think? Are they young or old? What does their day-to-day life
look like? Are they nice, independent, quick to anger, or deviously calculating?
When asking these questions, we often have to rely upon our own subjective
experiences and feelings to make evaluations of other people; this is especially
true when all that we have to go on is a person’s language. The central question,
then, is “do different types of people talk in different ways?” Can we reliably infer
characteristics of a person based solely on the words they use? As is discussed in
this section, the answer turns out to be a resounding “yes.”
These types of evaluations fall under the broader umbrella of individual differ-
ences, a large area of research that tries to understand the various ways in which
people are psychologically different from one another, yet relatively consistent over
time. Individual differences not only include personality evaluations like neuroti-
cism and extraversion, but also include things such as gender, age, life experiences,
and so on—things that differ between individuals but are not necessarily caused by
internal psychology processes.
Language and Traditional Self-Report Personality Measures
In psychology, we generally refer to personality as collections of thoughts, behav-
iors, and emotions that are indicative of a person across time and situations.
When we say that an individual is neurotic, for example, we really mean to say
that they exhibit a cluster of specific, stable psychological patterns over time. A
neurotic person tends to have difficulty regulating their emotions, they typically
interpret ambiguous information as negative, and they consistently experience
anxiety at above-average levels on a day-to-day basis. Personality descriptors like
“neuroticism,” then, do not refer to “real” things, but instead act as a sort of a
shorthand term that we use to refer to these types of stable psychological patterns
and clusters.
Early work on the consistency of language found that a person’s language, like
their personality, showed reliable patterns between multiple time points (Pennebaker
and King 1999). Just as a person who is extroverted today is likely to be extroverted
next week, next month, and next year, so too does a person’s word use remain
relatively constant over time. If someone uses high rates of conjunctions relative
to other people today, they will most likely use high rates of conjunctions across
different times and situations. This proved to be an invaluable finding for language
research as it demonstrated that language use could be used to detect those stable
psychological patterns that broadly characterize a person. In practice, what this
means is that when we scientifically establish a link between a certain measure
of language (e.g., use of negations such as “isn’t” and “aren’t”) and personality
(e.g., impulsivity), we can infer certain stable aspects of a person indirectly based
174 R.L. Boyd
exclusively on how they write or talk. That is to say, by measuring the rate at which
someone uses negations, we can statistically estimate the degree to which they
are also impulsive. Much recent research has found that it is possible to estimate
a person’s personality using LIWC- and MEM-based measures, albeit with some
An important point about studying personality is that psychologists typically
assess an individual’s personality using self-report questionnaires. In a standard
social science research setting, self-report questionnaires can be an ideal personality
assessment tool: they are easy to fill out, rather fast to complete, and require no fuss
from the test-taker or the researcher.7Unfortunately, the self-report paradigm is of
limited usefulness when conducting several other types of work. Until backwards
time travel is invented (presumably this will come shortly after the aforementioned
hoverboards), we cannot present questionnaires to historical figures such as Francis
Beaumont or Franklin Pierce and politely ask them to “fill in the bubbles” at their
convenience. Other sources of personality information can be problematic: observer
reports of historical figures tend to be skewed, riddled with myth, and spotty,
particularly as we look further back in history. Fortunately, the very foundations
of personality (stable trends in thought, feeling, and behaviors) tend to be encoded
in human language, allowing us to establish personality—language links and,
subsequently, make scientifically grounded estimates of an individual’s general
psychological makeup.
When discussing personality, the most commonly used theoretical framework is
known as the “Big 5” of personality (neuroticism, extroversion, openness to expe-
rience, conscientiousness, and agreeableness; see John et al. 2008). Most language
research on personality that uses self-reports attempts to correlate language use with
peoples’ responses to Big 5 questionnaires, and the majority of these studies use the
psychological measures generated by LIWC. Pennebaker and King (1999) initially
found correlations between Big 5 personality measures and the language of students
writing different types of stories and narratives. For example, people scoring high
on self-reported neuroticism use fewer positive emotion words and high rates of
negative emotion words, as we would expect, and participants who score higher on
extroversion tend to use more social words. Since then, several other researchers
have found that LIWC measures of language can be used to estimate someone’s
personality (Yarkoni 2010; Komisin and Guinn 2012), and some research has even
been extended to automatically assessing the personality of fictional characters
based on their language (e.g., Flekova and Gurevych 2015; Liberman 2015). Most
research has found that LIWC-based approaches to estimating personality perform
well above average, even when simple statistical models are used (e.g., Mairesse
et al. 2007).
7Note, however, that self-report measures of personality are not without their own drawbacks and
imperfections. In order to accurately answer a self-report question about yourself, you must have
both accurate information about yourself in a given domain and a willingness/ability to make
accurate self-reports. The literature on self-report biases and pitfalls is rather extensive but beyond
the scope of this chapter.
Psychological Text Analysis in the Digital Humanities 175
Other research using the MEM has also found patterns of word use that
can be used to understand personality as well. In a study that extracted themes
from individual writing samples, Chung and Pennebaker (2008) found common
social themes such as sociability (comprised of words like “independent,” “fun,”
and “confident”), maturity (“mature,” “successful,” “caring”), and psychological
stability (“satisfied,” “healthy,” “positive”). When quantifying these themes in
peoples’ writing, the researchers found that the degree to which each person invoked
these themes was related to how they scored on self-reports of the Big 5. For
example, people who used more words from the sociability theme tended to score
higher on agreeableness, and people who wrote using words from the maturity
theme tended to score higher on conscientiousness. This work revealed that even
the common themes people talk about (and the way that people talk about them)
can be telling of a person’s personality, sometimes counterintuitively. Because the
MEM extracts themes based on common content words, this approach can also be
used to establish relevant lists of words to watch for when manually trying to infer
personality based on an author’s texts.
It is important to note that, while a considerable amount of research has been
done attempting to link self-report questionnaires of personality to language use,
this approach often results in weaker findings than studies that use other types of
individual differences data. In the remaining portion of this section, the studies
presented will generally focus on language use and its link to other forms of data,
such as behavioral data (performance-based measures, such as test performance,
and reports of specific actions in which people engage) and objective self-reports
that are typically unbiased (e.g., age).
Language across the Lifespan
Pennebaker and Stone (2003) conducted two studies examining the effect of age
on how people write. The researchers analyzed the writings of authors who wrote
multiple works throughout their lifetime (e.g., Alcott, Austen, Shakespeare, Yeats)
and found several consistent patterns. As authors aged, they tended to use more
positive emotion words and future tense words. Aging was also found to be related
to the use of fewer negative emotion words, past-tense verbs, and self-references.
A second study of non-professional writing by psychology study participants found
the same patterns, and these results converge with other research on age-related
psychological changes (e.g., Watson 2000), as well as other studies of age-related
shifts in language use (e.g., Schler et al. 2006).
Language and Core Values
Recently, Boyd et al. (2015) conducted a large-scale analysis of how a person’s
values are expressed in the language that they use. Using the MEM, they first asked
participants to write about their core values (those principles that guide their lives,
176 R.L. Boyd
decisions, and behaviors), as well as their common behaviors. In their work, they
found that they were able to extract word clusters that corresponded to distinct
values, such as religion,empathy, and personal growth. Importantly, they found
that the value-relevant themes that people used were strongly predictive of people’s
actual behaviors.
After establishing word-to-value mappings, the researchers conducted a large-
scale analysis of Facebook users to explore the degree to which value-relevant
language was present in unprompted writings by the general public. The results from
their second study supported those of their first—greater use of value-related themes
was broadly predictive of what types of behaviors people engaged in (and reported
engaging in) online. People who invoked the religion core value theme were more
likely to attend church, and individuals who used words from the empathy theme
were more likely share messages in support of cancer awareness and friendship.
These types of language-based value—behavior links pave the way for assessing
the values that a person holds central in their lives from their writings and verbal
speech. Because the researchers established value-relevant themes using the MEM,
these word lists can be applied to new domains in novel ways. The fact that a
person’s casual, innocuous words can be used to predict specific, concrete day-to-
day behaviors may strike some readers as rather unnerving. Indeed, such findings
highlight the power of psychological text analysis methods.
Language and Thinking Styles
In judgment and decision-making research, decades of work has found that different
individuals tend to solve problems and make decisions in two broad ways: (1) slowly
and deliberately, or (2) quickly and intuitively (see Kahneman 2011; also sometimes
referred to as “rational” vs. “experiential” decision-making, respectively). These
“decision-making styles” tend to be fairly static over time; someone who is an
“intuitive” decision-maker today will tend to make decisions in the same way across
time and contexts (Fetterman and Robinson 2013). Rational decision-makers tend
to be highly analytic, do well in school, and can sometimes be perceived as socially
cold or distant (a relatively common television/movie cliché that, as it turns out,
has some basis in reality). On the other hand, experiential decision-makers tend to
rely more on their emotions, social events, and narratives to make decisions, and are
typically perceived as being more socially open and warm than rational decision-
By analyzing function words with LIWC, Pennebaker et al. (2014) were able
to develop a language-based measure of thinking styles named the Categorical-
Dynamic Index (CDI) that corresponds to how people make decisions and mentally
engage with the world. At the high end of the CDI are people who primarily use high
rates of articles and prepositions while, conversely, people on the other end of the
CDI tend to use lots of pronouns, negations, adverbs, and other types of function
words. Using the CDI, Pennebaker et al. (2014) were able to predict measures
pertaining to college performance, such as student GPA and pre-college SAT scores,
Psychological Text Analysis in the Digital Humanities 177
as a function of students’ admission essays. Using the same measure, Jordan and
Pennebaker (2016) found striking differences between the 2016 Democratic and
Republican US presidential candidates during debates, with Democratic candidates
generally scoring on the “rational” side of the CDI and Republican candidates
scoring on the “experiential” side.
Authorship Attribution Using Psychological Analyses
The LIWC application and dictionary are used regularly to help understand authors
precisely because of the stable nature of language use by individuals. By building
on the findings showing that a person’s linguistic patterns are markers of their
underlying psychology, and that both exhibit relatively static trends over time, it
is theoretically possible to aggregate several language-based measures of a person’s
psychology into a “psychological fingerprint” that is unique to them. For example,
Petrie et al. (2008) used LIWC measures of the Beatles’ lyrics to identify the unique
psychology of each Beatles member, and also found which songwriter’s style was
predominant in Beatles songs that were written collaboratively.8
What do we do when the authorship of a work is unknown or disputed? Over
the years, several methods of automated stylometry have been developed to try
to establish authorship using word distributions (e.g., Juola 2006; Koppel et al.
2008). Many of these methods are rather effective, however, they can conceptually
oversimplify the individual by reducing them to single words rather than human
beings with complex mental lives. By combining several of the measures outlined
in this chapter (e.g., LIWC-based measures of personality, the CDI, MEM-derived
themes), Boyd and Pennebaker (2015) were able to statistically establish the
likelihood that Shakespeare authored the long-disputed play Double Falsehood.
In their study, Boyd and Pennebaker (2015) found that three possible authors—
William Shakespeare, John Fletcher, and Lewis Theobald—all had highly distinct
psychological patterns of language use across their respective solo works. In fact,
each author’s psychological fingerprint was so unique that all three authors could be
reliably distinguished with near-100% accuracy using exclusively language-based
psychological measures. After establishing each author’s unique psychological
attributes, the researchers were able to automatically extract a new psychological
composite from the play Double Falsehood, then statistically compare the disputed
play to each author’s unique psychological fingerprint. Their findings overwhelm-
8Another interesting analysis of the Beatles using LIWC was performed by Kasser (2013), who
explored the interpretation of the song Lucy in the Sky with Diamonds from a psychological
perspective. While the song is often cited as being overtly about drug use, Kasser (2013) found
that the psychological fingerprint of the song was generally quite similar to other lyrics authored
by John Lennon. Kasser did find linguistic markers consistent with drug experience descriptions,
however, Lucy in the Sky with Diamonds also scored relatively high on language measures that
pertain to distancing oneself from painful experiences, such as a lack of emotional content and
very few markers of “here and now” thinking (sometimes called “psychological distancing”).
178 R.L. Boyd
ingly supported the notion that Shakespeare was the primary author of the original
play, with modest contributions from John Fletcher and likely later editing on
the part of Lewis Theobald. Additionally, because the language-based measures
were principally revealing of the authors’ respective personalities, the researchers
were able to compare the psychological fingerprints to historical accounts of each
author. Boyd and Pennebaker’s (2015) analysis largely converged with historical
and observer reports, such as Fletcher having been a highly social individual and
Theobald having been highly intelligent but cold and confrontational.
Language and Mental Health
Given that much of the early psychological interest in language dealt with disorders
and mental health issues, it is perhaps not surprising that much of the modern work
using automated text analysis also began in this area. By analyzing the language of
an individual, it is possible to uncover areas that are particularly problematic in their
lives. Depression, emotional upheavals, and childhood traumas can leave lasting
marks on the ways in which a person communicates that are nearly impossible to
detect with the naked eye. However, by using automated text analysis, the subtle
differences in function word use and thematic relationships can be quite striking
and powerfully telling of a person’s current and past emotional difficulties.
Depression and Suicide
One of the best-explored areas of mental health and language includes research
on depression and suicide. For example, Stirman and Pennebaker (2001) analyzed
300 poems by 18 poets (9 of whom committed suicide) with LIWC in order to
test for language patterns indicative of risk for depression/suicide. The researchers
found that the poets who committed suicide used language patterns consistent with
a social withdrawal, such as increased first person singular pronouns (I, me, my) and
decreased first person plural pronouns (we, us, our). These patterns were particularly
pronounced when poems were written close to the time of an author’s suicide. The
findings of this study revealed that those poets who committed suicide increasingly
suffered from an inability to extract themselves from their own mental worlds,
instead exhibiting an elevated focus on the self and decreased attention to their social
Other studies have found the same language patterns leading up to an author’s
suicide. An analysis of Marilyn Monroe’s personal writings published in Fragments
found similar first person pronoun patterns leading up to her death (Fernández-
Cabana et al. 2013). Additionally, various cognitive and emotional patterns indica-
tive of suicide and depression, such as decreases in cognitive complexity, were
found in Fragments. While such patterns of language use are not proof of suicide,
they strongly suggest suicidal perturbations in Monroe’s psychological functioning.
Psychological Text Analysis in the Digital Humanities 179
Similar language patterns of self-focus and social isolation were found in the written
works of Henry Hellyer, a ninetieth century surveyor and explorer whose death was
ruled a suicide (Baddeley et al. 2011).
The work on links between language and depression/suicide has been extended
to other areas as well. Such work includes research on comparisons between
individuals who committed suicide versus spree killings (Egnoto and Griffin 2016),
women suffering from post-partum depression (De Choudhury et al. 2013), and even
depression support forums (Ramirez-Esparza et al. 2008).
The Impact of Traumatic Events on Psychology and Language Use
A sizeable portion of psychological text analysis research has examined the links
between language and how people process and come to understand traumatic events.
Indeed, the original version of LIWC was initially created as a way to automatically
analyze narratives written by people coping with trauma in their own lives (known
as “expressive writing”; see Pennebaker and Evans 2014).
In a study of psychological changes in response to traumatic events, Gortner and
Pennebaker (2003) conducted an archival study of newspaper articles surrounding
the 1999 Aggie Bonfire collapse, an event that caused the death of 12 people
at Texas A&M University. Using a LIWC analysis of local newspaper articles
spanning a timeline of 1 month before to approximately a year after the event,
several psychological patterns mined from the articles’ language followed a “social
stages of coping” model that appeared to capture broader psychological trends in the
community. For example, social words and cognitive words showed an immediate
increase after the incident, suggesting a general increase in the need for social and
mental engagement with the world. Additionally, the linguistic shifts seen in the
newspaper articles were predictive of rates of illness in the affected community,
acutely reflecting the effects of stress on individual physical health (Sapolsky 1994).
Cohn et al. (2004) conducted a study of online journals kept by individuals living
in the USA by analyzing entries in the 2 months prior September 11, 2001 and the
2 months following; they found strikingly similar results regarding pre- and post-
trauma language patterns.9
Recently, psychological research on language and coping has begun to adopt
more of the “bottom-up” language analysis techniques such as the MEM. Stanton
et al. (2015) conducted an analysis of texts written by women who were survivors of
childhood sexual abuse (CSA) using the MEM. In their analysis, word patterns and
themes were identified that differentiated these individuals from women who had
experienced no sexual abuse (NSA) during their childhood. Stanton et al. found that
women who had experienced some form of sexual abuse during childhood showed
9Interestingly, this study and others suggest that higher use of both positive and negative emotion
words may generally reflect greater immersion a given writing topic (e.g., Holmes et al. 2007;
Tausczik and Pennebaker 2010).
180 R.L. Boyd
different thematic patterns when talking about sex than those women who had not.
For example, women who were CSA survivors tended to use less language related
to virginity and relationships and more language related to physical attraction than
their NSA counterparts; these results converge with other psychology research on
the relationship between a history of sexual abuse, physical attraction, and sexual
risk-taking behaviors (e.g., Senn et al. 2006).
In a follow-up study, Pulverman et al. (2016) found that as CSA women
underwent cognitive therapy for their experiences of abuse, their conceptualization
of sex (measured via their language) came to resemble those of women who had
never experienced sexual abuse. Similar studies of alcohol abuse (Lowe et al.
2013) and inpatient psychotherapy aftercare (Wolf et al. 2010) have found that the
themes invoked by individuals around certain topics allow us to both understand the
common ways that people think about their situations, but also help to infer whether
someone has experienced certain types of upheavals in their life.
Language and Social Processes
It may be something of an understatement, but language is, at its core, inherently
social. Up to this point, the research highlighted in this chapter shows an unspoken
awareness of the social nature of language, yet seldom touches upon this fact
directly. Nevertheless, the past few years have seen a growth in the automated psy-
chological analysis of social interactions as a way to extract and better understand
social processes. Multiple empirical studies by Molly Ireland and colleagues have
found that people not only coordinate what they talk about as conversations unfold
(i.e., content words), but how they talk as well (function word use).10
Ireland et al. (2011) explored the degree to which two peoples’ function words
synchronize in the context of romantic relationships. In a study of introductory
conversations in speed daters, the researchers found that those couples who engaged
in higher language style matching (LSM) showed increased mutual attraction to
each other, suggesting that they were keenly focused on each other to the point
where they even began to speak similarly. The researchers’ second study looked at
instant messages between partners in romantic relationships, finding that couples
with higher LSM tended to exhibit greater relationship stability. In fact, Ireland
et al. (2011) found that by calculating LSM in these brief conversations between
romantic partners, they were able to statistically predict relationship longevity more
accurately than the people who were actually in the relationships.
In another study, Ireland and Pennebaker (2010) conducted LSM analyses of
three pairs of famous writers across their relationships by analyzing function
10The degree to which people synchronize their function words is often not directly perceptible,
however, higher language style matching among individuals can foster perceptions of social
connectedness and support (Rains 2015).
Psychological Text Analysis in the Digital Humanities 181
1906-1908 1909 1910 1911 1912- 1838- 1845-
1846 1847-
1855 1856 1944-1955 1956-1960 1961-1963
18611913 18441907
Freud and Jung
Mean LSM
The Brownings Plath and Hughes
Fig. 2 Language style matching across time for famous writers and their relationships with
close others (reprinted from Ireland and Pennebaker 2010; originally published by the American
Psychological Association)
word synchrony in poetry and personal letters using LIWC. In their analysis, the
researchers found that function word synchrony in the personal correspondence
between Sigmund Freud and Carl Jung, one of Freud’s students, mirrored their
relationship trajectory. Freud and Jung had a notoriously contentious relationship
as Jung gradually began to break away from the teachings of Freud, eventually
accusing Freud of arrogance and stubbornness. LSM successfully captured the
evolving (and later deteriorating) stages of the pair’s relationship over a 7-year
period (see Fig. 2).
Elizabeth Barrett Browning and Robert Browning’s relationship is also quite
infamous for its ups and downs. Points of interpersonal synchrony and asynchrony
are revealed in the couple’s poetry, with function word use being most similar during
times of harmony in their marriage. Similarly, the works of Sylvia Plath and Ted
Hughes, two influential poets who were married in 1956, exhibited LSM peaks
during times of harmony and a gradual decline up to the time of Plath’s suicide.
While still a relatively new measure, LSM has begun to see use in various forms
of psychological research that reveal much about the importance of synchrony
in understanding human interactions. Borelli et al. (2016) have found that higher
LSM is found in strong mother–child attachments, with securely attached children
exhibiting higher function word similarity to their mothers. Other research finds
that crime suspects are more likely to confess to a crime when their language style
coordinates with that of their interrogator (Richardson et al. 2014), and LSM during
US presidential debates is even related to public perceptions of US presidential
candidates (Romero et al. 2015).
182 R.L. Boyd
Psychological text analysis methods can be used to better understand the com-
plicated mental universes of people in nearly any research domain. Did articles
published in The London Times show greater psychological immediacy following
the proliferation of the electric telegraph? What are the psychological underpinnings
of different types of humor across cultures? Perhaps most importantly, was that
Facebook friend who sent you an amorous personal message at 3 am really “just
kidding,” or were they being authentic? Answers to these questions and others are
now just a few short clicks away. Having read this chapter, you possess a robust
and rather advanced knowledge set that allows you to perform psychological text
analysis in your own research.
Generally speaking, perhaps the strongest advantages of psychological text
analysis techniques include their wide availability, their extensive validation, and
their ease to use. If desired, you can begin using these techniques within minutes
of finishing this chapter. Texts that have been closely poured over for decades,
and even centuries, have hidden psychological information embedded within them
that is waiting to be revealed, and you could very well be the first person to make
previously unimagined discoveries.
Each of the methods introduced in this chapter has its own strengths: LIWC is
extremely easy to use “out of the box,” whereas MEM (and the MEH software)
allows for more fine-tuned analyses of text collections, albeit with a bit more
effort/analysis required. Both types of methods, and particularly LIWC, have seen
extensive use in psychological research and have been adopted by other fields as
well, including computer/information sciences, sociology, business, and medicine,
to name a few. Such methods are adaptable to virtually any field of interest that
involves humans, and new discoveries are being made almost every day.
As technology pulls disparate fields closer together, unprecedented insights will
result from combining techniques and information hailing from different disciplines.
New techniques like automated psychological text analysis will not wholly replace
subjective readings of texts but, rather, these two approaches will strengthen and
complement each other. Various methods, including historical reports, psychologi-
cal text analysis, subjective interpretation, and automated methods of understanding
within- and between-text relationships, can all inform one another to create benefits
that no single method in isolation could ever hope to achieve. As new techniques
emerge, they too will be integrated into systems of understanding the human
condition that are currently difficult to imagine. The prospects for newer, faster,
and more accurate understandings of human experience are expanding every day. It
is a very exciting time for innovation.
Acknowledgments Preparation of this chapter was aided by grants from the National Institute
of Health (5R01GM112697-02), John Templeton Foundation (#48503), and the National Science
Foundation (IIS-1344257). The views, opinions, and findings contained in this chapter are those
of the author and should not be construed as position, policy, or decision of the aforementioned
agencies, unless so designated by other documents. The author would like to thank Elisavet
Psychological Text Analysis in the Digital Humanities 183
Makridis, Natalie M. Peluso, James W. Pennebaker, and the anonymous reviewers for their helpful
feedback on earlier versions of this chapter.
When preparing and processing texts using psychological text analysis, there are
some widely accepted (but often unspoken) guidelines that are quite commonplace.
These guidelines serve to ensure accurate insights both while processing texts and
in subsequent statistical analyses. This appendix can be thought of as a basic
“need to know” reference and primer for technical considerations when performing
psychological text analysis. Feel free to treat this appendix as a “cheat sheet” for the
methods covered in this chapter—one that should help to give you a head start in
the world of research with language.
Preparing Texts for Analysis
One of the most common questions that people ask during their first romp into
the world of psychological text analysis is this: “How should I prepare my texts?”
Ultimately, there is no answer to this question that will apply in all cases, as
guidelines will vary as a function of text source, research questions, and goals.
However, there are some basic considerations that apply to virtually all cases. As
a general rule when it comes to the psychological analysis of text, “good enough”
really is “good enough” for most purposes. One could literally spend years preparing
a collection of text files so that they are 100% perfect for analysis, however, the
conceptual (and, more importantly, statistical) gains from doing so are often nil.
Spelling and Spelling Variations
It is tempting to worry about correcting texts so that all words that could potentially
be captured by a dictionary are successfully recognized. Note, however, that word
distributions follow what is known as a Zipf distribution (see Piantadosi 2014),
wherein a relatively small number of words constitute the majority of words actually
seen in texts, verbalizations, and so on. Translated into practical terms, what this
means is that unless a very common word is misspelled with high frequency, it is
unlikely to have a measurable impact on LIWC-based and MEM-based measures
of psychological processes. For example, if a single text has the misspelling “teh”
two times, yet contains 750 uses of other articles in the whole text, the measured
percentage of articles in a text will differ from the actual occurrence of articles by
such a small amount as to be negligible.
184 R.L. Boyd
Texts with several high-frequency misspellings, however, may benefit from
correction. Multiple software programs exist that allow users to automatically
batch-correct text files to avoid the tedious job of manual spelling correction
(e.g., GNU Aspell; While most of these applications are useful
only for the technically savvy, other options exist that allow users to find specific
misspellings and replace them with corrections, such as Find and Replace (FAR; Relatedly, regional spelling variants may
benefit from standardization, depending on the nature of one’s research question.
For example, given that the MEM looks for co-occurrences of words, we might
expect the words “bloody,” “neighbor,” and “color” to co-occur more often than
“bloody,” “neighbor,” and “color.” Unless we are interested in identifying culture-
specific word co-occurrences, we would want to standardize regional variants
to have parallel spellings across all texts, ensuring more accurate psychological
Finally, certain special circumstances arise when analyzing transcribed verbal
exchanges, particularly when using software such as LIWC. Several categories of
“utterances” (such as nonfluency words like “uh” and “um,” and filler words like
“like” and “y’know”) are psychologically meaningful (Laserna et al. 2014), but are
often transcribed idiosyncratically according to certain traditions. The word “like” is
particularly problematic given its various meanings as a result of homography (e.g.,
expressing evaluation—“I like him”—or filling spaces—“I, like, love sandwiches”).
Primarily in the case of verbal transcriptions, many utterances must be converted to
software-specific tokens that are recognized explicitly as filler words to improve text
analysis accuracy (LIWC, for example, uses “rrlike” for filler word recognition).
Working with the MEM and MEM Results
When performing any type of topic modeling procedure, including the MEM,
several decisions must be made by the researcher performing the analysis. These
typically include answering questions such as “What is the correct number of
themes to extract?” and “How do I know what label to give each of these themes?”
Topic modeling results are occasionally dismissed as purely subjective, however,
this is seldom the case. Indeed, while some steps occasionally require some
(arguably) arbitrary decisions, such decisions are typically made by relying on
domain expertise. The author’s recommendation is that, when in doubt, the best
course of action is to consult with experts familiar with the type of research being
How to Extract/Understand Themes Using MEM
While the heart of the MEM is a statistical procedure known as a Principal
Components Analysis (PCA), most of the typical guidelines that are recommended
Psychological Text Analysis in the Digital Humanities 185
for a PCA do not extend to the modeling of language patterns. Instead, it may be
more useful to think of the MEM as co-opting a statistical procedure to reach an
end-goal, rather than the PCA being the goal itself.
When extracting themes using a PCA, a researcher must specify a kparameter,
with kbeing the number of themes for which a PCA must solve. What is the ideal
kparameter? In other words, how many themes should be extracted? Is it 10? 15?
50? This is a contentious issue in virtually every field that uses some form of topic
modeling. The best answer at this time comes in the form of another question: “What
makes sense given your research question?”
When attempting to perform a psychological analysis of text, domain expertise
on psychological constructs is immensely helpful. The primary recommendation
for determining the optimal kparameter is to test multiple k’s, settling on the k
parameter that appears to best represent the problem space. The ideal kparameter is
also directly influenced by the amount of data that you are processing. If you have
an extremely large number of wordy texts, you will be able to extract many, many
more themes than if you are analyzing 100 posts made to Twitter.
Even with extremely large datasets, however, the smallest number of themes that
can be coherently extracted is typically the most optimal, particularly in cases of
psychological research. Whereas it is not uncommon to see sophisticated studies
that extract hundreds (or even thousands) of topics from a corpus, 95% of these
topics will end up being an uninterpretable and highly inter-correlated grab-bag
of words that does not represent anything in particular. Extracting large numbers
of topics may still be highly useful in predictive modeling, however, this practice
is often problematic from a social sciences/digital humanities perspective and also
leads to serious concerns about the replicability of one’s findings.
Finally, when it comes to labeling themes that have been extracted using MEM
or other topic models, there are no hard and fast rules. Labels assigned to word
clusters should primarily be treated as a shorthand for referring to word clusters and
should not treated as objective operationalizations during research in most cases.
For example, one researcher may see the words “happy,” “cry,” and “shout” cluster
together in a MEM analysis and label this as a broad “emotion” theme. Another
researcher may see the same word clusters and see a “joyous” theme, believing that
the words “cry” and “shout” are being used in a positive manner in conjunction with
happiness. Most of the time, the labels are fairly simply to agree upon (e.g., a word
cluster comprised of “college,” “study,” “test,” and “class” is unlikely to be anything
other than a broader “school” theme). However, when in doubt, one of the best ways
to understand or interpret a questionable theme is to look for texts that score high
for that theme, then read them closely to see how the theme-related words are used.
How to Score Texts for MEM Themes
We will typically want to score each text in our corpus for each theme that
has been extracted. Once themes have been extracted using the MEM, there
are two primary ways to make this information useful for subsequent statistical
186 R.L. Boyd
modeling. In the first method, “factor scores” can be calculated using a statistical
approach, such as regression scoring. In this approach, each word is weighted
for its “representativeness” of a given theme—these weights correspond to the
“factor loadings” of each word to each theme. For example, if a theme is extracted
pertaining to animals, the word “cat” may have a factor loading of 0.5, the word
“bird” has a loading of 0.25, and so on. These scores can be used in a multiple
linear regression model:
yD(ax)C(bz)C::: etc., or
Animal Theme D(0.5*cat) C(0.25*bird) ::: etc.
This procedure is used to score a text for the “animal” theme, and so on for all
other themes extracted during the MEM. Some statistical software, such as IBM’s
SPSS, have options that allow users to perform this scoring method automatically.
If using the regression approach to scoring texts for MEM themes, it is extremely
important to note that the MEM typically requires that one performs their PCA
with something called a VARIMAX rotation. Without going too far into the
details, an orthogonal axis rotation such as VARIMAX ensures that all themes are
mathematically 100% independent of each other. In practical terms, this means that
regression scores for themes will be perfectly uncorrelated, which is often not an
accurate reflection of psychological processes. This method is occasionally used in
the literature but is not recommended for most purposes unless you have a well-
articulated reason for doing so.
The second method for scoring MEM themes in texts is conceptually much sim-
pler and strongly recommended. Essentially, this alternative method uses the MEM
as a means of theme discovery or, in other words, simply determining what common
themes exist in a corpus, as well as which words are indicative of which themes.
Following this, a return to the word counting approach used by LIWC is used to
score texts for each theme. For example, imagine that your use of the MEM uncovers
two broad themes in your texts: a color theme (including the words “red,” “blue,”
“green,” and “brown”) and a clothing theme (including the words “dress,” “shirt,”
“shorts,” and “jeans”). The next step is to create a custom dictionary that places the
“color” and “clothing” words into separate categories. You would then rescan your
texts with software like LIWC to calculate the percentage of words pertaining to
color and clothing, respectively. Unlike with the regression method of scoring texts
(which would force a perfect zero correlation between the two themes), one would
likely find that use of color words and clothing words shows modest bivariate corre-
lation, which makes sense from both and intuitive and psychological perspective.
J.L. Baddeley, G.R. Daniel, J.W. Pennebaker, How Henry Hellyer’s use of language foretold his
suicide. Crisis 32(5), 288–292 (2011)
Borelli, J. L., Ramsook, K. A., Smiley, P., Kyle Bond, D., West, J. L., K.H. Buttitta, Language
matching among mother-child dyads: associations with child attachment and emotion reactiv-
ity. Soc. Dev. (2016). doi:10.1111/sode.12200
Psychological Text Analysis in the Digital Humanities 187
R.L. Boyd, MEH: Meaning Extraction Helper (Version 1.4.13) [Software]. Available from http:// (2016)
R.L. Boyd, J.W. Pennebaker, Did Shakespeare write double falsehood? Identifying individuals by
creating psychological signatures with text analysis. Psychol. Sci. 26(5), 570–582 (2015)
R.L. Boyd, S.R. Wilson, J.W. Pennebaker, M. Kosinski, D.J. Stillwell, R. Mihalcea, Values in
words: using language to evaluate and understand personal values, in Proceedings of the Ninth
International AAAI Conference on Web and Social Media (2015), pp. 31–40
C.K. Chung, J.W. Pennebaker, Revealing dimensions of thinking in open-ended self-descriptions:
an automated meaning extraction method for natural language. J. Res. Pers. 42(1), 96–132
M.A. Cohn, M.R. Mehl, J.W. Pennebaker, Linguistic markers of psychological change surrounding
September 11, 2001. Psychol. Sci. 15(10), 687–693 (2004)
M. De Choudhury, M. Gamon, S. Counts, E. Horvitz, Predicting depression via social media, in
Annual Proceedings of the 2013 AAAI Conference on Web and Social Media (ICWSM) (2013)
J. Dewey, How we think (D.C. Heath, Boston, 1910)
M.J. Egnoto, D.J. Griffin, Analyzing language in suicide notes and legacy tokens: investigating
clues to harm of self and harm to others in writing. Crisis 37(2), 140–147 (2016)
M. Fernández-Cabana, A. García-Caballero, M.T. Alves-Pérez, M.J. García-García, R. Mateos,
Suicidal traits in Marilyn Monroe’s fragments. Crisis 34(2), 124–130 (2013)
A.K. Fetterman, M.D. Robinson, Do you use your head or follow your heart? Self-location predicts
personality, emotion, decision making, and performance. J. Pers. Soc. Psychol. 105, 316–334
L. Flekova, I. Gurevych, Personality profiling of fictional characters using sense-level links
between lexical resources, in Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing (EMNLP) (2015)
S. Freud, On Aphasia (International Universities Press, London, 1891)
E. Gortner, J.W. Pennebaker, The archival anatomy of a disaster: media coverage and community-
wide health effects of the Texas A&M bonfire tragedy. J. Soc. Clin. Psychol. 22(5), 580–603
D. Holmes, G.W. Alpers, T. Ismailji, C. Classen, T. Wales, V. Cheasty, A. Miller, C. Koopman,
Cognitive and emotional processing in narratives of women abused by intimate partners.
Violence Against Women 13(11), 1192–1205 (2007)
M.E. Ireland, J.W. Pennebaker, Language style matching in writing: synchrony in essays,
correspondence, and poetry. J. Pers. Soc. Psychol. 99(3), 549–571 (2010)
M.E. Ireland, R.B. Slatcher, P.W. Eastwick, L.E. Scissors, E.J. Finkel, J.W. Pennebaker, Language
style matching predicts relationship initiation and stability. Psychol. Sci. 22(1), 39–44 (2011)
O.P. John, L.P. Naumann, C.J. Soto, in Handbook of Personality: Theory and Research, ed. by O.
P. John, R. W. Robins, L. A. Pervin. Paradigm shift to the integrative big-five trait taxonomy:
history, measurement, and conceptual issues (Guilford Press, New York, 2008), pp. 114–158
K. Jordan, J.W. Pennebaker, How the candidates are thinking: analytic versus narrative thinking
styles. Retrieved January 21, 2016, from
how-the-candidates-are-thinking-analytic-versus-narrative-thinking-styles/ (2016)
P. Juola, Authorship attribution. Found. Trends Inf. Retr. 1(3), 233 (2006)
D. Kahneman, Thinking, Fast and Slow (Farrar, Straus and Giroux, New York, 2011)
T. Kasser, Lucy in the Mind of Lennon (Oxford University Press, New York, 2013)
M. Komisin, C. Guinn, Identifying personality types using document classification methods, in
Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society
Conference (2012)
M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution. J. Am. Soc.
Inf. Sci. Technol. 60(1), 9–26 (2008)
C.M. Laserna, Y. Seih, J.W. Pennebaker, J. Lang. Soc. Psychol. 33(3), 328–338 (2014)
H.D. Lasswell, D. Lerner, I. De Sola Pool, The Comparative Study of Symbols: An Introduction
(Stanford University Press, Stanford, 1952)
188 R.L. Boyd
M. Liberman, Linguistic dominance in house of cards. Retrieved March 12, 2015, from http:// (2015)
R.D. Lowe, D. Heim, C.K. Chung, J.C. Duffy, J.B. Davies, J.W. Pennebaker, In verbis, vinum?
Relating themes in an open-ended writing task to alcohol behaviors. Appetite 68, 8–13 (2013)
F. Mairesse, M.A. Walker, M.R. Mehl, R.K. Moore, Using linguistic cues for the automatic
recognition of personality and conversation in text. J. Artif. Intell. Res. 30(1), 457–500 (2007)
C. Martindale, The grammar of altered states of consciousness: a semiotic reinterpretation of
aspects of psychoanalytic theory. Psychoanal. Contemp. Thought 4, 331–354 (1975)
D.C. McClelland, J.W. Atkinson, R.A. Clark, E.L. Lowell, The Achievement Motive (Irvington,
Oxford, 1953)
E. Mergenthaler, Emotion-abstraction patterns in verbatim protocols: a new way of describing
psychotherapeutic processes. J. Consult. Clin. Psychol. 64(6), 1306–1315 (1996)
G.A. Miller, The Science of Words (Scientific American Library, New York, 1995)
F. Moretti, Distant Reading (Verso, London, 2013)
J.W. Pennebaker, J.F. Evans, Expressive Writing: Words that Heal (Idyll Arbor, Enumclaw, 2014)
J.W. Pennebaker, M.E. Francis, Linguistic Inquiry and Word Count (LIWC): A Computer-Based
Text Analysis Program (Erlbaum, Mahwah, NJ, 1999)
J.W. Pennebaker, L.A. King, Linguistic styles: language use as an individual difference. J. Pers.
Soc. Psychol. 77(6), 1296–1312 (1999)
J.W. Pennebaker, L.D. Stone, Words of wisdom: Language use over the life span. Pers. Processes
Individ. Differ. 85(2), 291–301 (2003)
J.W. Pennebaker, T.J. Mayne, M.E. Francis, Linguistic predictors of adaptive bereavement. J. Pers.
Soc. Psychol. 72, 863–871 (1997)
J.W. Pennebaker, C.K. Chung, J. Frazee, G.M. Lavergne, D.I. Beaver, When small words foretell
academic success: the case of college admissions essays. PLoS One 9(12), e115844 (2014)
J.W. Pennebaker, R.L. Boyd, K. Jordan, K. Blackburn, The Development and Psychometric
Properties of LIWC2015 (University of Texas, Austin, TX, 2015a)
J.W. Pennebaker, R.J. Booth, R.L. Boyd, M.E. Francis, Linguistic Inquiry and Word Count:
LIWC2015 (Pennebaker Conglomerates, Austin, TX, 2015b)
K.J. Petrie, J.W. Pennebaker, B. Sivertsen, Things we said today: a linguistic analysis of the
Beatles. Psychol. Aesthet. Creat. Arts 2(4), 197–202 (2008)
S.T. Piantadosi, Zipf’s word frequency law in natural language: a critical review and future
directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)
C.S. Pulverman, R.L. Boyd, A.M. Stanton, C.M. Meston, Changes in the sexual self-schema
of women with a history of childhood sexual abuse following expressive writing treatment.
Psychol. Trauma. 9(2), 181–188 (2016). doi:10.1037/tra0000163
S.A. Rains, Language style matching as a predictor of perceived social support in computer-
mediated interaction among individuals coping with illness. Commun. Res. 43(5), 694–712
N. Ramirez-Esparza, C.K. Chung, E. Kacewicz, J.W. Pennebaker, The psychology of word use in
depression forums in English and in Spanish: texting two text analytic approaches, in Annual
Proceedings of the 2008 AAAI Conference on Web and Social Media (ICWSM) (2008)
B.H. Richardson, P.J. Taylor, B. Snook, S.M. Conchi, C. Bennell, Language style matching and
police interrogation outcomes. Law Hum. Behav. 38(4), 357–366 (2014)
D.M. Romero, R.I. Swaab, B. Uzzi, A.D. Galinsky, Mimicry is presidential: Linguistic style
matching in presidential debates and improved polling numbers. Personal. Soc. Psychol. Bull.
41(10), 1311–1319 (2015)
S. Ross, In praise of overstating the case: a review of Franco Moretti, distant reading. Dig. Humanit.
Q. 8(1), 1 (2014)
R.M. Sapolsky, Why Zebras Don’t Get Ulcers: A Guide To Stress, Stress Related Diseases, and
Coping (W.H. Freeman, New York, 1994)
J. Schler, M. Koppel, S. Argamon, J.W. Pennebaker, Effects of age and gender on blogging, in
Proceedings of the 2005 AAAI Spring Symposium: Computational Approaches to Analyzing
Weblogs (2006)
Psychological Text Analysis in the Digital Humanities 189
T.E. Senn, M.P. Carey, P.A. Vanable, Childhood sexual abuse and sexual risk behavior among men
and women attending a sexually transmitted disease clinic. J. Consult. Clin. Psychol. 74(4),
720–731 (2006)
A.M. Stanton, R.L. Boyd, C.S. Pulverman, C.M. Meston, Determining women’s sexual self-
schemas through advanced computerized text analysis. Child Abuse Negl. 46, 78–88 (2015)
S.W. Stirman, J.W. Pennebaker, Word use in the poetry of suicidal and nonsuicidal poets.
Psychosom. Med. 63, 517–522 (2001)
P.J. Stone, D.C. Dunphy, M.S. Smith, D.M. Ogilvie, The General Inquirer: A Computer Approach
to Content Analysis (MIT, Cambridge, 1966)
Y.R. Tausczik, J.W. Pennebaker, The psychological meaning of words: LIWC and computerized
text analysis methods. J. Lang. Soc. Psychol. 29(1), 24–54 (2010)
D. Watson, Mood and Temperament (Guilford Press, New York, 2000)
W. Weintraub, Verbal Behavior in Everyday Life (Springer, New York, 1989)
M. Wolf, C.K. Chung, H. Kordy, Inpatient treatment to online aftercare: e-mailing themes as a
function of therapeutic outcomes. Psychother. Res. 20(1), 71–85 (2010)
T. Yarkoni, Personality in 100,000 words: a large-scale analysis of personality and word use among
bloggers. J. Res. Pers. 44(3), 363–373 (2010)
... The transversal dimensions, or components, are latent structures that can be considered to have a general meaning underlying the terms they are composed of. For this reason, after carefully reinterpreting the emerging components and the words affecting them, a qualitative exploration of them was conducted in order to identify cross-dimensional meanings that could capture the meaning of what emerged through statistical analysis [48]. This bottom-up technique is widely applied in social psychology [48,49] and all those areas in which the aim is to analyse content from large-scale language data. ...
... For this reason, after carefully reinterpreting the emerging components and the words affecting them, a qualitative exploration of them was conducted in order to identify cross-dimensional meanings that could capture the meaning of what emerged through statistical analysis [48]. This bottom-up technique is widely applied in social psychology [48,49] and all those areas in which the aim is to analyse content from large-scale language data. ...
... For a cleaner and more coherent analysis, stop words were deleted from the final dataset of definitions. emerging components and the words affecting them, a qualitative exploration of them was conducted in order to identify cross-dimensional meanings that could capture th meaning of what emerged through statistical analysis [48]. This bottom-up technique widely applied in social psychology [48,49] and all those areas in which the aim is to ana lyse content from large-scale language data. ...
Full-text available
Nowadays, digital platforms are applying some game-linked elements in their services with the aim to attract, retain and shape human interaction. Therefore, it is essential to investigate gamification with regard to its social sustainability. Gamification strategies are strategical in the agri-food industry to reach social and marketing goals. Despite the presence of several literature reviews on gamification, no study proposes a reflection on the meaning and purposes of gamification in the agri-food industry. This paper aims to identify the core dimensions underlying the concept of gamification, analysing its definitions and purposes through a systematic literature review, content analysis and principal component analysis. Eight core dimensions were detected leading to the conception of a new definition of gamification in the agri-food industry. Our results highlighted the potential of gamification to impact society, leaving points of reflection on how it can be made more inclusive and effective. Theoretical and practical implications were provided for academics, researchers, agri-food companies, policy makers, management engineers, technology makers, marketers and practitioners. The novelty of this study lies in the application of a social psychology methodology to give meaning to the words, overcoming the limits derived from qualitative research based on, only, content analysis.
... We also removed short entries, as these texts did not contain enough content words to contribute to stable themes and so can be considered partial or missing data. A 25-50-word minimum word count is recommended for texts submitted to these types of automated analyses, especially when the total number of texts is relatively small (as ours is; for discussion, see Boyd, 2017). Applying a 25-word minimum resulted in the removal of an additional 32.10% of descriptions, for a new total of 3,960 event descriptions (M = 79, SD = 31, min = 38, max = 164 per participant), ranging in length between 25 and 509 words (M = 131, SD = 86). ...
... Inspection of the scree plot for the PCA on each set of unigrams suggested approximately 10 components at the 'elbow' with eigenvalues above 1 (Cattell, 1966). The scree test is an established method of selecting the number of themes to extract in the MEM (Rodríguez-Arauz et al., 2017), where the goal is to retain the smallest number that are interpretable or internally coherent (Boyd, 2017). We extracted the elbow ± 5 (i.e., 5, 10, and 15 components) so that we could also assess the impact of this parameter on our results. ...
Unlabelled: Emotional granularity is the ability to create differentiated and nuanced emotional experiences and is associated with positive health outcomes. Individual differences in granularity are hypothesized to reflect differences in emotion concepts, which are informed by prior experience and impact current and future experience. Greater variation in experience, then, should be related to the rich and diverse emotion concepts that support higher granularity. Using natural language processing methods, we analyzed descriptions of everyday events to estimate the diversity of contexts and activities encountered by participants. Across three studies varying in language (English, Dutch) and modality (written, spoken), we found that participants who referred to a more varied and balanced set of contexts and activities reported more differentiated and nuanced negative emotions. Experiential diversity was not consistently associated with granularity for positive emotions. We discuss the contents of daily life as a potential source and outcome of individual differences in emotion. Supplementary information: The online version contains supplementary material available at 10.1007/s42761-023-00185-2.
... El análisis del vocabulario emocional activo mediante respuestas abiertas ofrece una ventana a la experiencia emocional, especialmente a la forma en que la persona atiende y reflexiona sobre su experiencia (Boyd, 2017;Boyd y Schwartz, 2021;Vine et al. 2020). El enfoque constructivista, además, enfatiza que la influencia percepciónlenguaje es bidireccional: la posesión de determinado vocabulario también influye en la forma en que el/la sujeto construye su experiencia emocional (Brooks et al. 2017;Feldman-Barrett, 2017;Feldman-Barrett et al., 2007). ...
Full-text available
Introducción. La experiencia emocional se ha propuesto como un fenómeno que puede ser tratado en la enseñanza de la filosofía. En este artículo, tras una breve revisión teórica, se establece una distinción conceptual relevante y se reporta un estudio al respecto. Metodología. Se utilizó un diseño exploratorio, utilizando una encuesta cualitativa donde docentes de filosofía (n= 45) escribieron respuestas abiertas sobre autores/as, temas y estrategias/metodologías para abordar la experiencia y el vocabulario emocional. Para las estrategias/metodologías las respuestas se codificaron inductivamente, mientras que los/as autores/as y temas se analizaron con categorías pre-estructuradas. Resultados/Discusión. La metodología utilizada es útil para explorar la diversidad. Esta es muy amplia en las respuestas sobre autores/as que pueden utilizarse para abordar la experiencia y el vocabulario emocional en el aula (117). Aunque el universo de “temas filosóficos-emocionales” que pueden ser tratados también es diverso, se observa una cierta convergencia en temas típicamente filosóficos, como “felicidad”, “amor”, “miedo” “libertad”, “angustia”, etc. Al respecto de las metodologías y estrategias, la diversidad es menor, presentándose el diálogo como el recurso central, en conjunto con la lectoescritura, el uso de recursos artísticos y metodologías activas/corporizadas. Conclusión. La enseñanza de la filosofía aparece como un territorio fértil para el abordaje de la experiencia emocional; sin embargo, definir su deseabilidad implica tanto más investigación como la discusión filosófica acerca de cuáles son los objetivos de la enseñanza de la filosofía.
... sg., measuring the frequency of first person singular pronouns, was included because previous research pervasively links it to mental distress [84]. Only posts with at least 25 words were included in the analyses because LIWC estimates for shorter texts are less reliable [60,85]. ...
Full-text available
Background Mental health (MH) peer online forums offer robust support where internet access is common, but healthcare is not, e.g., in countries with under-resourced MH support, rural areas, and during pandemics. Despite their widespread use, little is known about who posts in such forums, and in what mood states. The discussion platform Reddit is ideally suited to study this as it hosts forums (subreddits) for MH and non-MH topics. In bipolar disorder (BD), where extreme mood states are core defining features, mood influences are particularly relevant. Objectives This exploratory study investigated posting patterns of Reddit users with a self-reported BD diagnosis and the associations between posting and emotions, specifically: 1) What proportion of the identified users posts in MH versus non-MH subreddits? 2) What differences exist in the emotions that they express in MH or non-MH subreddit posts? 3) How does mood differ between those users who post in MH subreddits compared to those who only post in non-MH subreddits? Methods Reddit users were automatically identified via self-reported BD diagnosis statements and all their 2005–2019 posts were downloaded. First, the percentages of users who posted only in MH (non-MH) subreddits were calculated. Second, affective vocabulary use was compared in MH versus non-MH subreddits by measuring the frequency of words associated with positive emotions, anxiety, sadness, anger, and first-person singular pronouns via the LIWC text analysis tool. Third, a logistic regression distinguished users who did versus did not post in MH subreddits, using the same LIWC variables (measured from users’ non-MH subreddit posts) as predictors, controlling for age, gender, active days, and mean posts/day. Results 1) Two thirds of the identified 19,685 users with a self-reported BD diagnosis posted in both MH and non-MH subreddits. 2) Users who posted in both MH and non-MH subreddits exhibited less positive emotion but more anxiety and sadness and used more first-person singular pronouns in their MH subreddit posts. 3) Feminine gender, higher positive emotion, anxiety, and sadness were significantly associated with posting in MH subreddits. Conclusions Many Reddit users who disclose a BD diagnosis use a single account to discuss MH and other concerns. Future work should determine whether users exhibit more anxiety and sadness in their MH subreddit posts because they more readily post in MH subreddits when experiencing lower mood or because they feel more able to express negative emotions in these spaces. MH forums may reflect the views of people who experience more extreme mood (outside of MH subreddits) compared to people who do not post in MH subreddits. These findings can be useful for MH professionals to discuss online forums with their clients. For example, they may caution them that forums may underrepresent people living well with BD.
... Participants' essays were corrected for common misspellings (e.g., "boyfreind" instead of "boyfriend") and elongations (e.g., "sooo unhappy"). All written responses containing fewer than 50 words were removed from the dataset, to ensure validity of measurement and reliable scores (see, e.g., Boyd, 2017;Cutler et al., 2021;Pennebaker & Ireland, 2011). On average (after removing texts with < 50 words), participants wrote 211.60 words (SD = 186.22). ...
Borderline personality disorder (BPD) is characterized by severe interpersonal dysfunction, yet the underlying nature of such dysfunction remains poorly understood. The present study adopted a behavioral approach to more objectively describe the social-cognitive contributors to interpersonal dysfunction in BPD. Participants (N = 530) completed an online survey comprising validated measures of BPD features and other problematic interpersonal traits (e.g., narcissism), as well as a writing prompt where they were asked to share their personal thoughts about relationships. Computerized language analysis methods were used to quantify various psychosocial dimensions of participants’ writing, which were incorporated into a principal component analysis. Analyses revealed four core social dimensions of thought: (1) Connectedness/Intimacy; (2) Immediacy; (3) Social Rumination; (4) Negative Affect. All four dimensions correlated with BPD features in intuitive ways, some of which were specific to BPD. This study highlights the value of natural language analysis to explore fundamental dimensions of personality disorder.
... Participants' written essays were corrected for common misspellings and idiosyncrasies prior to analysis, and all texts containing fewer than 50 words were excluded from subsequent analysis to ensure reliability of language analysis and validity of measurement (see, e.g., Boyd, 2017;Cutler et al., 2021). Participants wrote an average of 211.60 words (SD = 186.22) ...
Full-text available
Background Emotion dysregulation is a characteristic central to borderline personality disorder (BPD). Valuably, verbal behaviour can provide a unique perspective for studying emotion dysregulation in BPD, with recent research suggesting that the varieties of emotion words one actively uses (i.e., active emotion vocabularies [EVs]) reflect habitual experience and potential dysregulation therein. Accordingly, the present research examined associations between BPD and active EVs across two studies. Methods Study 1 (N = 530) comprised a large non-clinical sample recruited from online forums, whereby BPD traits were measured via self-report. Study 2 (N = 64 couples) consisted of mixed-gender romantic couples in which the woman had a BPD diagnosis, as well as a control group of couples. In both studies, participants’ verbal behaviours were analysed to calculate their active EVs. Results Results from both studies revealed BPD to be associated with larger negative EV (i.e., using a broad variation of unique negative emotion words), which remained robust when controlling for general vocabulary size and negative affect word frequency in Study 2. The association between BPD and negative EV was insensitive to context. Limitations Limitations of this research include: 1) the absence of a clinical control group; 2) typical constraints surrounding word-counting approaches; and 3) the cross-sectional design (causality cannot be inferred). Conclusions Our findings contribute to BPD theory as well as the broader language and emotion literature. Importantly, these findings provide new insight into how individuals manifesting BPD attend to and represent their emotional experiences, which could be used to inform clinical practice.
... In the context of applying computers to analyze Shakespeare's plays, substantial efforts have been made to verify the authenticity of plays attributed to Shakespeare Rizvi, 2019;Barber, 2020]. Boyd and Pennebaker [2015] used quantitative analysis techniques [Boyd, 2017] to determine whether "Double Falsehood", published after Shakespeare's death, was likely written by Shakespeare. The analysis attempted to identify psychological signatures of three authors by examining the distribution of words they use, their grammar, and the meaning of the words, and then compared those signatures to the text of the unknown play. ...
Full-text available
The availability of quantitative text analysis methods has provided new ways of analyzing literature in a manner that was not available in the pre-information era. Here we apply comprehensive machine learning analysis to the work of William Shakespeare. The analysis shows clear changes in the style of writing over time, with the most significant changes in the sentence length, frequency of adjectives and adverbs, and the sentiments expressed in the text. Applying machine learning to make a stylometric prediction of the year of the play shows a Pearson correlation of 0.71 between the actual and predicted year, indicating that Shakespeare's writing style as reflected by the quantitative measurements changed over time. Additionally, it shows that the stylometrics of some of the plays is more similar to plays written either before or after the year they were written. For instance, Romeo and Juliet is dated 1596, but is more similar in stylometrics to plays written by Shakespeare after 1600. The source code for the analysis is available for free download.
... Saifuddin Ahmed et al. studied whether online protest activities have the same emotional underpinnings as offline protest actions for sustaining and nourishing a social movement, and how these emotions alter across different stages of the social movement [9]. Ryan L. Boyd highlighted how language may reveal profound insights into the minds of others using well-established and straightforward psychometric approaches [10]. ...
Conference Paper
Full-text available
Recently the use of computerized text analysis tools to assess an indi-vidual's linguistic, emotional and psychological characteristics has exploded in the field of empirical psychology. As a result, information about what people convey through their words can be swiftly and reliably extracted and analyzed. The key purpose of this research work is to analyze text data to assess linguistic and emotional characteristics with the help of computer-assisted text analysis tools. The analysis employed widely available text and sentiment analysis tools, Empath and LIWC. As text data, children's storybook reviews were analyzed in this work. These reviews are written by the children for the children. Empath and LIWC tools helped to measure the reviewer's sentiment, analytical ability and cognition level. Finally, by calculating the Pearson correlation coefficient for the selected variables, it is inferred that Empath shares a high correlation with LIWC.
... The resulting thematic analysis entailed both induction and deduction addressing the conceptual background points to categories and keywords for the legislative data assessment (Vourvachis & Woodward, 2015). The involved content analysis extended to legislative, semi-legislative, and policy documents (Krippendorff, 2013) and entailed both quantitative aspects presented by automatic word counts (frequency and concentration of pre-set keywords) and qualitative aspects (Kuckartz, 2014) revealed by the combination of manual scoring and glossing and of automatic processing by artificial intelligence such as Linguistic Inquiry and Word Count (LIWC) (Boyd, 2017;Tausczik & Pennebaker, 2010), in particular LIWC-22 exploring of 1000 words, see Namely, LIWC is suggested as the gold standard in software for analysing word use, it is suitable for the assessment of legislative, semi-legislative and policy documents, and it allows different assessment based on the type of document (such as personal writing or formal writing). ...
Full-text available
Objective: The objective of the article is to explore and assess whether the SFDR legal framework creates a legitimate, effective, and efficient mechanism that supports a genuinely sustainable investment and elimi-nates greenwashing and other trade-offs. It targets the Regulation (EU) 2019/2088 on sustainability-related disclosures in the financial service sector aka SFDR which sets a law duty on financial market participants and advisers concerning information about sustainability (Art. 1). Corporate Social Responsibility (CSR) parasitic practices, such as greenwashing, are to be eliminated in disclosures, communications, and internet pages (Art. 9 – Art.13) by appropriate information (Art. 1(17)) and the principle of doing no significant harm (Art. 2a). Research Design & Methods: A deep holistic five-step chronological contextual analysis of key legislative and semi-legislative instruments with LIWC assessment was performed. It was supported with a comparative and teleological interpretation and refreshed with Socratic questions. Findings: The research led to four rather unexpected propositions: (i) the endorsement of SFDR by EU institu-tions varies, (ii) key instruments are expressed neutrally and technically but their authenticity varies, (iii) mo-rality appears to be avoided, and (iv) the interpretation litigates against an artificial disassociation of concepts linked to sustainability, CSR, and shared values. Implications & Recommendations: Since the performed analysis was instantaneous and textual and led to rather unexpected propositions, it should be juxtaposed and extended by adding the longitudinal dimension, the applied dimension, and the outside perspective along with empirical field observation. Contribution & Value Added: This is a pioneering study regarding the wording assessment of the EU law on sustainability. Considering the critical importance of a legitimate, effective, and efficient legal framework in this area and the pre-existing academic vacuum regarding such an exploration of SFDR and related instru-ments, this contribution is a valuable first step.
It has long been suggested that films have changed the way we listen, but cinema’s contribution to broader listening cultures has only recently started to receive serious academic attention. Taking this issue as its central topic, The Oxford Handbook of Cinematic Listening explores—from philosophical, archival, empirical, and analytical perspectives—the genealogies of cinema’s audiovisual practices, the relationship between film aesthetics and listening protocols, and the extension of cinematic modes of listening into other media and everyday situations.
Full-text available
Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. Recent work in “non-traditional” authorship attribution demonstrates the practicality of automatically analyzing documents based on authorial style, but the state of the art is confusing. Analyses are difficult to apply, little is known about type or rate of errors, and few “best practices” are available. In part because of this confusion, the field has perhaps had less uptake and general acceptance than is its due. This review surveys the history and present state of the discipline, presenting some comparative results when available. It shows, first, that the discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices.
Full-text available
Objective: Sexual self-schemas are cognitive generalizations about the sexual self that influence the processing of sexually pertinent information and guide sexual behavior. Until recently sexual self-schemas were exclusively assessed with self-report instruments. Recent research using the meaning extraction method, an inductive method of topic modeling, identified 7 unique themes of sexual self-schemas: family and development, virginity, abuse, relationship, sexual activity, attraction, and existentialism from essays of 239 women (Stanton, Boyd, Pulverman, & Meston, 2015). In the current study, these themes were used to examine changes in theme prominence after an expressive writing treatment. Method: Women (n = 138) with a history of childhood sexual abuse completed a 5-session expressive writing treatment, and essays on sexual self-schemas written at pretreatment and posttreatment were examined for changes in themes. Results: Women showed a reduction in the prominence of the abuse, family and development, virginity, and attraction themes, and an increase in the existentialism theme. Conclusions: This study supports the validity of the 7 themes identified by Stanton and colleagues (2015) and suggests that expressive writing may aid women with a history of sexual abuse to process their abuse history such that it becomes a less salient aspect of their sexual self-schemas. (PsycINFO Database Record
Links between mother-infant affective matching and attachment security are well-documented, but research on other types of behavioral matching and attachment security are lacking, as are studies that examine these constructs later in children's development. We examine language style matching (LSM) between mothers and their school-aged children (N = 68), using interviews with each dyad member. As predicted, regressions revealed that higher mother-child relational LSM was associated with greater child attachment security (operationalized as high security, low dismissal), and that higher LSM predicted smaller increases in children's electrodermal response to a relational probe 1.5 years later. Further, mother-child relational LSM was a mediator in the indirect path between children's attachment security and children's reactivity. We discuss the potential utility of LSM as a measure of relationship quality and future studies that could refine our understanding of parent-child language matching.
Background: Identifying precursors that will aid in the discovery of individuals who may harm themselves or others has long been a focus of scholarly research. Aim: This work set out to determine if it is possible to use the legacy tokens of active shooters and notes left from individuals who completed suicide to uncover signals that foreshadow their behavior. Method: A total of 25 suicide notes and 21 legacy tokens were compared with a sample of over 20,000 student writings for a preliminary computer-assisted text analysis to determine what differences can be coded with existing computer software to better identify students who may commit self-harm or harm to others. Results: The results support that text analysis techniques with the Linguistic Inquiry and Word Count (LIWC) tool are effective for identifying suicidal or homicidal writings as distinct from each other and from a variety of student writings in an automated fashion. Conclusion: Findings indicate support for automated identification of writings that were associated with harm to self, harm to others, and various other student writing products. This work begins to uncover the viability or larger scale, low cost methods of automatic detection for individuals suffering from harmful ideation.