ArticlePDF Available

Expert-Informed Topic Models for Document Set Discovery


Abstract and Figures

The first step in many text-as-data studies is to find documents that address a specific topic within a larger document set. Researchers often rely on simple keyword searches to do this, even though this may introduce considerable selection bias. Such bias may be even greater when researchers lack the domain knowledge required to make informed search decisions, for example, in cross-national research or research on unfamiliar social contexts. We propose expert-informed topic modeling (EITM) as a hybrid approach to tackle this problem. EITM combines the validity of external domain knowledge captured through expert surveys with probabilistic topic models to help researchers identify subsets of documents that cover initially unknown domain-specific topics, such as specific events and debates, that belong to a researcher-defined master topic. EITM is a flexible and efficient approach to the thematic selection of documents from large text corpora for further study. We benchmark and validate the method by discovering blog posts that address the public role of religion within large corpora of Australian, Swiss, and Turkish blog posts and provide researchers with a complete workflow to guide the application of EITM in their own work.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
Communication Methods and Measures
ISSN: (Print) (Online) Journal homepage:
Expert-Informed Topic Models for Document Set
Eike Mark Rinke, Timo Dobbrick, Charlotte Löb, Cäcilia Zirn & Hartmut
To cite this article: Eike Mark Rinke, Timo Dobbrick, Charlotte Löb, Cäcilia Zirn & Hartmut Wessler
(2021): Expert-Informed Topic Models for Document Set Discovery, Communication Methods and
Measures, DOI: 10.1080/19312458.2021.1920008
To link to this article:
© 2021 The Author(s). Published with
license by Taylor & Francis Group, LLC.
Published online: 30 Jun 2021.
Submit your article to this journal
View related articles
View Crossmark data
Expert-Informed Topic Models for Document Set Discovery
Eike Mark Rinke
, Timo Dobbrick
, Charlotte Löb
, Cäcilia Zirn
and Hartmut Wessler
School of Politics and International Studies, University of Leeds, Leeds, UK;
Mannheim Centre for European Social
Research (MZES), University of Mannheim, Mannheim, Germany;
Institute for Media and Communication Studies,
University of Mannheim, Mannheim, Germany;
Data and Web Science Group, University of Mannheim, Mannheim,
The rst step in many text-as-data studies is to nd documents that address
a specic topic within a larger document set. Researchers often rely on
simple keyword searches to do this, even though this may introduce con-
siderable selection bias. Such bias may be even greater when researchers
lack the domain knowledge required to make informed search decisions, for
example, in cross-national research or research on unfamiliar social contexts.
We propose expert-informed topic modeling (EITM) as a hybrid approach to
tackle this problem. EITM combines the validity of external domain knowl-
edge captured through expert surveys with probabilistic topic models to
help researchers identify subsets of documents that cover initially unknown
domain-specic topics, such as specic events and debates, that belong to
a researcher-dened master topic. EITM is a exible and ecient approach to
the thematic selection of documents from large text corpora for further
study. We benchmark and validate the method by discovering blog posts
that address the public role of religion within large corpora of Australian,
Swiss, and Turkish blog posts and provide researchers with a complete work-
ow to guide the application of EITM in their own work.
Reproduction Materials: The validation data, code, and any additional materials required to reproduce
all analyses in this article are available in an online appendix, at:
Few recent developments have transformed social science as much as the exploding availability of
large amounts of text. Text-as-data research has become one of the most vital and expanding areas of
activity in the social sciences, with many research projects applying new methods of collecting (e.g.,
Landers et al., 2016; Mahdavi, 2019) and analyzing text data (e.g., Brier & Hopp, 2010; van Atteveldt &
Peng, 2018).
The first step of many text-as-data studies is to find in a larger corpus of texts those documents that
speak to a predefined theme – or master topic – of interest. Simply put, researchers are often not
interested in all text documents available to them but need to find thematically defined subsets of
documents within larger corpora. They may, for example, want to generate manageable text datasets
for further, more refined analysis. Or they may want to identify the population of all documents
belonging to a specific class of topics to draw a sample from for in-depth manual content analysis.
However, researchers often neglect the step of identifying document subsets that speak to a
prespecified theme even though this step can have huge implications for the validity of results coming
from further analysis of these subsets. Different subset selection processes generate different subsets,
CONTACT Eike Mark Rinke School of Politics and International Studies, University of Leeds, Leeds, LS2
© 2021 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
which, in turn, can lead to dramatically different statistical inferences and substantive conclusions
prone to unknown biases.
The current state-of-the-art approach to subset discovery from unstructured text corpora is to use
computer-assisted Boolean keyword searches, in which an algorithm proposes suitable keywords and
creates Boolean search queries based on the text corpus at hand (King et al., 2017). But while this
selection approach constitutes a significant advance over the bias-prone ad-hoc selection of keywords
for search queries, it becomes problematic when researchers lack the knowledge required to make
informed decisions about which of the proposed keywords to select. Unfortunately, such lack of
requisite domain knowledge often obtains, especially in research involving text data from unfamiliar
social contexts. In this article, we propose a hybrid approach, expert-informed topic models (EITM),
that combines external domain knowledge captured through expert surveys with fully automated
clustering methods to help researchers identify those subsets of documents that cover domain-specific
topics within a researcher-defined master topic. Our approach uses inputs from expert surveys to
enhance document classification based on probabilistic topic models. We provide researchers with
a complete workflow to guide their own applications and illustrate the approach in a case study
identifying articles that discuss the public role of religion from a large corpus of news articles including
three languages (English, German, and Turkish).
We begin by describing the applied classification problem addressed in this article, outlining the
state of the art of existing approaches to it and highlighting the issues these approaches cannot solve
and which we tackle with the EITM approach. We then discuss the utility of an expert-informed
approach to the discovery of document subsets from large corpora of unstructured text, especially for
cross-national comparative research designs, and explain the seven-step EITM approach to combining
expert input and document classification based on LDA topic models.
Given that model validation is a key practice in text-as-data research (Grimmer & Stewart, 2013),
our approach involves a two-step validation process: First, we validate different models in terms of
how well their estimated topics cohere and capture expert knowledge, using text data from five
countries (Australia, Germany, Switzerland, the USA, and Turkey) and three media types (printed
newspapers, news websites, and political blogs). Second, we take the EITM that is optimally coherent
and representative of expert inputs and evaluate its performance as a thematic classifier using text data
collected from political blogs based in three of these countries (Australia, Switzerland, and Turkey)
and including all three languages. Detailed information about all data-collection, topic-modeling, and
document-classification processes as well as any data, code, and additional materials required to adapt
and implement EITM are available in a comprehensive online appendix (,
to help researchers reproduce the method and adapt it to their research projects.
The applied classication problem: Why bring expert domain knowledge to topic
Assume the following situation: A researcher has access to a large collection of news articles from
several countries and wants to compare how minority rights are reported on in these countries.
However, the articles are unlabeled and she knows that only a small – perhaps tiny proportion of
articles, which could still be huge in absolute numbers, will talk about minority rights. Even worse, the
researcher is not familiar with the public debates that happened during the time in which the articles
were collected in some of the countries she wants to study. This means that she will likely be unaware
of important debates relevant to minority rights that took place during data collection and, impor-
tantly, many such articles will not include simple keywords such as “minority” or “minority rights”,
which she would be forced to use in a naïve search query approach to finding the subset of relevant
documents. This means that such a naïve approach is likely to introduce considerable, yet unknown
and unknowable, bias into her study, threatening the validity of her conclusions. Simply put, our
researcher will just miss any debate and story related to minority rights she was not aware of and did
not have a fitting keyword for in her search query. This is a common research situation in comparative
text-as-data studies, and one which in the past has been ignored at the expense of the validity of this
area of study. It is also a research situation that is bound to become more common as ever greater
amounts of textual data from all around the world become available to researchers. In this study we
tackle this problem head-on.
More formally, our problem is one of document classification for very low-prevalence classes: We
have a large unlabeled corpus and want to classify each document in that corpus as either thematically
relevant or irrelevant, with the expectation that the proportion of relevant documents will be very
small. So, we are not interested in the full corpus, just in the potentially small fraction of documents
belonging to an unknown set of relevant categories, which we want to fold into a binary relevant/
irrelevant distinction. That distinction is based not on a specific, narrow topic or event but on
a nonspecific, broad thematic area of interest (i.e., a “master topic”). In the fictive example at the
beginning of this section, the master topic is minority rights; in the use case we present below it is the
public role of religion. The broadness of any given master topic means that we need to assume
unknown latent sub-topics. Further, the involved researchers do not have sufficient domain knowl-
edge about the social context to correctly classify the documents without additional input. Our
solution is to bring expert-provided domain-specific knowledge to bear on unsupervised thematic
document discovery. We exploit such knowledge to improve the efficiency and validity of both the
initial discovery of themes and debates from unsupervised text modeling and the subsequent discovery
of relevant document sets based on it. Since in this scenario the data is highly unbalanced, it is virtually
impossible to find enough positive cases to produce sufficient training data for traditional supervised
learning approaches to document classification and discovery. We mitigate the “missing knowledge”
problem by incorporating expert domain knowledge about the latent sub-topics of our master topic
(i.e., in our fictive example: individual events, debates or issues that are related to minority rights) into
an LDA-based classification process. Incorporating expert knowledge is difficult to do in classical
classification approaches as they are more closed than LDA-based topic models: Topic models include
an inductive step of estimating topics directly from the analyzed document set that is missing from
“classical” supervised approaches. In the EITM process, we exploit this step to input expert-provided
domain information into the document discovery process. While LDA-based approaches have been
rarely used for classification tasks, we thus harness their relative openness to address this “closedness
problem” of established supervised machine learning approaches.
The most common strategy for researchers to discover documents covering a pre-specified master
topic has been to run keyword-based Boolean search queries on the universe of documents available to
them (Barberá et al., 2021). Therefore, we present a comparison of EITM’s document discovery
performance with the common simple search-query approach below.
While simple search queries are most commonly found in applied research, the state-of-the-art
methodological work in the area of document set discovery is a keyword-search-query approach
proposed by King et al. (2017; KLR). The goals of their approach and EITM are the same: to find the
document sub-corpus of thematic interest to the researcher. In other words, the analytical goal is to
discover, within a given document set, the subset of thematically relevant documents. This corre-
sponds to what King et al. describe as selecting a target set from a search set of documents (King et al.,
2017, pp. 975–976) and Barberá et al. as selecting the document population of interest from the
universe of available documents (Barberá et al., 2021, p. 21).
EITM shares some important commonalities with the KLR approach. Both are computer-
assisted approaches that draw on the unique ability of humans to recognize relevant keywords
for keyword and document set discovery. Both also use these human inputs in interaction with the
outputs of advanced automated text classification methods. Finally, both approaches involve the
same broad stages of document identification: seeding, generalizing, validating, and discovering.
Seeding refers to the input of human-generated information (“seeds”) into the computational
process; generalizing refers to the formulation of general concepts (e.g., keywords or topics) from
specific examples; validating refers to the human correction of intermediate process outputs; and
discovering refers to the final selection of the thematically relevant documents (i.e., the sub-corpus
of thematic interest – the target set in King et al.’s or population in Barberá et al.’s terms) from the
larger document set.
But the EITM approach also differs from the KLR approach in important ways: First, in the nature
of the human inputs it leverages, which EITM formalizes as coming from domain experts rather than
the researchers themselves.
Second, it differs in the steps involved and functions of the main stages in
the document discovery process. Table 1 gives an overview of the main stages as implemented in the
KLR and EITM approaches and summarizes the differences in how the two approaches tackle the
problem of thematic document set discovery (a more detailed breakdown of the practical steps
involved in EITM is given in the next sections and in Figure 1).
With their computer-assisted approach to keyword and document set discovery, the KLR approach
frees human coders from the burden of creating keyword lists and queries on their own. But the
approach still relies on preexisting “detailed contextual knowledge” (King et al., 2017, p. 979) of
human coders, which is required to distinguish relevant from irrelevant keywords proposed by an
algorithm. And as Barberá et al. (Barberá et al., 2021, p. 20) note, keyword-generation and -expansion
methods must start with a human selection of keywords to seed the algorithm. This means that
researchers need to be careful in thinking about which seed keywords are both relevant and repre-
sentative in order to avoid biasing path dependencies.
The EITM approach lowers the need for detailed contextual knowledge on part of the researcher by
using expert inputs to highlight relevant keywords and facilitate human-coder learning about relevant
Table 1. Main Stages of the King, Lam, and Roberts Keyword Algorithm (KLR) and Expert-Informed Topic Models (EITM) for Document
Set Discovery.
1 Seeding of keywords for producing training
document set
of keywords for producing expert inputs (Step 1)
2 Generalizing of training document set to suggested
of expert inputs to ranked estimated topics (Steps 2–4)
3 Validating of suggested keywords for document
of ranked estimated topics for document discovery (Step 5)
4 Discovering of target document subset via search
of target document subset via model-based probability
estimation (Steps 6–7)
The table displays a simplified overview of both approaches. Each stage listed is composed of multiple steps in practice (individual
EITM steps as shown in Figure 1 in parentheses). Both approaches can be adapted, for example, by choosing different document
clustering approaches or inserting more points of human input or iterating the process.
Figure 1. Overview of the EITM process.
contexts. As a result, applied researchers depend less on an assumption of preexisting contextual
knowledge in human coders. Of course, experts also need to come up with relevant keywords, which is
why it is important to (a) survey experts that can plausibly be expected to have good knowledge of the
subject domain of interest and (b) recruit a reasonable number of experts to ensure that the coverage
(or “recall” in machine learning terms) of keywords indicating relevant topics (i.e., those belonging to
the researcher-specified master topic of interest) is sufficiently high. This is crucial given that in King
et al.’s (2017) study the authors found that most keywords were named by only one of 43 non-expert
subjects tasked with providing relevant keywords for a search query. Proper thematic coverage thus
requires a proper number of keyword-providing experts.
A sufficiently large group of external experts also solves a second problem. If keyword candidates
are computer-generated and must be evaluated by humans, as King et al. (2017) propose, and if the
primary researchers do not have sufficient relevant domain and context knowledge, it is preferable to
have external experts propose the keywords in the first place, not only to improve document set
discovery but also to support researcher learning about the contexts studied. Expert inputs support not
only a more valid discovery of target document sets, but also a more locally informed, “emic”
understanding of studied contexts. As such, bringing in experts also reorients the process of knowing
about the cultures we study in large-scale text analysis. It makes good on a demand often placed on
cross-cultural research in modern comparative methodology by bestowing emic (“insider”) validity
upon an etic (“outsider”) comparative perspective (Whitaker, 2017).
Another key advantage of our topic modeling approach over keyword search approaches like the
KLR algorithm is that it generates a continuous distribution of relevance scores rather than a sharp
binary classification. In this context, it is important to recognize that (semi-)automated pre-
classification will never be perfect. However, the approach we propose is more flexible than keyword
queries in that it allows the researcher to adjust the pre-classification parameters to their research
situation and the acceptable level of tolerance for classification error. For example, if a researcher has
few resources at their hands for conducting follow-up human validation of automated document
classifications, they may want to select only documents in the uppermost part of the estimated
relevance distribution. If they have larger resources for final human classification at their disposal
but want to implement their project in a more efficient manner, they can work their way down from
the top of the estimated document relevance distribution in a manual follow-up classification step
until they have reached their desired sample size. We provide an illustration of the efficiency gains
achieved through this process below. In sum, the EITM approach provides a classification framework
that is more flexible, contextually valid and less reliant on assumptions about researcher domain
knowledge than the available approaches based on Boolean keyword search queries.
Expert-informed topic models
Figure 1 provides an overview of the seven-step EITM approach and shows how it combines expert,
computational, and researcher inputs to produce valid scalable solutions to topic-based document
subset discovery. When applied to a document discovery problem in which a researcher aims to
discover a subset of documents that talk about specific topics (i.e., issues or debates), the approach asks
domain experts to name the topics and keywords that appeared in relevant thematic discussions
during the period for which data was collected (Step 1). While the task of recruiting and surveying
knowledge domain experts requires additional effort compared to established document set discovery
approaches, it will often require fewer resources than making the substantial investment of resources
required for researchers to acquire comprehensive expert knowledge themselves. This will be true
especially in situations where studied social contexts are highly unfamiliar and language barriers to the
studied texts are high for the researcher. Depending on research interest, experts may be substantive
specialists in specific subject domains or area specialists in specific geographical regions.
After standard preprocessing of the full document corpus (Step 2) and estimating an LDA topic
model on it (Step 3), the EITM approach uses expert inputs to rank the generated topics by their
estimated relevance (Step 4). The topic ranking then serves as a basis for a manually validated binary
classification of estimated topics as either belonging or not belonging to the researcher-specified
master topic (Step 5). Importantly, the manual binary validation of estimated topic relevance in Step 5
is again informed by the comprehensive expert-provided list of relevant topics. After the top-ranked or
all (if resources allow) estimated topics are human-validated as relevant or irrelevant, we estimate, for
each document, its cumulative probability of containing one or more of the relevant topics, which can
then be used for classification (“discovery”) decisions about individual documents based on
a researcher-specified relevance probability cutoff (Step 6).
Finally, in keeping with the principle of validating any given implementation in automated text
analysis (Grimmer & Stewart, 2013, p. 271), we evaluate the resulting classifier using the receiver
operating characteristic (ROC) curve, classifier precision, recall, and F
-scores as established
classification performance metrics (Step 7). We provide a detailed description of each step below.
The key concept of the EITM approach is the cumulative match probability (CMP), which is
the central expert-informed output generated by the process. The CMP formalizes qualitative
expert input into a continuous “relevance probability metric” for each topic estimated in the
unsupervised topic modeling process. This metric allows researchers to infer the thematic rele-
vance of any given estimated topic based on a prespecified probability cutoff value or, alterna-
tively, supports human coders in a manual decision task to validate the thematic relevance or
irrelevance of individual estimated topics, informed by the expert-provided insights about relevant
The EITM process integrates three types of input. Expert input fulfills the dual function of
increasing both the efficiency with which researchers can prescreen estimated topics for thematic
relevance and the validity with which they can identify previously unknown relevant topics.
Computational input in the form of inductive, unsupervised Latent Dirichlet Allocation (LDA) topic
modeling (Blei et al., 2003; see also Guo et al., 2016; Jacobi et al., 2016; Maier et al., 2018) combines
with expert and researcher inputs to identify relevant topics and documents while building on state-of-
the-art guidelines for applying such models in the social sciences (Trilling & Jonkman, 2018; van
Atteveldt & Peng, 2018). Researcher input into the EITM approach serves to enhance the selection of
relevant topics over a fully automated solution and keep with the crucial principle of validation in
automated text analysis (Grimmer & Stewart, 2013, p. 271).
In sum, EITM ensures that not only documents containing known relevant topics, but also
those containing previously unknown relevant topics will, on average, be assigned higher rele-
vance scores based on topic model estimates. These higher, and more valid, relevance scores
assigned to documents, in turn, make it more likely that they will correctly be identified as
thematically relevant by the research team. In essence, the EITM approach represents a scalable
solution to increase the recall, or true positive rate, in complex, large-scale content-based
document set discovery tasks. We note that, of course, this solution will not, and cannot, be
perfect. Classification errors will still be made. The degree to which such errors are deemed
acceptable to researchers will depend on the specific research situation. For example, in situations
where document set discovery is a step in the preparation of a fully automated follow-up text
analysis process, classification errors may be more tolerable than in situations where document
set discovery is used to identify the universe of documents to draw a representative random
sample for further in-depth human coding. In the latter case, EITM can help researchers conduct
a pre-selection process that is more efficient and substantively valid than the standard approach
based on researcher-generated keyword searches (see Barberá et al., 2021, pp. 21–26). The outputs
of the EITM process may then be subjected to follow-up manual screening by human coders to
ensure that random sampled documents are indeed thematically relevant. This manual process is
not strictly part of the EITM approach, but it will again be informed and improved to the extent
that human coders learn about relevant topics from the expert inputs acquired in the EITM
The EITM workow
The following sections describe each step of the EITM process in detail. We illustrate its application
and demonstrate its validity and classification performance using the Mannheim International News
Discourse Data Set (MIND, Rinke et al., 2019). Our validation is based on a complete collection of all
news items published by 94 different news websites, printed newspapers, and political blogs from five
countries (Australia, Germany, Switzerland, Turkey, and the United States) in the full year ranging
from August 1, 2015 until July 31, 2016. Our use case is a document set discovery task in which we
want to identify, from the full corpus of all articles published by these media outlets in a full year, those
articles that belong to the master topic “public role of religion in societal life”. Appendix A provides
detailed information about MIND and the specific data underlying our use case.
Step 1: Conduct expert survey
In the first step of the EITM process, expert inputs are generated to later inform the classification of
estimated topics as thematically relevant or irrelevant. In our case, we tapped the knowledge of domain
experts in each of the countries from which text material was collected using semi-standardized
surveys that followed standards for expert selection aimed at mitigating bias used in past cross-
national expert surveys (e.g., Kopecký et al., 2016). The survey asked selected experts to name topics
and debates that occurred during the time period studied. Experts were also asked to provide keywords
associated with these topics and debates.
We then grouped similar topics provided by different
experts into a smaller number of meta-topics using the online qualitative content analysis tool
QCAmap (Mayring, 2014). Keywords associated with such topics were pooled on the meta-topic
level. Expert-provided topics were assigned to one or several meta-topics leading to a consolidated
array of non-disjunct meta-topics (see Appendix B2). The meta-topics were used in Step 3 to evaluate
the quality of estimated topic models by comparing the keywords in each meta-topic to the top
keywords of each LDA-generated topic.
Overall, we identified a pool of 614 experts who were contacted to take the survey. The survey was
fielded from August 15 until November 18, 2016 and generated a total of 96 completed questionnaires
(Australia: n = 15, Germany: n = 19, Switzerland: n = 21, Turkey: n = 20, USA: n = 21). Appendix B1
provides a detailed description of expert recruitment procedures, participant demographics, and
response rates.
Step 2: Preprocess corpus
We used standard text preprocessing steps to prepare the corpus for analysis, including cleaning,
lemmatization, term frequency–inverse document frequency (tf–idf) calculation, and corpus feature
selection (see Manning et al., 2008, Chapters 2.2, 6.2). Importantly, researchers should carefully
consider each such step as preprocessing choices may influence the classification results (Denny &
Spirling, 2018). Appendix C1 provides details to help readers preprocess their own text corpora.
Step 3: Perform topic modeling
Following text preprocessing, we trained the topic model using the LDA model implementation in
Gensim, a freely available Python library for scalable statistical semantics (Řehůřek & Sojka, 2010). We
used the default Gensim settings for prior parameters α and η, and optimized the topic model by
tuning the parameter specifying the number of estimated topics (see Maier et al., 2018, p. 99). In order
to do so, we generated topic models with 100, 500, and 1000 topics for each country with the goal of
identifying the best-performing of the three models. In a next step we used three metrics to evaluate
the topic model solutions: the human-rated coherence of estimated topics (Newman et al., 2011, pp.
501–502; 2010, 106) as well as measures to assess the precision (BLEU measure, Papineni et al., 2002)
and recall (ROUGE measure, Lin, 2004) with which the estimated topic models covered the expert-
provided relevant topics. BLEU and ROUGE are measures developed in the computational translation
literature, which we repurposed to assess the degree to which a given topic model constitutes
a comprehensive and accurate representation of expert knowledge in the substantive domain of
interest for the classification task. We identified the “best” model solution using a researcher-driven
reading of the three metrics (human-rated coherence, precision/BLEU, and recall/ROUGE) wherein
the 500-topic solution provided the best tradeoff among these criteria. The results show that the main
tradeoff was between the degree to which our model captured the entirety of expert inputs (i.e., recall/
ROUGE) and the degree to which the output of the model (i.e., the number of topics to be validated as
well as their intelligibility and interpretability) remained manageable. Appendix C2 provides further
detail on the metrics and the procedures applied for topic model selection. It also provides all Python
scripts used for topic modeling, which users can adjust to their own research situations. All following
steps in the EITM process are based on the 500-topic model solution.
Step 4: Rank topics
The output of an LDA topic model is a set of m topics t
to t
, each of which corresponds to a probability
distribution of words within that topic. This means that a topic t
consists of a dictionary set of n words
to w
, each with certain probabilities p(t
) to p(t
) of occurring in that topic (see Figure 2).We
divided the set of words in each topic into subsets of relevant (marked green in Figure 2) and non-
relevant words, depending on whether they appear in the keyword list provided by the experts.
We then
calculated the CMP of each topic by summing the topic-specific probabilities of all expert-provided
relevant keywords per topic. The CMP of each topic can be interpreted as its overall probability of
containing relevant keywords and as such is an estimate of the topic’s thematic relevance. The researcher
can then create a list of all topics estimated by the topic model, ranked by their CMP.
Step 5: Select relevant topics
In Step 5, human coders manually review the estimated topics and make a binary yes/no judgment
about the thematic relevance of each topic. If the number of estimated topics is too large for
a researcher to perform a review of all topics given available resources, the EITM process is flexible
enough to allow the researcher to minimize the resulting loss of classification performance. In such
situations, coders may perform this manual review step only on the upper end of the estimated
relevance probability distribution, while the topics on the lower end of the estimated relevance
distribution are discarded as irrelevant. In our use case, we selected the top 100 (or top 20%) of topics
estimated by each topic model for further manual review.
Human coders were presented with the CMP score of each topic and the top-100 keywords with the
greatest estimated probability of belonging to that topic. Keywords with a match in the list of expert-
provided keywords were color-highlighted to guide the coders’ review process. Color-highlighting
keywords for manual topic review helps coders in two ways: First, it prevents them from overlooking
relevant keywords and associated topics they already know to be relevant. Second, and more impor-
tantly, it indicates to coders keywords they may be either entirely unfamiliar with or which have an
Figure 2. Calculation of cumulative match probability (CMP) based on topic model.
Note. Each topic t consists of a set of words w with assigned probabilities p.
unknown association with a known relevant topic. It also signals to coders which keywords influenced
the CMP, which may aid the interpretation of topics estimated to have a high probability of being
relevant. Taken together, color highlighting as a feature of the EITM process works to improve the
selection of relevant topics for subsequent discovery of the relevant document subset and supports
coder and researcher learning about previously unknown relevant topics in the text data. Figure 3
shows a screenshot of a topic list as presented to the coders.
In our application, three coders were instructed to rate each topic as relevant or irrelevant
following a coding protocol provided to them. Coders were also instructed to rate how certain
they were about whether the given topic was thematically relevant for the purpose of the
research project or not (i.e., whether it belonged to the master topic of the public role of
religion in societal life). Each topic was then declared relevant or irrelevant following the
majority decision of coders (see Appendix C3 for the coding protocol as well as all data and
analysis scripts used in this step).
In sum, the EITM process improves the topic selection in two ways: First, by relating the
topics estimated via topic modeling to expert-provided keywords, thus expanding the list of
relevant expert-provided keywords with frequently co-occurring keywords. Second, by enabling
human coders, based on the CMP ranking, to make more valid and efficient decisions in
deselecting topics likely to be thematically irrelevant from the subsequent document discovery
Step 6: Score document relevance
Next, the topic model is used to estimate the probability of containing at least one thematically
relevant topic for each document in the corpus (Figure 4). For a corpus consisting of o documents,
each document d
with {x | x o} can be represented by a distribution of topics t
to t
associated probabilities that the document contains these topics of p(d
, t
) to p(d
, t
). This docu-
ment-specific sum of probabilities was estimated for the relevant topics. The resulting score can be
considered as the individual relevance score for document d
. The higher this score, the higher the
probability of the respective document to contain at least one of the topics of interest. This enables the
researcher to classify documents as relevant or irrelevant based on their corresponding relevance score
(see Appendix D).
Figure 3. Results of Topic Ranking for Australia with 500 Topics (Screenshot)
Note. Screenshot as presented to human topic-relevance coders, presenting the top 20 topics and top 10 terms for each topic ranked
by estimated topic relevance and term-topic membership probabilities. Green cells include terms that match with a keyword
provided by at least one participant in the expert survey. CMP corresponds to the summed topic-level probability of all terms
mentioned by at least one participant in the expert survey to belong to the respective topic.
Step 7: Evaluate classier
Validation is a crucial step in the development of any automated text-as-data technique (Grimmer &
Stewart, 2013). It is therefore important in any application of the EITM approach to evaluate the
quality of the classification solution it provides. The evaluation process we suggest consists of several
steps. In a first validation step, we evaluate the estimated topic models in terms of their (a) coherence
of estimated topics and (b) their coverage of topics and keywords provided by the domain experts.
Appendix C2 provides a detailed description of this first step.
In a second validation step, we evaluate the classification performance of our EITM against
a human-coded gold standard. The gold-standard ratings are then compared to the relevance scores
predicted by the classifier. The assumption here is that such human validation of automated pre-
classification will be effective even in the absence of coder domain knowledge. In an ideal research
situation, researchers using the EITM process would draw on experts to validate their classification
solutions. However, in our use case we employed student coders without expert domain knowledge as
a “second-best” option for validation. This second-best approach is still useful given that the key
problem addressed by EITM is that researchers cannot pre-define keywords for valid thematic
document discovery. Our assumption when using non-expert coders in this step here is that it will
be possible even for humans without prior domain knowledge to recognize a thematically relevant
article when they see it.
We evaluated the performance of the EITM classifier using receiver operating characteristic (ROC)
curves as well as three established metrics: precision, recall, and the F
-score (see Japkowicz & Shah,
2014). This process assesses the degree to which the predicted relevance score works as an effective
thematic classifier.
In our use case, human coders first coded the relevance of the entire collected text material from all
political blogs in three of our countries: Australia, Switzerland, and Turkey. We then compared the
results to the scores predicted by the EITM approach. Blogs are a “hard case” for testing classifier
performance because blog posts will, on average, be more difficult to classify than more standardized
categories of text such as news reports, which routinely rely on recurring professional formal and
content features, such as the use of certain words and phrases, due to journalistic writing conventions.
We therefore consider this evaluation to be a conservative test of classifier performance.
Gold-standard data to evaluate the EITM classifier was produced by 13 student coders who
identified any documents from the full corpus of Australian, Swiss, and Turkish blog posts included
in the MIND dataset that deal with the master topic of our use case, “the public role of religion in
societal life” (N
= 9, N
= 10, N
= 10 blogs and N
= 7,503, N
= 1,635, N
= 3,565 blog
posts, see Appendix A1 for an overview of collected data). Coders underwent intensive, multi-wave
training in applying a detailed protocol for identifying documents as thematically relevant or irrele-
vant (see Appendix D1 for the selection protocol).
Several measures were taken to ensure a high quality of the human-coded benchmark data. All
coders received the same training and instructions and were native speakers or fluent in at least two of
the three source languages. Following coder training, we established pilot coding reliabilities before
starting the main coding process. These reliability tests involved 11 of the 13 coders who accounted for
Figure 4. Calculation of document relevance scores (rs) based on topic model.
Note. Each document d consists of a set of topics t
to t
with a probability p to belong to this topic.
92% of all 13,750 coding acts performed as part of the main coding process (not included were C09,
who accounted for 251, or 1.8%, and C28, who accounted for 850, or 6.2%, coding acts in the main
coding process). In order to assess pilot coding reliability, coders rated 150 German and Swiss
newspaper website articles and blog posts as either relevant or irrelevant. Coding reliability was
assessed using percent agreement as well as two chance-corrected measures of interrater reliability,
Brennan and Prediger (1981) and Krippendorff’s alpha (Krippendorff, 2018). Overall, pilot reliability
was excellent, with percent agreement = .92, kappa = .85, and alpha = .83 (all data and analytical code
required to reproduce these analyses are available in Appendix D2).
In order to produce the human-coded gold-standard data, we implemented an online coding
process that allowed coders to code all 12,703 Australian, Swiss, and Turkish blog posts included in
the MIND dataset as thematically relevant or irrelevant. In order to eliminate any remaining coding
error, the online coding process consisted of a multi-step validation procedure: any document for
which the first coder indicated thematic relevance and/or uncertainty in their coding decision was
coded by a second coder and any possible coding disagreement between the first and second coder
was resolved by a tie-breaking third coder. These double-coding and majority decision rules served
to minimize the proportion of false-positive and to maximize the proportion of true-positive
classifications of blog posts as thematically relevant. The coding process also minimized bias due
to any remaining idiosyncratic coder error post-training by implementing a set of rules aimed at
avoiding a monopolization of coding for specific sources by any individual coder. Appendix D3
gives a detailed description of the online coding tool and process used, including a list of coder
allocation rules.
Coders performed a total of 13,750 coding acts, including all initial coding (round 1), double-
coding (round 2), and tie-breaker coding (round 3), as part of the main coding process to produce the
human-coded gold standard data.
This procedure identified 184 thematically relevant blog posts for
Australia (out of 7,503), 28 for Switzerland (out of 1,635), and 99 for Turkey (out of 3,565) (see
Appendix D3).
Receiver operating characteristic
The first performance metric we evaluated and optimized was the area under curve (AUC) in the
receiver operating characteristic (ROC) plot. The ROC curve indicates the performance of a binary
classifier at different classification thresholds (see Japkowicz & Shah, 2014). A perfect classifier (i.e.,
one that yields 100% true positives and 0% false positives) is represented as a dot in the (0, 1) corner of
the ROC plot. The diagonal corresponds to a bad classifier that is equivalent to randomly guessing
results. The AUC is a single scalar value representing the expected ROC performance of a classifier. As
the portion of the area under the ROC curve, its value will always be between 0 and 1.0 with an
uninformative classifier indicated by a value of 0.5. The AUC is equivalent to the probability that the
classifier will rank a randomly chosen positive instance higher than a randomly chosen negative
instance (Fawcett, 2006, p. 868).
The ROC curves and the AUC show that classification based on the 500-topic models worked well
for the Swiss and Australian corpora, while performance dropped slightly but remained acceptable for
the Turkish corpus (Figure 5). The diminished classification performance for the Turkish corpus could
be a function of problems with the analyzed text data, preprocessing procedures, or different coding
styles of the coders during Step 5.
The ROC curves also show that classification performance is sensitive to the selected number of
topics. While in Australia the model estimating 100 topics had a slight performance edge, the 500-
topic model was superior in the other two countries. In Switzerland in particular, the models
estimating 100 and 1000 topics performed poorly. This supports the choice of topic models with the
model hyperparameter set to estimate 500 topics for all countries and languages as these consistently
exhibited good performance. It also shows the importance of making an informed choice about the
topic model hyperparameter in any given research project.
Precision, recall, and F
The classification problem we address here involves an extremely unbalanced dataset with a very low
prevalence of the positive class in the raw dataset. In such situations, the ROC and AUC can be
misleading, which is why we also evaluated the precision, recall, average precision and F
-scores for
the positive class (Manning et al., 2008).
Precision is the fraction of items correctly classified as relevant (TP) among all items classified as
relevant by the classifier (TP+FP): TP/(TP+FP). Recall is the fraction of all relevant items (TP + FN)
that were correctly classified as relevant by the classifier (TP): TP/(TP+FN). The F
-measure is the
harmonic mean of a classifier’s recall and precision. Its highest possible value is 1, indicating perfect
precision and recall. The F
-measure thus balances precision and recall and is a common measure of
overall classification performance.
Table 2 (right-hand side) reports the classification results at the relevance score cutoff where the
EITM classifier reaches the maximum F
score. At the cutoff, the F
-score is .26 for the Australian, .34
for the Swiss and .25 for the Turkish corpus for the respective model with 500 topics. Precision and
recall of the classifiers are similar across the three countries.
In order to benchmark EITM’s classification performance, we emulated a simple keyword-based
search approach of the sort that is often used but rarely explicated in applied communication research:
the researcher creates a search query that is designed to maximize recall while avoiding the excessive
inclusion of irrelevant articles that may occur due to the inclusion of theme-unspecific signifiers, for
example, political party names like “CDU” in Germany. Recent examples of studies using this kind of
approach in communication research include Eshbaugh-Soha (2010), Gentzkow and Shapiro (2010),
Figure 5. Classifier receiver operating characteristic (ROC) and area under the curve (AUC) for Australia, Switzerland, and Turkey at
100, 500, 1000 topics.
and Schäfer et al. (2014). In our benchmarking exercise, we used the thematic keywords collected in
the expert survey to build the search query. Table 2 (left-hand side) shows the results of this simple yet
common approach to document set discovery compared against the manual gold standard coding for
Australian, Swiss, and Turkish blogs.
When compared to EITM (right-hand side), the keyword-based search selects a larger number of
documents and therefore has a much higher recall. However, this comes at the price of vastly reduced
precision, as most of selected documents are not actually relevant. Compared to the keyword-based
search-query, the EITM-based classification results in an increase of the F
-score by 3.5 to 4.8 times.
But rather than just looking at this improved classification performance, it is important to note that
more generally EITM bestows researchers with greater flexibility in fine-tuning the classification to
their specific research needs. A keyword-based search query yields exactly one result and can only be
adjusted by including or excluding specific keywords or changing Boolean operators. With EITM, the
researcher can tune the classification result in a more fine-grained fashion by setting custom cutoff
values for the relevance scores assigned to documents in the corpus.
Figure 6 displays the precision-recall curves for document classification based on topic models with
100, 500 and 1000 topics for the Australian, Swiss, and Turkish blog posts in the MIND dataset. It also
includes the F
and average precision (AP) scores for each classifier. The AP is the mean of the
precision over all possible relevance cutoff values weighted by class distribution. It is the average
probability, across the full range of possible cutoffs, that a document picked up based on the EITM
(pre-)classifier actually is relevant.
The figure again shows that, in this application, the topic models estimating 500 topics tended to
perform best, with the exception of Australia, where the model estimating 100 topics scored highest. It
also illustrates the trade-off between precision and recall for the classifiers. As the recall of the classifier
is increased by lowering the relevance score cutoff point, the precision of the classification decreases.
That is, if a researcher requires high recall, they can keep drawing documents with a lower relevance
score at the expense of an increasing number of irrelevant documents and more time needed for sifting
through irrelevant documents.
Table 3 juxtaposes the average precision (AP) of the selected models estimating 500 topics with the
prevalence of the positive class (i.e., relevant documents) in the dataset. The prevalence is the fraction
of relevant topics in the document corpus and the probability of picking up a thematically relevant
document when making a random draw from the corpus.
In our unbalanced dataset, the AP scores indicate a significant improvement when selecting
relevant documents. Given that AP is the average probability, over all possible cutoffs, that
a document retrieved as relevant by the EITM (pre-)classifier is actually relevant, we find for the
Australian corpus, for which the AP is .16, that the EITM-based pre-classification increases the
probability of picking up a relevant document from the pre-classified corpus by a factor of 6.5
compared to random draws from the full Australian corpus (i.e., the prevalence). The probability of
retrieving a relevant document based on EITM was 12.3 times higher for the Swiss corpus and 5.4
times higher for the Turkish corpus.
We can use the efficiency gains observed for documents collected from blogs to estimate the
number of working hours saved for a task commonly encountered by applied researchers: sampling
a fixed number of thematically relevant documents from a corpus containing relevant and irrelevant
Table 2. Classification performance of EITM versus a baseline keyword-based search approach.
Baseline keyword-query search EITM
Country Precision Recall F
Precision Recall F
Australia 0.04 0.99 0.07 0.21 0.35 0.26
Switzerland 0.04 0.96 0.07 0.28 0.42 0.34
Turkey 0.04 0.92 0.07 0.18 0.43 0.25
For EITM, the precision, recall, and F
-score are reported for the maximum obtained F
-score for the models with 500 topics in
Australia, Switzerland, and Turkey for all blog items in the MIND dataset.
documents. In our use case, we test the efficiency gains achieved through EITM for the example task of
generating from the full MIND dataset a sample of 300 thematically relevant documents for each of the
three countries (Australia, Switzerland, and Turkey), with a target sample size of 100 per document
Figure 6. Precision / recall curve for the positive class for Australia, Switzerland, and Turkey at 100, 500, 1000 topics.
Note. F1 is the maximum obtained F
-score and AP is the average precision for the classifier.
Table 3. Observed speed-up (for Blogs) and estimated efficiency gains (for websites & newspapers) for document discovery based on
EITM (compared to a fully manual approach).
Blogs Websites & Printed Newspapers
Observed Estimated
Country Positive Class Total Items Prevalence AP Speed-up
Hours Hours Saved Relative
Australia 184 7503 .02 .16 6.5 30.5 11.7 64.6 85%
Switzerland 28 1635 .02 .21 12.3 10.3 13.3 149.8 92%
Turkey 99 3565 .03 .15 5.4 37.0 16.8 73.9 81%
Total 41.8 288.3 87%
Positive class is the number of identified relevant news items; total items is the number of all items in this category; prevalence is the
calculated proportion of positive items; AP is the average precision of the EITM classifier based on estimated relevance scores and
human-coding decisions; speed-up is the estimated factor by which EITM accelerates the manual identification of relevant
documents; coding hours are the human-coding times observed in the online coding tool in hours; hours saved and relative
reduction of coding time for websites and printed newspapers are estimates based on the speed-up observed for the classification
of blog posts.
category (blog posts, online news articles, printed news articles) per country (or fewer if the MIND
dataset contained less than 100 relevant documents for a document category in a given country). Based
on the theoretical “speed-up” as observed by measuring the average precision for the manual
identification of thematically relevant blog posts, we estimate that in this application the EITM
classification reduces the work required to reach these target sample sizes from the other two
document types in the MIND dataset (i.e., online and printed newspaper articles) by 87% (see
Table 3).
It is important to consider that we tested the EITM-based classifiers on blog posts. Looking at the
average time coders needed to decide about the relevance of documents, we find that blog posts appear
to be more difficult to code than the other document categories: in the example task described above,
for example, coders spent 30.5 of the total 42.2 coding hours spent on the three Australian document
corpora coding blog posts only. Therefore, we may assume that performance estimates based on blog
data are conservative and likely to improve for applications to more standardized genres of text like
news items from printed newspapers and news websites, because these are generally more structured
and use a more conventional writing style, which makes them easier to classify. Considering the speed-
up estimates above, this translates into 64.6 hours of manual coding time that a researcher would save
when aiming to sample the target number of 300 thematically relevant documents from the Australian
corpus. In the same task, coders spent a total of 23.6 hours coding the Swiss corpora to identify the
target number of relevant documents within each category of documents (blog posts, online news
articles, and printed newspaper articles). They spent 44% (10.3 hours) of their overall working time
coding blog posts, which translates into 149.8 hours of coding time saved; for Turkey, it took coders
a total of 53.8 coding hours to identify the fixed target number of relevant articles for each document
category. Coders spent 69% (37 hours) of their overall working time coding blogs, which translates
into 73.9 hours of saved manual coding time needed to meet the target size of the relevant document
sample. With the target sample sizes applied to the three MIND document categories, we estimate that
in this realistic example sampling task applying the EITM approach will have saved at least 288.4 hours
of manual coding time in these three countries alone. Of course, one feature of the EITM process is
that it is highly scalable, and the larger the dataset to which it is applied, the greater the efficiency gains
will be.
It is important to note that in our validation study it would not have been necessary to code all
items for blogs in Australia to achieve the goal of drawing a sample of 100 items. For Swiss and Turkish
blogs, all items needed to be coded, given that the total number of relevant items in the corpus was
below the target sample size. Since all items falling into the blog category had to be coded to create the
gold standard data, the potential time saved for Australian blogs was not included in the estimated
total hours saved.
However, the performance data indicated that the classification problem in this case study and
dataset is very hard. The AUC and average precision scores suggest that relevance score estimation was
far from perfect. This, in turn, raises the question of whether selecting items based on the estimated
relevance scores may introduce a selection bias into the sampling process.
Bias can be introduced by every imperfect classification process and especially by keyword-based
selection processes. Within the EITM framework, the intermediate step of selecting relevant topics
(Step 5) is key to reducing selection bias. One option that it offers researchers is to bring in the
surveyed experts again, rather than the researchers themselves, to select and validate the topics.
Another option EITM offers is to optimize the ROUGE score, which is the recall of the expert
keywords by the topic model. Selecting a topic model with a high ROUGE score ensures broad
coverage of expert-named debates and can be expected to reduce bias in the resulting classification.
For example, if the goal is to sample news items related to discussions of public health in 2020,
a keyword search might overwhelmingly return news items related to COVID-19. If the domain
experts also listed a debate about alcohol abuse that is not covered by the topic model the ROUGE
score will drop. In such a situation, it is possible to adjust the topic model to include this debate, for
example, by varying hyperparameters such as the number of topics. Another option that would be
open to researchers using EITM is to decrease the weight of the topics related to COVID-19, to actively
influence the relevance scores, so that it becomes easier to discover news items debating other public
health issues.
Finally, EITM users can make use of chunking to further reduce classification bias: In the case study
presented here, we decided not to draw single documents based on the relevance score until we
reached the sample size. Instead, we drew larger chunks of 10,000 documents from the fully ranked
document set, working our way down from the top of the relevance distribution, and randomized the
order of documents within chunks. This made it possible to discover relevant news items with a lower
relevance score while still eliminating those news items that were extremely unlikely to be related to
the master topic. The EITM process allows researchers to reduce bias and maximize recall by adjusting
either the chunk size or the probability cutoff values according to the amount of time that they are able
to invest into the item-selection process.
In sum, it is important to note that it is important for researchers using EITM to be aware of the
risks of selection bias as much as with any other classification technique. The above examples show
that several options exist for them to address and hopefully reduce such bias within the EITM process.
Concluding remarks
We propose an expert-informed hybrid approach to discovering document subsets belonging to
a researcher-defined master topic within large sets of text data. The expert-informed topic modeling
(EITM) approach combines the efficiency and scalability of unsupervised text analysis with the deep
knowledge of experts within the relevant social and substantive domain. It is a hybrid procedure in
that it involves the discovery of relevant topics as well as the classification of documents into these
topics. This means that the procedure is doubly open. First, it opens up the process of document set
discovery to inductive algorithm-driven discovery. Second, it allows for post-hoc input of expert
knowledge to validate and weight algorithmic results and guide document set classification decisions.
At its heart, the EITM approach is a post-hoc way of adding problem-specific structure to unsuper-
vised computational model output by drawing on expert knowledge. It thus can be thought of as
a “semi-supervised” approach to document set discovery.
EITM is the first validated method for (semi-)automated document discovery that does not require
detailed contextual knowledge on the part of the primary researchers. The approach is flexible in that it
can be applied to any given subject domain. Unlike deterministic keyword-based strategies, EITM
provides a probabilistic method for identifying relevant sub-topics that researchers were not pre-
viously aware of (e.g., specific public debates in specific social contexts at a specific time) and text
documents that deal with these topics. The EITM approach therefore is particularly helpful
in situations in which researchers lack comprehensive domain knowledge such as cross-nationally
comparative studies or studies of highly specialized expert or subcultural discourses. It is also flexible
in that it makes a valid topic-based identification of document subsets from unfamiliar knowledge
domains more efficient in research projects, irrespective of their degree of tolerance for error in
document classification.
One possible alternative to surveying experts for the generation of debate themes and keywords may
be to start with a set of keywords defined by the researcher and then refine this set using computer-aided
methods such as that proposed by King et al. (2017). This approach may work well in cases in which the
master topic of interest is signified by a set of keywords that has a high likelihood of occurring in each of
the initially unknown sub-debates of interest. It may work less well when the master topic of interest is
very broad and encompasses debates that may not share a “common denominator” of keywords. For
example, the master topic in our use case (the public role of religion in societal life) encompassed some
clearly relevant debates (e.g., the “Handschlag-Debatte” in Switzerland, and the debates on same sex
marriage and on adoption rights for same-sex couples in the USA), and some articles on these debates
did not include generic references to the master topic. In such a constellation an expert-driven approach
may provide better coverage of all relevant debates compared to the researcher-driven, computer-
assisted approach by King and colleagues. In short, emic domain knowledge simply may be superior to
computational and researcher inputs in some situations that are commonly encountered by commu-
nication researchers, especially in cross-national studies of text-as-data.
More recently, word embeddings and neural networks have been proposed for text classification
(Rudkowsky et al., 2018; Turney & Pantel, 2010). Even though there has been recent work on the
interpretability of word embeddings (Jang & Myaeng, 2017; Şenel et al., 2018), one benefit of using
LDA for the initial clustering of the search space is that the results returned by LDA are more easily
interpreted by humans. This allows a manual selection of topics for subsequent classification. LDA
outputs thus can serve as inductive starting points to identifying relevant and irrelevant topics and
documents without requiring training data for supervised machine learning.
Moreover, its LDA-based approach makes EITM transparent from start to end. The LDA clustering
results help researchers gain insights into the specific full document set at hand. We regard this as an
advantage over modern pre-trained models like BERT, which get trained on massive general text
corpora (Devlin et al., 2019; Kaliyar, 2020). However, the EITM framework also, in principle, allows
for the use of different clustering algorithms that may have distinct advantages in different research
Finally, once a researcher has selected a set of documents using EITM, they may use this set
together with the topic distribution generated by LDA to produce training data for machine learning
and iteratively increase the classification performance even further.
In projects where researchers aim for a reduction of error in automated text subset discovery, rather
than its elimination, an EITM-approach alone can achieve this goal. However, as shown above, in
projects where the researchers aim at achieving a maximally error-free topic-based identification of
document subsets, pre-classification of documents based on EITM-generated relevance scores
improves the efficiency of the process by reducing the average time needed for manual coders in
a follow-up classification task to identify thematically relevant items within the larger search set of
documents. The topics identified as relevant based on EITM are used to estimate for each document in
the search set the continuous probability of it belonging to the master topic. Researchers can then
define a probability threshold that must be met for any given document to be classified as relevant for
further analysis of any form. As we have shown, the EITM approach provides a substantively valid,
efficient, and flexible framework for document-set discovery under imperfect domain knowledge of
researchers. Here, we have validated it with corpora in three different languages (English, German, and
In future applications, the EITM process can easily be expanded to other languages and is open to
further extensions based on researcher needs and resources. For example, future applications may use
a two-stage expert survey design, in which experts not only provide open lists of topics and keywords
but are surveyed a second time to validate the final list of relevant topics to improve the resulting
estimates of document relevance probabilities. It is also possible to apply EITM not only to estimate
a single topic model per country as we did here but to make more fine-grained distinctions, for
example, by estimating one model per media type in each country, in order to allow for an even more
context-sensitive detection of topics and debates.
The proliferation of large, heterogeneous sets of media content and other politically relevant text data
will make the specific classification task that EITM solves more common. First, primary researchers will
not be able to know the discourses, media contexts or countries they aim to study deeply enough in
advance to be able to make valid document set choices upfront. This is a pressing issue in comparative
studies that reach beyond a small set of countries well known to the primary researchers. It is also
important in studies of specialized subcultures and professional milieus for which few prior insights
exist. In this sense, EITM opens the door to uncharted territory. Second, EITM’s continuous relevance
estimates allow for researcher-defined cutoffs and enhanced efficiency in selecting documents and thus
increase researcher flexibility. Finally, continuous relevance scores can also be used as inputs for more
refined substantive analyses, for example, by comparing item-level content features in various brackets of
topical relevance. In sum, EITM provides a new hybrid framework with multiple possible extensions to
enable further analysis in text-as-data studies. Importantly, it not only provides a way for social scientists
to improve the validity of their document discovery procedures, but also supports their learning about
social contexts they may want to study but have incomplete prior knowledge about.
1. There also are more minor differences concomitant to the more basic differences between the approaches. For
example, because it focuses on inductive learning about initially unknown debates in situations where the
researchers are lacking necessary domain knowledge, EITM draws on probabilistic topic models for the
unsupervised estimation of topics and mixed-membership classification of documents into these topics. In
contrast, the KLR approach generalizes after a researcher-selected “seed” search query to identify a reference
document set for training purposes and thus relieson supervised machine learning classifiers, which may
introduce biasing path dependencies into the process.
2. Experts were given the following instructions: “Please give a brief description of each topic that comes to your
mind, using the designated fields. In a second step, please enter for each topic 3–5 keywords that characterize the
topic and that are, in your best judgment, helpful in identifying news stories covering this topic.”
3. Using only expert-provided keywords is a conservative approach. Another possible approach would be to expand
the list of inputted keywords to synonyms. However, such expansion may introduce an additional need to
validate the resultant synonyms, which could be done by feeding potential synonyms to experts in a second
survey round. It is also important to note that an exhaustive list of keywords will be more important for
deterministic approaches based on search-queries. One benefit of using topic modeling for document discovery
is that, as a probabilistic approach, it will capture each term in its topic solution that is systematically associated
with the expert inputs, which means that we can expect EITM to function well even if the expert inputs are
lexically incomplete so long as they are thematically exhaustive. Similarly, the probabilistic process ensures that
topics that are well described with only a few keywords are not ranked lower than those with many relevant
keywords as such highly discriminant keywords will be assigned greater weight for their respective topic.
4. Australian blogs: 7,503 (round 1) + 274 (round 2) + 141 (round 3) = 7.918 coding acts; Swiss blogs: 1,635
(round 1) + 60 (round 2) + 32 (round 3) = 1,727 coding acts; Turkish blogs: 3,565 (round 1) + 318 (round 2) + 222
(round 3) = 4,105 coding acts.
5. Note that both the speed-up and efficiency gain analysis assess the acceleration of the document discovery
process, not a comprehensive cost-benefit analysis, which would include the effort involved in the other steps of
the EITM process such as conducting the expert survey.
Disclosure statement
We have no known conflict of interest to disclose.
This research was funded by the German Research Foundation (DFG) Grant WE 2888/6-1.
Eike Mark Rinke
Timo Dobbrick
Charlotte Löb
Cäcilia Zirn
Hartmut Wessler
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J. (2021). Automated text classification of news articles:
A practical guide. Political Analysis, 29(1), 19–42.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3,
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and
Psychological Measurement, 41(3), 687–699.
Brier, A., & Hopp, B. (2010). Computer assisted text analysis in the social sciences. Quality & Quantity, 45(1), 103–128.
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and
what to do about it. Political Analysis, 26(2), 168–189.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for
language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.
Eshbaugh-Soha, M. (2010). The tone of local presidential news coverage. Political Communication, 27(2), 121–140.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
Gentzkow, M., & Shapiro, J. M. (2010). What drives media slant? Evidence from U.S. daily newspapers. Econometrica, 78
(1), 35–71.
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for
political texts. Political Analysis, 21(3), 267–297.
Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P. (2016). Big social data analytics in journalism and mass
communication: Comparing dictionary-based text analysis and unsupervised topic modeling. Journalism & Mass
Communication Quarterly, 93(2), 332–359.
Jacobi, C., van Atteveldt, W., & Welbers, K. (2016). Quantitative analysis of large amounts of journalistic texts using
topic modelling. Digital Journalism, 4(1), 89–106.
Jang, K.-R., & Myaeng, S.-H. (2017). Elucidating conceptual properties from word embeddings. Proceedings of the 1st
Workshop on Sense, Concept and Entity Representations and Their Applications, Valencia, Spain, 91–95. https://doi.
Japkowicz, N., & Shah, M. (2014). Evaluating learning algorithms: A classification perspective. Cambridge University
Kaliyar, R. K. (2020). A multi-layer bidirectional transformer encoder for pre-trained word embedding: A survey of
BERT. 10th International Conference on Cloud Computing, Data Science Engineering (Conuence), Noida, India,
King, G., Lam, P., & Roberts, M. E. (2017). Computer-assisted keyword and document set discovery from unstructured
text. American Journal of Political Science, 61(4), 971–988.
Kopecký, P., Sahling, J.-H. M., Panizza, F., Scherlis, G., Schuster, C., & Spirova, M. (2016). Party patronage in
contemporary democracies: Results from an expert survey in 22 countries from five regions. European Journal of
Political Research, 55(2), 416–431.
Krippendorff, K. (2018). Content analysis: An introduction to its methodology (4th ed.). Sage.
Landers, R. N., Brusso, R. C., Cavanaugh, K. J., & Collmus, A. B. (2016). A primer on theory-driven web scraping:
Automatic extraction of big data from the Internet for use in psychological research. Psychological Methods, 21(4),
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In M.-F. Moens & S. Szpakowicz (Eds.),
Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, (pp. 74–81). Association
for Computational Linguistics.
Mahdavi, P. (2019). Scraping public co-occurrences for statistical network analysis of political elites. Political Science
Research and Methods, 7(2), 385–392.
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U.,
Häussler, T., Schmid-Petri, H., & Adam, S. (2018). Applying LDA topic modeling in communication research:
Toward a valid and reliable methodology. Communication Methods and Measures, 12(2–3), 93–118.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Mayring, P. (2014). Qualitative content analysis: Theoretical foundation, basic procedures and software solution. http://
Newman, D., Bonilla, E. V., & Buntine, W. (2011). Improving topic coherence with regularized topic models. In
J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds..), Advances in Neural Information
Processing Systems 24 (pp. 496–504). Curran Associates, Inc.
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. Human Language
Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational
Linguistics, Los Angeles, CA, 100–108.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine
translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA,
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC
2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 45–50.
Rinke, E. M., Löb, C., Dobbrick, T., & Wessler, H. (2019). Mannheim International News Discourse Data Set (MIND)
[Data set].
Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., & Sedlmair, M. (2018). More than bags of words:
Sentiment analysis with word embeddings. Communication Methods and Measures, 12(2–3), 140–157. https://doi.
Schäfer, M. S., Ivanova, A., & Schmidt, A. (2014). What drives media attention for climate change? Explaining issue
attention in Australian, German and Indian print media from 1996 to 2010. International Communication Gazette, 76
(2), 152–176.
Şenel, L. K., Utlu, İ., Yücesoy, V., Koç, A., & Çukur, T. (2018). Semantic structure and interpretability of word
embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1769–1779. https://doi.
Trilling, D., & Jonkman, J. G. F. (2018). Scaling up content analysis. Communication Methods and Measures, 12(2–3),
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial
Intelligence Research, 37, 141–188.
van Atteveldt, W., & Peng, T.-Q. (2018). When communication meets computation: Opportunities, challenges, and
pitfalls in computational communication science. Communication Methods and Measures, 12(2–3), 81–92. https://
Whitaker, E. M. (2017). Emic and etic analysis. In B. S. Turner (Ed.), The Wiley-Blackwell encyclopedia of social theory
(pp. 1–2). Wiley-Blackwell.
... This warrants a very well-informed selection and comparability across countries. The topical relevance of the base corpus was likewise ensured through a survey of experts, who identified high-profile debates on the topic of interest in 2015/2016 that the selected coverage now represents (for further information on how the base corpus was compiled see Rinke et al. 2021). Both the base corpus and the expert input informed the collection of user contributions for this study. ...
Full-text available
This study is the first to compare the integrative complexity of online user comments across distinct democratic political systems and in discussion arenas with different primary use functions. Integrative complexity is a psycho-linguistic construct that is increasingly used by communication scholars to study the argumentative quality of political debate contributions. It captures the sophistication of online user comments in terms of differentiation and integration, mapping whether a post contains different aspects or viewpoints related to an issue and the extent to which it draws conceptual connections between these. This study investigates user contributions on the public role of religion and secularism in society between August 2015 and July 2016 from Australia, the United States, Germany, and Switzerland. In each country, it analyzes user posts from the (a) website comment sections and (b) public Facebook pages of mainstream news media, from the (c) Facebook pages of partisan collective actors and alternative media, and from (d) Twitter. Almost as many user contributions implicitly or explicitly differentiate various dimensions of or perspectives on an issue as express unidimensional, simplistic thoughts. Conceptual integration, however, is rare. The integrative complexity of online user comments is higher in consensus-oriented than in majoritarian democracies and in arenas that are used primarily for issue-driven, plural discussions rather than preference-driven, like-minded debates. This suggests that the accommodative public debate cultures of consensus-oriented political systems and interactions with individuals who hold different positions promote more argumentatively complex over simple online debate contributions.
Full-text available
The recent increase in digitally available data, tools, and processing power is fostering the use of computational methods to the study of communication. This special issue discusses the validity of using big data in communication science and showcases a number of new methods and applications in the fields of text and network analysis. Computational methods have the potential to greatly enhance the scientific study of communication because they allow us to move towards collaborative large-N studies of actual behavior in its social context. This requires us to develop new skills and infrastructure and meet the challenges of open, valid, reliable, and ethical “big data” research. By bringing together a number of leading scholars in one issue, we contribute to the increasing development and adaptation of computational methods in communication science.
Full-text available
Moving beyond the dominant bag-of-words approach to sentiment analysis we introduce an alternative procedure based on distributed word embeddings. The strength of word embeddings is the ability to capture similarities in word meaning. We use word embeddings as part of a supervised machine learning procedure which estimates levels of negativity in parliamentary speeches. The procedure’s accuracy is evaluated with crowdcoded training sentences; its external validity through a study of patterns of negativity in Austrian parliamentary speeches. The results show the potential of the word embeddings approach for sentiment analysis in the social sciences.
Full-text available
Collecting network information on political elites using conventional methods such as surveys and text records is challenging in authoritarian and/or conflict-ridden states. I introduce a data collection method for elite networks using scraping algorithms to capture public co-appearances at political and social events. Validity checks using existing data show the method effectively replicates interaction-based networks but not networks based on behavioral similarities; in both cases, measurement error remains a concern. Applying the method to Nigeria illustrates that patronage—measured in terms of public connectivity—does not drive national oil companies appointments. Given that theories of elite behavior aim to understand individual-level interactions, the applicability of data using this technique is well-suited to situations where intrusive data collection is costly or prohibitive.
Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.
Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.
Employing a number of different standalone programs is a prevalent approach among communication scholars who use computational methods to analyze media content. For instance, a researcher might use a specific program or a paid service to scrape some content from the Web, then use another program to process the resulting data, and finally conduct statistical analysis or produce some visualizations in yet another program. This makes it hard to build reproducible workflows, and even harder to build on the work of earlier studies. To improve this situation, we propose and discuss four criteria that a framework for automated content analysis should fulfill: scalability, free and open source, adaptability, and accessibility via multiple interfaces. We also describe how to put these considerations into practice, discuss their feasibility, and point toward future developments.
Latent Dirichlet allocation (LDA) topic models are increasingly being used in communication research. Yet, questions regarding reliability and validity of the approach have received little attention thus far. In applying LDA to textual data, researchers need to tackle at least four major challenges that affect these criteria: (a) appropriate pre-processing of the text collection; (b) adequate selection of model parameters, including the number of topics to be generated; (c) evaluation of the model’s reliability; and (d) the process of validly interpreting the resulting topics. We review the research literature dealing with these questions and propose a methodology that approaches these challenges. Our overall goal is to make LDA topic modeling more accessible to communication researchers and to ensure compliance with disciplinary standards. Consequently, we develop a brief hands-on user guide for applying LDA topic modeling. We demonstrate the value of our approach with empirical data from an ongoing research project. Free e-prints here:
Dense word embeddings, which encode semantic meanings of words to low dimensional vector spaces have become very popular in natural language processing (NLP) research due to their state-of-the-art performances in many NLP tasks. Word embeddings are substantially successful in capturing semantic relations among words, so a meaningful semantic structure must be present in the respective vector spaces. However, in many cases, this semantic structure is broadly and heterogeneously distributed across the embedding dimensions, which makes interpretation a big challenge. In this study, we propose a statistical method to uncover the latent semantic structure in the dense word embeddings. To perform our analysis we introduce a new dataset (SEMCAT) that contains more than 6500 words semantically grouped under 110 categories. We further propose a method to quantify the interpretability of the word embeddings; the proposed method is a practical alternative to the classical word intrusion test that requires human intervention.