Gregor Wiedemann

Gregor Wiedemann
Leibniz-Institut für Medienforschung | Hans-Bredow-Institut (HBI) · Media Research Methods Lab

Dr.-Ing. (Computer Science), M.A. (Political Science)
NLP for the social and communication sciences, senior researcher at Leibniz-Institute for Media Research

About

85
Publications
28,986
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,809
Citations
Introduction
Working at the intersection of social science and computer science with regard to research methods, especially on natural language processing for qualitative data analysis.
Additional affiliations
September 2020 - present
Leibniz-Institut für Medienforschung | Hans-Bredow-Institut (HBI)
Position
  • Senior Researcher
Description
  • Head of the Media Research Methods Lab, Coordinator of the project "Social Media Obervatory" within the Research Institute for Social Cohesion (FGZ/RISC)
July 2018 - July 2018
GESIS - Leibniz-Institute for the Social Sciences
Position
  • Lecturer
Description
  • Teaching of the Course: "Big Data module II - Text Mining for Social Scientists" (1 week, full-time)
April 2018 - July 2018
Hamburg University
Position
  • Lecturer
Description
  • Co-teaching summer semester course "Text Mining" (BA level)

Publications

Publications (85)
Preprint
Few-shot learning and parameter-efficient fine-tuning (PEFT) are crucial to overcome the challenges of data scarcity and ever growing language model sizes. This applies in particular to specialized scientific domains, where researchers might lack expertise and resources to fine-tune high-performing language models to nuanced tasks. We propose PETap...
Conference Paper
Solidarity is a crucial concept to understand social relations in societies. In this study, we investigate the frequency of (anti-)solidarity towards women and migrants in German parliamentary debates between 1867 and 2022. Using 2,864 manually annotated text snippets, we evaluate large language models (LLMs) like Llama 3, GPT-3.5, and GPT-4. We fi...
Conference Paper
Full-text available
Argument mining usually operates on short, decontextualized argumentative units such as main and subordinate clauses, or full sentences as proxies for arguments. Argumentation in digital media environments , however, is embedded in larger contexts. Especially on social media platforms, argumentation unfolds in dialog threads or tree structures wher...
Article
Full-text available
Pre-trained language models (PLM) based on transformer neural networks developed in the field of natural language processing (NLP) offer great opportunities to improve automatic content analysis in communication science, especially for the coding of complex semantic categories in large datasets via supervised machine learning. However, three charac...
Article
Full-text available
Telegram ist in den vergangenen Jahren zu einem relevanten Bestandteil der politischen Öffentlichkeit geworden. Der Instant-Messaging-Dienst unterstützt sowohl den interpersonalen Austausch als auch die Informationsverteilung an nahezu beliebig große Publika. Zugleich entzieht er sich bislang vergleichsweise erfolgreich allen Regulierungsversuchen,...
Article
Full-text available
This article describes the basic concept, ethical and legal considerations, technical implementation as well as resulting tools and data collections of the Social Media Observatory (SMO). Since 2020, the SMO is developed as an open science research infrastructure within the Research Institute Social Cohesion (RISC) in Germany. It focuses on (the su...
Conference Paper
Full-text available
We investigate the semantic retrieval potential of pre-trained contextualized word embeddings (CWEs) such as BERT, in combination with explicit linguistic information, for various NLP tasks in an information retrieval setup. In this paper, we compare different strategies to aggregate contextualized word embeddings along lexical, syntactic, or gramm...
Article
Full-text available
This software review provides a systematic overview of R packages for topic modeling. These packages facilitate the application of computational text analysis to conduct research compatible with a wide variety of methodological frameworks employed in the social and communication sciences. For this overview, the analysis process is divided into four...
Conference Paper
Full-text available
Protest events provide information about social and political conflicts, the state of social cohesion and democratic conflict management, as well as the state of civil society in general. Social scientists are therefore interested in the systematic observation of protest events. With this paper, we release the first German language resource of prot...
Conference Paper
Full-text available
We approach aspect-based argument mining as a supervised machine learning task to classify arguments into semantically coherent groups referring to the same defined aspect categories. As an exemplary use case, we introduce the Argument Aspect Corpus-Nuclear Energy that separates arguments about the topic of nuclear energy into nine major aspects. S...
Chapter
Full-text available
For automatic content analysis, topic models became an increasingly popular research method to reveal thematic structures of large document collections. However, research interests often go beyond topics that are limited to broad discourse-level semantics. On a more fine-grained level, it is also of interest that arguments, stances, frames, or disc...
Conference Paper
Full-text available
To ease the difficulty of argument stance classification, the task of same side stance classification (S3C) has been proposed. In contrast to actual stance classification, which requires a substantial amount of domain knowledge to identify whether an argument is in favor or against a certain issue, it is argued that, for S3C, only argument similari...
Preprint
Full-text available
This article introduces to the interactive Leipzig Corpus Miner (iLCM) - a newly released, open-source software to perform automatic content analysis. Since the iLCM is based on the R-programming language, its generic text mining procedures provided via a user-friendly graphical user interface (GUI) can easily be extended using the integrated IDE R...
Article
Full-text available
In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As first steps, the workflow usually involves batch scanning and optical character recognition (OCR) of documents. In the case of multi-page documents, the preservation of...
Chapter
Full-text available
Soziale Netzwerke wie Facebook bieten ihren NutzerInnen die Möglichkeit, die dort zahlreich verlinkten Inhalte traditioneller Massenmedien zu diskutieren. Dabei treffen Menschen mit sehr unterschiedlichen politischen Einstellungen aufeinander. Vermehrt kommt es zu diskriminierenden Kommentaren, denen mit Gegenrede widersprochen wird. Der Artikel an...
Article
Full-text available
Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently...
Article
Full-text available
Two different perspectives on argumentation have been pursued in computer science research, namely approaches of argument mining in natural language processing on the one hand, and formal argument evaluation on the other hand. So far these research areas are largely independent and unrelated. This article introduces the agenda of our recently start...
Article
Full-text available
The open REFI-QDA standard allows for the exchange of entire projects from one QDA software to another, on condition that software vendors have built the standard into their software. To reveal the new opportunities emerging from overcoming QDA software lock-in, we describe an experiment with the standard in four separate research projects done by...
Preprint
Full-text available
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in...
Article
Full-text available
Insight problem solving has been conceptualized as a dynamic search through a constrained search space where a non-obvious solution needs to be found. Multiple sources of task difficulty have been defined that can keep the problem solver from finding the right solution such as an overly large search space or knowledge constraints requiring a change...
Preprint
Full-text available
Topic modeling enables researchers to explore large document corpora. Large corpora, however, can sometimes be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested, that is (1) to model random document samples and (2) to prune the vocabulary of the corpus. Althou...
Conference Paper
Full-text available
Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number...
Preprint
Full-text available
Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number...
Conference Paper
Full-text available
De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifiers can significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a cla...
Conference Paper
Full-text available
We contrast three views of how words contribute to a listener's understanding of a sentence and compare corresponding quantitative models of how the listener's probabilistic prediction on sentence completion is affected in online comprehension. The Semantic Similarity Model presupposes that the predictor of a word given a preceding discourse is the...
Conference Paper
Full-text available
We present a neural network based approach of transfer learning for offensive language detection. For our system, we compare two types of knowledge transfer: supervised and unsupervised pre-training. Supervised pre-training of our bidirectional GRU-3-CNN architecture is performed as multi-task learning of parallel training of five different tasks....
Preprint
Full-text available
De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifierscan significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a clas...
Article
Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, a...
Chapter
While corpus linguistic approaches such as key term extraction and collocation analysis are already part of the toolbox in some branches of discourse analysis, social scientists only recently became aware of the many opportunities provided by text mining. This contribution introduces unsupervised and supervised machine learning techniques to analys...
Preprint
Full-text available
We investigate different strategies for automatic offensive language classification on German Twitter data. For this, we employ a sequentially combined BiLSTM-CNN neural network. Based on this model, three transfer learning tasks to improve the classification performance with background knowledge are tested. We compare 1. Supervised category transf...
Preprint
Full-text available
For named entity recognition (NER), bidirectional recurrent neural networks became the state-of-the-art technology in recent years. Competing approaches vary with respect to pre-trained word embeddings as well as models for character embeddings to represent sequence information most effectively. For NER in German language texts, these model variati...
Conference Paper
Full-text available
We investigate different strategies for automatic offensive language classification on German Twitter data. For this, we employ a sequentially combined BiLSTM-CNN neural network. Based on this model, three transfer learning tasks to improve the classification performance with background knowledge are tested. We compare 1. Supervised category transf...
Chapter
Full-text available
Investigative journalism in recent years is confronted with two major challenges: (1) vast amounts of unstructured data originating from large text collections such as leaks or answers to Freedom of Information requests, and (2) multi-lingual data due to intensified global cooperation and communication in politics, business and civil society. Faced...
Conference Paper
Full-text available
For named entity recognition (NER), bidirectional recurrent neural networks became the state-of-the-art technology in recent years. Competing approaches vary with respect to pre-trained word embeddings as well as models for character embeddings to represent sequence information most effectively. For NER in German language texts, these model variati...
Preprint
Full-text available
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organi...
Preprint
Full-text available
Investigative journalism in recent years is confronted with two major challenges: 1) vast amounts of unstructured data originating from large text collections such as leaks or answers to Freedom of Information requests, and 2) multi-lingual data due to intensified global cooperation and communication in politics, business and civil society. Faced w...
Preprint
Insight problem solving has been conceptualized as a dynamic search through a constrained search space where a non-obvious solution needs to be found. Multiple sources of task difficulty have been defined that can keep the problem solver from finding the right solution such as an overly large search space or knowledge constraints requiring a change...
Conference Paper
Full-text available
In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As a first step, the workflow involves scanning and Optical Character Recognition (OCR) of documents. Preservation of document contexts of single page scans is a major req...
Conference Paper
Full-text available
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements fo...
Conference Paper
Full-text available
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements fo...
Preprint
Full-text available
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements fo...
Article
Full-text available
Social media are an emerging new paradigm in interdisciplinary research in crisis informatics. They bring many opportunities as well as challenges to all fields of application and research involved in the project of using social media content for an improved disaster management. Using the Central European flooding 2013 as our case study, we optimiz...
Article
Latent Dirichlet allocation (LDA) topic models are increasingly being used in communication research. Yet, questions regarding reliability and validity of the approach have received little attention thus far. In applying LDA to textual data, researchers need to tackle at least four major challenges that affect these criteria: (a) appropriate pre-pr...
Preprint
Full-text available
For digitization of paper files via OCR, preservation of document contexts of single scanned images is a major requirement. Page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. This can be immensely helpful in the context of "digital mailrooms" or retro-digitization of large pape...
Conference Paper
Full-text available
The article introduces our concept and experiences of teaching text mining in R to humanists and social scientists in a one week course. We teach methods to support the entire analysis workflow, from data import and conversion, basic (linguistic) preprocessing to actual analysis such as key-term extraction, co-occurrence statistics, topic models an...
Book
Full-text available
Gregor Wiedemann evaluates text mining applications for social science studies with respect to conceptual integration of consciously selected methods, systematic optimization of algorithms and workflows, and methodological reflections relating to empirical research. In an exemplary study, he introduces workflows to analyze a corpus of around 600,00...
Chapter
Chapter 3 has introduced a selection of Text Mining (TM) procedures and integrated them into a complex workflow to analyze large quantities of textual data for social science purposes. In Chapter 4 this workflow has been applied to a corpus of two newspapers to answer a political science question on the development of democratic discourses in Germa...
Chapter
Despite there is a long tradition of Computer Assisted Text Analysis (CATA) in social sciences, it followed a rather parallel development to QDA. Only a few years ago, realization of TM potentials for QDA started to emerge slowly. In this chapter, I reflect on the debate of the use of software in qualitative social science research together with ap...
Chapter
In the light of recent research debates on computational social science and Digital Humanities (DH) as meanwhile adolescent disciplines dealing with big data (Reichert, 2014), I strove for answering in which ways Text Mining (TM) applications are able to support Qualitative Data Analysis (QDA) in the social sciences in a manner that fruitfully inte...
Chapter
The last chapter already has demonstrated that Text Mining (TM) applications can be a valid approach to social science research questions and that existing studies employ single TM procedures to investigate larger text collections. However, to benefit most effectively from the use of TM and to be able to develop complex research designs meeting req...
Chapter
The Text Mining (TM) workflows presented in the previous chapter provided a variety of results which will be combined in the following to a comprehensive study on democratic demarcation in Germany. The purpose of this chapter is to present an example of how findings from the introduced set of TM applications on large text collections contribute to...
Chapter
Digitalization and informatization of science during the last decades have widely transformed the ways in which empirical research is conducted in various disciplines. Computer-assisted data collection and analysis procedures even led to the emergence of new subdisciplines such as bioinformatics or medical informatics. The humanities (including soc...
Conference Paper
Full-text available
In terminology work, natural language processing, and digital humanities, several studies address the analysis of variations in context and meaning of terms in order to detect semantic change and the evolution of terms. We distinguish three different approaches to describe contextual variations: methods based on the analysis of patterns and linguis...
Conference Paper
Full-text available
http://www.dhd2016.de/abstracts/sektionen-001.html Für die Analyse großer Mengen qualitativer Textdaten stehen den Sozialwissenschaften unterschiedliche konventionelle und innovative Methoden der Inhalts- und Diskursanalyse zur Verfügung. Die klassische sozialwissenschaftliche Inhaltsanalyse kann methodisch mit Verfahren des überwachten maschinell...
Chapter
Full-text available
When studying online communication, researchers are confronted with vast amounts of unstructured text data and experience severe limitations to the established methods of manual quantitative content analysis. Text mining methods developed in computational natural language processing (NLP) allow the automatic capture of semantics in massive populati...
Poster
Full-text available
The identification of well-defined categorical contents in text is a typical task in methods of content and discourse analysis. The combination of social science-related content analysis and (semi)-automatic text classification from natural language processing in the form of an active learning processes can be seen as an innovative and new approach...
Book
Der Band führt anhand theoretischer Reflexionen, konkreter Anleitungen und einzelner Fallstudien in die Grundlagen der Verwendung von Text Mining Verfahren in den Sozialwissenschaften ein. Insofern Gesellschaft – und damit auch Politik – für die teilnehmenden Akteure sprachlich vermittelt ist, sind durch die Analyse von Sprache Rückschlüsse auf Ges...
Chapter
Full-text available
See more at: http://research.europeana.eu/blogpost/extending-the-method-toolbox-text-mining-for-social-science-and-humanities-research
Chapter
Qualitative Methoden, die durch die Analyse von Texten Aussagen über die soziale Wirklichkeit ermöglichen sollen, gehören zweifelsohne zum zeitgenössischen Kanon sozialwissenschaftlicher Forschung (vgl. dazu Stulpe / Lemke in diesem Band). Wissenssoziologie und Hermeneutik sind ebenso einschlägige Konzepte, wie Grounded Theory, Diskursanalyse oder...
Chapter
Der Leipzig Corpus Miner (LCM) ist eine Webanwendung, die verschiedene Text Mining-Verfahren für die Analyse großer Mengen qualitativer Daten bündelt. Durch eine einfach zu bedienende Benutzeroberfläche ermöglicht der LCM Volltextzugriff auf 3,5 Millionen Zeitungstexte, die nach Suchbegriffen und Metadaten zu Subkollektionen gefiltert werden können...
Chapter
Der Beitrag fasst die Ergebnisse der Fallstudien aus Teil II des Bandes zusammen. Dabei wird deutlich, dass der Einsatz von Text Mining in der qualitativen Sozialforschung die Chance bietet, die Opposition von Qualität und Quantität in Fällen der Verfügbarkeit großer Datenmengen in produktiver Weise aufzulösen. Sollen jenseits rein datengetriebener...
Article
Social science research using Text Mining tools requires—due to the lack of a canonical heuristics in the digital humanities—a blended reading approach. Integrating quantitative and qualitative analyses of complex textual data progressively, blended reading brings up various requirements for the implementation of Text Mining infrastructures. The ar...
Conference Paper
Full-text available
This paper presents a procedure to retrieve subsets of rele-vant documents from large text collections for Content Analysis, e.g. in social sciences. Document retrieval for this purpose needs to take account of the fact that analysts often cannot describe their research objective with a small set of key terms, especially when dealing with theoretic...
Conference Paper
Full-text available
This paper presents the \Leipzig Corpus Miner"|a technical infrastructure for supporting qualitative and quantitative content analysis. The infrastructure aims at the integration of \close reading" procedures on individual documents with procedures of \distant reading", e.g. lexical characteristics of large document collections. Therefore informati...
Chapter
Full-text available
Der Beitrag setzt sich mit der Bildungsarbeit des deutschen Inlandsgeheimdienstes "Verfassungsschutz" auseinander. Erläutert wird die geschichtliche Entwicklung des Konzepts "Verfassungsschutz durch Aufklärung". Anhand eines Fallbeispiels, dem "Planspiel Extremismus und Demokratie", werden problematische Aspekte der geheimdienstlichen Bildungsarbei...
Article
Full-text available
Two developments in computational text analysis may change the way qualitative data analysis in social sciences is performed: 1. the availability of digital text worth to investigate is growing rapidly, and 2. the improvement of algorithmic information extraction approaches, also called text mining, allows for further bridging the gap between quali...
Article
Two developments in computational text analysis may change the way qualitative data analysis in social sciences is performed: 1. the availability of digital text worth to investigate is growing rapidly, and 2. the improvement of algorithmic information extraction approaches, also called text mining, allows for further bridging the gap between quali...
Article
Full-text available
Securitization policies in recent years have strengthened critical discourses on surveillance. Nonetheless, it seems that the critique does not lead to pivotal changes in politics or individual behavior considering privacy issues. In contrast to the trade-off between security and freedom proclaimed by the liberal critics of surveillance this articl...
Chapter
Häufig wird Forderungen nach Aufgabe des Extremismusbegriffs mit dem Verweis auf einen Mangel an Alternativen begegnet. Dieser Beitrag zeigt dagegen exemplarisch anhand eines für Leipzig erstellten Handlungskonzeptes zur Stärkung der demokratischen Kultur, wie eine Problematisierung bestimmter Ereignisse, Strukturen und Ideologien ohne Rückgriff au...
Book
Freiheit stirbt mit Sicherheit! Oder? Geht es nach dem Willen von Sicherheitspolitikern, so erscheint beinahe jedes Mittel Recht, um Terrorakte schon in der Planungsphase zu unterbinden. Kritiker staatlicher Überwachung von Internet und Telefondaten sehen dagegen die „informationelle Selbstbestimmung“ zunehmend bedroht. Gleichzeitig geben viele Bür...

Network

Cited By