Gregor WiedemannLeibniz-Institut für Medienforschung | Hans-Bredow-Institut (HBI) · Media Research Methods Lab
Gregor Wiedemann
Dr.-Ing. (Computer Science), M.A. (Political Science)
NLP for the social and communication sciences, senior researcher at Leibniz-Institute for Media Research
About
85
Publications
28,986
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,809
Citations
Introduction
Working at the intersection of social science and computer science with regard to research methods, especially on natural language processing for qualitative data analysis.
Additional affiliations
Publications
Publications (85)
Few-shot learning and parameter-efficient fine-tuning (PEFT) are crucial to overcome the challenges of data scarcity and ever growing language model sizes. This applies in particular to specialized scientific domains, where researchers might lack expertise and resources to fine-tune high-performing language models to nuanced tasks. We propose PETap...
Solidarity is a crucial concept to understand social relations in societies. In this study, we investigate the frequency of (anti-)solidarity towards women and migrants in German parliamentary debates between 1867 and 2022. Using 2,864 manually annotated text snippets, we evaluate large language models (LLMs) like Llama 3, GPT-3.5, and GPT-4. We fi...
Argument mining usually operates on short, decontextualized argumentative units such as main and subordinate clauses, or full sentences as proxies for arguments. Argumentation in digital media environments , however, is embedded in larger contexts. Especially on social media platforms, argumentation unfolds in dialog threads or tree structures wher...
Pre-trained language models (PLM) based on transformer neural networks developed in the field of natural language processing (NLP) offer great opportunities to improve automatic content analysis in communication science, especially for the coding of complex semantic categories in large datasets via supervised machine learning. However, three charac...
Telegram ist in den vergangenen Jahren zu einem relevanten Bestandteil der politischen Öffentlichkeit geworden. Der Instant-Messaging-Dienst unterstützt sowohl den interpersonalen Austausch als auch die Informationsverteilung an nahezu beliebig große Publika. Zugleich entzieht er sich bislang vergleichsweise erfolgreich allen Regulierungsversuchen,...
This article describes the basic concept, ethical and legal considerations, technical implementation as well as resulting tools and data collections of the Social Media Observatory (SMO). Since 2020, the SMO is developed as an open science research infrastructure within the Research Institute Social Cohesion (RISC) in Germany. It focuses on (the su...
We investigate the semantic retrieval potential of pre-trained contextualized word embeddings (CWEs) such as BERT, in combination with explicit linguistic information, for various NLP tasks in an information retrieval setup. In this paper, we compare different strategies to aggregate contextualized word embeddings along lexical, syntactic, or gramm...
This software review provides a systematic overview of R packages for topic modeling. These packages facilitate the application of computational text analysis to conduct research compatible with a wide variety of methodological frameworks employed in the social and communication sciences. For this overview, the analysis process is divided into four...
Protest events provide information about social and political conflicts, the state of social cohesion and democratic conflict management, as well as the state of civil society in general. Social scientists are therefore interested in the systematic observation of protest events. With this paper, we release the first German language resource of prot...
We approach aspect-based argument mining as a supervised machine learning task to classify arguments into semantically coherent groups referring to the same defined aspect categories. As an exemplary use case, we introduce the Argument Aspect Corpus-Nuclear Energy that separates arguments about the topic of nuclear energy into nine major aspects. S...
For automatic content analysis, topic models became an increasingly popular research method to reveal thematic structures of large document collections. However, research interests often go beyond topics that are limited to broad discourse-level semantics. On a more fine-grained level, it is also of interest that arguments, stances, frames, or disc...
To ease the difficulty of argument stance classification, the task of same side stance classification (S3C) has been proposed. In contrast to actual stance classification, which requires a substantial amount of domain knowledge to identify whether an argument is in favor or against a certain issue, it is argued that, for S3C, only argument similari...
This article introduces to the interactive Leipzig Corpus Miner (iLCM) - a newly released, open-source software to perform automatic content analysis. Since the iLCM is based on the R-programming language, its generic text mining procedures provided via a user-friendly graphical user interface (GUI) can easily be extended using the integrated IDE R...
In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As first steps, the workflow usually involves batch scanning and optical character recognition (OCR) of documents. In the case of multi-page documents, the preservation of...
Soziale Netzwerke wie Facebook bieten ihren NutzerInnen die Möglichkeit, die dort zahlreich verlinkten Inhalte traditioneller Massenmedien zu diskutieren. Dabei treffen Menschen mit sehr unterschiedlichen politischen Einstellungen aufeinander. Vermehrt kommt es zu diskriminierenden Kommentaren, denen mit Gegenrede widersprochen wird. Der Artikel an...
Topic modeling enables researchers to explore large document corpora.
Large corpora, however, can be extremely costly to model in terms of time and
computing resources. In order to circumvent this problem, two techniques
have been suggested: (1) to model random document samples, and (2) to
prune the vocabulary of the corpus. Although frequently...
Two different perspectives on argumentation have been pursued in computer science research, namely approaches of argument mining in natural language processing on the one hand, and formal argument evaluation on the other hand. So far these research areas are largely independent and unrelated. This article introduces the agenda of our recently start...
The open REFI-QDA standard allows for the exchange of entire projects from one QDA software to another, on condition that software vendors have built the standard into their software. To reveal the new opportunities emerging from overcoming QDA software lock-in, we describe an experiment with the standard in four separate research projects done by...
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in...
Insight problem solving has been conceptualized as a dynamic search through a constrained search space where a non-obvious solution needs to be found. Multiple sources of task difficulty have been defined that can keep the problem solver from finding the right solution such as an overly large search space or knowledge constraints requiring a change...
Topic modeling enables researchers to explore large document corpora. Large corpora, however, can sometimes be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested, that is (1) to model random document samples and (2) to prune the vocabulary of the corpus. Althou...
Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number...
Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number...
De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifiers can significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a cla...
We contrast three views of how words contribute to a listener's understanding of a sentence and compare corresponding quantitative models of how the listener's probabilistic prediction on sentence completion is affected in online comprehension. The Semantic Similarity Model presupposes that the predictor of a word given a preceding discourse is the...
We present a neural network based approach of transfer learning for offensive language detection. For our system, we compare two types of knowledge transfer: supervised and unsupervised pre-training. Supervised pre-training of our bidirectional GRU-3-CNN architecture is performed as multi-task learning of parallel training of five different tasks....
De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifierscan significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a clas...
Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, a...
While corpus linguistic approaches such as key term extraction and collocation analysis are already part of the toolbox in some branches of discourse analysis, social scientists only recently became aware of the many opportunities provided by text mining. This contribution introduces unsupervised and supervised machine learning techniques to analys...
We investigate different strategies for automatic offensive language classification on German Twitter data. For this, we employ a sequentially combined BiLSTM-CNN neural network. Based on this model, three transfer learning tasks to improve the classification performance with background knowledge are tested. We compare 1. Supervised category transf...
For named entity recognition (NER), bidirectional recurrent neural networks became the state-of-the-art technology in recent years. Competing approaches vary with respect to pre-trained word embeddings as well as models for character embeddings to represent sequence information most effectively. For NER in German language texts, these model variati...
We investigate different strategies for automatic offensive language classification on German Twitter data. For this, we employ a sequentially combined BiLSTM-CNN neural network. Based on this model, three transfer learning tasks to improve the classification performance with background knowledge are tested. We compare 1. Supervised category transf...
Investigative journalism in recent years is confronted with two major challenges: (1) vast amounts of unstructured data originating from large text collections such as leaks or answers to Freedom of Information requests, and (2) multi-lingual data due to intensified global cooperation and communication in politics, business and civil society. Faced...
For named entity recognition (NER), bidirectional recurrent neural networks became the state-of-the-art technology in recent years. Competing approaches vary with respect to pre-trained word embeddings as well as models for character embeddings to represent sequence information most effectively. For NER in German language texts, these model variati...
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organi...
Investigative journalism in recent years is confronted with two major challenges: 1) vast amounts of unstructured data originating from large text collections such as leaks or answers to Freedom of Information requests, and 2) multi-lingual data due to intensified global cooperation and communication in politics, business and civil society. Faced w...
Insight problem solving has been conceptualized as a dynamic search through a constrained search space where a non-obvious solution needs to be found. Multiple sources of task difficulty have been defined that can keep the problem solver from finding the right solution such as an overly large search space or knowledge constraints requiring a change...
In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As a first step, the workflow involves scanning and Optical Character Recognition (OCR) of documents. Preservation of document contexts of single page scans is a major req...
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements fo...
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements fo...
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements fo...
Social media are an emerging new paradigm in interdisciplinary research in crisis informatics. They bring many opportunities as well as challenges to all fields of application and research involved in the project of using social media content for an improved disaster management. Using the Central European flooding 2013 as our case study, we optimiz...
Latent Dirichlet allocation (LDA) topic models are increasingly being used in communication research. Yet, questions regarding reliability and validity of the approach have received little attention thus far. In applying LDA to textual data, researchers need to tackle at least four major challenges that affect these criteria: (a) appropriate pre-pr...
For digitization of paper files via OCR, preservation of document contexts of single scanned images is a major requirement. Page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. This can be immensely helpful in the context of "digital mailrooms" or retro-digitization of large pape...
The article introduces our concept and experiences of teaching text mining in R to humanists and social scientists in a one week course. We teach methods to support the entire analysis workflow, from data import and conversion, basic (linguistic) preprocessing to actual analysis such as key-term extraction, co-occurrence statistics, topic models an...
Gregor Wiedemann evaluates text mining applications for social science studies with respect to conceptual integration of consciously selected methods, systematic optimization of algorithms and workflows, and methodological reflections relating to empirical research. In an exemplary study, he introduces workflows to analyze a corpus of around 600,00...
Chapter 3 has introduced a selection of Text Mining (TM) procedures and integrated them into a complex workflow to analyze large quantities of textual data for social science purposes. In Chapter 4 this workflow has been applied to a corpus of two newspapers to answer a political science question on the development of democratic discourses in Germa...
Despite there is a long tradition of Computer Assisted Text Analysis (CATA) in social sciences, it followed a rather parallel development to QDA. Only a few years ago, realization of TM potentials for QDA started to emerge slowly. In this chapter, I reflect on the debate of the use of software in qualitative social science research together with ap...
In the light of recent research debates on computational social science and Digital Humanities (DH) as meanwhile adolescent disciplines dealing with big data (Reichert, 2014), I strove for answering in which ways Text Mining (TM) applications are able to support Qualitative Data Analysis (QDA) in the social sciences in a manner that fruitfully inte...
The last chapter already has demonstrated that Text Mining (TM) applications can be a valid approach to social science research questions and that existing studies employ single TM procedures to investigate larger text collections. However, to benefit most effectively from the use of TM and to be able to develop complex research designs meeting req...
The Text Mining (TM) workflows presented in the previous chapter provided a variety of results which will be combined in the following to a comprehensive study on democratic demarcation in Germany. The purpose of this chapter is to present an example of how findings from the introduced set of TM applications on large text collections contribute to...
Digitalization and informatization of science during the last decades have widely transformed the ways in which empirical research is conducted in various disciplines. Computer-assisted data collection and analysis procedures even led to the emergence of new subdisciplines such as bioinformatics or medical informatics. The humanities (including soc...
In terminology work, natural language processing, and digital humanities, several studies address the analysis of variations in context and meaning of terms in order to detect semantic change and the evolution of terms. We distinguish three different approaches to describe contextual variations: methods based on the analysis of patterns and linguis...
http://www.dhd2016.de/abstracts/sektionen-001.html
Für die Analyse großer Mengen qualitativer Textdaten stehen den Sozialwissenschaften unterschiedliche konventionelle und innovative Methoden der Inhalts- und Diskursanalyse zur Verfügung. Die klassische sozialwissenschaftliche Inhaltsanalyse kann methodisch mit Verfahren des überwachten maschinell...
When studying online communication, researchers are confronted with vast amounts of unstructured text data and experience severe limitations to the established methods of manual quantitative content analysis. Text mining methods developed in computational natural language processing (NLP) allow the automatic capture of semantics in massive populati...
The identification of well-defined categorical contents in text is a typical task in methods of content and discourse analysis. The combination of social science-related content analysis and (semi)-automatic text classification from natural language processing in the form of an active learning processes can be seen as an innovative and new approach...
Der Band führt anhand theoretischer Reflexionen, konkreter Anleitungen und einzelner Fallstudien in die Grundlagen der Verwendung von Text Mining Verfahren in den Sozialwissenschaften ein. Insofern Gesellschaft – und damit auch Politik – für die teilnehmenden Akteure sprachlich vermittelt ist, sind durch die Analyse von Sprache Rückschlüsse auf Ges...
See more at: http://research.europeana.eu/blogpost/extending-the-method-toolbox-text-mining-for-social-science-and-humanities-research
Qualitative Methoden, die durch die Analyse von Texten Aussagen über die soziale Wirklichkeit ermöglichen sollen, gehören zweifelsohne zum zeitgenössischen Kanon sozialwissenschaftlicher Forschung (vgl. dazu Stulpe / Lemke in diesem Band). Wissenssoziologie und Hermeneutik sind ebenso einschlägige Konzepte, wie Grounded Theory, Diskursanalyse oder...
Der Leipzig Corpus Miner (LCM) ist eine Webanwendung, die verschiedene Text Mining-Verfahren für die Analyse großer Mengen qualitativer Daten bündelt. Durch eine einfach zu bedienende Benutzeroberfläche ermöglicht der LCM Volltextzugriff auf 3,5 Millionen Zeitungstexte, die nach Suchbegriffen und Metadaten zu Subkollektionen gefiltert werden können...
Der Beitrag fasst die Ergebnisse der Fallstudien aus Teil II des Bandes zusammen. Dabei wird deutlich, dass der Einsatz von Text Mining in der qualitativen Sozialforschung die Chance bietet, die Opposition von Qualität und Quantität in Fällen der Verfügbarkeit großer Datenmengen in produktiver Weise aufzulösen. Sollen jenseits rein datengetriebener...
Social science research using Text Mining tools requires—due to the lack of a canonical heuristics in the digital humanities—a blended reading approach. Integrating quantitative and qualitative analyses of complex textual data progressively, blended reading brings up various requirements for the implementation of Text Mining infrastructures. The ar...
This paper presents a procedure to retrieve subsets of rele-vant documents from large text collections for Content Analysis, e.g. in social sciences. Document retrieval for this purpose needs to take account of the fact that analysts often cannot describe their research objective with a small set of key terms, especially when dealing with theoretic...
This paper presents the \Leipzig Corpus Miner"|a technical infrastructure for supporting qualitative and quantitative content analysis. The infrastructure aims at the integration of \close reading" procedures on individual documents with procedures of \distant reading", e.g. lexical characteristics of large document collections. Therefore informati...
Der Beitrag setzt sich mit der Bildungsarbeit des deutschen Inlandsgeheimdienstes "Verfassungsschutz" auseinander. Erläutert wird die geschichtliche Entwicklung des Konzepts "Verfassungsschutz durch Aufklärung". Anhand eines Fallbeispiels, dem "Planspiel Extremismus und Demokratie", werden problematische Aspekte der geheimdienstlichen Bildungsarbei...
Two developments in computational text analysis may change the way qualitative data analysis in social sciences is performed: 1. the availability of digital text worth to investigate is growing rapidly, and 2. the improvement of algorithmic information extraction approaches, also called text mining, allows for further bridging the gap between quali...
Two developments in computational text analysis may change the way qualitative data analysis in social sciences is performed: 1. the availability of digital text worth to investigate is growing rapidly, and 2. the improvement of algorithmic information extraction approaches, also called text mining, allows for further bridging the gap between quali...
Securitization policies in recent years have strengthened critical discourses on surveillance. Nonetheless, it seems that the critique does not lead to pivotal changes in politics or individual behavior considering privacy issues. In contrast to the trade-off between security and freedom proclaimed by the liberal critics of surveillance this articl...
Häufig wird Forderungen nach Aufgabe des Extremismusbegriffs mit dem Verweis auf einen Mangel an Alternativen begegnet. Dieser Beitrag zeigt dagegen exemplarisch anhand eines für Leipzig erstellten Handlungskonzeptes zur Stärkung der demokratischen Kultur, wie eine Problematisierung bestimmter Ereignisse, Strukturen und Ideologien ohne Rückgriff au...
Freiheit stirbt mit Sicherheit! Oder? Geht es nach dem Willen von Sicherheitspolitikern, so erscheint beinahe jedes Mittel Recht, um Terrorakte schon in der Planungsphase zu unterbinden. Kritiker staatlicher Überwachung von Internet und Telefondaten sehen dagegen die „informationelle Selbstbestimmung“ zunehmend bedroht. Gleichzeitig geben viele Bür...