Andreas Witt’s research while affiliated with Leibniz Institute for the German Language and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (90)


Fig. 1 Ratio of impact indicating sentences in our dataset, numbers in %.
Fig. 3 Distribution of impact categories among domains, in %.
Fig. 5 Most frequent words per impact category (absolute numbers).
Comparison of the performance of the classification models on the main categories across different methods, domains, and intensities (without impact-irrelevant sentences)
Accuracy and F1 of few-shot prompting in ChatGPT across different domains and intensities. The scores of high-intensity impact in the domain of mobility is excluded due to an insufficient number of instances for comparison.
Impact Classification within and beyond Academia: Domain-Robust Annotation and the Capacity of Large Language Models
  • Preprint
  • File available

November 2024

·

15 Reads

·

·

·

[...]

·

Andreas Witt

Prior analyses and assessments of the impact of scientific research has mainly relied on analyzing its scope within academia and its influence within scholarly circles. However, by not considering the broader societal, economic, and policy implications of research projects, these studies overlook the ways in which scientific discoveries contribute to technological innovation, public health improvements, environmental sustainability, and other areas of real-world application. We expand upon this prior work by developing and validating a conceptual and computational solution to automatically identify and categorize the impact of scientific research within and especially beyond academia based on text data. We first empirically develop and evaluate an annotation schema to capture and classify the impact of research projects based on research reports from different scientific domains. We then annotate a large dataset of more than 45k sentences extracted from research reports for the developed impact categories. We examine the annotated dataset for patterns in the distribution of impact categories across different scientific domains, co-occurrences of impact categories, and signal words of impact. Using the annotated texts and the novel classification schema, we investigate the performance of large language models (LLMs) for automated impact classification. Our results show that fine-tuning the models on our annotated datasets statistically significantly outperforms zero- and fewshot prompting approaches. This indicates that state-of-the-art LLMs without fine-tuning may not work well for novel classification schemas such as our impact classification schema, and in turn highlights the importance of diligent manual annotations as empirical basis in the field of computational social science.

Download


Open Science and Language Data: Expectations vs. Reality: The Role of Research Data Infrastructures

September 2023

·

35 Reads

Proceedings of the Conference on Research Data Infrastructure

Language data are essential for any scientific endeavor. However, unlike numerical data, language data are often protected by copyright, as they easily meet the threshold of originality. The role of research infrastructures (such CLARIN, DARIAH, and Text+) is to bridge the gap between uses allowed by statutory exceptions and the requirements of Open Science. This is achieved on the one hand by sharing language data produced by research organisations with the widest possible circle of persons, and on the other by mutualizing efforts towards copyright clearance and appropriate licensing of datasets.


WebLicht overview
The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond

June 2023

·

87 Reads

·

5 Citations

Language Resources and Evaluation

CLARIN is a European Research Infrastructure Consortium developing and providing a federated and interoperable platform to support scientists in the field of the Social Sciences and Humanities in carrying-out language-related research. This contribution provides an overview of the entire infrastructure with a particular focus on tool interoperability, ease of access to research data, tools and services, the importance of sharing knowledge within and across (national) communities, and community building. By taking into account FAIR principles from the very beginning, CLARIN succeeded in becoming a successful example of a research infrastructure that is actively used by its members. The benefits CLARIN members reap from their infrastructure secure a future for their common good that is both sustainable and attractive to partners beyond the original target groups.



Language Matters

October 2022

·

83 Reads

·

2 Citations

CLARIN stands for “Common Language Resources and Technology Infrastructure”. In 2012 CLARIN ERIC was established as a legal entity with the mission to create and maintain a digital infrastructure to support the sharing, use, and sustainability of language data (in written, spoken, or multimodal form) available through repositories from all over Europe, in support of research in the humanities and social sciences and beyond. Since 2016 CLARIN has had the status of Landmark research infrastructure and currently it provides easy and sustainable access to digital language data and also offers advanced tools to discover, explore, exploit, annotate, analyse, or combine such datasets, wherever they are located. This is enabled through a networked federation of centres: language data repositories, service centres, and knowledge centres with single sign-on access for all members of the academic community in all participating countries. In addition, CLARIN offers open access facilities for other interested communities of use, both inside and outside of academia. Tools and data from different centres are interoperable, so that data collections can be combined and tools from different sources can be chained to perform operations at different levels of complexity. The strategic agenda adopted by CLARIN and the activities undertaken are rooted in a strong commitment to the Open Science paradigm and the FAIR data principles. This also enables CLARIN to express its added value for the European Research Area and to act as a key driver of innovation and contributor to the increasing number of industry programmes running on data-driven processes and the digitalization of society at large.


Text+: Language- and text-based Research Data Infrastructure

April 2022

·

41 Reads

·

1 Citation

Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. By offering easy access to high quality research data, Text+ will support a maximum of methodological diversity, which in turn is a prerequisite for innovative and transdisciplinary research. Text+ focuses on Collections, Lexical Resources and Editions. These data domains have a long tradition of research and are linked to mature methodological paradigms that require distinctive but also cross-disciplinary practices of data generation, curation and management. The three types of research data are indispensable for a wide range of Humanities disciplines, including, but not limited to, Classical Philology, Linguistics, Literary Studies, Social and Cultural Anthropology, Non-European Cultures, Jewish Studies and Religious Studies, Philosophy, and language- and text-based research in the Social and Political Sciences. From the outset, 26 data centres will participate in Text+ that are technically sound and that are highly regarded in their fields of specialisation. They will provide data, tools, and services for the analysis and re-use of research data across a broad range of disciplines. By grouping data, tools, and services into thematic clusters, an optimal bundling is achieved. There are 34 institutions participating in Text+ that represent the communities addressed by Text+ as broadly as possible: research libraries, universities, Digital Humanities data centres as well as members of the Union of German Academies of Arts and Sciences and of the Leibniz Society. In addition, leading computing centres ensure robust and persistent operation of services for a distributed research data infrastructure. The high level of interest in Text+ is not only evidenced by the substantial in-kind contributions by the Text+ partner institutions, but is also documented by the more than 120 research-driven user stories and by the large number of letters of support from the communities of interest participating in Text+. At the heart of the governance structure are three scientific coordination committees for the data domains and one for the infrastructure. Their task is to continuously evaluate the portfolio of data, tools and services and to promote its further development according to the priorities of the participating disciplines in coordination with the infrastructure providers. The research data management strategy of Text+ is the core instrument for achieving the main objectives of Text+ in the NFDI context. It paves the way for the integration of data, tools and services into an infrastructure that meets relevant standards and implements the FAIR and CARE principles.


Figure 1: Cash and in-kind contributions by the member countries to the DARIAH ERIC in 2015 according to the DARIAH ERIC statutes, p.19 (note: figures have substantially changed since 2015)
Figure 2: Formation process of the CLARIN in-kind contribution and the important role of the CLARIN centres (Source: own presentation)
Figure 3: Process for the description of the DARIAH-DE in-kinds highlighting the collection process (Source: own presentation)
In-kind Contributions for two ERICs in one National Initiative: Practices and Experiences in CLARIAH-DE

September 2021

·

13 Reads

An ERIC (European Research Infrastructure Consortium) operates and provides research infrastructure offerings, usually within a disciplinary scope, for researchers regardless of their national or institutional background. Apart from direct funding, an ERIC is built upon the national contributions from partner countries, notably cash and in-kind contributions. This concept, the procedures, advantages and options for future enhancement are described in this paper from the German perspective focussing on the in-kinds. The paper is intended to support an informed discussion on the advancement of the in-kind contribution concept. The overall aim is to contribute to the efficiency and uptake of the ERIC's research infrastructure offerings. 2



Digital Research Infrastructure

March 2021

·

152 Reads

·

1 Citation

Digital research infrastructures can be divided into four categories: large equipment, IT infrastructure, social infrastructure, and information infrastructure. Modern research institutions often employ both IT infrastructure and information infrastructure, such as databases or large-scale research data. In addition, information infrastructure depends to some extent on IT infrastructure. In this paper, we discuss the IT, information, and legal infrastructure issues that research institutions face.


Citations (49)


... To enhance the uptake, visibility, operation, and transnational collaboration of K-centres, CLARIN developed a variety of initiatives. Their list and detailed descriptions can be found on the CLARIN website under the tile "Learn & Exchange" 5 as well as in overview articles by Jong et al. (2022), Jong et al. (2018), and Branco et al. (2023) among others. We will briefly present some of initiatives grouping them according to their functions. ...

Reference:

Transnational Research Infrastructure: A Journey Through CLARIN Knowledge Centres
The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond

Language Resources and Evaluation

... It has different performance indicators in six perspectives: competitiveness, sustainable development, innovation, strategic knowledge partnership, human capital and internal business processes. Rezapour et al. propose a method for evaluating the impact of funded research beyond academia that takes into account social and economic aspects [44]. Schubert defined 12 measures that determine a research output: publications, citations, conference articles, international co-publications, professorial job offers, advisory services for companies, cooperation with companies, membership in advisory boards, number of doctoral titles, number of state doctoral theses, editorships, and scholarships [45]. ...

Beyond Citations: Corpus-based Methods for Detecting the Impact of Research Outcomes on Society

... To fill this gap, Witt et al. (2018) evaluated the impact of research projects beyond academia. Based on the analysis of scientific project reports, they empirically developed an alternative classification schema for categorizing impact, which distinguishes between monetary and non-monetary impact and includes the subcategories economic, technical, socio-cultural, political-legal, environmental impact and income for research institutions. ...

Impact of Scientific Research beyond Academia: An Alternative Classification Schema

... Various forms of statistical techniques are implemented in order to visualize the data patterns as well as trends of the text [3]. With the adoption of various sorts of parsing mechanism as well as inclusion of various linguistic charecteristics, the operation of text mining is carried out [4]. It also involves a significant removal of data redundancy and inclusion of significant knowledge from the highly structured data. ...

Modeling, Learning, and Processing of Text Technological Data Structures
  • Citing Book
  • January 2012

Studies in Computational Intelligence

... For expressing such relationships, external references are used in XML and they can be seen to correspond to foreign key references in relational databases. In terms of external references, XML-based data can be reorganized into different kinds of hierarchies by XSLT, for example (Lemnitzer et al. 2013). The conceptual ER-model presented in this paper is more general than DTD and our database schema given in ''Appendix''. ...

Representing human and machine dictionaries in markup languages (SGML, XML)
  • Citing Book
  • January 2014

... Der Fokus wurde auf wortmediale Variation gelegt, da die korpuslinguistische Aufbereitung der entsprechenden Fälle am wenigsten Aufwand verursacht und ohnehin eine Auswahl für die Stichprobe getroffen werden musste. 7 Nähere Informationen zu der genutzten Korpusanalyseplattform KorAP finden sich inDiewald et al. (2016) und hier: <https://www.ids-mannheim.de/digspra/kl/projekte/korap/> (zuletzt abgerufen am 23. ...

KorAP Architecture – Diving in the Deep Sea of Corpus Data

... 6 This attention is driven by a variety of factors, including the desire to disambiguate queries as reflections of information needs 7 , differences arising from personal requirements 8 and varying levels of complexity of distinct tasks 9 – as evidenced by new tracks within the Special Interest Group on Information Retrieval (SIGIR) of the Association for Computing Machinery (ACM). 10 Document representation models can be based on two distinct perspectives (Mehler et al., 2010a): one focused on their meaning (or content) and the other focused on their genre-related function. In order to illustrate these two dimensions, consider the example of personal academic homepages (Rehm, 2002). ...

Introduction: Modeling, Learning and Processing of Text-Technological Data Structures

Studies in Computational Intelligence

... feldt and Sperberg-McQueen, 2001) and Generalized Ordered-Descendant Direct Acyclic Graphs (GODDAG, cf. McQueen and Huitfeldt, 2004) Multi-colored Trees (MCT, cf. Jagadish et al., 2004) or Delay Nodes (cf. Le Maitre, 2006). XCONCUR, formerly known as MuLaX (cf. Hilbert, 2005 and Hilbert et al., 2005) has been recently accompanied by XCONCUR-CL (cf. Schonefeld, 2007, Witt et al., 2007 ) as a constraintbased validation language. Although some of these approaches (e.g. LMNL, TexMECS, XCONCUR) support inline annotation of multiple annotation layers, these documents can get very complex when dealing with a large number of annotation layers. As a drawback, both, design and implementation of most of thes ...

On the lossless transformation of single-file, multi-layer annotations into multi-rooted trees
  • Citing Article
  • January 2007

... The policy and user management component Kustvakt provides several configurations related to user and policy management; for instance, it is possible to set up the default authorization scopes and the expiration period for authorization codes and access tokens (see Section 2.1). Moreover, default foundries for different annotation levels can be configured, as well as the behaviour of the query rewrite mechanism (Bański et al. 2014) which is fundamental to KorAP. ...

Access Control by Query Rewriting: the Case of KorAP