
Maciej Ogrodniczuk- PhD
- Head of Department at Institute of Computer Science, Polish Academy of Sciences
Maciej Ogrodniczuk
- PhD
- Head of Department at Institute of Computer Science, Polish Academy of Sciences
Universal Discourse | CLARIN-PL | DARIAH-PL | LLMs4EU | ENEOLI | ParlaMint | ParlaCAP | PLLuM | HIVE | Jasnopis
About
123
Publications
18,000
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
584
Citations
Introduction
Research interests: corpus linguistics, semantic description of Polish, annotation and resolution of reference and discourse relations, linguistic processing of parliamentary data
Current institution
Additional affiliations
March 2010 - present
Institute of Computer Science, Polish Academy of Sciences
Position
- Associate Professor | Head of Linguistic Engineering Group
Description
- Head of the Linguistic Engineering Group
February 2010 - May 2020
Institute of Computer Science, Polish Academy of Sciences
Position
- Professor (Associate)
Description
- Head of Linguistic Engineering Group
Education
February 2001 - June 2006
October 1995 - June 2000
Publications
Publications (123)
W 2015 r. powstał Jasnopis – autorska aplikacja mierząca zrozumiałość tekstu. W ciągu dekady z inicjatywy badawczej zmienił się w projekt, który nie tylko wciąż rozwija się dzięki podstawom naukowym, lecz także odpowiada na potrzeby rynkowe. Dziś Jasnopis to aplikacja i zespół ekspertów, którzy pracują nad samym narzędziem i rozwijają różnorodną zw...
The paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and ar...
This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management-Semantic Annotation Framework (SemAF), which outlines a set of core...
The paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and ar...
The ParlaMint project, a CLARIN flagship initiative, seeks to standardize the representation of parliamentary data across diverse languages and regions. Version 3.0 of ParlaMint encompasses corpora from 26 European countries and autonomous regions, available for download and search under the CC-BY license. These corpora adhere to a common XML encod...
This paper presents efforts towards creating a tool for translating texts from Middle Polish into modern Polish. Archaic texts sourced from the CBDU digital library were translated into modern language using ChatGPT and the resulting parallel corpus was used to train a neural text-to-text model. We assessed the results using automatic metrics and p...
This paper explores a discourse relations annotation project carried out under the CLARIN-PL initiative, leveraging the ISO 24617-8 standard. The goal is to boost research interoper-ability and foster multilingual research. Our team of three linguist-annotators tackled the annotation of a corpus spanning several gen-res, including e.g., literature...
This paper explores the performance of the T5 text-to-text transfer-transformer language model together with some other generative models on the task of generating keywords from abstracts of scientific papers. Additionally, we evaluate the possibility of transferring keyword extraction and generation models tuned on scientific text collections to l...
The quality of language technology (LT) for Polish has greatly improved recently, influenced by three independent trends. The first one is Poland-specific and concerns the increase in national funding of both scientific and R&D projects, resulting in the construction of The National Corpus of Polish and the development of the CLARIN-PL and DARIAH-P...
Recently proposed systems for open-domain question answering (OpenQA) require large amounts of training data to achieve state-of-the-art performance. However, data annotation is known to be time-consuming and therefore expensive to acquire. As a result, the appropriate datasets are available only for a handful of languages (mainly English and Chine...
This article juxtaposes the reflections on the expectations from plain language as declared by the participants of workshops delivered by the Jasnopis team with the status of linguistic research on phenomena such as linguistic awareness, linguistic norm, or attitudes towards language. These arguments serve the purpose of proposing a comprehensive a...
In The Digital Library of Polish and Poland-related Ephemeral Prints from the 16th, 17th and 18th Centuries a small fraction of items contains manually created Latin–Polish dictionaries explaining Latin fragments injected into Polish content. At the same time, rapid development of machine translation creates new opportunities for creating such dict...
The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstract...
The paper explores the idea of detecting and correcting post-OCR errors in a corpus of Polish scientific abstracts by first evaluating several available spellchecking approaches and then reusing one of the rule-based solutions to eliminate frequent errors most likely resulting from technical problems of the OCR process. The fine-tuning consisted in...
The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstract...
This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for...
We propose a text style transfer method for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the meaning of the text. We fine-tune three pre-trained language models on a newly created parallel corpus of vulgar/non-vulgar sentence pairs, then we evaluate style transfer accuracy, content preservatio...
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples...
The paper presents an experiment intended to overcome the problem of searching for different spelling variants in old Polish prints. In the case of The Digital Library of Polish and Poland-related Ephemeral Prints from the 16th, 17th and 18th Centuries two concurrent layers of text (transliteration and transcription) underlying selected digital lib...
We introduce the Question Answering Challenge-a shared task organised at the PolEval 2021. The task involves answering open-domain free-form questions in Polish through an automatic system, without human intervention or accessing external services. We describe the motivation behind the problem, explore various question types and formulations, and l...
The paper presents a series of experiments related to enhancing the content of digital library items with links to relevant Wikipedia entries that could offer the reader additional background information. Two methods of gathering such links are investigated: a Wikifier-based solution and search in Wikipedia using its integrated engine. The results...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
The article analyzes differences in the description of discourse relations in corpus research, in particular with the reference to the use of discourse markers – expressions that tie together subsequent fragments of the text and provide information about the nature of these relations. The text presents three concepts of the description of explicitn...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
The National Corpus of Polish emerged as a cumulative result of many years of work on large reference corpora by computer scientists and linguists in Poland. While its impact on research in linguistics, humanities and language technology is unquestionable and highly significant, the construction of the national corpus was halted in 2011. In the pap...
The paper presents two experiments related to enhancing the content of a digital library with data from external repositories. The concept involves three related resources: a digital library of Middle Polish prints where items are stored in image form, the same items in textual form in a linguistically annotated corpus, and a dictionary of Middle P...
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
The article presents current research on coreference resolution for Polish, from development of a sufficiently general model of reference relations to implementation of tools using this model to automatically detect coreference in written texts. The task is accomplished using corpus approach, with manual annotation of reference structures, verifica...
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristi...
This paper of fers the use of figurative language found in online blogs and used as source material for the research project Synamet, the aim of which is to create a semantically and grammatically annotated corpus of synaesthetic metaphors for Polish. A particular emphasis of the proposed article is laid upon the verbal synaes- thesia found in the...
This paper presents the Multiservice platform and its integration with the CLARIN Language Resources Switchboard. Multiservice combines a set of offline natural language processing tools for the Polish language. It features, among others, disambiguating tagging, dependency parsing and coreference resolution. A demonstration version of the platform,...
This paper examines the portability of Stanford’s multi-pass rule-based sieve coreference resolution system to inflectional language (Polish) with a different annotation scheme. The presented system is implemented in BART, a modular toolkit later adapted to the sieve architecture by Baumann et al. The sieves for Polish include processing of zero su...
Tekst jest publicystyczną próbą nakreślenia dalszych kierunków prac nad komputerowym przetwarzaniem polszczyzny w obliczu intensywnego rozwoju cyfrowych narzędzi i zasobów dla języka polskiego oraz zacieśniającej się współpracy między polskimi ośrodkami badawczymi zajmującymi się lingwistyką komputerową. Za najważniejszy temat autor uważa wznowieni...
The article attempts at framing directions for future work on computational processing of Polish in the face of recent intensive development of electronic tools and resources and close co-operation between Polish research centres involved in computational linguistics. The author regards renewing the work on the National Corpus of Polish as the most...
The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th−18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of...
In this paper we present a series of new methods of measuring readability of Polish non-literary texts. Starting with a short discussion of previous approaches we attempt at identification of new factors influencing readability and propose a new formula taking into account various linguistic features of a text. We also implement two corpus-based me...
Poster from the conference Slavic Corpus Linguistics: The Historical Dimension,
Tromsø, Norway, April 21–22, 2015
In search of a method of automatic measurement of readability
of informational texts
Summary
Numerous texts that are incomprehensible for some of their intended
recipients can be found in the Polish public space. In many cases, the problem is
related to the structure of the text, which can be measured objectively. Methods
of determining the degree...
Numerous texts that are incomprehensible for some of their intended recipients can be found in the Polish public space. In many cases, the problem is related to the structure of the text, which can be measured objectively. Methods of determining the degree of readability of texts have been developed for many languages. The importance of this proble...
'Coreference’ presents specificities of reference, anaphora and coreference in Polish, establish identity-of-reference annotation model and present methodology used to create the corpus of Polish general nominal coreference. Various resolution approaches are presented, followed by their evaluation. By discussing the subsequent steps of building a c...
This paper describes the results of creating a shallow grammar of Polish capable of detecting multi-level nested nominal phrases, intended to be used as mentions in coreference resolution tasks. The work is based on existing grammar developed for the National Corpus of Polish and evaluated on manually annotated Polish Coreference Corpus.
In this paper the first preliminary results of the analysis of marks collected within the tables of META-NET series of Language White Papers of CESAR project languages are demonstrated. Although they are preliminary results, we can consider them useful for showing us where real gaps in language resources and tools can be detected.
This paper attempts a preliminary interpretation of the occurrence of different types of linguistic constructs in the manually-annotated Polish Coreference Corpus by providing analyses of various statistical properties related to mentions, clusters and near-identity links. Among others, frequency of mentions, zero subjects and singleton clusters is...
This article presents the Polish Summaries Corpus, a new resource created to support the development and evaluation of the tools for automated single-document summarization of Polish. The Corpus contains a large number of manual summaries of news articles, with many independently created summaries for a single text. Such approach is supposed to ove...
Digital libraries are frequently treated just as a new method of storage of digitized artifacts, with all consequences of transferring long-established ways of dealing with physical objects into the digital world. Such attitude improves availability, but often neglects other opportunities offered by global and immediate access, virtuality and linki...
Measuring readability of a text is the first sensible step to its simplification. In this paper we present an overview of the most common approaches to automatic measuring of readability. Of the described ones, we implemented and evaluated: Gunning FOG index, Flesch-based Pisarek method. We also present two other approaches. The first one is based...
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress a...
This paper discusses different methods of estimating the inter-annotator agreement in manual annotation of Polish coreference and proposes a new BLANC-based annotation agreement metric. The commonly used agreement indicators are calculated for mention detection, semantic head annotation, near-identity markup and coreference resolution.
This book constitutes the refereed proceedings of the 9th International Conference on Advances in Natural Language Processing, PolTAL 2014, Warsaw, Poland, in September 2014. The 27 revised full papers and 20 revised short papers presented were carefully reviewed and selected from 83 submissions. The papers are organized in topical sections on morp...
The Polish Coreference Corpus (PCC) is a large corpus of Polish general nominal coreference built upon the National Corpus of Polish. With its 1900 documents from 14 text genres, containing about 540,000 tokens, 180,000 mentions and 128,000 coreference clusters, the PCC is among the largest coreference corpora in the international community. It has...
This paper reports on linguistic features and decisions that we find vital in the process of annotation and resolution of coreference for highly inflectional languages. The presented results have been collected during preparation of a corpus of general direct nominal coreference of Polish. Starting from the notion of a mention, its borders and pote...
This paper presents a new implementation of the multi-purpose set of NLP tools for Polish, made available online in a common web service framework. The tool set comprises a morphological analyzer, a tagger, a named entity recognizer, a dependency parser, a constituency parser and a coreference resolver. Additionally, a web application offering chai...
Creating a coreference corpus for an inflectional and free-word-order language is a challenging task due to specific syntactic features largely ignored by existing annotation guidelines, such as the absence of definite/indefinite articles (making quasi-anaphoricity very common), frequent use of zero subjects or discrepancies between syntactic and s...
This paper reports on the preliminary experiment aimed at verification whether extraction of nominal facts corresponding to world knowledge from both structured and unstructured data could be effectively performed and its results used as a source of pragmatic knowledge for coreference resolution in Polish. Being the proof-of-concept only, this appr...
Creating a coreference resolution tool for a new language is a challenging task due to substantial effort required by development of associated linguistic data, regardless of rule-based or statistical nature of the approach. In this paper, we test the translation- and projection-based method for an inflectional language, evaluate the result on a co...
It has been recently discussed in linguistics that the notion of identity in the task of coreference resolution is of continuous nature, ranging from “complete” identity to non-identity. The current paper confronts this idea with experimental data for Polish, resulting in a new approach to the notion of identity. It extends the definition of corefe...
The ATLAS project, started in March 2010, intends to create a multilingual language processing framework integrating the common set of linguistic tools for a group of European languages, among them Polish. The chained tools producing multi-level UIMA-encoded annotation of texts can be used by NLP applications for complex language-intensive operatio...
Merging of Language Resources is not only a matter of mapping between different annotation schemata but also of linguistic tools coping with heterogeneous annotation formats in order to produce one single output. In this paper we present a web content management system ATLAS which succeeded to integrate and harmonize resources and tools for six lan...
The paper intends to give a brief summary of one the most recent efforts on building the pan-European language technology infrastructure: META-NET – a network of Excellence consisting of 54 research centres from 33 countries – and specifically, its Central and South-European participating project: CESAR. One of the major activities of the project i...
The article presents the results from the project of the thematic Digital Library of Polish and Poland-related Ephemeral Prints from the 16th, 17th and 18th Centuries, intending to preserve the unique multilingual material, make it available for education and extend it with the joint efforts of historians, philologists, librarians and computer scie...
This paper presents the ATLAS platform - multilingual language processing framework integrating the common set of linguistic tools for a group of European languages (less-resourced: Bulgarian, Croatian, Greek, Polish and Romanian together with English and German as reference languages). State-of-the-art NLP functionality offered by the platform all...