
Zdenek Zabokrtsky- Charles University in Prague
Zdenek Zabokrtsky
- Charles University in Prague
About
100
Publications
9,390
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,525
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (100)
The paper presents an overview of the third edition of the shared task on multilingual coreference resolution, held as part of the CRAC 2024 workshop. Similarly to the previous two editions, the participants were challenged to develop systems capable of identifying mentions and clustering them based on identity coreference. This year's edition took...
In light of the recent push for the creation and unification of large morphologically annotated resources, there is a call for (preferably language-independent, low-resource) methods of morph classification. This paper reports on a pilot experiment on morph classification of the Czech language. We have performed two experiments - root morph recogni...
This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for...
Words of any language are to some extent related thought the ways they are formed. For instance, the verb exemplify and the noun examples are both based on the word example, but the verb is derived from it, while the noun is inflected. In Natural Language Processing of Russian, the inflection is satisfactorily processed; however, there are only a f...
Our work aims at developing a multilingual data resource for morphological segmentation. We present a survey of 17 existing data resources relevant for segmentation in 32 languages, and analyze diversity of how individual linguistic phenomena are captured across them. Inspired by the success of Universal Dependencies, we propose a harmonized scheme...
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian,...
This study presents a preliminary overview of 18 resources which contain morphematic segmentation of word forms or of lemmas in various languages, or from which such segmentation could be derived.
The article presents a semi-automatic method for the construction of word-formation networks focusing particularly on derivation. The proposed approach applies a sequential pattern mining technique to construct useful morphological features in an unsupervised manner. The features take the form of regular expressions and later they are used to feed...
The paper deals with harmonisation of existing data resources containing word-formation features by converting them into a common file format and partially aligning their annotation schemas. We summarise (dis)similarities between the resources and describe individual steps of the harmonisation procedure, including manual annotations and application...
The quality of human translation was long thought to be unattainable for computer translation systems. In this study, we present a deep-learning system, CUBBITT, which challenges this view. In a context-aware blind evaluation by human judges, CUBBITT significantly outperformed professional-agency English-to-Czech news translation in preserving text...
In this paper, we introduce a new and improved version of DeriSearch, a search engine and visualizer for word-formation networks. Word-formation networks are datasets that express derivational, compounding and other word-formation relations between words. They are usually expressed as directed graphs, in which nodes correspond to words and edges to...
This paper deals with automatic morphological segmentation of Czech lemmas contained in the word-formation network DeriNet. Capturing derivational relations between base and derived lemmas, and segmenting lemmas into sequences of morphemes are two closely related formal models of how words come into existence. Thus we propose a novel segmentation m...
This article gives an overview of how sentence meaning is represented in eleven deep-syntactic frameworks, ranging from those based on linguistic theories elaborated for decades to rather lightweight NLP-motivated approaches. We outline the most important characteristics of each framework and then discuss how particular language phenomena are treat...
The aim of this paper is to open a discussion on harmonization of existing data resources related to derivational morphology. We present a newly assembled collection of eleven harmonized resources named "Universal Derivations" (clearly being inspired by the success story of the Universal Dependencies initiative in treebanking), as well as the harmo...
In this work, we introduce a new large hand-annotated morpheme-segmentation lexicon of Persian words and present an algorithm that builds a morphological network using this segmented lexicon.
The resulting network captures both derivational and inflectional relations. The algorithm for inducing the network approximates the distinction between root m...
Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning-based approa...
We focus on the task of unsupervised lemmatization, i.e. grouping together inflected forms of one word under one label (a lemma) without the use of annotated training data. We propose to perform agglomerative clustering of word forms with a novel distance measure. Our distance measure is based on the observation that inflections of the same word te...
This dataset includes 45300 Persian word forms which are manually segmented into sequences of morphemes.
We present a corpus of Czech sentences with manually annotated named entities, in which a rich two-level hierarchy of named entity types was used. The corpus was the first available large Czech named entity resource and since 2007, it has stimulated the research in this field for Czech. We describe the two-level fine-grained hierarchy allowing embe...
Dependency parsers are almost ubiquitously evaluated on their accuracy scores, these scores say nothing of the complexity and usefulness of the resulting structures. As dependency parses are basic structures in which other systems are built upon, it would seem more reasonable to judge these parsers down the NLP pipeline. In this chapter, we will di...
We present a work in progress aimed at extracting translation pairs of source and target dependency treelets to be used in a dependency-based machine translation system. We introduce a novel unsupervised method for parallel tree segmentation based on Gibbs sampling. Using the data from a Czech-English parallel treebank, we show that the procedure c...
The paper introduces the DeriNet lexical database, which includes more than 969,000 Czech words interconnected by 718,000 links corresponding to derivational relations (relations between a base word and a word derived from it). Derivational relations were identified by semi-automatic procedures and manual annotation. As the DeriNet network is fully...
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Ma...
The relationship between two important semantic properties (polysemy and syn-onymy) of language and one of the most fundamental syntactic network properties (a degree of the node) is observed. Based on the synergetic theory of language, it is hypothesized that a word which occurs in more syntactic contexts, i.e. it has a higher degree, should be mo...
We present HamleDT-a HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. In the present article, we provide a thorough investigation and discussion of a number of phenomena that are...
In the present paper, we describe the development of the lexical network DeriNet, which captures core word-formation relations on the set of around 266 thousand Czech lexemes. The network is currently limited to derivational relations because derivation is the most frequent and most productive word-formation process in Czech. This limitation is ref...
In this paper, we show some properties of function words in dependency trees. Function words are grammatical words, such as articles, prepositions, pronouns, conjunctions, or auxiliary verbs. These words are often short and very frequent in texts and therefore many of them can be easily recognized. We formulate a hypothesis that function words tend...
This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using...
Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a...
Morph length is one of the indicative feature that helps learning the morphology of languages, in particular agglutinative languages. In this paper, we introduce a simple unsupervised model for morphological segmentation and study how the knowledge of morph length affect the performance of the segmentation task under the Bayesian framework. The mod...
The possibility of deleting a word from a sentence without violating its syntactic correctness belongs to traditionally known manifestations of syntactic dependency. We introduce a novel unsupervised parsing approach that is based on a new n-gram reducibility measure. We perform experiments across 18 languages available in CoNLL data and we show th...
This paper describes a system for unsupervised dependency parsing based on Gibbs sampling algorithm. The novel approach introduces a fertility model and reducibility model, which assumes that dependent words can be removed from a sentence without violating its syntactic correctness.
One of the most notable recent improvements of the TectoMT English-to-Czech translation is a systematic and theoretically supported revision of formemes---the annotation of morpho-syntactic features of content words in deep dependency syntactic structures based on the Prague tectogrammatics theory. Our modifications aim at reducing data sparsity, i...
Dependency parsing has made many advancements in recent years, in particular for English. There are a few dependency parsers that achieve comparable accuracy scores with each other but with very different types of errors. This paper examines creating a new dependency structure through ensemble learning using a hybrid of the outputs of various parse...
We propose HamleDT – HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. While the license terms prevent us from directly redistributing the corpora, most of them are easily acquirab...
CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of no...
We present the Prague Dependency Treebank 2.5, the newest version of PDT and the first to be released under a free license. We show the benefits of PDT 2.5 in comparison to other state-of-the-art treebanks. We present the new features of the 2.5 release, how they were obtained and how reliably they are annotated. We also show how they can be used i...
We introduce and describe ongoing work in our Indonesian dependency treebank. We described characteristics of the source data as well as describe our annotation guidelines for creating the dependency structures. Reported within are the results from the start of the Indonesian dependency treebank. We also show ensemble dependency parsing and self tr...
In this work, we present first results on noun phrase coreference resolution on Czech data. As the data resource for our experiments, we employed yet unfinished and unpublished extension of Prague Dependency Treebank 2.0, which captures noun phrase coreference and bridging relations. Incompleteness of the data influenced one of our motivations ---...
Syntax of natural language has been the focus of linguistics for decades. The complex network theory, being one of new research tools, opens new perspectives on syntax properties of the language. Despite numerous partial achievements, some fundamental problems remain unsolved. Specifically, although statistical properties typical for complex networ...
Accuracy of dependency parsers is one of the key factors limiting the quality of dependencybased machine translation. This paper deals with the influence of various dependency parsing approaches (and also different training data size) on the overall performance of an English-to-Czech dependency-based statistical translation system implemented in th...
Very few attempts have been reported in the literature on dependency parsing for Tamil. In this paper, we report results obtained
for Tamil dependency parsing with rule-based and corpus-based approaches. We designed annotation scheme partially based on
Prague Dependency Treebank (PDT) and manually annotated Tamil data (about 3000 words) with depend...
The present paper describes Treex (formerly TectoMT), a multi-purpose open-source framework for de-veloping Natural Language Processing applications. It fa-cilitates the development by exploiting a wide range of soft-ware modules already integrated in Treex, such as tools for sentence segmentation, tokenization, morphological analy-sis, part-of-spe...
In the present paper we describe TectoMT, a multi-purpose open-source NLP framework. It allows for fast and efficient development
of NLP applications by exploiting a wide range of software modules already integrated in TectoMT, such as tools for sentence
segmentation, tokenization, morphological analysis, POS tagging, shallow and deep syntax parsin...
We present a tool for annotation of semantic inter-sentential discourse relations on the tectogrammatical layer of the Prague Dependency Treebank (PDT). We present the way of helping the annotators by several useful features implemented in the annotation tool, such as a possibility to combine surface and deep syntactic representation of sentences d...
We describe two systems for English-to-Czech machine translation that took part in the WMT09 translation task. One of the systems is a tuned phrase-based system and the other one is based on a linguistically motivated analysis-transfer-synthesis approach.
We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge still unexploited in the field of Computational Linguistics, in spite of highly successful Hidden Markov (Chain) Models. In dependency trees, the independence assumptions made by HMTM correspond to the intuition of linguistic dependency. Therefore we sug...
The present paper summarizes our recent results concerning English-Czech Machine Trans- lation implemented in the TectoMT framework. The system uses tectogrammatical trees as the transfer medium. A detailed analysis of errors made by the previous version of the system (considered as the baseline) is presented first. Then several improvements of the...
This paper gives an overview of the current state of the Prague English Dependency Tree- bank project. It is an updated version of a draft text that was released along with a CD present- ing the first 25% of the PDT-like version of the Penn Treebank - WSJ section (PEDT 1.0). Before the January 2009 release, the conversion from the original phrase s...
We present a new English→Czech machine translation system combining linguistically motivated layers of language description (as defined in the Prague Dependency Treebank annotation scenario) with statistical NLP approaches.
This paper describes CzEng 0.7, a new release of Czech-English parallel corpus freely available for research and educat ional purposes. We provide basic statistics of the corpus and focus on data produced by a community of volunteers. Anonymous contributors manually correct the output of a machine translation (MT) system, generating on average 2000...
We present a new English→Czech machine translation system combining linguistically motivated layers of language description (as defined in the Prague Dependency Treebank annotation scenario) with statistical NLP approaches.
Petr Sgall: Language in Its Multifarious Aspects. Edited by Eva Hajičová and Jarmila Panevová. Prague: Charles University in Prague – The Karolinum Press, 2006. 558 pp.
This paper deals with the treatment of Named Entities (NEs) in Czech. We introduce a two-level NE classification. We have
used this classification for manual annotation of two thousand sentences, gaining more than 11,000 NE instances. Employing
the annotated data and Machine-Learning techniques (namely the top-down induction of decision trees), we...
Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 79-84. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically pub...
This volume offers a selection from the papers presented at the 2005 Annual Symposium on Arabic Linguistics, held at the University of Illinois at Urbana-Champaign. The papers cover a variety of topics in Arabic Linguistics, ranging from the lexicon, phonology, syntax and computational linguistics.
In this paper we describe in detail two dependency parsing techniques developed and evaluated using the Prague Dependency Treebank 2.0 Then we propose two approaches for combining various existing parsers in order to obtain better accuracy The highest parsing accuracy reported in this paper is 85.84 %, which represents 1.86 % improvement compared t...
In this paper we deal with a new rule-based approach to theNatural Language Generation problem. The presented system synthesizes
Czech sentences from Czech tectogrammatical trees supplied by the Prague Dependency Treebank2.0 (PDT2.0). Linguistically
relevant phenomena including valency, diathesis, condensation, agreement, word order, punctuation an...
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Our approach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detailed annotation guide and an annotation editor. The i...
In this paper we report our work on the system of grammatemes (mostly semantically-oriented counterparts of morphological categories such as number, degree of comparison, or tense), the concept of which was introduced in Functional Generative Description, and is now further elaborated in the context of Prague Dependency Treebank 2.0. We present als...
VALLEX is a linguistically annotated lexicon aiming at adescription of syntactic information which is supposed to be useful
for NLP. The lexicon contains roughly 2500 manually annotated Czech verbs with over 6000 valency frames (summer 2005). In
this paper we introduce VALLEX and describe an experiment where VALLEX frames were assigned to 10,000 co...
The aim of this paper is two-fold. First, we want to present a part of the annotation scheme of the Prague Dependency Treebank
2.0 related to the annotation of coreference on the tectogrammatical layer of sentence representation (more than 45,000 textual
and grammatical coreference links in almost 50,000 manually annotated Czech sentences). Second,...
In this paper we report our work on the system of gram- matemes (mostly semantically-oriented counterparts of morphological categories such as number, degree of comparison, or tense), the concept of which was introduced in Functional Generative Description, and is now further elaborated in the context of Prague Dependency Treebank 2.0. We present a...
This research note reports on the work in progress which regards automatic transformation of phrase-structure syntactic trees of Arabic into dependency-driven analytical ones. Guidelines for these descriptions have been developed at the Linguistic Data Consortium, University of Pennsylvania, and at the Faculty of Mathematics and Physics and the Fac...
x 2 X; y 2 Xg ; the alternative approach uses the formula (X +X)CFA = fx + x : x 2 Xg : (Similarly for other operations.) Recall that there is no difference if the two arguments are different quantities, say X;Y : X + Y = (X + Y ) CFA = fx + y : x 2 X; y 2 Y g : Different results are obtained only in case when some of the variables appears repeated...
Valency lexicon of Czech verbs has been intensively worked on for more than a year, and now we have at our disposal a detailed
description of valency frames of several hundreds verbs. Presently, the challenge naturally arises, to use the existing lexicon
for capturing valency of other word classes. In this paper, we focus on valency of nouns derive...
A lexicon containing a certain kind of syntactic information about verbs is one of the crucial prerequisities for most tasks in Natural Language Processing. The goal of the project described in the paper is to create a human-and machine-readable lexicon capturing in detail valency behavior of hundreds most frequent Czech verbs. Manual annotation ef...
In the standard fuzzy arithmetic, the vagueness of fuzzy quantities always increases. G. J. Klir [2, 3] suggests an alternative
– the constrained fuzzy arithmetic – which reduces this effect. On the other hand, it significantly increases the complexity
of computations in comparison to the classical calculus of fuzzy quantities. So far, little atte...
A syntactic lexicon of verbs with the subcategorization information is crucial for NLP. Two phases of creating such lexicon
are presented. The first phase consists of the automatic preprocessing of source data—particular valency frames are proposed.
Where it is possible, the functors are assigned, otherwise the set of possible functors is proposed....
This paper presents work in progress, the goal of which is to develop a module for automatic transition from analytic tree
structures to tectogrammatical tree structures within the Prague Dependency Treebank project. Several rule-based and dictionary-based
methods were combined in order to be able to make maximal use of both information extractable...
In this paper we present the results of our experiments with modifications of the feature set used in the Czech mutation of
the Maximum Spanning Tree parser. First we show how new feature templates improve the parsing accuracy and second we decrease
the dimensionality of the feature space to make the parsing process more effective without sacrifici...
Functional Generative Description (FGD) is a stratificational dependency-based approach to natural language description, which has been developed by Petr Sgall and his collaborators in Prague since 1960's. Although FGD bears surprisingly many resemblances with the Meaning-Text Theory, to our knowledge there is no reasonably detailed comparative stu...
Accuracy of dependency parsers is one of the key factors limiting the quality of dependency-based machine translation. This paper deals with the influence of various dependency pars-ing approaches (and also different training data size) on the overall performance of an English-to-Czech dependency-based statisti-cal translation system implemented in...
In this paper, we focus on alignment of Czech and English tectogrammatical dependency trees. The alignment of deep syntactic de-pendency trees can be used for training transfer models for machine translation systems based on analysis-transfer-synthesis architecture. The results of our experiments show that shifting the alignment task from the word...
We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This pa...
The aim of this paper is to describe and evaluate a system that automates a part of the transition from analytical to tectogrammatical tree structures within the Prague Dependency Treebank. In particular, it assigns functors to autosemantic words. The system is based on the machine learning approach of decision tree induction. The resulting softwar...
We introduce a bilingual MR lexicon of Swedish support verb constructions that lemmatizes their noun components (predicate nouns). The lexicon is meant to be part of a valency lexicon of common Swedish verbs. It is based on the valency theory developed within the Functional Generative Description and it is enriched with Lexical Functions. In order...
Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 211-222. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electroni...