Gideon Kotzé

Gideon Kotzé
Masaryk University | MUNI

Doctor of Philosophy
Started new position at the DISSINET project at Masaryk University

About

16
Publications
5,615
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
96
Citations
Citations since 2016
6 Research Items
45 Citations
2016201720182019202020212022024681012
2016201720182019202020212022024681012
2016201720182019202020212022024681012
2016201720182019202020212022024681012

Publications

Publications (16)
Chapter
We describe past and present work surrounding the development of treebank related NLP resources for Georgian. In particular, we provide an overview of efforts made in the development of a morphologically and syntactically annotated treebank for this non-configurational language, as well as its application in the development of a syntactic parser. B...
Article
As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final qu...
Conference Paper
Full-text available
We discuss issues surrounding the development of a treebank and a vanilla context-free grammar for the Georgian language. We also investigate ongoing parallel corpus and treebank creation and alignment experiments with Ger-man. An investigation of the output suggests that scarcity of training data may be caused not only by a relatively small parall...
Article
Full-text available
We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax-and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-...
Conference Paper
Although their use in training quality machine translation systems has been proven, parallel corpora — large collections of translated texts — are generally hard to come by for the majority of languages. To counteract this fact, a relatively small collection may be processed in more depth by further cleaning and more accurately splitting and aligni...
Article
Full-text available
We present a series of experiments involving the machine translation of Zulu to English using a well-known statistical software system. Due to morphological complexity and relative scarcity of resources, the case of Zulu is challenging. Against a selection of baseline models, we show that a relatively naive approach of dividing Zulu words into syll...
Chapter
Full-text available
In this paper the PaCo-MT project is described, in which Parse and Corpus-based Machine Translation has been investigated: a data-driven approach to stochastic syntactic rule-based machine translation.In contrast to the phrase-based statistical machine translation systems (PB-SMT) which are string-based and do not use any linguistic knowledge, an M...
Thesis
Full-text available
Large collections of translated texts—called parallel corpora—are often automatically aligned on word and sentence level to be used as training data for machine translation systems. We may also choose to syntactically analyze the sentences to produce syntax trees. If we do this on both sides and the nodes of the trees are also aligned, the end resu...
Article
Previous experiments suggest that a rule-based approach to tree alignment error correction serves to be an effective complement to statistical alignment. We show how, using relatively few features, an implementation of Brill's Transformation-Based Learning algorithm improves the results of a high precision model of the statistical aligner Lingua-Al...
Article
Full-text available
This paper reports on-going work on building a large automatically tree- aligned parallel treebank in the context of a syntax-based machine translation (MT) approach. For this we develop a discriminative tree aligner based on a log-linear model with a rich feature set. We incorporate various language- independent and language-specific features taki...
Article
Full-text available
In this paper we propose a discriminative framework for automatic tree alignment. We use a rich feature set and a log-linear model trained on small amounts of hand-aligned training data. We include contextual features and link dependencies to improve the results even further. We achieve an overall F-score of almost 80% which is significantly better...
Article
Full-text available
Development of an Afrikaans wordnet: methodology and integration The Afrikaans wordnet is a lexical-conceptual network in the form of an electronic lexical database, developed at the North- West University. In this article, a methodology for a semi-automatic construction of the entries – so-called synonym sets – is investigated. Firstly, a backgrou...
Article
1. Abstract Automatic sub-tree alignment of parallel treebanks often display regular errors that can be corrected by improving the alignment model. However, if the aligner is statistical, often much more training data is needed to properly address these errors. In some cases, a rule-based approach to error correction can provide a quick and conveni...
Article
Automatic alignment of parallel treebanks often display reg-ular errors that can be corrected by improving the alignment model. However, if the aligner is statistical, often much more training data is needed to properly address these errors. In some cases, a rule-based ap-proach to error correction may provide a quick and convenient solution. We pr...
Article
In this paper, we present results of an on-going investigation of a manually aligned parallel treebank and an automatic tree aligner. We establish the features that show a significant correlation with align-ment performance. We present those fea-tures with the biggest correlation scores and discuss their significance, with men-tion of future applic...

Network

Cited By

Projects

Projects (4)
Project
Document, research and make available data on South African Sign Language (SASL) place names.
Project
This project uses the methods of social network analysis, geoinformatics, and natural language processing to shed new light on the social, spatial, and discursive patterns of medieval dissident Christianities, heresy trials, and inquisitorial records. Our case studies focus on Languedoc from the 1230s to the 1320s; Lombardy and Tuscany from the 1240s to the 1300s; and England from the 15th to the 16th centuries, thereby covering various dissident religious cultures such as Cathars, Waldensians, Beguins, Fraticelli, Guglielmites, and Lollards. The project is funded by an ERC Consolidator grant (2021-2026). Previously it has received an EXPRO grant from the Czech Science Foundation (2019-2021).