added a research item
Linking Basque Lexical Resources
- David Lindemann
- Mikel Alonso
In this paper, we present a workflow for historical dictionary digitization, with a 1745 Spanish-Basque-Latin dictionary as use case. We start with scanned facsimile images, and get to represent attestations of modern standard Basque lexemes as Linked Data, in the form they appear in the dictionary. We are also able to produce an index of the dictionary, i. e. a Basque-Spanish version, and to map extracted Spanish and Basque lexical items to reference dictionary lemma list entries. The workflow is entirely based on freely available software. OCR and information extraction are performed using Machine Learning algorithms; data exhibits and the transcription curation environment are provided using Wikisource and Wikidata. Our evaluation of a first iteration of the workflow suggests its capability to deal with early modern printed dictionary text, and to reduce manual effort in the different stages significantly.
Lexical resources originally meant as human-readable dictionaries, or lexical-semantic databases designed for other purposes, are most often developed isolated from each other, so that a linking of data across resources, which doubtlessly means an added value to both human readers and knowlegde-based computational applications, implies some sort of mapping strategy. This presentation offers a brief survey of automated and manually supervised approaches for mapping content of different machine-readable lexical resources (digitized or digital born) on lemma level, as it has been done in multiple occasions, and on concept (word sense) level, as undertaken in current research projects. In this context, we pay special attention to WordNet as pivot resource for lemma and sense linking, and limitations from a lexicographer's point of view, together with some ideas for possible workarounds.
Ibon Sarasolak 1982. urtean euskarazko maiztasun-hiztegia argitaratu zuen, 1977ko corpus batean oinarriturik. Ondorengo hamarkadetan, euskaraz idatzitako testuen zein baliabide elektronikoen kopurua handitu egin da esponentzialki. Gaur eskuragarri ditugun datuetan oinarrituta, euskara batuaren maiztasun-lemategi bat garatzea dugu helburu ikerketa honetan, asmo bikoitzari jarraiki: alde batetik, UPV-EHUn garatzen ari den hiztegi elebidun batentzat euskarazko lemategia proposatzea, eta beste aldetik, egun existitzen diren lemategien edukiak alderatzea, sortutako lemategiaren egokitasuna ebaluatzeko zein euren arteko konparazioa burutzeko.
This poster presents preliminary considerations for a new project: A merged set of Basque (legacy) lexical resources or unified lexical database. At this preliminary stage, our main attention lies on the catalogue of data sources, on philological problems (e.g. regarding lemmatization), and on the design of the database. We propose a data model and a workflow for the inclusion of all kinds of Basque dictionaries and other resources such as the Basque WordNet and NLP lexicons. The data found in Basque dictionaries bare several problems, such as presence of dialectal and historical forms from before and after the creation of a Basque standard in 1968, and inconsistencies in lemmatization and the treatment of homography, homonymy, and polysemy. We present some solutions for these problems, as well as an XML schema for the lexical database and example datasets. The new merged resource may be used as Basque diachronic lexicographical corpus or serve as datasource for the creation of new lexicographical products. Using word sense identifiers found in Basque WordNet, the resource may also be linked to lexical resources of other languages.
This paper presents a simple method for drafting bilingual dictionary content using existing lexical and NLP resources for Basque. The method consists of five steps, three belonging to a semi-automatic drafting, and another two to semi-automatic and manual post-editing: (1), the building of a corpus-based frequency lemma list; (2) the drafting of syntactical entities belonging to a lemma-sign; (3) the drafting of word senses belonging to syntactical entities; (4) a semi-automatic detection of gaps regarding syntactical entities, and (5) manual detection of word sense gaps. The described method relies on the exploitation of existing resources for Basque, and the multilingual cross-references present in WordNet. The application of the described method follows two goals: (1), a drafting of a series of bilingual dictionaries with Basque, and (2), a contribution to the updating and enrichment of two Basque NLP resources used for the drafting, EDBL and EusWN.
This paper presents a simple methodology to create corpus-based frequency lemma lists, applied to the case of the Basque language. Since the first work on the matter in 1982, the amount of text written in Basque and language resources related to this language has grown exponentially. Based on state-of-the-art Basque corpora and current NLP technology, we develop a frequency lemma list for standard Basque. Our aim is twofold: On the one hand, to propose a primary Basque lemma list for a bilingual dictionary that is currently being worked on at UPV/EHU, and on the other, to contrast existing Basque dictionary lemma lists with frequency data, in order to evaluate the adequacy of our proposal and to compare lemma lists with each other.
In this paper, we present a simple method for drafting sense-disambiguated bilingual dictionary content using lexical data extracted from merged wordnets, on the one hand, and from BabelNet, a very large resource built automatically from wordnets and other sources, on the other. Our motivation for using English-Basque as a showcase is the fact that Basque is still lacking bilingual lexicographical products of significant size and quality for any combination other than with the five major European languages. At the same time, it is our aim to provide a comprehensive guide to bilingual dictionary content drafting using English as pivot language, by bootstrapping wordnet-like resources; an approach that may be of interest for lexicographers working on dictionary projects dealing with other combinations that have not been covered in lexicography but where such resources are available. We present our experiments, together with an evaluation, in two dimensions: (1) A quantitative evaluation by describing the intersections of the obtained vocabularies with a basic lemma list of Standard Basque, the language for which we intend to provide dictionary drafts, and (2) a manual qualitative evaluation by measuring the adequateness of the bootstrapped translation equivalences. We thus compare recall and precision of the applied dictionary drafting methods considering different subsets of the draft dictionary data. We also discuss advantages and shortcomings of the described approach in general, and draw conclusions about the usefulness of the selected sources in the lexicographical production process.