Article

The apertium bilingual dictionaries on the web of data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared translation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource Description Framework) and how their data have been made accessible on the Web as linked data. As a result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A subset of 22 such dictionaries was initially converted into RDF (resource description framework), 2 published as linked open data on the Web, and made available for access and querying in a way compliant with Semantic Web standards [13]. More recently, an updated version of the Apertium RDF graph has been released, covering 53 language pairs [11] (see Fig. 1). ...
... It must therefore be a subgraph of some biconnected component. 13 Generating dictionaries for particular language pairs is a popular instance of the more general dictionary enrichment problem. 14 https://github.com/martavillegas/ApertiumRDF ...
... Apertium language data is all released using the GNU General Public Licence, 31 a free/open-source licence. This has, in particular, made it easy for derivatives of Apertium bilingual dictionaries to be published such as lexical-markup framework (LMF) dictionaries, 32 and the RDF dictionaries mentioned in this paper [11,13]. Methods like ours can be used to extend or create Apertium bilingual dictionaries. ...
Article
Full-text available
In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. Such linguistic linked data have stimulated the development and use of openly available linguistic knowledge graphs, as is the case with the Apertium RDF, a collection of interconnected bilingual dictionaries represented and accessible through Semantic Web standards. In this work, we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speed-up, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. Over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for dictionary enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.
... not only does the resource become bidirectional and diachronic, with the opportunity to extend to multilingualism, but it also allows for the "aggregation and integration of linguistic resources" [15], thereby serving as a potential aid in the development of future language resources for isiXhosa, an under-resourced language in South Africa [1] (p. 1). Examples of projects which serve their data as Linked Data include Princeton WordNet 3.1 (PWN), a large lexical database, DBpedia, a knowledge base which extracts structured content from various Wikimedia projects, BabelNet, a multilingual encyclopedic dictionary, and the Apertium Bilingual Dictionaries (ABD), with the latter three projects running on a Virtuoso server [16][17][18]. ...
... As shown in Table 1, Ontolex-Lemon was the only model which fulfilled all the modelling requirements. PWN and BabelNet have published their datasets in RDF format using the lemon model, as has Zhishi.lemon, the lexical realisation of Zhishi.me, a Chinese dataset in the Linked Open Data cloud, and the ABD, a machine translation platform with up to 40 language pairs [18] (p. 2), [38] (p. 47). ...
... Moving onto the versioning of RDF datasets, within the context of Linguistic Linked Data, versioning is used by BabelNet, although it is applied globally for their BabelNet-lemon schema description, with Flati et al. acknowledging that "maybe a more sophisticated infrastructure would be needed in order to express more complex versioning description needs" [1] (p. 6), [56]. When the generation and publication of RDF data for the ABD was detailed by Gracia et al., versioning was not included in the discussion [18] [18,53,[57][58][59], it does not appear that versioning has been discussed further within the domain of Linguistic Linked Data, and in the context of vocabularies used by Babelnet, Flati et al. commented that changes are unaccounted for "and this aspect might thus be investigated in more detail in the [near] future by the whole community" [1] (p. 6), [56]. ...
Article
Full-text available
The English-Xhosa Dictionary for Nurses (EXDN) is a bilingual, unidirectional printed dictionary in the public domain, with English and isiXhosa as the language pair. By extending the digitisation efforts of EXDN from a human-readable digital object to a machine-readable state, using Resource Description Framework (RDF) as the data model, semantically interoperable structured data can be created, thus enabling EXDN’s data to be reused, aggregated and integrated with other language resources, where it can serve as a potential aid in the development of future language resources for isiXhosa, an under-resourced language in South Africa. The methodological guidelines for the construction of a Linguistic Linked Data framework (LLDF) for a lexicographic resource, as applied to EXDN, are described, where an LLDF can be defined as a framework: (1) which describes data in RDF, (2) using a model designed for the representation of linguistic information, (3) which adheres to Linked Data principles, and (4) which supports versioning, allowing for change. The result is a bidirectional lexicographic resource, previously bounded and static, now unbounded and evolving, with the ability to extend to multilingualism.
... (There had been an attempt to include varieties of historic languages within ISO 639-6, but this Part was withdrawn in 2014. 15 ) Bellandi et al. [2] discuss the modeling of linguistic data from Old Occitan (a Romance language spoken during the Middle Ages in what is today southern France) and other languages using OntoLex-Lemon. To code their Old Occitan lexemes, they use the tag 'aoc': lemon:writtenRep "canabo"@aoc [2,4]. ...
... N ng is the name of a dialect cluster of the !Ui-Tuu language family (formerly referred to as Southern Khoisan), spoken over a geographically large area in the southern Kalahari Desert; N|uu is the Western variety of N ng, and 'Au, the Eastern variety ( [16,[11][12][13][14][15][16][17]; [33]; [5,27]). Both dialects are near-extinct with two speakers for 'Au and three speakers for N|uu as of 2013 (with the most fluent speaker of N|uu acting as a language teacher to young people); all N ng speakers use Afrikaans as their main language [5,[15][16]. ...
... N ng is the name of a dialect cluster of the !Ui-Tuu language family (formerly referred to as Southern Khoisan), spoken over a geographically large area in the southern Kalahari Desert; N|uu is the Western variety of N ng, and 'Au, the Eastern variety ( [16,[11][12][13][14][15][16][17]; [33]; [5,27]). Both dialects are near-extinct with two speakers for 'Au and three speakers for N|uu as of 2013 (with the most fluent speaker of N|uu acting as a language teacher to young people); all N ng speakers use Afrikaans as their main language [5,[15][16]. Since the late 19th Century, linguists have collected data of Khoisan 19 languages: this data is sparse, heterogeneous and difficult to access with misclassified languages, inappropriate language names and insufficient metadata as examples of the challenges faced, in addition to the identity of diverse corpora in archival material hard to assess, both in relation to each other and to modern languages ( [16,[5][6][7][8]; [5,2]). ...
Conference Paper
Full-text available
In recent years, the modeling of data from linguistic resources with Resource Description Framework (RDF), following the Linked Data paradigm and using the OntoLex-Lemon vocabulary, has become a prevalent method to create datasets for a multilingual web of data. An important aspect of data modeling is the use of language tags to mark lexicons, lexemes, word senses, etc. of a linguistic dataset. However, attempts to model data from lesser-known languages show significant shortcomings with the authoritative list of language codes by ISO 639: for many lesser-known languages spoken by minorities and also for historical stages of languages, language codes, the basis of language tags, are simply not available. This paper discusses these shortcomings based on the examples of three such languages, i.e., two varieties of click languages of Southern Africa together with Old French, and suggests solutions for the issues identified.
... A subset of the family of bilingual dictionaries developed in Apertium was converted to the LMF [7] ISO standard as part of the METANET4U Project. 10 From that subset of Apertium dictionaries, only the entries in Apertium which were annotated as nouns, proper nouns, verbs, adjectives and adverbs were considered (from a long list of heterogeneous parts of speech present across datasets). This LMF subset constituted the basis for the first RDF representation of the Apertium dictionaries [10], which was released as LLOD 11 (we will refer to it as Apertium RDF v1.0 in the rest of this paper). ...
... A subset of the family of bilingual dictionaries developed in Apertium was converted to the LMF [7] ISO standard as part of the METANET4U Project. 10 From that subset of Apertium dictionaries, only the entries in Apertium which were annotated as nouns, proper nouns, verbs, adjectives and adverbs were considered (from a long list of heterogeneous parts of speech present across datasets). This LMF subset constituted the basis for the first RDF representation of the Apertium dictionaries [10], which was released as LLOD 11 (we will refer to it as Apertium RDF v1.0 in the rest of this paper). Such an RDF version of the Apertium dictionary data was based on the lemon model, the predecessor of Ontolex-lemon, and its translation module [9]. ...
Chapter
Full-text available
We describe the use of linguistic linked data to support a cross-lingual transfer framework for sentiment analysis in the pharmaceutical domain. The proposed system dynamically gathers translations from the Linked Open Data (LOD) cloud, particularly from Apertium RDF, in order to project a deep learning-based sentiment classifier from one language to another, thus enabling scalability and avoiding the need of model re-training when transferred across languages. We describe the whole pipeline traversed by the multilingual data, from their conversion into RDF based on a new dynamic and flexible transformation framework, through their linking and publication as linked data, and finally their exploitation in the particular use case. Based on experiments on projecting a sentiment classifier from English to Spanish, we demonstrate how linked data techniques are able to enhance the multilingual capabilities of a deep learning-based approach in a dynamic and scalable way, in a real application scenario from the pharmaceutical domain.
... In particular, this includes lexical data, as the (directed multi-)graph is generally recognized to be a generic formalism for the representation of dictionaries and machine-readable lexical resources. As such, already the Lexical Markup Framework (Francopoulo et al., 2006, LMF) built on feature structures (largely equivalent to directed multi-graphs, but serialized in XML), and the increasing popularity of OntoLex-Lemon (and RDF) for lexical resources mostly reflects a transition from traditional XML-based representations to RDF-based representations of the same underlying data structure (Gracia et al., 2018). In opposition to XML which provides validation on a syntactic level only, the RDF data model allows to formalize the semantics independently from constraints on their order of representation. ...
... Since 2012, the Ontology-Lexicon W3C Community Group has been further developing this model towards a generic data model for lexical resources, and its application to the historical lexicography of a medieval language variety is the main contribution of our paper. Despite the growing popularity of the Linked Data paradigm in application to lexicographic resources (Witte et al., 2011;Bouda and Cysouw, 2012;Declerck et al., 2015), and in particular, adaptations of Lemon (Borin et al., 2014;Klimek and Brümmer, 2015;Bosque-Gil et al., 2016;Gracia et al., 2018), the focus of current activities in this direction lies on the modern stages of the languages. Notable exceptions in this context include etymological dictionaries, e.g., on Germanic languages (Chiarcos and Sukhareva, 2014), and dictionaries of classical languages, e.g., on Ancient Greek (Khan et al., 2017). ...
Conference Paper
Full-text available
The adaptation of novel techniques and standards in computational lexicography is taking place at an accelerating pace, as manifested by recent extensions beyond the traditional XML-based paradigm of electronic publication. One important area of activity in this regard is the transformation of lexicographic resources into (Linguistic) Linked Open Data ([L]LOD), and the application of the OntoLex-Lemon vocabulary to electronic editions of dictionaries. At the moment, however, these activities focus on machine-readable dictionaries, natural language processing and modern languages and found only limited resonance in philology in general and in historical language stages in particular. This paper presents an endeavor to transform the resources of a comprehensive dictionary of Old French into LOD using OntoLex-Lemon and it sketches the difficulties of modeling particular aspects that are due to the medieval stage of the language.
... On the other hand, translation data is abundant due to the existence of large parallel corpora used in machine translation as well as large multilingual lexical resources such as BabelNet (Navigli and Ponzetto, 2012) and Apertium (Forcada et al., 2011;Gracia et al., 2018). As such, it seems natural that the use of these resources can provide important evidence for translation and an approach by means of translation graphs and clustering algorithms could be highly effective. ...
... For the intercultural analysis we take the Apertium graph of translations (Gracia et al., 2018) as the basis of our analysis. We plot the immediate neighborhoods of the word "fish" based on the translations given in this resource in Figure 6. ...
Article
Full-text available
Word senses are the fundamental unit of description in lexicography, yet it is rarely the case that different dictionaries reach any agreement on the number and definition of senses in a language. With the recent rise in natural language processing and other computational approaches there is an increasing demand for quantitatively validated sense catalogues of words, yet no consensus methodology exists. In this paper, we look at four main approaches to making sense distinctions: formal, cognitive, distributional, and intercultural and examine the strengths and weaknesses of each approach. We then consider how these may be combined into a single sound methodology. We illustrate this by examining two English words, “wing” and “fish,” using existing resources for each of these four approaches and illustrate the weaknesses of each. We then look at the impact of such an integrated method and provide some future perspectives on the research that is necessary to reach a principled method for making sense distinctions.
... A subset of the family of bilingual dictionaries developed in Apertium was converted to the LMF [7] ISO standard as part of the METANET4U Project. 10 From that subset of Apertium dictionaries, only the entries in Apertium which were annotated as nouns, proper nouns, verbs, adjectives and adverbs were considered (from a long list of heterogeneous parts of speech present across datasets). This LMF subset constituted the basis for the first RDF representation of the Apertium dictionaries [10], which was released as LLOD 11 (we will refer to it as Apertium RDF v1.0 in the rest of this paper). ...
... A subset of the family of bilingual dictionaries developed in Apertium was converted to the LMF [7] ISO standard as part of the METANET4U Project. 10 From that subset of Apertium dictionaries, only the entries in Apertium which were annotated as nouns, proper nouns, verbs, adjectives and adverbs were considered (from a long list of heterogeneous parts of speech present across datasets). This LMF subset constituted the basis for the first RDF representation of the Apertium dictionaries [10], which was released as LLOD 11 (we will refer to it as Apertium RDF v1.0 in the rest of this paper). Such an RDF version of the Apertium dictionary data was based on the lemon model, the predecessor of Ontolex-lemon, and its translation module [9]. ...
Conference Paper
Full-text available
We describe the use of linguistic linked data to support a cross-lingual transfer framework for sentiment analysis in the pharmaceutical domain. The proposed system dynamically gathers translations from the Linked Open Data (LOD) cloud, particularly from Apertium RDF, in order to project a deep learning-based sentiment classifier from one language to another, thus enabling scalability and avoiding the need of model retraining when transferred across languages. We describe the whole pipeline traversed by the multilingual data, from their conversion into RDF based on a new dynamic and flexible transformation framework , through their linking and publication as linked data, and finally their exploitation in the particular use case. Based on experiments on projecting a sentiment classifier from English to Spanish, we demonstrate how linked data techniques are able to enhance the multilingual capabilities of a deep learning-based approach in a dynamic and scalable way, in a real application scenario from the pharmaceutical domain.
... The ACORN-SAT Linked Climate Dataset [32] 2016 W3C RDF Data Cube Vocabulary, W3C Semantic Sensor Network ontology, acorn-sat observation ontology, acorn-series time series ontology, climate ontology, raindist rainfall district ontology Vocabulary of Interlinked Datasets: RDF Data Cube Vocabulary, Simple Knowledge Organization System, Semantic Sensor Network ontology, GeoNames ontology (version 3.1), Basic Geo (WGS84 lat/long) vocabulary, Time Ontology in OWL, DOLCE+DnS Ultralite External datasets already SSN Ontology based link to it Table 1 ( Continued) Linked dataset Published date Base vocabulary Linked to vocabulary Linked from vocabulary Star rating Comments The Apertium Bilingual Dictionaries on the Web of Data [22] 2016 Lemon, Lexinfo, Lemon translation module Lemon, lexinfo Lemon along with lemon translation module serves as the basis for the lemon-ontolex model 5 ...
... The Apertium RDF Graph contains the RDF version of the Apertium 1 bilingual dictionaries (Forcada et al., 2011), which have been transformed into RDF and published on the Web following the Linked Data principles. As described by Gracia et al.. (2015), the core linguistic data of the Apertium RDF Graph was modeled using lemon, the LExicon Model for ONtologies (McCrae et al., 2012) while the translations between lexical entries used the lemon translation module (Gracia et al., 2014). Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictionaries and it is expected that more Apertium data will be included in the near future. ...
Conference Paper
Full-text available
The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes' degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries. Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictionaries and constitutes a large unified array of linked lexical entries and translations that are available and accessible on the Web (http://linguistic.linkeddata.es/apertium/). In particular, its graph structure allows for interesting exploitation opportunities, some of which are addressed in this paper. Two 'massive' experiments are reported: in the first one, the original EN-ES translation set was removed from the Apertium RDF Graph and a new EN-ES version was generated. The results were compared against the previously removed EN-ES data and against the Concise Oxford Spanish Dictionary. In the second experiment, a new non-existent EN-FR translation set was generated. In this case the results were compared against a converted wiktionary English-French file. The results we got are really good and perform well for the extreme case of correlated polysemy. This lead us to address the possibility to use cycles and nodes degree to identify potential oddities in the source data. If cycle density proves efficient when considering potential targets, we can assume that in dense graphs nodes with low degree may indicate potential errors.
... As mentioned previously, lemon and its successor OntoLex-Lemon have been widely adopted for the modelling and publishing of lexica and dictionaries as linked data. The core module has proven to be reasonably effective in capturing some of the most typical kinds of lexical information contained in dictionaries and lexical resources in general (e.g., [41][42][43][44][45]). However, there are certain fairly common situations in which the model falls short, most notably in the representation of certain elements of dictionaries and other lexicographic datasets [46]. ...
Article
Full-text available
This article provides a comprehensive and up-to-date survey of models and vocabularies for creating linguistic linked data (LLD) focusing on the latest developments in the area and both building upon and complementing previous works covering similar territory. The article begins with an overview of some recent trends which have had a significant impact on linked data models and vocabularies. Next, we give a general overview of existing vocabularies and models for different categories of LLD resource. After which we look at some of the latest developments in community standards and initiatives including descriptions of recent work on the OntoLex-Lemon model, a survey of recent initiatives in linguistic annotation and LLD, and a discussion of the LLD metadata vocabularies META-SHARE and lime. In the next part of the paper, we focus on the influence of projects on LLD models and vocabularies, starting with a general survey of relevant projects, before dedicating individual sections to a number of recent projects and their impact on LLD vocabularies and models. Finally, in the conclusion, we look ahead at some future challenges for LLD models and vocabularies. The appendix to the paper consists of a brief introduction to the OntoLex-Lemon model.
... This definition [2] is closely related to TransGraphs [6] in existing NLP literature and Semantic Maps in Cognitive Science. One method to populate such a graph is by leveraging multiple bilingual dictionaries many of which are freely available online through projects like Apertium RDF [7]. We adopt the same to stay consistent with the TIAD'21 test set distribution. ...
Preprint
Full-text available
This paper describes an approach used to generate new translations using raw bilingual dictionaries as part of the 4th Task Inference Across Dictionaries (TIAD 2021) shared task. We propose Augmented Cycle Density (ACD) as a framework that combines insights from two state of the art methods that require no sense information and parallel corpora: Cycle Density (CD) and One Time Inverse Consultation (OTIC). The task results show that across 3 unseen language pairs, ACD's predictions, has more than double (74%) the coverage of OTIC at almost the same precision (76%). ACD combines CD's scalability - leveraging rich multilingual graphs for better predictions, and OTIC's data efficiency - producing good results with the minimum possible resource of one pivot language.
Article
The adoption of Semantic Web technologies and the Linked Data paradigm has been driven by the need to ensure the construction of resources that are at the same time interoperable, shareable and reusable by the scientific community. OntoLex-Lemon, which exploits the expressive power of ontologies, has come to be the de facto standard model for the representation of lexica and terminologies. The number of users potentially interested in editing or consuming OntoLex-Lemon data is thus very large. Unfortunately, the use of ontology editors for constructing such language resources can be very tedious due to the complexity and verbosity of the model, which heavily relies on specific modeling technicalities. This underlines the importance of developing tools and services that facilitate the creation and editing of resources and bring lexicography and terminology closer to the Semantic Web. In this paper we present LexO, a collaborative web editor for easily building and managing lexical and terminological resources in the context of the Semantic Web, based on the OntoLex-Lemon model. It makes the model accessible to users who do not possess the needed technical skills, thus allowing for wider adoption of new technological advances in the Semantic Web.
Conference Paper
Full-text available
We describe the use of Linguistic Linked Open Data (LLOD) to support a cross-lingual transfer framework for concept detection in online health communities. Our goal is to develop multilingual text analytics as an enabler for analyzing health-related quality of life (HRQoL) from self-reported patient narratives. The framework capitalizes on supervised cross-lingual projection methods, so that labeled training data for a source language are sufficient and are not needed for target languages. Cross-lingual supervision is provided by LLOD lexical resources to learn bilingual word embeddings that are simultaneously tuned to represent an inventory of HRQoL concepts based on the World Health Organization's quality of life surveys (WHOQOL). We demonstrate that lexicon induction from LLOD resources is a powerful method that yields rich and informative lexical resources for the cross-lingual concept detection task which can outperform existing domain-specific lexica. Furthermore, in a comparative evaluation we find that our models based on bilingual word embeddings exhibit a high degree of complementarity with an approach that integrates machine translation and rule-based extraction algorithms. In a combined configuration, our models rival the performance of state-of-the-art cross-lingual transformers, despite being of considerably lower model complexity.
Chapter
Full-text available
We describe the use of Linguistic Linked Open Data (LLOD) to support a cross-lingual transfer framework for concept detection in online health communities. Our goal is to develop multilingual text analytics as an enabler for analyzing health-related quality of life (HRQoL) from self-reported patient narratives. The framework capitalizes on supervised cross-lingual projection methods, so that labeled training data for a source language are sufficient and are not needed for target languages. Cross-lingual supervision is provided by LLOD lexical resources to learn bilingual word embeddings that are simultaneously tuned to represent an inventory of HRQoL concepts based on the World Health Organization’s quality of life surveys (WHOQOL). We demonstrate that lexicon induction from LLOD resources is a powerful method that yields rich and informative lexical resources for the cross-lingual concept detection task which can outperform existing domain-specific lexica. Furthermore, in a comparative evaluation we find that our models based on bilingual word embeddings exhibit a high degree of complementarity with an approach that integrates machine translation and rule-based extraction algorithms. In a combined configuration, our models rival the performance of state-of-the-art cross-lingual transformers, despite being of considerably lower model complexity.
Chapter
In previous chapters, we discussed how to model linguistic data sets using the Resource Description Framework as a basis to publish them as linked data on the Web. In this chapter, we describe a methodology that can be followed in the transformation of legacy linguistic datasets into linked data. The methodology comprises of different tasks, including the specification, modelling, generation, linking, publication and exploitation of the data. We will discuss specific guidelines that can be applied in the transformation of particular types of resources, such as bilingual/multilingual dictionaries, WordNets, terminologies and corpora.
Chapter
This chapter introduces the Lexicon Model for Ontologies (lemon) as defined by the Ontolex W3C community group. The model was originally developed to enrich ontologies with lexical information expressing how the elements of the ontology including classes, properties and individuals are referred to in a given language. In this chapter we cover the core of the Ontolex-lemon model as well as the extra modules developed by the Ontolex group on syntax and semantics, decomposition, variation and translation and metadata. We then briefly describe some applications of the model.
Article
Full-text available
Knowledge graphs have, for the past decade, been a hot topic both in public and private domains, typically used for large-scale integration and analysis of data using graph-based data models. One of the central concepts in this area is the Semantic Web, with the vision of providing a well-defined meaning to information and services on the Web through a set of standards. Particularly, linked data and ontologies have been quite essential for data sharing, discovery, integration, and reuse. In this paper, we provide a systematic literature review on knowledge graph creation from structured and semi-structured data sources using Semantic Web technologies. The review takes into account four prominent publication venues, namely, Extended Semantic Web Conference, International Semantic Web Conference, Journal of Web Semantics, and Semantic Web Journal. The review highlights the tools, methods, types of data sources, ontologies, and publication methods, together with the challenges, limitations, and lessons learned in the knowledge graph creation processes.
Conference Paper
Full-text available
The English-Xhosa Dictionary for Nurses is a unidirectional dictionary with English and isiXhosa as the language pair, published in 1935 and recently converted to Linguistic Linked Data. Using the Ontolex-Lemon model, an ontological framework was created, where the purpose was to present each lexical entry as "historically dynamic" instead of "ontologically static" (Veltman, 2006:6, cited in Rafferty, 2016:5), therefore the provenance information and generation of linked data for an ontological framework with instances constantly evolving was given particular attention. The output is a framework which provides guidelines for similar applications regarding URI patterns, provenance, versioning, and the generation of RDF data.
Article
Full-text available
As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service-oriented ones (group 4). Contributions encompassed in these four groups are described, highlighting their reuse by the community and the modelling challenges that are still to be faced.
Article
Data reconciliation and Named Entity Recognition (NER) are closely related concepts to the domain of data carpentry in general and library carpentry in particular. In this context, the part III of the three-part series on library carpentry (part I & II have been published in April & June issues of this journal) is an attempt to apply library carpentry methods in the core areas of information organization in a library of any type or size along with additional utilities like cross-linking of data sources, automatic translation, sentiment analysis and so on. A total of five case studies are included in this research study covering these areas with a focus on do-by-yourself mode.
Preprint
Full-text available
The focus of this thesis is broadly on the alignment of lexicographical data, particularly dictionaries. In order to tackle some of the challenges in this field, two main tasks of word sense alignment and translation inference are addressed. The first task aims to find an optimal alignment given the sense definitions of a headword in two different monolingual dictionaries. This is a challenging task, especially due to differences in sense granularity, coverage and description in two resources. After describing the characteristics of various lexical semantic resources, we introduce a benchmark containing 17 datasets of 15 languages where monolingual word senses and definitions are manually annotated across different resources by experts. In the creation of the benchmark, lexicographers' knowledge is incorporated through the annotations where a semantic relation, namely exact, narrower, broader, related or none, is selected for each sense pair. This benchmark can be used for evaluation purposes of word-sense alignment systems. The performance of a few alignment techniques based on textual and non-textual semantic similarity detection and semantic relation induction is evaluated using the benchmark. Finally, we extend this work to translation inference where translation pairs are induced to generate bilingual lexicons in an unsupervised way using various approaches based on graph analysis. This task is of particular interest for the creation of lexicographical resources for less-resourced and under-represented languages and also, assists in increasing coverage of the existing resources. From a practical point of view, the techniques and methods that are developed in this thesis are implemented within a tool that can facilitate the alignment task.
Article
Domain-specific terminologies play a central role in many language technology solutions. Substantial manual effort is still involved in the creation of such resources, and many of them are published in proprietary formats that cannot be easily reused in other applications. Automatic term extraction tools help alleviate this cumbersome task. However, their results are usually in the form of plain lists of terms or as unstructured data with limited linguistic information. Initiatives such as the Linguistic Linked Open Data cloud (LLOD) foster the publication of language resources in open structured formats, specifically RDF, and their linking to other resources on the Web of Data. In order to leverage the wealth of linguistic data in the LLOD and speed up the creation of linked terminological resources, we propose TermitUp, a service that generates enriched domain specific terminologies directly from corpora, and publishes them in open and structured formats. TermitUp is composed of five modules performing terminology extraction, terminology post-processing, terminology enrichment, term relation validation and RDF publication. As part of the pipeline implemented by this service, existing resources in the LLOD are linked with the resulting terminologies, contributing in this way to the population of the LLOD cloud. TermitUp has been used in the framework of European projects tackling different fields, such as the legal domain, with promising results. Different alternatives on how to model enriched terminologies are considered and good practices illustrated with examples are proposed.
Chapter
Full-text available
Linked Data technologies and methods are enabling the creation of a data network where pieces of data are interconnected on the Web using machine-readable formats such as Resource Description Framework (RDF). This paradigm offers great opportunities to connect and make available knowledge in different languages. However, in order to make this vision a reality, there is a need for guidelines, techniques, and methods that allow publishers of data to overcome language and technological barriers. In this chapter, we review existing methodologies from the point of view of multilingualism and propose a series of guidelines to help publishers when publishing Linked Data in several languages.
Conference Paper
Full-text available
The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes' degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries. Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictionaries and constitutes a large unified array of linked lexical entries and translations that are available and accessible on the Web (http://linguistic.linkeddata.es/apertium/). In particular, its graph structure allows for interesting exploitation opportunities, some of which are addressed in this paper. Two 'massive' experiments are reported: in the first one, the original EN-ES translation set was removed from the Apertium RDF Graph and a new EN-ES version was generated. The results were compared against the previously removed EN-ES data and against the Concise Oxford Spanish Dictionary. In the second experiment, a new non-existent EN-FR translation set was generated. In this case the results were compared against a converted wiktionary English-French file. The results we got are really good and perform well for the extreme case of correlated polysemy. This lead us to address the possibility to use cycles and nodes degree to identify potential oddities in the source data. If cycle density proves efficient when considering potential targets, we can assume that in dense graphs nodes with low degree may indicate potential errors.
Article
Full-text available
Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our extraction of multilingual lexical data from Wiktionary data and to provide it to the community as a Multilingual Lexical Linked Open Data (MLLOD). This lexical resource is structured using the LEMON Model. This data, called DBnary, is registered at http://thedatahub.org/dataset/dbnary.
Article
Full-text available
Lexica and terminology databases play a vital role in many NLP applications, but currently most such resources are published in application-specific formats, or with custom access interfaces, leading to the problem that much of this data is in “data silos” and hence difficult to access. The Semantic Web and in particular the Linked Data initiative provide effective solutions to this problem, as well as possibilities for data reuse by inter-lexicon linking, and incorporation of data categories by dereferencable URIs. The Semantic Web focuses on the use of ontologies to describe semantics on the Web, but currently there is no standard for providing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology. We present our model, lemon, which aims to address these gaps while building on existing work, in particular the Lexical Markup Framework, the ISOcat Data Category Registry, SKOS (Simple Knowledge Organization System) and the LexInfo and LIR ontology-lexicon models.
Article
Full-text available
Manually constructing multilingual translation lexicons can be very costly, both in terms of time and human effort. Although there have been many efforts at (semi–)automatically merging bilingual machine readable dictionaries to produce a multilingual lexicon, most of these approaches place quite specific requirements on the input bilingual resources. Unfortunately, not all bilingual dictionaries fulfil these criteria, especially in the case of under–resourced language pairs. We describe a low cost method for constructing a multilingual lexicon using only simple lists of bilingual translation mappings. The method is especially suitable for under–resourced language pairs, as such bilingual resources are often freely available and easily obtainable from the Internet, or digitised from simple, conventional paper–based dictionaries. The precision of random samples of the resultant multilingual lexicon is around 0.70–0.82, while coverage for each language, precision and recall can be controlled by varying threshold values. Given the very simple input resources, our results are encouraging, especially in incorporating under–resourced languages into multilingual lexical resources.
Article
Full-text available
Apertium is a free/open-source platform for rule-based machine translation. It is being widely used to build machine translation systems for a variety of language pairs, especially in those cases (mainly with related-language pairs) where shallow transfer suffices to produce good quality translations, although it has also proven useful in assimilation scenarios with more distant pairs involved. This article summarises the Apertium platform: the translation engine, the encoding of linguistic data, and the tools developed around the platform. The present limitations of the platform and the challenges posed for the coming years are also discussed. Finally, evaluation results for some of the most active language pairs are presented. An appendix describes Apertium as a free/open-source project. KeywordsFree/open-source machine translation–Rule-based machine translation–Apertium–Shallow transfer–Finite-state transducers
Article
Full-text available
Optimizing the production, maintenance and extension of lexical resources is one the crucial aspects impacting Natural Language Processing (NLP). A second aspect involves optimizing the process leading to their integration in applica- tions. With this respect, we believe that the production of a consensual specifica- tion on multilingual lexicons can be a useful aid for the various NLP actors. Within ISO, one purpose of LMF (ISO- 24613) is to define a standard for lexi- cons that covers multilingual data.
Article
Full-text available
This paper presents an Italian to Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish-Catalan and Spanish-Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM. Aquest article presenta un sistema de traducció automàtica basat en regles de l'italià al català construït automàticament combinant les dades lingüístiques dels parells espanyol-català i espanyol-italià existents. Es duu a terme un postprocessament manual superficial per a corregir incoherències en els diccionaris derivats automàticament i per a afegir-hi paraules molt freqüents que no hi són d'acord amb una anàlisi del corpus. El sistema s'avalua en el corpus KDE4 i supera Google Translate aproximadament per deu punts absoluts tant pel que fa al TER (índex d'edició de traducció) com pel que fa al GTM (mètode de traducció gramàtica). Este artículo presenta un sistema de traducción automática basado en reglas del italiano al catalán construido mediante la combinación de datos lingüísticos de los pares existentes español-catalán y español-italiano. Se lleva a cabo un postprocesamiento manual superficial para corregir incoherencias en los diccionarios derivados automáticamente y para añadir palabras muy frecuentes que no están en ellos según un análisis del corpus. El sistema se evalúa en el corpus KDE4 y supera a Google Translate aproximadamente por diez puntos absolutos tanto por lo que respecta al TER (índice de edición de traducción) como por lo que respecta al GTM (método de traducción gramática).
Article
Full-text available
We define deterministic augmented letter transducers (DALTs), a class of finitestate transducers which provide an e#cient way of implementing morphological analysers which tokenize their input (i.e., divide texts in tokens or words) as they analyse it, and show how these morphological analysers may be maintained (i.e., how surface form--lexical form transductions may be added or removed from them) while keeping them minimal; e#cient algorithms for both operations are given in detail. The algorithms may also be applied to the incremental construction and maintentance of other lexical modules in a machine translation system such as the lexical transfer module or the morphological generator.
Article
Full-text available
When using a third language to construct a bilingual dictionary, it is necessary to discriminate equivalencies from inappropriate words derived as a result of ambiguity in the third language. We propose a method to treat this by utilizing the structures of dictionaries to measure the nearness of the meanings of words. The resulting dictionary is a word-to-word bilingual dictionary of nouns and can be used to refine the entries and equivalencies in published bilingual dictionaries. 1 Introduction When vocabulary cannot be found in bilingual dictionaries, it is frequently obtained by using a third language as an intermediary. This indicates that supplemental information may lie in other forms in other dictionaries. Here we try using electronic dictionaries which can be reformed on a large scale, to extract this informations so that we can obtain subsidiary data and refine a direct bilingual dictionary. Looking up words in bilingual dictionaries intermediating the third language is a m...
Chapter
Most of the linguistic resources available to day are about the world's ma-jor languages. This paper discusses two projects which have world-wide coverage as their aim. Glottolog/Langdoc is an attempt to attain near-complete bibliographical coverage for the world's lesser-known languages (i.e. 95% of the world's linguistic diversity), which then provides solid empirical ground for extensional definitions of languages and language classification. Automated Similarity Judgment Program (ASJP) online provides standardized lexical distance data for 5800 languages from Brown et al (2008) as Linked Data. These two projects are the first attempt at a Ty-pological Linked Data Cloud, to which PHOIBLE by Moran (this vol.) can easily be added in the future.
Article
We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks.
Article
The Web has witnessed an enormous growth in the amount of semantic information published in recent years. This growth has been stimulated to a large extent by the emergence of Linked Data. Although this brings us a big step closer to the vision of a Semantic Web, it also raises new issues such as the need for dealing with information expressed in different natural languages. Indeed, although the Web of Data can contain any kind of information in any language, it still lacks explicit mechanisms to automatically reconcile such information when it is expressed in different languages. This leads to situations in which data expressed in a certain language is not easily accessible to speakers of other languages. The Web of Data shows the potential for being extended to a truly multilingual web as vocabularies and data can be published in a language-independent fashion, while associated language-dependent (linguistic) information supporting the access across languages can be stored separately. In this sense, the multilingual Web of Data can be realized in our view as a layer of services and resources on top of the existing Linked Data infrastructure adding (i) linguistic information for data and vocabularies in different languages, (ii) mappings between data with labels in different languages, and (iii) services to dynamically access and traverse Linked Data across different languages. In this article, we present this vision of a multilingual Web of Data. We discuss challenges that need to be addressed to make this vision come true and discuss the role that techniques such as ontology localization, ontology mapping, and cross-lingual ontology-based information access and presentation will play in achieving this. Further, we propose an initial architecture and describe a roadmap that can provide a basis for the implementation of this vision.
Study on persistent URIs Available from http
  • P Archer
  • S Goedertier
  • N Loutas
P. Archer, S. Goedertier, and N. Loutas. Study on persistent URIs. Technical report, Dec. 2012. Available from http://joinup.ec.europa.eu/sites/default/files/ D7.1.3%20-%20Study%20on%20persistent%20URIs_0.pdf.
Interoperability challenges for linguistic linked data
  • D Lewis
D. Lewis. Interoperability challenges for linguistic linked data. In Proc. of Open Data on the Web ODW'13, Apr. 2013.
Representing multilingual data as linked data: The case of BabelNet 2.0
  • M Ehrmann
  • F Cecconi
  • D Vannella
  • J P Mccrae
  • P Cimiano
  • R Navigli
M. Ehrmann, F. Cecconi, D. Vannella, J. P. McCrae, P. Cimiano, and R. Navigli. Representing multilingual data as linked data: the case of BabelNet 2.0. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, May 2014. ELRA.
Study on persistent URIs
  • P Archer
  • S Goedertier
  • N Loutas
P. Archer, S. Goedertier, and N. Loutas. Study on persistent URIs. Technical report, Dec. 2012. Available from http://joinup.ec.europa.eu/sites/default/files/ D7.1.3%20-%20Study%20on%20persistent%20URIs_0.pdf.
Lexical markup framework (LMF) for NLP multilingual resources
  • G Francopoulo
  • N Bel
  • M George
  • N Calzolari
  • M Monachini
  • M Pet
  • C Soria
G. Francopoulo, N. Bel, M. George, N. Calzolari, M. Monachini, M. Pet, and C. Soria. Lexical markup framework (LMF) for NLP multilingual resources. In Proc. of the Workshop on Multilingual Language Resources and Interoperability, pages 1-8, Sydney, Australia, July 2006. ACL.