ArticlePDF Available

When linguistics meets web technologies. Recent advances in modelling linguistic linked data

Article

When linguistics meets web technologies. Recent advances in modelling linguistic linked data

Abstract and Figures

This article provides a comprehensive and up-to-date survey of models and vocabularies for creating linguistic linked data (LLD) focusing on the latest developments in the area and both building upon and complementing previous works covering similar territory. The article begins with an overview of some recent trends which have had a significant impact on linked data models and vocabularies. Next, we give a general overview of existing vocabularies and models for different categories of LLD resource. After which we look at some of the latest developments in community standards and initiatives including descriptions of recent work on the OntoLex-Lemon model, a survey of recent initiatives in linguistic annotation and LLD, and a discussion of the LLD metadata vocabularies META-SHARE and lime. In the next part of the paper, we focus on the influence of projects on LLD models and vocabularies, starting with a general survey of relevant projects, before dedicating individual sections to a number of recent projects and their impact on LLD vocabularies and models. Finally, in the conclusion, we look ahead at some future challenges for LLD models and vocabularies. The appendix to the paper consists of a brief introduction to the OntoLex-Lemon model.
Content may be subject to copyright.
Semantic Web 0 (0) 1 1
IOS Press
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
When Linguistics Meets Web Technologies.
Recent advances in Modelling Linguistic
Linked Open Data
Anas Fahad Khan a, Christian Chiarcos b, Thierry Declerck c, Daniela Gifu d,
Elena González-Blanco García e, Jorge Gracia f, Maxim Ionov b, Penny Labropoulou h,
Francesco Mambrini i, John P. McCrae j, Émilie Pagé-Perronk, Marco Passarottii,
Salvador Ros Muñoz l, Ciprian-Octavian Truic˘
am
aIstituto di Linguistica Computazionale «A. Zampolli», Consiglio Nazionale delle Ricerche, Italy
E-mail: fahad.khan@ilc.cnr.it
bApplied Computational Linguistics Lab, Goethe-Universität Frankfurt am Main, Germany
E-mail: chiarcos@informatik.uni-frankfurt.de,
E-mail: ionov@informatik.uni-frankfurt.de
cDFKI GmbH, Multilinguality and Language Technology, Saarbrücken, Germany
E-mail: declerck@dfki.de
dFaculty of Computer Science, Alexandru Ioan Cuza University of Iasi, Romania
E-mail: daniela.gifu@info.uaic.ro
eLaboratory of Innovation on Digital Humanities, IE University, Spain
E-mail: egonzalezblanco@faculty.ie.edu
fAragon Institute of Engineering Research, University of Zaragoza, Spain
E-mail: jogracia@unizar.es
gApplied Computational Linguistics Lab, Goethe-Universität Frankfurt am Main, Germany
E-mail: ionov@informatik.uni-frankfurt.de
hInstitute for Language and Speech Processing, Athena Research Center, Greece
E-mail: penny@athenarc.gr
iCIRCSE Research Centre, Università Cattolica del Sacro Cuore, Milan, Italy
E-mail: francesco.mambrini@unicatt.it,
E-mail: marco.passarotti@unicatt.it
jInsight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway,
Ireland
E-mail: john.mccrae@insight-centre.org
kWolfson College, University of Oxford, United Kingdom
E-mail: emilie.page-perron@wolfson.ox.ac.uk
lLaboratory of Innovation on Digital Humanities, National Distance Education University UNED, Spain
E-mail: sros@scc.uned.es
mComputer Science and Engineering Department, Faculty of Automatic Control and Computers, University
Politehnica of Bucharest, Romania
E-mail: ciprian.truica@upb.ro
Abstract. This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and
ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds
1570-0844/0-1900/$35.00 © 0 IOS Press and the authors. All rights reserved
2AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
upon and complements previous works covering similar territory. The article begins with an overview of recent trends which
have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding
of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital
humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this
we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent
work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies
META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in
a number of recent projects and which has a significant impact on LLD vocabularies and models.
Keywords: linguistic linked data, FAIR, corpora, annotation, language resources, OntoLex-Lemon, Digital Humanities, metadata,
models
1. Introduction
The growing popularity of linked data, and espe-
cially of linked open data (that is, linked data with an
open license), as a means of publishing language re-
sources (lexica, corpora, data categories, etc.) necessi-
tates a greater emphasis on models for linguistic linked
data (LLD) since these are key to what makes linked
data resources so reusable and so interoperable (at a
semantic level). The purpose of this article is to pro-
vide a comprehensive and up-to-date survey of mod-
els used for representing linguistic linked data. It will
focus on the latest developments and will both build
upon as well as trying to complement previous works
covering similar territory by avoiding too much repeti-
tion and overlap with the latter.
In the following section, Section 2, we give an
overview of a number of trends from the last few years
which have had, or which are likely to have, a signifi-
cant impact on the definition and/or use of LLD mod-
els. We relate these trends to the rest of the article by
highlighting relevant sections of the article (in bold).
This overview of trends will help to locate the present
work within a wider research context, something that
is extremely useful in an area as active as linguistic
linked data, as well as assisting readers in navigating
the rest of the article. Next, in Section 2.4, we com-
pare the present article with other related work, includ-
ing an earlier survey of LLD models, in order to help
clarify the topics and approach of the present work.
Section 3 gives an overview of the most widely used
models in LLD. Then in Section 4, we look at recent
developments in community standards and initiatives.
These include the latest extensions of the OntoLex-
Lemon model in Section 4.1, a discussion of relevant
work in copora and annotations in Section 4.2, and a
section on metadata Section 4.3. Finally there is a sec-
tion discussing projects, Section 5, and the conclusion,
Section 6.
2. Setting the Scene: An Overview of Relevant
Trends for LLD
The trends we have decided to focus on in this
overview are the FAIRification of data in Section 2.1,
the importance of projects to LLD models in Section
2.2, and finally the increasing importance of Digital
Humanities use cases in Section 2.3.
2.1. FAIR New World
With the growing importance of Open Science ini-
tiatives, and especially those promoting the FAIR
guidelines (where FAIR stands for Findable, Accessi-
ble, Interoperable and Reusable) [1] and the conse-
quent emphasis on the modelling, creation and publi-
cation of language resources as FAIR digital resources
shared models and vocabularies have begun to take
on an increasingly prominent role. Although the lin-
guistic linked data community has been active in pro-
moting shared RDF vocabularies and models for years,
this new emphasis on FAIR is likely to have a con-
siderable impact in several ways, not least in terms of
the necessity for these models to demonstrate a greater
coverage, and to be more interoperable one with an-
other. We will look at one series of FAIR related rec-
ommendations for models in Section 3 and see how
they might be applied to the case of LLD. However in
the rest of the subsection we will take a closer look
at the FAIR principles themselves and show why their
widespread adoption is likely to lead to a greater role
for LLD models and vocabularies in the future.
In The FAIR Guiding Principles for scientific data
management and stewardship [1], the article which
AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 3
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
first articulated the well known FAIR principles, the
authors clearly state that the criteria proposed by these
principles are intended both "for machines and peo-
ple" and that they provide "‘steps along a path’ to ma-
chine actionability", where the latter is understood to
describe structured data that would allow a "computa-
tional data explorer" to determine:
The type of a "digital research object"
Its usefulness with respect to tasks to be carried
out
Its usability especially with respect to licensing
issues, represented in a way that would allow the
agent to take "appropriate action".
The current popularity of the FAIR principles and,
in particular, their promotion by governments and re-
search funding bodies, such as the European Commis-
sion,1through several national and international ini-
tiatives reflects a wider recognition of the potential of
structured and machine actionable data in changing
how research is carried out, and especially in helping
to support open science practices. The FAIR ideal, in
short, is to allow machines as much autonomy as pos-
sible in working with data, by the expedient of render-
ing as much of the semantics of that data explicit (and
machine actionable) as possible.
Publishing data using a standardised data model like
the Resource Description Framework2(RDF) which
was specifically intended to facilitate interoperability
and interlinking between datasets along with the
other standards proposed in the Semantic Web stack
and the technical infrastructure which has been devel-
oped in order to support it obviously goes a long way
towards facilitating the publication of datasets as FAIR
data. In addition, however, it is also vital that there ex-
ist specialised vocabularies/terminologies/models and
data category registries in order to ensure a viable,
domain-wide level of interoperability and re-usability
of data. These former resources serve to describe the
shared theoretical assumptions held by a community of
experts with regard to the semantics of the terms used
by that community, and do so in a form that is (to vary-
ing extents) machine readable to computational agents.
The following FAIR principles are especially salient
here:
F2. data are described with rich metadata.
1https://ec.europa.eu/info/sites/info/files/turning_fair_into_
reality_0.pdf
2https://www.w3.org/TR/rdf-primer/
I1. (meta)data use a formal, accessible, shared,
and broadly applicable language for knowledge
representation.
I2. (meta)data use vocabularies that follow FAIR
principles.
It is important to note that the emphasis placed
on machine actionability in FAIR resources (that
is, on enabling computational agents to find rele-
vant datasets and resources and to take "appropriate
action" when they find them) gives Semantic Web
vocabularies/registries a substantial advantage over
other (non-Semantic Web native) standards in the
field of linguistics like the Text Encoding Initiative
(TEI) guidelines3[2], the Lexical Markup Framework
(LMF) [3] or the Morpho-syntactic Annotation Frame-
work (MAF) [4].
For a start, none of these other standards possess
a ‘native’, widely-used, widely technically supported
knowledge representation language for describing the
semantics of vocabulary terms in a machine readable
way, or at least nothing as powerful as the Web Ontol-
ogy Language (OWL)4or the Semantic Web Rule Lan-
guage (SWRL)5. For instance., there is no standard-
ised way of describing the meanings of morphemes,
lexemes, lemmas, etc. in TEI in a machine actionable
way.
The ability to give precise, axiomatic definitions of
terms in a formal knowledge representation (KR) lan-
guage (allied with already established conceptual mod-
elling techniques and ontology engineering best prac-
tises) is especially helpful in humanistic disciplines
such as linguistics or literary scholarship, where there
can often be quite different definitions of the same
or similar core concepts, e.g., with respect to differ-
ent scholarly traditions or schools of thought. Using
a machine readable description in OWL, once again,
in conjunction with an ontology modelling methodol-
ogy such as OntoClean [5], and together a more hu-
man readable description given as documentation, can
help to clarify (according to the expressive limitations
of OWL) what we mean when we use a concept like
‘Sense’ or ‘Morpheme’ in a dataset. It also facilitates
the machine readable description of the relationship
between different definitions of concepts across lan-
guages or traditions.
3https://tei-c.org/guidelines/
4https://www.w3.org/TR/2012/REC-owl2-overview-20121211/
5https://www.w3.org/Submission/SWRL/
4AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
Secondly, thanks to the use of a shared data model
and a powerful native linking mechanism, linguistic
linked data datasets can be easily (and in a standard
way) integrated with/enriched by (linked data) datasets
belonging to other disciplines, for instance geograph-
ical and historical datasets or gazetters and authority
lists. OWL, and vocabularies, such as PROV-O,6also
allow us to add information pertaining to when some-
thing happened, or whether we are describing a hy-
pothesis7or not (in which case, also who made it and
when). Once again all of these things can be described
in a way that makes the semantics of the information
(relatively) explicit and machine actionable through
the use of pre-existing standards and technologies in-
cluding the Semantic Web query language SPARQL
Protocol and RDF Query Language (SPARQL) as well
as freely available Semantic Web reasoning engines.
Moreover the pursuit of the FAIR ideal has opened
the way to new means of publishing datasets which
offer enhanced opportunities for the re-use of such
data in an automatic or semi-automatic way. These in-
clude for instance nanopublications,cardinal asser-
tions and knowlets8. The potential of these new pub-
lishing approaches for discovering new facts as well as
for comparing concepts and tracking how single con-
cepts change are well described in [6].
The field of language resources offers us a rich array
of highly structured kinds of datasets, structured ac-
cording to a series of widely shared conventions (this
is what makes the definition of models and vocabular-
ies for lexica, corpora, etc, so viable in the first place)
something that would seem to lend itself well to mak-
ing such resources FAIR in the machine-oriented spirit
of the original description of those principles as well
as to the new data publication approaches previously
mentioned. However, the better and more expressive
the underlying models are the more effective they will
be.
6https://www.w3.org/TR/prov-o/
7For which one could use the Semantic Web ontology CR-
MInfhttp://www.cidoc-crm.org/crminf/.
8Nanopublications are defined as the "smallest possible machine
readable graph-like structure that represents a meaningful asser-
tion" [6] and consist of publishing a single subject-predicate-object
triple with full provenance information; a generalisation of this idea
is that of the cardinal assertion where a single assertion is associ-
ated with more than one provenance graph. A knowlet consists of a
collection of multiple cardinal assertions, with the same subject con-
cept [6] and can be viewed as locating that concept in a rich ‘concep-
tual space’. For instance, this could be a cloud of predicates centered
around a word or a sense.
In order to ensure the continued effectiveness of
linked data and the Semantic Web in facilitating the
creation of FAIR resources, it is vital that pre-existing
vocabularies/models/data registries be re-used when-
ever possible in the modelling of user data; this of
course also means ensuring that these models have suf-
ficient coverage and defining extensions when this is
not the case, as well as creating training materials suit-
able for different groups of users. Part of the intention
of this article, together with the foundational work car-
ried out in [7], is to provide an overview of what exists
out there in terms of LLD-focused models, to look at
the areas which are receiving most attention in order to
highlight those which are so far underrepresented. In
addition in Section 3 we look at the most well known
LLD models in the light of a recent series of recom-
mendations on the publication of models as FAIR re-
sources.
2.2. The Importance of Projects and Community
Initiatives in LLD
One significant indicator of the success which LLD
has had in the last few years is the variety of new
funded projects which have included the publication
of linguistic datasets as linked data as a core theme.
These include projects at a continental or transnational
level, notably European H2020 projects, ERCs and
COST actions, as well as projects at the national and
regional levels. Arguably, this recent success in obtain-
ing project funding reflects a much wider recognition
of the importance of linked data as a means of ensuring
the interoperability and accessibility of language re-
sources both to the research community and to a wider
public. In addition, it also demonstrates the continuing
maturation of the field as LLD continues to be applied
to new domains and use cases, in many cases, within
the context of the projects alluded to above. In addi-
tion, these projects also offer us clear examples of the
use of the LLD vocabularies and models we will look
at ‘in the wild’ so to speak and demonstrating their ap-
plication to a wide number of medium to large scale
datasets.
We have therefore decided to dedicate a section of
the current article, Section 5 to a detailed discussion
of the current situation as regards research projects and
LLD models and vocabularies. This includes a detailed
overview of the area, Section 5.1 along with an ex-
tended descriptions of a number of projects which we
regard as the most significant from the point of view of
LLD models and vocabularies. These are (in order of
AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 5
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
appearance): the Linked Open Dictionaries (LiODi)
project (Section 5.2.1); the Poetry Standardization
and Linked Open Data (POSTDATA) project; the
LiLa: Linking Latin ERC project (Section 5.2.4); the
Prêt-à-LLOD project (Section 5.2.5); the European
network for Web-centred linguistic data science
(NexusLinguarum) COST action (Section 5.2.6). A
list of all the projects described in Section 5 can be
found in Table 3.
Note, however, that although the projects which we
will discuss in Section 5 have, in many cases, set the
agenda for the development of LLD models and vo-
cabularies, much of the actual work on the definition of
these resources was carried out and is being carried
out within community groups, such as the W3C On-
toLex group. We therefore include an update on com-
munity standards and initiatives in Section 4. These
include a subsection on the latest activities in the On-
toLex group (Section 4.1); a discussion of recent work
on LLD models for corpora and annotation (Section
4.2); and similarly for what concerns models and vo-
cabularies for LLD resource metadata (Section 4.3).
Section 5.1.2 features a discussion of the relation-
ship between community initiatives and projects.
2.3. The Relationship of LLD to the Digital
Humanities
Several of the projects which we will discuss in this
article are related to the area of Digital Humanities
(DH). This is the third major trend which we want
to highlight here, since it represents a move away (or
rather a branching off) from LLD’s beginnings in com-
putational linguistics and natural language processing
(although these latter two still perhaps represent the
majority of applications of LLD), something that calls
for a shift in emphases in the definition and coverage
of LLD models. This overlap between LLD and DH is
particularly apparent in the modelling of corpora an-
notation (Section 4.2) and in support for lexicographic
use cases (see Section 4.1.1 and Section 5.2.3). Indeed
one obvious example of these shared concerns is the
publication of retro-digitised dictionaries as LLD lex-
ica (a major theme of the ELEXIS project, see Sec-
tion 5.2.3). The latter use case confronts us with the
challenge of formally modelling both the content of a
lexicographic work, that is the linguistic descriptions
which it contains, as well as those aspects which per-
tain to it as a physical text to be represented in digital
form. In the latter case, this includes the representation
of (elements of) the form of the text, i.e., its structural
layout and overall visual appearance.9In fact, as we
discuss in our description of the OntoLex Lexicogra-
phy module in Section 4.1.1) even the structural divi-
sion of lexicographic works into textual units such as
entries and senses is not always isomorphic to the rep-
resentation of the lexical content of those units using
OntoLex-Lemon classes such as LexicalEntry and Lex-
icalSense.
We may also wish to model different aspects of the
history of the lexicographic work as physical text.10
All of this calls for a much richer provision of meta-
data categories than had previously been considered
for LLD lexica, both at the level of the whole work as
well as at the level of the entry. It also requires the ca-
pacity to model salient aspects of the same artefact or
resource at different levels of description (something
which is indeed offered by the OntoLex Lexicogra-
phy module, see Section 4.1.1). We discuss metadata
challenges in humanities use cases in Section 4.3. A
related topic is the relationship between notions such
as word from the lexical/linguistic and the philologi-
cal points of view and, more broadly speaking, the re-
lationship between linguistic and philological annota-
tions of text is a topic which is just starting to gain at-
tention within the context of LLD. It is being studied
both at the level of community initiatives (see Section
4.2) as well as in projects such as LiLa (see Section
5.2.4) as well as POSTDATA (Section 5.2.2).
An additional series of challenges arises in the con-
sideration of resources for classical and historical lan-
guages, or indeed, historical stages of modern lan-
guages. For instance in the case of lexical resources
for historical languages we often come up against the
necessity of having to model attestations (something
that is discussed in Section 4.1.3) which sometimes
cite reconstructed texts, as well as the desirability of
being able to represent different scholarly and philo-
logical hypotheses for instance when it comes to mod-
elling etymologies. The LiLa project [9] (Section 5.2.4
for a more detailed description) provides a good exam-
9Encompassing what the TEI dictionary chapter guidelines call
the typographical and editorial views. See https://www.tei-c.org/
release/doc/tei-p5-doc/en/html/DI.html#DIMV
10For example, in the case of older resources, annotating instances
where the content has been superseded by subsequent scholarly
work. Or we might want to track the evolution of a historically sig-
nificant lexicographic work over the course of a number of editions,
in order to see, for example, how changes in entries reflected both
linguistic and wider, non-linguistic trends. This was in fact one of
the motivations behind the Nénufar project [8], described in Section
5.1.1.
6AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
ple of the challenges and opportunities of adopting the
LLD model to represent linguistic (meta)data for both
lexical and textual resources for a classical language
(Latin).
One extremely important (non RDF-based) standard
for encoding documents in the Digital Humanities is
TEI/XML. In the current article we discuss the rela-
tionship between TEI and RDF-based annotation ap-
proaches in Section 4.2.1, and introduce the new lexi-
cographic TEI-based standard TEI Lex-0 and describe
current work on a crosswalk between OntoLex-Lemon
and the latter in Section 5.2.3.
Finally, see Section 5.1.1 for an overview of a num-
ber of projects combining DH and LLD.
2.4. Related Work
The current work is intended, among other things, to
both complement as well as to update a previous gen-
eral survey on models for representing LLD, published
by Bosque-Gil et al. in 2018 [7]. Although we are now
only two years on from the publication of that article,
we feel that enough has happened in the intervening
time period to justify a new survey article. In addition
we believe that we cover a much wider range of topics
than the previous article and that our focus is also quite
different. Broadly speaking, that previous work offered
a classification of various different LLD vocabularies
according to the different levels of linguistic descrip-
tion that they covered. The current paper however con-
centrates more on the use of LLD vocabularies in prac-
tise and on their availability (this is very much how we
have approached the survey in Section 3). Moreover,
the present article includes a detailed discussion of re-
cent work in the use of LLD models and vocabularies
in corpora and annotation, Section 4.2, as well as an
extensive section on metadata, Section 4.3, neither of
which were given the same detailed level of coverage
in [7]. Additionally we also cover the following initia-
tives which were not discussed in [7] because they had
not yet gotten underway:
The development of new OntoLex-Lemon mod-
ules for morphology Section 4.1.2 and frequency,
attestations, and corpus Information, described in
Section 4.1.3
An important new initiative in aligning LLD vo-
cabularies for corpora and annotation, described
in Section 4.2.5.
In what follows we will assume that the reader already
has some grounding in linked data in general includ-
ing a basic familiarity with the Resource Description
Framework (RDF), RDF Schema (RDFS) and the Web
Ontology Language (OWL) and linguistic linked
data in particular. The recently published Linguistic
linked data: representation, generation and applica-
tions [10] should however give the interested reader
who is missing this minimal background a comprehen-
sive introduction to and overview of the latter field,
focusing on more established models and vocabular-
ies and their application rather than on recent devel-
opments. Another important new book on the topic of
LLD and which has relevance to the current work is
the collected volume Development of linguistic linked
open data resources for collaborative data-intensive
research in the language sciences [11] which aims to
describe major developments since 2015. It consists
mostly of position papers by researchers from the lin-
guistics and the language resource communities.
3. LLD Models: An Overview
Summary The current section will give an overview
of some of the most well known and/or widely used
models and vocabularies in LLD. A summary of the
models discussed in the current section (and in the
whole article) can be found in Tables 1 and 2 (with Ta-
ble 1 dealing with published LLD models/vocabular-
ies and 2 with models/vocabularies that are currently
unavailable or no longer updated). An account of some
the latest developments with regards to these models,
on the other hand, can be found in Section 4.
We will classify each of the models described in this
section according to the scheme given in the linguis-
tic LOD cloud diagram11 (the cloud itself is described
in [12]), namely:
Corpora (and Linguistic Annotations)(Section
3.1)
Lexicons and Dictionaries (Section 3.2)
Terminologies, Thesauri and Knowledge Bases
(Section 3.3)
Linguistic Resource Metadata (Section 3.4)
Linguistic Data Categories (Section 3.5)
Typological Databases (Section 3.6)
For each category we list the most prominent and/or
widely used LLD models/vocabularies that belong to
that category (the relevant section is given in paren-
11http://linguistic-lod.org/llod-cloud
AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 7
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
theses after each category in the list above). These
models were either originally designed to help encode
that kind of dataset or have been widely appropri-
ated for that end; in the case of the category Linguis-
tic Data Categories we list linked data linguistic data
categories. For instance, the OntoLex-Lemon model
falls under Lexicons and Dictionaries since it was ini-
tially conceived as a means of enriching ontologies
with lexical information, that is, of lexicalising onto-
logical concepts, but subsequently gained popularity as
a means of encoding linked data lexica although it can
also be used for modelling and publishing other kinds
of datasets. Tables 1 and 2 give a summary of the LLD
vocabularies and models covered in this paper (with
the relevant sections of the article listed).
Below we describe our methodology for the rest of
the section. In Section 3.7 we discuss tools and plat-
forms for the publication of LLD.
Our Approach to Classification
This section is intended as an overview so we do not
give a detailed description of single models. Several of
these models are described in more detail in the rest of
the article, or in the Appendix in the case of OntoLex-
Lemon. Others can be found in the previous survey pa-
per, [7]. Instead we will describe here them on the ba-
sis of a number of criteria many of which are related
to their status as FAIR resources, and in particular to
their status as FAIR models and vocabulary. In partic-
ular, we will refer to a recent draft survey on FAIR Se-
mantics [13], the result of a dedicated brainstorming
workshop of the FAIRsFAIR project.12 This report out-
lined a number of recommendations and best practices
for FAIR semantic artefacts where these are defined as
"machine -actionable and -readable formalisation[s] of
a conceptualisation enabling sharing and reuse by hu-
mans and machines" (the term includes: taxonomies,
thesauri, ontologies).
From all of the recommendations listed in [13] we
have selected the following subset on the basis of
their salience to the set of models and vocabularies
under discussion (with justifications for recommenda-
tions based on those given in [13]):
(P-Rec 2) Ensure there is a separate URI for the
metadata and that they are published separately;
this helps in making the resource more findable
and supports the extraction of this metadata.
12https://www.fairsfair.eu/
(P-Rec 4) Publish semantic artefacts and their
contents in a semantic repository: in order to be
able to exploit repository technologies for find-
ability and re-use of semantic artefacts ;
(P-Rec 6) Retrievability through search engines ;
(P-Rec 10) Use a foundational ontology to align
semantic artefacts (this enhances re-usability);
(P-Rec 13) Create documented crosswalks and
bridges
(P-Rec 16) Ensure clear licensing of semantic
artefacts.
To start with the recommendations (P-Rec 2), (P-Rec
4), and (P-Rec 10) have been followed by none of
the models/vocabularies which we look at below. Fol-
lowing these three recommendations would, however,
greatly help to make these resources (and the datasets
they help to encode) more FAIR and we regard their
adoption as desirable future objectives for the mod-
els and vocabularies listed below, bringing them into
line with the latest thinking on making such kinds of
resource FAIR.13 In terms of the recommendation (P-
Rec 13) at the time of writing we can only mention
ongoing efforts at developing a TEI Lex-0/OntoLex-
Lemon crosswalk described in Section 5.2.3.
We will use (P-Rec 6) and (P-Rec 16) to help
us to analyse the models and vocabularies to fol-
low. For instance several of the models mentioned
do exist on the Linked Open Vocabulary (LOV)14
search engine15 [15] and the DBpedia archivo on-
tology archive.17 In cases where licensing informa-
tion is available as machine actionable metadata, using
properties like DCT:license and URI’s such as https://
creativecommons.org/publicdomain/zero/1.0/ we will
point this out as it enhances the re-usability of those
resources.
In addition to the written descriptions of different
LLD models given below, we also give a tabular sum-
mary of the most significant/stable/widely available18
of these models in Table 1. This also points, in rel-
evant cases, to other parts of the sections of the pa-
13The adoption of foundational ontologies, for instance, might
help to alleviate some of the problems raised by the proliferation of
independently developments as described in [7].
14https://lov.linkeddata.es/dataset/lov
15Note that the LOV site provides a list of criteria for inclusion on
their search engine [14]16
17http://archivo.dbpedia.org/
18Several of the models which are described in the rest of the sec-
tion aren’t available, at least anymore, but may be interesting for
historical reasons.
8AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
per where a more in-depth description of said model
is given. Every one of the models listed in the table
at is an OWL ontology.. We will also list the other
vocabularies which they make use of (aside from the
vocabularies OWL, RDF, and RDFS which are com-
mon to all of the vocabularies on the list). These in-
clude the well known ontologies/vocabularies: XML
Schema Definition19 (XSD); the Friend of a Friend
Ontology20 (FOAF); the Simple Knowledge Organi-
sation System21 (SKOS); Dublin Core22 (DC); Dublin
Core Metadata Initiative (DCMI) Metadata Terms;23
the Data Catalog Vocabulary24 (DCAT), described also
in Section 4.3; the PROV Ontology25 (PROV-O).
In addition the table also mentions the following vo-
cabularies.
Activity Streams(AS): a vocabulary for activity
streams.26
GOLD: an ontology for describing linguistic data,
which is described in Section 3.5.
MARL: a vocabulary for describing and annotat-
ing subjective opinions.27
ITSRDF: an ontology used within the Internation-
alization Tag Set.28
The Creative Commons vocabulary29 (CC).
VANN: a vocabulary for annotating vocabulary
descriptions.30
SKOS-XL: an extension of SKOS with extra
support for “describing and linking lexical en-
tities”.31 SKOS and SKOS-XL are, along with
lemon and its successor OntoLex-Lemon, amongst
the most well known ways of enriching linked
data taxonomies and conceptual hierarchies with
linguistic information. We will look at the use of
a SKOS-XL vocabulary in the context of a project
on the classification of folk tales in Section 5.
19https://www.w3.org/TR/xmlschema-0/
20http://xmlns.com/foaf/spec/
21https://www.w3.org/2004/02/skos/
22https://www.dublincore.org/specifications/dublin-core/
dcmi-terms/
23https://www.dublincore.org/specifications/dublin-core/
dcmi-terms/
24https://www.w3.org/TR/vocab-dcat-2/)
25https://www.w3.org/TR/prov-o/
26https://www.w3.org/TR/activitystreams-vocabulary/
27http://www.gsi.dit.upm.es/ontologies/marl/
28https://www.w3.org/TR/its20/
29https://creativecommons.org/ns
30https://vocab.org/vann/
31https://www.w3.org/TR/skos-reference/skos-xl.html
3.1. Vocabularies and Models for Corpora and
Linguistic Annotations
Linguistic annotation, e.g. for digital editions, cor-
pora, and linking texts with external resources has
long been a topic of interest in the context of RDF
and linked data. Coexisting with relational databases,
XML-based formats (most notably, TEI, see 4.2)
or simply text-based formats, RDF-based annotation
models have been steadily undergoing development
and are increasingly being used in research and in-
dustry. Currently, there are two primary RDF vocab-
ularies widely used for text annotations: NLP Inter-
change Format (NIF),32 used mostly in the language
technology sector and Web Annotation,33 formerly
known as Open Annotation (abbreviated here as OA),
used in digital humanities, life sciences and bioinfor-
matics. Both models have their advantages and short-
comings, and a number of proposals to extend these
have been proposed. Most importantly, there is a need
for synchronization between the two. Both are avail-
able in LOV34 and archivo35 (the NIF core in the case
of NIF36). The Web Annotation model, although it is
covered by a W3C software and document notice and
license, does not express this information in the form
of triples in the resource metadata; NIF on the other
hand does express licensing information as machine
actionable metadata.
More details about both models and their recent de-
velopments are described in Section 4.2. Other vo-
cabularies described in that section include POWLA,
CoNLL-RDF and Ligt. The first of these, POWLA,37
is available on archivo,38 the only one of the three to
be so available. CoNLL-RDF39 has version info as a
string using the owl:versionInfo property and is cov-
ered by a CC-BY 4.0 license as specified in the LI-
CENSE.data page.40
32https://nif.readthedocs.io/en/latest/
33https://www.w3.org/TR/annotation-model/
34https://lov.linkeddata.es/dataset/lov/vocabs/nif and https://lov.
linkeddata.es/dataset/lov/vocabs/oa
35http://archivo.dbpedia.org/info?o=http://www.w3.org/ns/oa
36http://archivo.dbpedia.org/info?o=http://persistence.
uni-leipzig.org/nlp2rdf/ontologies/nif-core
37http://purl.org/powla/powla.owl
38https://archivo.dbpedia.org/info?o=http://purl.org/powla/
powla.owl
39http://purl.org/acoli/conll#
40https://github.com/acoli-repo/conll- rdf/blob/master/LICENSE.
data.txt
AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 9
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
Summary
Name Other Vocab-
ularies/Models
Used
LLO Category Licenses Versions (at
time of writing
26/07/21)
Extended Cov-
erage in Current
Article
OntoLex-Lemon CC, DC, FOAF,
SKOS, XSD
Lexicons and
Dictionaries
CC0 1.0 Version 1.0,
2016 (but this is
closely based on
the prior lemon
model [16])
Section 3.2,
Section 4.3.3 and
Appendix A
Lexicog
(OntoLex-
Lemon)
DC, LexInfo,
SKOS, VOID,
XSD
Lexicons and
Dictionaries
CC0 Version 1.0,
(2019-03-08)
Section 3.2 and
Section 4.1.1
MMoOn DC, FOAF,
GOLD, LexVo,
OntoLex-Lemon,
SKOS, XSD
Terminologies,
Thesauri and KBs
(Morphology)
CC-BY 4.0 Version 1.0, 2016 Section 3.3
Web Annotation
Data Model (OA)
AS, FOAF,
PROV, SKOS,
XSD
Corpora and
Linguistic Anno-
tations
W3C Software
and Document
Notice and
License
Version
"2016-11-
12T21:28:11Z"
Section 3.1 and
Section 4.2.3
NLP Interchange
Format (NIF
Core)
DC, DCTERMS,
ITSRDF, levont,
MARL, OA,
PROV, SKOS,
VANN, XSD
Corpora and
Linguistic Anno-
tations
Apache 2.0 and
CC-BY 3.0
Version 2.1.0 Section 3.1 and
Section 4.2.2
POWLA FOAF, DC, DCT, Corpora and
Linguistic Anno-
tations
NA Last Updated
2018-04-03
Section 4.2
CoNLL-RDF DC, NIF Core,
XSD
Corpora and
Linguistic Anno-
tations
Apache 2.0 and
CC-BY 4.0
Last Updated
2020-05-26
Section 4.2.4
Ligt DC, NIF Core,
OA
Corpora and
Linguistic Anno-
tations
NA Version 0.2
(2020-05-26)
Section 4.2.4
META-SHARE CC, DC, DCAT,
FOAF, SKOS,
XSD
Linguistic Re-
source Metadata
CC-BY 4.0 Version 2.0 (pre-
release)
Section 3.4 and
Section 4.3.2
OLiA DCT, FOAF,
SKOS
Linguistic Data
Categories
CC-BY-SA 3.0 Version last up-
dated 27/02/20
Section 3.5
LexInfo CC, Ontolex,
TERMS, VANN
Linguistic Data
Categories
CC-BY 4.0 Version 3.0,
14/06/2014
Section 3.5
LexVo FOAF, SKOS,
SKOSXL, XSD
Typological
Databases
CC-BY-SA3.0 Version 2013-02-
09
Section 3.6
Table 1
Summary of published LLD vocabularies
3.2. Lexicons and Dictionaries
The most well known model for the creation and
publication of lexica and dictionaries as linked data
is the OntoLex-Lemon model41 [17], an output of
41The URI for OntoLex-Lemon is: http://www.w3.org/ns/lemon/
ontolex and the OntoLex-Lemon guidelines can be found at https:
//www.w3.org/2016/05/ontolex/.
the W3C ontolex working group which manages its
ongoing development and further extension (see Ap-
pendix A for an introduction to the model with exam-
ples and Section 4.1 for extensions and further devel-
opments). It is based on a previous model, the LExi-
con Model for ONtologies (lemon) [16]. Like its pre-
decessor, OntoLex-Lemon was designed with the in-
tention of enriching ontologies with linguistic infor-
mation and not of modelling dictionaries and lexicons
10 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
Summary
Name LLO Category Status (at time of
writing 26/07/21)
Extended Cov-
erage in Current
Article
OntoLex-Lemon:
FrAC
Lexicons and
Dictionaries
Under Develop-
ment
Section 4.1.3
OntoLex-Lemon:
Morphology
Lexicons and
Dictionaries
Under Develop-
ment
Section 4.1.2
PHOIBLE Terminologies,
Thesauri and
KBs
Unavailable Section 3.3
FRED Corpora and
Linguistic Anno-
tations
Project Specific
Vocabulary
Section 4.2
NAF Corpora and
Linguistic Anno-
tations
Project Specific
Vocabulary
Section 4.2
GOLD Linguistic Data
Categories
No Longer Up-
dated
Section 3.5
Table 2
Other LLD Vocabularies Discussed in this Paper
per se. Thanks to its popularity however, it has come
to take on the status of a de facto standard for the
modelling and codification of lexical resources in RDF
(including, for instance, retrodigitized dictionaries and
wordnets) in general. Resources which have been mod-
elled using OntoLex-Lemon include: the LLD version
of the Princeton Wordnet,42 DBnary (the linked data
version of Wiktionary) [18], and the massive multilin-
gual knowledge graph Babelnet [19]. The OntoLex-
Lemon model is modular and consists of a core mod-
ule along with modules for Syntax and Semantics,43
Decomposition,44 and Variation and Translation,45 as
well as a dedicated metadata module, lime46 (all of
these modules are described in Appendix A, except for
lime which is described in Section 4.3.3).
OntoLex-Lemon is available on LOV as is its prede-
cessor lemon.47 All of its separate modules are listed
separately however:48 the core;49 lime;50 vartrans;51
42http://wordnet-rdf.princeton.edu/about
43http://www.w3.org/ns/lemon/synsem
44http://www.w3.org/ns/lemon/decomp
45http://www.w3.org/ns/lemon/vartrans
46http://www.w3.org/ns/lemon/lime
47https://lov.linkeddata.es/dataset/lov/vocabs/lemon
48See the Appendix for a description of each, aside from lime de-
scribed in Section 4.3.3
49https://lov.linkeddata.es/dataset/lov/vocabs/ontolex
50https://lov.linkeddata.es/dataset/lov/vocabs/lime
51https://lov.linkeddata.es/dataset/lov/vocabs/vartrans
synsem;52 the decomp module.53 Three of its modules
are available on archivo, the core:54 the lime metadata
module55 and the Variation and Translation module.56
All of the OntoLex modules have their licenses (CC0
1.0) described with RDF triples using the CC vocabu-
lary57 with a URI as an object. Version information is
described using owl:versionInfo.
The OntoLex-Lemon Lexicography module58 (de-
scribed in more detail in Section 4.1.1) was published
separately from OntoLex-Lemon. It is not available on
LOV yet, however it is available on archivo.59 The li-
cense (CC-Zero) is described with RDF triples using
the CC vocabulary60 and DC61 with a URI as an object.
Version information is described using owl:versionInfo.
52https://lov.linkeddata.es/dataset/lov/vocabs/synsem
53https://lov.linkeddata.es/dataset/lov/vocabs/lexdcp
54https://archivo.dbpedia.org/info?o=http://www.w3.org/ns/
lemon/ontolex
55http://archivo.dbpedia.org/info?o=http://www.w3.org/ns/
lemon/lime
56http://archivo.dbpedia.org/info?o=http://www.w3.org/ns/
lemon/vartrans
57Using the cc:license property
58The guidelines for the module can be found at https://www.w3.
org/2019/09/lexicog/, the URL for the module is at http://www.w3.
org/ns/lemon/lexicog#
59https://archivo.dbpedia.org/info?o=http://www.w3.org/ns/
lemon/lexicog
60Using the cc:license property
61using dc:rights
AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 11
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
3.3. Vocabularies for Terminologies, Thesauri and
Knowledge Bases
The Simple Knowledge Organisation System
(SKOS) is a W3C recommendation for the creation of
terminologies and thesauri, or more broadly speaking,
knowledge organisation systems.62 We will not go into
any depth into it here since it is a general purpose vo-
cabulary which is applied well beyond the domain of
language resources.
In terms of specialised vocabularies or models for
the modelling of linguistic knowledge bases and
aside from linguistic data category registries which
will be discussed in Section 3.5 we can list two.
The first is MMoOn ontology63 which was designed
for the creation of detailed morphological invento-
ries [20]. It does not currently seem to be available on
any semantic repositories/archives/search engines but
it does have its own dedicated website64 which offers a
SPARQL endpoint (although this was down at the time
of writing). Its license information (it has a CC-BY 4.0
license) is available as triples using dct:license with a
URI as an object.
PHOIBLE is an RDF model for creating phono-
logical inventories [7]. As of the time of writing,
PHOIBLE data was no longer available as a complete
RDF graph, but only in its native (XML) format from
which RDF fragments are dynamically generated. The
original data remains publicly available,65 but on the
PHOIBLE website, it is only possible to browse and
export selected content into RDF/XML.66 Since it no
longer provides resolvable URIs for its components,
PHOIBLE data does not fit within the narrower scope
of LLD vocabularies anymore. It does, however, main-
tain a non-standard way of linking, as it has been ab-
sorbed into the Cross-Linguistic Linked Data infras-
tructure [21, CLLD] (along with other resources from
the typology domain). CLLD datasets and their RDF
exports continue to be available as open data under
https://clld.org/, see below for additional details, Sec-
tion 3.6.
62https://www.w3.org/2004/02/skos/
63https://github.com/MMoOn-Project/MMoOn/blob/master/core.
ttl 64https://mmoon.org/
65https://github.com/clld/phoible/tree/master/phoible/static/data
66See, for example, https://phoible.org/inventories/view/161.
3.4. Linguistic Resource Metadata
Due to the importance of the topic we give a much
fuller overview in Section 4.3; here we will only look
at accessibility issues for the two models for language
resource metadata which we mention there. These are
the METASHARE ontology67 and lime. The latter has
been previously introduced and is described in more
detail in Section 4.3.3. The former is currently in its
pre-release version 2.0 (the last update being 2020-03-
20). Its license information (it has a CC-BY 4.0 li-
cense) is available as triples using dct:license with a
URI as an object.
3.5. Linguistic Data Categories
History
As of 2010, two major repositories were in widespread
use by different communities for addressing the har-
monization and linking of linguistic resources via their
data categories. In computational lexicography and
language technology, the most widely applied termi-
nology repository was ISOcat [22] which provided
human-readable and XML-encoded information about
linguistic data categories that were relevant for linguis-
tic annotation, the encoding of electronic dictionaries
and language resource metadata via persistent URIs.
In the field of language documentation and typol-
ogy, the General Ontology of Linguistic Description
(GOLD) emerged in the early 2000s [23], having been
originally developed in the context of the project En-
dangered Metadata for Endangered Languages Data
(E-MELD, 2002-2007).68 GOLD stood out in partic-
ular because of its excellent coverage of low resource
languages. In the RELISH project, a curated mirror of
GOLD-2010 was incorporated into ISOcat [24]. Un-
fortunately, since then, GOLD development has stalled
and, while the resource is still being maintained by the
LinguistList (along with the data from related projects)
and still remains accessible,69 it has not been updated
since [25] (and for this reason we have not included it
in our summary table). In parts, its function seems to
have been taken over by ISOcat, but it is worth point-
ing out here that the ISOcat registry exists only as a
static, archived resource, but no longer as an opera-
tional system.
67http://www.meta-share.org/ontologies/meta-share/
meta-share-ontology.owl/documentation/index-en.html
68http://emeld.org/
69https://linguistlist.org/projects/gold.cfm
12 AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
The Current Situation
The ‘official’ successor of ISOcat, the CLARIN
Concept Registry is briefly discussed in Section 4.3
below (it is not strictly speaking a linked data vo-
cabulary). Another one of its successors is the Lex-
Info ontology,70 the data category register used in
OntoLex-Lemon and which re-appropriates many of
the concepts which were contained in ISOcat for use
within the lexical domain (dictionaries, terminologies,
lexica). Currently in its third version, LexInfo can
be found both on the LOV search engine71 and on
archivo,72 it appears both times however in its second
version. Version 3.0 is under development since late
2019 in a community-guided process via GitHub,73
and is not registered with either service, yet. LexInfo
has a (CC-BY 4.0) license. This is described with RDF
triples using the CC vocabulary and DCT with a URI
as an object in both cases. Version information is de-
scribed using owl:versionInfo.
For linguistic data categories in linguistic annota-
tion (of corpora and by NLP tools), a separate termi-
nology repository exists with the Ontologies of Lin-
guistic Annotation [26, OLiA].74 OLiA has been de-
veloped since 2005 in an effort to link community-
maintained terminology repositories such as GOLD,
ISOcat or the CLARIN Concept Registry with an-
notation schemes and domain- or community-specific
models such as LexInfo or the Universal Dependen-
cies specifications by means of an intermediate “Ref-
erence Model”. OLiA consists of a set of modular, in-
terlinked ontologies and is designed as a native linked
data resource. Its primary contributions are to provide
machine-readable documentation of annotation guide-
lines and a linking with and among other terminology
repositories. It has been suggested that such a collec-
tion of linking models, developed in an open source
process via GitHub, may be capable of circumvent-
ing some of the pitfalls of earlier, monolithic solutions
of the ISOcat era [27]. At the moment, OLiA cov-
ers annotation schemes for more than 100 languages,
for morphosyntax, syntax, discourse and aspects of se-
mantics and morphology. OLiA has a (CC-BY 4.0) li-
cense; this is described using the Dublin Core property
license with a URI as an object.
70https://lexinfo.net/
71https://lov.linkeddata.es/dataset/lov/vocabs/lexinfo
72http://archivo.dbpedia.org/info?o=http://www.lexinfo.net/
ontology/2.0/lexinfo
73It will be the first version that is compliant with OntoLex-
Lemon.
74http://purl.org/olia
3.6. Vocabularies for Typological Datasets
Relevant Resources and Initiatives
Linguistic typology is commonly defined as the
field of linguistics that studies and classifies languages
based on their structural features [28]. The field of lin-
guistic typology has natural ties with language docu-
mentation, and accordingly, considerable work on lin-
guistic typology and linked data has been conducted in
the context of the GOLD ontology (see above, Section
3.5). We can identify the following relevant datasets.
One of the main contributors and advisors to the sci-
entific study of typology is the Association for Lin-
guistic Typology (ALT).75 They facilitate the descrip-
tion of the typological patterns underlying datasets.
One of the most well-known resources that ALT makes
available is the World Atlas of Language Structures
(WALS)76 [29, 30] which is a large database of phono-
logical, grammatical, and lexical properties of lan-
guages gathered from descriptive materials. This re-
source can both be used interactively online and can be
downloaded. The CLLD77 (Cross-Linguistic Linked
Data) project integrates WALS, thus, offering a frame-
work that structures this typological dataset using the
Linked Data principles.
Another collection that provides web-based access
to a large collection of typological datasets is the Ty-
pological Database System (TDS) [31, 32]. The main
goals of TDS are to offer users a linguistic knowledge
base and content metadata. The knowledge base in-
cludes a general ontology and dictionary of linguis-
tic terminology, while the metadata describes the con-
tent of the term ontology databases. TDS supports a
unified querying across all the typological resources
hosted with the help of an integrated ontology. The
Clarin Virtual Language Observatory (VLO)78 in-
corporates TDS among its repositories.
Finally, another group of datasets relevant for typo-
logical research include large-scale collections of lex-
ical data, as provided, for example by PanLex79 and
Starling.80 An early RDF edition of PanLex has been
described by [33] and was incorporated in the initial
version of the Linguistic Linked Open Data cloud dia-
gram. At the time of writing, however, this early RDF
75https://linguistic-typology.org/
76https://wals.info/
77https://clld.org/
78https://vlo.clarin.eu/
79http://panlex.org
80https://starling.rinet.ru/
AF. Khan et al. / When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data 13
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
version does not seem to be accessible anymore. In-
stead, CSV and JSON dumps are being provided from
the PanLex website. On this basis [34] describe a fresh
OntoLex-Lemon edition