PresentationPDF Available

Use and Evaluation of Wordnets as Lexicographical Resource



See presentation on
16 July 2018
wnlex workshop, ljubljana
wnlex2018 workshop:
wordnets as lexicographical resources
16 July 2018
wnlex workshop, ljubljana | 2/45
Motivation for this workshop
From dictionary to wordnet
The relation between mostly concept-based lexical-semantic networks (wordnets) and
lemma-based lexical resources (dictionaries) has been explored so far mainly for
wordnet-building purposes, and such projects and related issues are well documented.
From wordnet to dictionary
In spite of not being meant to serve lexicographical purposes, wordnets have become a
de facto standard for the drafting of dictionary content. Experiences and related issues
have just started to be systematically discussed.
Our Goal
A survey of solved and unsolved issues regarding wordnet-based lexicography
Data models and interoperability of lexical resources
Lexicographical processes, workflows
co-organized by:
16 July 2018
wnlex workshop, ljubljana | 3/45
wnlex speakers
Andrea Bellandi
Institute for Computational Linguistics «A. Zampolli», Pisa, Italy
Martin Benjamin
Kamusi Project (
John McCrae
National University of Ireland Galway / Ollscoil na hÉireann Gaillimh, Ireland
Darja Fišer
University of Ljubljana, Slovenia
Fahad Khan
Institute for Computational Linguistics «A. Zampolli», Pisa, Italy
David Lindemann
Universität Hildesheim, Germany
Maciej Piasecki
Wrocław University of Technology, Wrocław, Poland
16 July 2018
wnlex workshop, ljubljana | 4/45
wnlex registered participants
I am interested in the differences between human-oriented dictionaries and NLP-oriented lexical
I am working in the Elexis project and I am interested in defining extensions to the W3C OntoLex-
Lemon model. I am also a user of WN, and main aim is to combine Ontolex and WN for allowing
to attribute senses to morphological variants of lemmas, when this is needed.
I would like to expand my knowledge in wordnets and learn how to incorporate the acquired
knowledge in my current work with dictionaries. I am particularly interested in the linked-data
qualities of wordnets and learning of ways to overcome the limited nature of lemma-based
Present my poster and discuss future research directions with invited speakers and other
I am interested in expanding my knowledge in Wordnets as lexical resources, and how to utilize
related methods in the process of compiling a dictionary.
Want to get some insights on lexicography and some basic knowledge, also regarding needs of
users and struggles.
Interested in wordnets as lexicographical resource as well as in data models for wordnet-like
concept-based resources.
16 July 2018
wnlex workshop, ljubljana | 5/45
Time Schedule
09:00 - 09:45 Use and Evaluation of Wordnets as Lexicographical Resources
- David Lindemann
09:45 - 10:30 Lexicographic Perspective on Wordnet Interoperability in CLARIN
- Darja Fišer and Maciej Piasecki
10:30 - 11:00 Coffee break
11:00 - 11:45 Representing WordNets with OntoLex and the Global Wordnet Formats
- John McCrae
11:45 - 12:30 Linking Lexicographic Resources: The Opportunities and Challenges Offered by the
Semantic Web
- Fahad Khan & Andrea Bellandi
12:30 - 13:45 Lunch Break
13:45 - 14:15 Poster Session - Presentation of accepted posters
14:15 - 15:00 Wordnet as a crowd source for untreated languages, concepts, and data elements - Martin
15:00 - 15:45 Corpus-based Wordnet Development and plWordNet as a Relational Semantic Dictionary
- Maciej Piasecki
15:45 - 16:00 Coffee break
16:00 - 17:00 Wrap-up & Discussion - All participants
16 July 2018
wnlex workshop, ljubljana
Introductory Speech:
Use and Evaluation of Wordnets
as Lexicographical Resources
David Lindemann
University of Hildesheim
16 July 2018
wnlex workshop, ljubljana | 7/45
Introductory Speech: Overview
WordNet as lexicographical resource
Why WordNet?
A lexicographers‘ view on wordnet data
Language related issues, English bias
Glosses / Definitions
Lexical-semantic relations
Translation equivalents
Sense granularity
Data Models
Wordnet in lexicographical workflows
Open questions to work on
16 July 2018
wnlex workshop, ljubljana | 8/45
Why WordNet?
Princeton WordNet = WordNet for English (Miller, Fellbaum et al.)
Many other wordnets with links to Princeton WN items, cf. OMWN (Bond et al.)
De-facto standard for multilingual lexicography from scratch
Data model suitable for cross-language links at sense level
High rate of coverage of English standard lemma lists / English conceptualisations
High precision due to high amounts of manual lexicographical work
Examples for multilingual e-dictionaries based on wordnet data
BabelNet (+Wikipedia, etc.) - (Navigli et al.)
Kamusi (+crowdsourcing) - (Benjamin)
16 July 2018
wnlex workshop, ljubljana | 9/45
If you have a wordnet, use it!
Example multilingual lexicography with Basque
Language resources for Basque: Somehow paradox situation
Lack of bilingual dictionaries (beyond ES, FR, EN, RU)
Availability of quite a large and precise, hand-crafted WordNet based on PWN synsets, and
therefore aligned to a whole lot of languages
Research questions
Bilingual dictionary drafting using aligned wordnets - What does the lexicographer find?
Beyond synonymy and translation equivalence: What about all the other item types in a dictionary?
About the lemma: Phonetics, Morphology, Valency, Collocations,...
About the word sense: Other SemRels, Definitions, Example sentences
How can the Basque Wordnet benefit from wn-based lexicography?
Model for a bootstrapping loop
16 July 2018
wnlex workshop, ljubljana | 10/45
concept (synset) ID 02389559-n: „equus asinus“
Why WordNet?
Concept-based resource
links senses to senses
intra-language: lexical-semantic relations (hyponymy, meronymy, etc.)
cross-language: (different types of) conceptual equivalence
links senses to lexical items
intra-language: lexical-semantic relations (synonymy, antonymy)
cross-language: translation equivalence (as lexicalisations of equivalent concepts)
asino ciuco
asno burro
Italian synset
Spanish synset
Open Multilingual WordNet (Bond et al.)
hoofed mammal
16 July 2018
wnlex workshop, ljubljana | 11/45
Lemma-oriented LR vs. Concept-oriented LR
Polysemy: 3 word senses
16 July 2018
wnlex workshop, ljubljana | 12/45
Concept-oriented LR
Translation Equivs.
PferdPferd RossSpringer
SynonymySynonyms Translation Equivs.
16 July 2018
wnlex workshop, ljubljana | 13/45
Why WordNet?
Links to other resources
by lemma-sign (string) or
lempos-entity (lemma with POS)
Always possible,
but loses homograph /
word sense disambiguation
by lexical item
by lemma_senseNr string
(Bank_1 vs. Bank_2)
Princeton WN
by lexItemID
GermaNet data model
by sense
by senseID
Danish LR family
ILI: Open Multilingual WordNet
Global WordNet Grid / CILI
(Bond, McCrae & Vossen 2016)
concept x
unit 2_a
entity 2
unit 2_b
concept y
unit 1_a
entity 1
Wordnet data model
Linked Lexical Resources for Danish
16 July 2018
wnlex workshop, ljubljana | 14/45
Wordnets in Lexicography: Some drawbacks & pitfalls
English-biased conceptualisation of our world
English-biased data model
Glosses, Definitions
Lexical-semantic relations
Relation fuzzyness
Translation equivalence
Relation fuzzyness
Errors of translations from PWN
Sense granularity
16 July 2018
wnlex workshop, ljubljana | 15/45
WordNet as lexicographical resource: Language related issues
Language-related issues:
Princeton WordNet: A data model for English
Adaptation to features of other languages
Example: Aspect in Slavic languages like Slovene, Polish
English-biased data model leads to take verbs with different aspect/Aktionsart represented as synonyms
Adaptation of data model: One-to-many correspondances between verbs, equivalence typology
WordNet building: Translate Princeton WN vs. new, independent WN data model
verb item
verb item
Slavic verb_imp;
Slavic verb_perf
Slavic verb_perf
Slavic verb_imp
16 July 2018
wnlex workshop, ljubljana | 16/45
Use of wordnet data in lexicography: WN Glosses
Glosses, Definitions
Glosses: Hint for disambiguation for the human wordnet user
Just enough to be able to use as disambiguator
Definition in a language dictionary: Hint for the human dictionary user
Encyclopedic value as stand-alone text element
Bilingual dictionaries: Hints for word sense disambiguation in a foreign language
WN glosses as lexicographic definitions? cf. Benjamin 2016
16 July 2018
wnlex workshop, ljubljana | 17/45
WN lexical-semantic relations and Lexicography
horse #1
equine > odd-toed_ungulate > ungulate > placental > mammal > vertebrate > chordate > animal
horse #2
chessman > man > game_equipment
horse #3
gymnastic_apparatus > sports_equipment
Too fine-grained for a 1:1 use in a language dictionary?
More complete, more accurate than the information found in many dictionaries
(Pedersen et al. 2018: 103)
16 July 2018
wnlex workshop, ljubljana | 18/45
WN lexical-semantic relations and Lexicography
glass #1 {glass, drinking glass}
a container for holding liquids while drinking
> container > instrumentality
glass #2 {glass, glassful}
the quantity a glass will hold
> containerful > indefinite_quantity
snow #1 {snow, snowfall}
precipitation falling from clouds in the form of icy crystals
> precipitation > weather > athmospheric_phenomenon
snow #2 {snow}
a layer of snowflakes (white crystals of frozen water) covering the ground
> layer > region > location > object
More complete, more accurate than the information found in many dictionaries
16 July 2018
wnlex workshop, ljubljana | 19/45
WN lexical-semantic relations and Lexicography
In wordnets: alternative lexicalisations for the same concept, interchangeable in a context
Quasi-synonymy sometimes represented as homonymy relation, then gloss concerning register
English {chalk, crank, glass, ice, methamphetamine, methamphetamine hydrochloride, Methedrine,
meth, deoxyephedrine, chicken feed, shabu, trash}
English {policeman, officer, police officer}a member of a police force
English {cop, bull, copper, pig, fuzz} uncomplimentary terms for a policeman
Danish {betjent, funktionær, ordenshåndhæver, panser, politibetjent, strisser, strømer, tjenestemand}
In Lexicography: always quasi-synonymy (register, sociolect, dialect… pragmatics)
Thesauri (e.g. Lexical items bear usage labels
16 July 2018
wnlex workshop, ljubljana | 20/45
Translation equivalence: Cross-language linking of items
Interlingual indices
Open Multilingual WordNet (OMWN, cf. Bond & Foster 2013)
PWN synsets as pivot sense grid
Global WordNet Grid (Vossen, Bond & McCrae 2016)
English-independent sense repository
Bilingual Dictionary Drafting using OMWN
Quantitative evaluation using source language lemma list as standard
Qualitative evaluation by human annotators: Adequateness as translation equivalent candidate
Do I want this candidate as it is to appear in my dictionary entry as an equivalent? OK
Is it an acceptable equivalent, but does it need some manual editing? FUZZY
Is this noise / an inadequately matched equivalent pair? FALSE
16 July 2018
wnlex workshop, ljubljana | 21/45
Translation equivalents extracted from WN: evaluation
Lindemann et al. 2014: German-Basque
GermaNet v8 – BasqueWN v3: 21% recall, 83% precision
Lindemann & Kliche 2017: Basque-English
BasqueWN v3 – Princeton WN v3: 31% recall, 89% precision
Set of student assessments, BA course in computational lexicography, Hildesheim 2017
English – WOLF (FrenchWN): 58% precision
English – WONEF (FrenchWN): 74% precision
English – GermaNet v8: 87% precision
[English – BabelNet v3.7 German: 61% precision]
16 July 2018
wnlex workshop, ljubljana | 22/45
Fuzzy equivalency (interlingual quasi-synonymy)
More fine-grained evaluation of wordnet as multilingual lexicographical resource
List of criteria for being represented in a more advanced wordnet data model
3 typologies of translation equivalence
Maks 2007: OMBI project (reverting bilingual dictionaries)
Adamska-Sałaciak 2010: Typology of interlingual equivalence
Rudnicka 2017: Features of a „super strong“ interlingual equivalence
horse nag
16 July 2018
wnlex workshop, ljubljana | 23/45
OMBI (Maks 2007)
Contrasts in conceptual equivalence
Near Equivalent
Pragmatic Contrasts
Formal vs. neutral
Old-fashioned vs. neutral
Variant status
Preferred synonym vs. term variant
Contrasts in degree of lexicalisation
(established lexical unit vs. explanatory equivalent)
Fully lexicalised
16 July 2018
wnlex workshop, ljubljana | 24/45
Translation Equivalence (Adamska-Sałaciak 2010)
Type C: Cognitive
(a.k.a. semantic, systemic, prototypical,
conceptual, decontextualised, notional)
Has to be an established LU of TL > not
always possible to provide
Type E: Explanatory
(a.k.a. descriptive)
Always possible to provide
Type F: Functional
(a.k.a. situational, communicative,
discourse, dynamic)
Adequate translation in context, without
word-level correspondance
Type T: Translational
(a.k.a. insertable, textual, contextual)
Adequate translation in context, word-
level correspondance
16 July 2018
wnlex workshop, ljubljana | 25/45
Rudnicka et al. 2017 and implications
Super-strong equivalence:
i. identity in grammatical category (given from the synset mapping)
ii. identity in number
iii. identity in sense (synset (and lexical unit) relation structure and gloss)
iv. identity in register
v. identity in countability
vi. compatibility in (semantic) gender (if relevant/applicable)
vii. ‘first choice’ equivalent: listed first in bilingual dictionaries
viii. bidirectional
ix. high translation probability if it appears in a parallel corpus
x. unique for a single lexical unit
item 1 item 2
16 July 2018
wnlex workshop, ljubljana | 26/45
Sense granularity
WN sense clustering (creation of coarse senses): Several approaches
Surveys: Peters, Peters & Vossen 1998; Agirre & Lopez de Lacalle 2003 [senseval-2]
The “autohyponymy“ problem (Pociello, Agirre & Aldezabal 2011)
Princeton WN 3.0
Basque WN 3.0
16 July 2018
wnlex workshop, ljubljana | 27/45
WordNet sense clustering: Translation similarity
Candidates for
according to
semantic distance
calculated from
Resnik & Yarowski
Chugur, Gonzalo &
Verdejo 2002
16 July 2018
wnlex workshop, ljubljana | 28/45
Basque Lexical Resources / A model for BDD
Basque landscape of lexical resources
Lexicography: Basque Language Academy
Lexicography: Basque Language Institute @ EHU
NLP: IXA CL-group @ EHU
Lexicography / NLP: Elhuyar
Scarcity of bilingual dictionaries
only ES, FR, EN, RU meet state of the art
State of the art NLP lexical resources
parameter files for spell checkers, taggers, RBMT engines
Basque WordNet, part of MCR
built by the ‘expand’ method
fully aligned to PWN 3.0
Bilingual Dictionary Drafting (BDD)
Starting point: merged NLP lexicon and Wordnet (Lindemann & San Vicente 2016)
16 July 2018
wnlex workshop, ljubljana | 29/45
Basque Lexical Resources
Corpus-based frequency lemma list for Basque
Lemmata extracted from ETC (Sarasola, Salaburu & Landa 2013),
and Elh200 (Leturia 2014)
Comparison to 6 reference resources: 4 Dictionaries, Basque WN, 1 NLP lexicon
► Lindemann & San Vicente (2015)
16 July 2018
wnlex workshop, ljubljana | 30/45
Basque Dictionary Draft: (1) Homograph Level
Basic list of lemma-signs: 57.000
20+ occurrences in 200M-corpus and in 1+ reference resource
Frequency data from Elh200 corpus
16 July 2018
wnlex workshop, ljubljana | 31/45
(1) Homograph, (2) Syntactical Entity
Syntactical Entities (lempos-entities) from Elh200 corpus
Corpus pos-tagged with EusTagger, based on EDBL data
Frequency data for each lempos-entity
interesting for lexicographer
interesting for dictionary user
16 July 2018
wnlex workshop, ljubljana | 32/45
(1) Homograph, (2) Syntactical Entity, (3) Sense
Word senses from EusWN (Basque WordNet)
Linking of senses to syntactical entities (as child elements)
16 July 2018
wnlex workshop, ljubljana | 33/45
Drafted Basque dictionary content
based SE
SE with one
or more
Word senses
Word senses
SE present
in corpus
but not in
SE present
in EusWN
but not
found in
Verbs 4,151 1,636 6,567 2.01 2,515 279
23,921 15,193 30,613 4.01 8,728 3,479
Proper Nouns 2,443 132 153 1.16 2,311 60
Adjectives 6,147 50 141 2.82 6,097 8
Adverbs 1,556 0 0 0.00 1,556 0
Total 38,218 17,011 37,474 2.20 21,207 3,826
16 July 2018
wnlex workshop, ljubljana | 34/45
Dictionary Draft SE Gap Detection: semi-automatic
Blank SE (present in EDBL, not in EusWN):
Find corresponding synset in Princeton WordNet, copy ID
16 July 2018
wnlex workshop, ljubljana | 35/45
Dictionary Draft Sense Gap Detection: Manual work!
Definition EN EusWN 3.0 synset EN synset CAT synset
adar_1 one of the bony outgrowths on the
heads of certain ungulates
adar_1 horn_2 banya_1
adar_2 a railway line connected to a trunk line adar_2 branch_line_1
adar_3 a warning signal that is a loud wailing
adar_3, sirena_2
adar_4 a local branch of some fraternity or
adar_4 chapter_3 capítol_2
adar_5 a division of a stem, or secondary stem
arising from the main stem of a plant
adar_5 abar_2
branch_2 branca_1
adar_6 an alarm device that makes a loud
warning sound
sirena_4 adar_6
adar_7 a device used for easing the foot into a
zapata_sartzeko_1 shoehorn_1 calçador_1
16 July 2018
wnlex workshop, ljubljana | 36/45
Manual postediting of WordNet-based dictionary drafts
Crowdsourcing (as in
User is prompted to fill lexical gaps in his language's WN (wich is aligned to other WNs)
Language community empowerment (
Alignment at concept (word sense) level from the very beginning
Concepts new to the multilingual WN: Global WN Grid
Lexicographical workflows for a bootstrapping loop
Manual editing of bilingual dictionary drafts
Reuse of hand-validated data for
upgrading the original resources
Planned project: New series of bilingual
dictionaries with Basque
Image Source: Wikimedia Commons
'Bootstrapping Loop'
Every single bit of manual work,
every gap that is filled,
every sense that is split,
every link that is set,
every error that is found,
shall allow to upgrade both
EDBL and Basque WordNet.
16 July 2018
wnlex workshop, ljubljana | 37/45
Application example:
WordNet/BabelNet bootstrapping for EUS-SLO
Basque (EUS) – Slovene (SLO): A totally uncovered pair of 'smaller' languages
Quantitative Evaluation
Recall: Synsets that contain 1+ Basque standard headword and 1+ Slovene item
EusWN / SloWNet 20% (66% of 30%)
BabelNet 31% (78% of 40%)
Recall on 5,000 most frequent Basque headwords (BabelNet): 74% (3,707)
Recall on 20,000 most frequent Basque headwords (BabelNet): 53% (10,549)
Qualitative Evaluation
Precision: Unknown. EN-SL precision to be measured first.
16 July 2018
wnlex workshop, ljubljana | 38/45
Basque Wordnet
Basque Lexical Resources today
Basque SemCor
lexical unit
entity Existing
lemma sign: lemma string without POS and sense disambiguation
lempos-entity: lemma-sign with POS, all word senses
lexical unit: lemma-sign with POS and unique word sense
lemma sign
16 July 2018
wnlex workshop, ljubljana | 39/45
Basque Wordnet
Scenario: Basque Lexical Resources
Basque SemCor
New Series of
with Basque
lexical unit
entity Existing
lempos-entity: lemma-sign with POS, all word senses
lexical unit: lemma-sign with POS and unique word sense
16 July 2018
wnlex workshop, ljubljana | 40/45
Data modeling
Three entities to link item types to: lempos entity, lexical unit, concept
16 July 2018
wnlex workshop, ljubljana | 41/45
16 July 2018
wnlex workshop, ljubljana | 42/45
Treatment of Homonymy
Princeton WordNet 3.1, noun “tear” Cambridge Learners' Dic., noun “tear”
16 July 2018
wnlex workshop, ljubljana | 43/45
Workflow proposal for Basque
Bilingual Dictionary Draft for Basque-English including sense-to-sense mappings
Encouraging recall and precision rates; can be applied to other language pairs
Preliminaries for a research project
Bilingual Dictionary Drafts for many uncovered language pairs
Data model that allows
Manual and semi-automated (bulk) editing
Edition of e-dictionaries including more item types
Retro-updating of original resources:
'Bootstrapping Loop'
Engagement of lexicographers for
editing 'their' language pair
Edition of a new series of
bilingual dictionaries with Basque
Image Source: Wikimedia Commons
'Bootstrapping Loop'
16 July 2018
wnlex workshop, ljubljana | 44/45
Summary: Some open questions
Multilingual WordNet: Data modeling
Types of translation equivalence
Representation of relations between synsets / between lexical units
Inclusion of / linking to more lexicographic item types
Homonymy vs. Polysemy
Interoperability with existing standards
Linking of lexical resources of different shape
lemma-based resources, lemma-based links
concept-based resources, concept-based links
Evaluation of automatically built resources
Definition of lexicographic workflows
Hand-crafted edits / upgrades of wordnet-dictionaries
Tutorials / best practice guidelines
16 July 2018
wnlex workshop, ljubljana | 45/45
Thank you for your attention
Please find the bibliography at:
ResearchGate has not been able to resolve any citations for this publication.
FrenchWN): 58% precision English-WONEF (FrenchWN): 74% precision English-GermaNet v8: 87% precision
  • Wolf English
English-WOLF (FrenchWN): 58% precision English-WONEF (FrenchWN): 74% precision English-GermaNet v8: 87% precision [English-BabelNet v3.7 German: 61% precision] wnlex workshop, ljubljana | 37/45