Content uploaded by Emil Svoboda
Author content
All content in this area was uploaded by Emil Svoboda on Aug 16, 2022
Content may be subject to copyright.
Determining the origins of words using
machine learning
A computational tool for linguists, proudly presented by
Emil Svoboda
under the guidance of Prof. Martin Haspelmath.
Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics
Charles University
2
●Title
●Bio
●Introduction
○Parent retrieval
○Word classification
○Tree model
●Data sources
○DeriNet
○Universal Derivations
○Neoclassical compounding
●PaReNT
○Principle
○Performance
○Application
●Acknowledgements
●References
Table of contents
3
Introduction
●A speaker of L can (usually) find the origin of a word in L:
○determination ← determine (eng)
○dělat ← dodělat (ces)
○Ärztin ← Arzt (ger)
○perrito ← perro (esp)
○pindakaas ← pinda, kaas (dut)
○ бурелом ← буря, ломать (rus)
○étudiante ← étudiant (fra)
●We call the origin word(s) on the right
parent
(
s
):
○Given a word, the PaReNT tool must find its parents.
○We call this task parent retrieval.
Parent retrieval
5
Introduction
●We can define three broad classes of words:
1. Unmotivated words (0 parents):
a. casa ← Ø (esp)
b. lebka ← Ø (ces)
c. grün ← Ø (ger)
2. Derivatives (1 parent):
a. dodělat ← dělat (ces)
b. annehmen ← nehmen (ger)
c. muchachito ← muchacho (esp)
3. Compounds (2+ parents)
a. portefeuille ← porte, feuille (fra)
b. achteruitkijkspiegel ← achteruit, kijken, spiegel (dut)
●PaReNT must classify any given word into one of these. We call this word classification.
Word classification
6
Introduction
●Derivation and compounding may occur several times over, forming chains:
○literatura → literární
○teorie → teoretický
○literární + teoretický → literárněteoretický (ces)
●Assuming such chains always begin with an unmotivated word, they can be naturally
combined into trees:
Tree model
7
Introduction
Data
●A collection of trees describing Czech word-formation
families
●Used to only contain derivation; now also contains
compounding
●DeriNet 2.1 oers:
○Over 1 million lexemes
○782 thousand derivational relations
○1,952 compounding relations
○202 axoids
DeriNet
9
Data
●Extension of DeriNet into other languages
●Contains 31 data resources covering 21 languages
Universal Derivations (UDer)
10
Data
●How do we analyze “biotechnology”?
○derivative?
■“bio” is too lexical for an ax
■“bio” + “log(-y)” → biology
○compound?
■there’s no isolated “bio”
●Axoids: -psych-, -log-, -tri-, -metr-, -pseud-...
○Don’t occur by themselves, but can occur as a result of
■derivation: -psych- → psychosis
■compounding: -psych- + -metr- → psychometry
■conversion: -psych- → psycho
●Axoids are hypothesized to be shared across the L’s of UDer
Neoclassical compounding
11
Data
PaReNT
(
Pa
rent
Re
trieval
N
eural
T
ool)
●What is/are the parent(s) of the word “šálkožrout”?
●You can try and find its tree in DeriNet to find out…
○…but you won’t, because it’s a nonce word.
○Yet it is transparent for a native speaker.
○If you don’t speak Czech, you don’t even know if it’s a compound.
●We can, however, train a neural network using:
○DeriNet (ces)
○Universal Derivations (ces, ger, eng, dut, rus, fra, esp)
○Wiktionary (ces, ger, eng, dut, rus, fra, esp)
○GermaNet (ger)
○CELEX (ger, eng)
Principle
13
PaReNT
Principle
14
PaReNT
literárněteoretický literární teoretický
teoretický teorie
literární literatura
teorie teorie
……
●What is/are the parent(s) of the word “šálkožravec”?
●You can try and find its tree in DeriNet…
○…but you won’t, because it’s a nonce word.
○Yet it is transparent for a native speaker.
○If you don’t speak Czech, you don’t even know if it’s a compound.
●We can, however, train a neural network using:
○DeriNet (ces)
○Universal Derivations (ces, ger, eng, dut, rus, fra, esp)
○Wiktionary (ces, ger, eng, dut, rus, fra, esp)
○GermaNet (ger)
○CELEX (ger, eng, dut)
●The result is a model that can retrieve the parent(s) of unseen words
Principle
15
PaReNT
Parent retrieval
fallback → fall, back
development → develop
Arzt → Arzt
Performance
16
PaReNT
Word classification
fallback → “compound”
development → “derivative”
Arzt → “unmotivated”
Language Accuracy
Czech 72%
German 64%
English 69%
Spanish 59%
French 54%
Dutch 73%
Russian 72%
Total 66%
Language Accuracy
Czech 65%
German 87%
English 83%
Spanish 61%
French 61%
Dutch 91%
Russian 69%
Total 74%
How similar are languages wrt naming conventions?
1. We gathered a list of 1800 female entities in Czech:
1.1. Ex.: waitress, girl, cleaning lady…
2. We translated them into:
2.1. German, English, Spanish, French, Dutch, Russian
3. Classified the resulting expressions into one of the following strategies:
3.1.
unmarked
,
phrase
,
compound
,
derivative
,
unmotivated
4. Calculated how often each language used each of the strategies,
5. and clustered the languages accordingly.
Application
17
PaReNT
Results:
Application
18
PaReNT
Jensen-Shannon divergence Euclidean distance
This work was supported by the Grant No. START/HUM/010
of Grant schemes at Charles University (reg. No.
CZ.02.2.69/0.0/0.0/19_073/0016935). It used language
resources developed, stored, and distributed by the
LINDAT/CLARIAH-CZ project.
19
Acknowledgements
Thank you for your attention!
●Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann,
Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast Neural Machine Translation in C++. In
Proceedings of ACL 2018, System Demonstrations
, pages 116–121, Melbourne, Australia. Association for Computational Linguistics.
●Lukáš Kyjánek, Zdeněk Žabokrtský, Magda Ševčíková, Jonáš Vidra. Universal Derivations Kicko: A Collection of Eleven Harmonized Derivational
Resources for Eleven Languages. In
Proceedings of the 2nd Workshop on Resources and Tools for Derivational Morphology
. Prague: Charles
University. ISBN: 978-80-88132-08-0. 2022.
●Lukáš Kyjánek, Emil Svoboda, Jan Bodnár, Jonáš Vidra, Magda Ševčíková, Zdeněk Žabokrtský.
Modelling Macro-Level Competition in Word
Formation: A Case Study on Seven Languages
. Submitted to COLING 2022.
●Emil Svoboda & Magda Ševčíková. Spliting and Identifying Czech Compounds: A Pilot Study. In
Proceedings of the Third Workshop on Resources
and Tools for Derivational Morphology (DeriMo 2021)
. France, 2021, pp. 125-134.
●Emil Svoboda & Magda Ševčíková.
Word Formation Analyzer for Czech: Automatic Parent Retrieval and Classification of Word Formation
Processes
. The Prague Bulletin of Mathematical Linguistics. Prague, April 2022, 118 pp. 55–73.
●Jonáš Vidra et al. DeriNet 2.0: Towards an All-in-One Word-Formation Resource. In
Proceedings of the Second
●Jonáš Vidra, Zdeněk Žabokrtský, Magda Ševčíková, Lukáš Kyjánek. 2019. DeriNet 2.0: Towards an All-in-One Word-Formation Resource. In
Proceedings of the 2nd Workshop on Resources and Tools for Derivational Morphology
. Prague: Charles University. ISBN: 978-80-88132-08-0.
References
20
References