PresentationPDF Available

PaReNT: Determining the origins of words using machine learning

Authors:
Determining the origins of words using
machine learning
A computational tool for linguists, proudly presented by
Emil Svoboda
under the guidance of Prof. Martin Haspelmath.
Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics
Charles University
2
Title
Bio
Introduction
Parent retrieval
Word classification
Tree model
Data sources
DeriNet
Universal Derivations
Neoclassical compounding
PaReNT
Principle
Performance
Application
Acknowledgements
References
Table of contents
3
Introduction
A speaker of L can (usually) find the origin of a word in L:
determination determine (eng)
dělat dodělat (ces)
Ärztin Arzt (ger)
perrito perro (esp)
pindakaas pinda, kaas (dut)
бурелом буря, ломать (rus)
étudiante étudiant (fra)
We call the origin word(s) on the right
parent
(
s
):
Given a word, the PaReNT tool must find its parents.
We call this task parent retrieval.
Parent retrieval
5
Introduction
We can define three broad classes of words:
1. Unmotivated words (0 parents):
a. casa Ø (esp)
b. lebka Ø (ces)
c. grün Ø (ger)
2. Derivatives (1 parent):
a. dodělat dělat (ces)
b. annehmen nehmen (ger)
c. muchachito muchacho (esp)
3. Compounds (2+ parents)
a. portefeuille porte, feuille (fra)
b. achteruitkijkspiegel achteruit, kijken, spiegel (dut)
PaReNT must classify any given word into one of these. We call this word classification.
Word classification
6
Introduction
Derivation and compounding may occur several times over, forming chains:
literatura literární
teorie teoretický
literární + teoretický literárněteoretický (ces)
Assuming such chains always begin with an unmotivated word, they can be naturally
combined into trees:
Tree model
7
Introduction
Data
A collection of trees describing Czech word-formation
families
Used to only contain derivation; now also contains
compounding
DeriNet 2.1 oers:
Over 1 million lexemes
782 thousand derivational relations
1,952 compounding relations
202 axoids
DeriNet
9
Data
Extension of DeriNet into other languages
Contains 31 data resources covering 21 languages
Universal Derivations (UDer)
10
Data
How do we analyze “biotechnology”?
derivative?
“bio” is too lexical for an ax
“bio” + “log(-y)” biology
compound?
there’s no isolated “bio”
Axoids: -psych-, -log-, -tri-, -metr-, -pseud-...
Don’t occur by themselves, but can occur as a result of
derivation: -psych- psychosis
compounding: -psych- + -metr- psychometry
conversion: -psych- psycho
Axoids are hypothesized to be shared across the L’s of UDer
Neoclassical compounding
11
Data
PaReNT
(
Pa
rent
Re
trieval
N
eural
T
ool)
What is/are the parent(s) of the word “šálkožrout”?
You can try and find its tree in DeriNet to find out…
…but you won’t, because it’s a nonce word.
Yet it is transparent for a native speaker.
If you don’t speak Czech, you don’t even know if it’s a compound.
We can, however, train a neural network using:
DeriNet (ces)
Universal Derivations (ces, ger, eng, dut, rus, fra, esp)
Wiktionary (ces, ger, eng, dut, rus, fra, esp)
GermaNet (ger)
CELEX (ger, eng)
Principle
13
PaReNT
Principle
14
PaReNT
literárněteoretický literární teoretický
teoretický teorie
literární literatura
teorie teorie
What is/are the parent(s) of the word “šálkožravec”?
You can try and find its tree in DeriNet…
…but you won’t, because it’s a nonce word.
Yet it is transparent for a native speaker.
If you don’t speak Czech, you don’t even know if it’s a compound.
We can, however, train a neural network using:
DeriNet (ces)
Universal Derivations (ces, ger, eng, dut, rus, fra, esp)
Wiktionary (ces, ger, eng, dut, rus, fra, esp)
GermaNet (ger)
CELEX (ger, eng, dut)
The result is a model that can retrieve the parent(s) of unseen words
Principle
15
PaReNT
Parent retrieval
fallback fall, back
development develop
Arzt Arzt
Performance
16
PaReNT
Word classification
fallback compound”
development derivative”
Arzt “unmotivated”
Language Accuracy
Czech 72%
German 64%
English 69%
Spanish 59%
French 54%
Dutch 73%
Russian 72%
Total 66%
Language Accuracy
Czech 65%
German 87%
English 83%
Spanish 61%
French 61%
Dutch 91%
Russian 69%
Total 74%
How similar are languages wrt naming conventions?
1. We gathered a list of 1800 female entities in Czech:
1.1. Ex.: waitress, girl, cleaning lady
2. We translated them into:
2.1. German, English, Spanish, French, Dutch, Russian
3. Classified the resulting expressions into one of the following strategies:
3.1.
unmarked
,
phrase
,
compound
,
derivative
,
unmotivated
4. Calculated how often each language used each of the strategies,
5. and clustered the languages accordingly.
Application
17
PaReNT
Results:
Application
18
PaReNT
Jensen-Shannon divergence Euclidean distance
This work was supported by the Grant No. START/HUM/010
of Grant schemes at Charles University (reg. No.
CZ.02.2.69/0.0/0.0/19_073/0016935). It used language
resources developed, stored, and distributed by the
LINDAT/CLARIAH-CZ project.
19
Acknowledgements
Thank you for your attention!
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann,
Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast Neural Machine Translation in C++. In
Proceedings of ACL 2018, System Demonstrations
, pages 116–121, Melbourne, Australia. Association for Computational Linguistics.
Lukáš Kyjánek, Zdeněk Žabokrtský, Magda Ševčíková, Jonáš Vidra. Universal Derivations Kicko: A Collection of Eleven Harmonized Derivational
Resources for Eleven Languages. In
Proceedings of the 2nd Workshop on Resources and Tools for Derivational Morphology
. Prague: Charles
University. ISBN: 978-80-88132-08-0. 2022.
Lukáš Kyjánek, Emil Svoboda, Jan Bodnár, Jonáš Vidra, Magda Ševčíková, Zdeněk Žabokrtský.
Modelling Macro-Level Competition in Word
Formation: A Case Study on Seven Languages
. Submitted to COLING 2022.
Emil Svoboda & Magda Ševčíková. Spliting and Identifying Czech Compounds: A Pilot Study. In
Proceedings of the Third Workshop on Resources
and Tools for Derivational Morphology (DeriMo 2021)
. France, 2021, pp. 125-134.
Emil Svoboda & Magda Ševčíková.
Word Formation Analyzer for Czech: Automatic Parent Retrieval and Classification of Word Formation
Processes
. The Prague Bulletin of Mathematical Linguistics. Prague, April 2022, 118 pp. 55–73.
Jonáš Vidra et al. DeriNet 2.0: Towards an All-in-One Word-Formation Resource. In
Proceedings of the Second
Jonáš Vidra, Zdeněk Žabokrtský, Magda Ševčíková, Lukáš Kyjánek. 2019. DeriNet 2.0: Towards an All-in-One Word-Formation Resource. In
Proceedings of the 2nd Workshop on Resources and Tools for Derivational Morphology
. Prague: Charles University. ISBN: 978-80-88132-08-0.
References
20
References
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.