Content uploaded by Laura Giacomini
Author content
All content in this area was uploaded by Laura Giacomini on Nov 26, 2022
Content may be subject to copyright.
Computational, Cognitive, and Linguistic Approaches
to the Analysis of Complex Words and Collocations (CCLCC 2014)
Workshop organized as part of the
ESSLLI European Summer School in Logic, Language and Information
August 11-15, 2014 (ESSLLI first week), Tübingen, Germany
Proceedings
Verena Henrich & Erhard Hinrichs (eds.)
Tübingen, Germany, August 2014
CCLCC website: http://www.sfs.uni-tuebingen.de/~vhenrich/cclcc_2014/
Publisher:
Department of Linguistics (SfS)
University of Tübingen
Wilhelmstr. 19
72074 Tübingen, Germany
and
Collaborative Research Center: Emergence of Meaning (SFB 833)
University of Tübingen
Nauklerstr. 35
72074 Tübingen, Germany
Contact:
acl-sekretariat@sfs.uni-tuebingen.de
http://www.sfs.uni-tuebingen.de/
No part of this book may be reproduced in any form without the prior written permission of
the editors.
This volume has been compiled from the pdf files supplied by the authors.
Table of Contents
List of Reviewers ....................................................................................................................... 4
Workshop Program .................................................................................................................. 5
Acknowledgments ..................................................................................................................... 6
Preface ....................................................................................................................................... 7
INVITED TALKS ..................................................................................................................... 9
Invited Talk: Compound stress, informativity and semantic transparency
Melanie Bell ............................................................................................................................. 11
Invited Talk: The Semantics of Word Collocations from a Distributional Point of View
Eduard Hovy ............................................................................................................................ 13
SUBMITTED PAPERS .......................................................................................................... 15
Statistical methods for Estonian particle verb extraction from text corpus
Eleri Aedmaa ........................................................................................................................... 17
Variational models in collocation: taxonomic relations and collocates inheritance
Laura Giacomini ...................................................................................................................... 23
Automatic Collocation Extraction and Classification of Automatically Obtained
Bigrams
Daria Kormacheva, Lidia Pivovarova and Mikhail Kopotev ................................................. 27
Semantic modeling of collocations for lexicographic purposes
Lothar Lemnitzer and Alexander Geyken .............................................................................. 35
Treatment of Multiword Expressions and Compounds in Bulgarian
Petya Osenova and Kiril Simov ............................................................................................... 41
Cross-language description of shape: Shape-related properties and Artifacts as retrieved
from conventional and novel collocations across different languages
Francesca Quattri .................................................................................................................... 47
Using compound lists for German decompounding in a back-off scenario
Pedro Bispo Santos .................................................................................................................. 51
Multi-label Classification of Semantic Relations in German Nominal Compounds using
SVMs
Daniil Sorokin, Corina Dima and Erhard Hinrichs .............................................................. 57
Too Colorful To Be Real. The meanings of multi word patterns
Konrad Szczesniak ................................................................................................................... 65
Verb-Noun Collocations in PolNet 2.0
Zygmunt Vetulani and Grażyna Vetulani ............................................................................... 73
3
Variational models in collocation: taxonomic relations and collocates
inheritance
Laura Giacomini
Department of Translation and Interpreting
University of Heidelberg
Pl¨
ock 57a, D-69117 Heidelberg
laura.giacomini@iued.uni-heidelberg.de
Abstract
The paper presents part of the results ob-
tained in the frame of investigations con-
ducted at Heidelberg University on corpus
methods in translation practice and, in par-
ticular, on the topic of paradigmatic collo-
cates variation. It concentrates on collo-
cates inheritance across emotion words by
focusing on different syntactic frames and
a multilingual perspective in order to high-
light the potential benefits of this approach
for automatic analysis of word combina-
tions and its applications, e.g. in the fields
of e-lexicography and machine translation.
1 Introduction: Purpose and Method
Paradigmatic variation in collocational structures,
both on the base(s) and the collocate(s) level, al-
ways plays a key role in language production (cf.
Hall 2010, Nuccorini 2001) and is far from be-
ing limited to the mutual substitutability of near-
synonymic lexical elements. In particular, inheri-
tance of collocates (cf. definition of base/collocate
in Hausmann 1999) observed in the context of an
ontology-based semantic analysis, turns out to be
an interesting example of how languages tend to
build collocational clusters and patterns that are
poorly represented in existing lexicographic re-
sources and still cannot be sufficiently grasped by
available corpus query systems.
Initial observations made by Giacomini (2012)
on collocates inheritance across emotion words in
Italian can be summarised as follows:
a- meaning relations inside a semantic field
given a semantic field, a number of semantic
(here taxonomic) relations can be identified
between its lexical items;
b- semantically-based collocates inheritance
a corpus-based study of the collocational
behaviour of these items points out that
collocates of hypernymic bases are fre-
quently inherited by the hyponymic bases
according to semantic contiguity patterns
acknowledged by language use.
This paper enlarges upon the topic of collo-
cates inheritance by focusing on different syntac-
tic frames and a multilingual perspective in order
to highlight the potential benefits of this approach
for automatic analysis of word combinations and
its applications, e.g. in the fields of e-lexicography
and machine translation. The paper presents part
of the results obtained in the frame of investiga-
tions conducted at the Department of Translation
and Interpreting of Heidelberg University on cor-
pus methods in translation practice.
Lexical information on word combinations such
as collocations (Burger 2007) was automatically
retrieved from large multilingual web corpora,
syntactically and semantically evaluated and com-
pared with lexicographic data from collocation
dictionaries. The focus on relatively small se-
mantic fields, such as some subfields of emotions,
and an ontology-based approach to the lexicon
had the advantage of highlighting fine-grained se-
mantic clustering of collocational elements and al-
lowed for possible generalisations on this type of
paradigmatic variation.
2 Observing Collocates Inheritance in
Multilingual Corpora
2.1 Data and analysis
The excerpts from the extracted data contain
equivalent collocations in four languages (Italian,
French, German and English). Data refer to gen-
eral language nouns denoting emotions and to the
collocations they build in some of their usual syn-
tagmatic constellations. For each collocational
pattern, the hypernymic base is emphasized in
bold letters and is followed by a list of relevant hy-
23
ponymic bases that share the same collocate. Tax-
onomic relations were assessed by using existing
language-specific lexical ontologies such as the
Princeton WordNet and by introducing the neces-
sary adjustments on the basis of multilingual stud-
ies on emotion concepts and words (cf. Niedenthal
2004).
Lexical information was extracted with the
help of the corpus-query system Sketch Engine
(https://the.sketchengine.co.uk) from large web
corpora in the four reference languages, namely
itWac, frWac1.1, deTenTen10, and ukWac, that
include around 1,5-2,8 billion tokens, are PoS-
tagged and lemmatised. This level of annotation
was required to identify also co-occurrent but non-
adjacent bases and collocates. In particular, collo-
cation candidates were retrieved by means of the
Word Sketch function, which groups collocates of
a given lexeme along predetermined syntactic pat-
terns.
Relevance and arrangement of equivalent bases
were determined through frequency criteria and
statistical association measures (MI and logDice).
Table 1 and 2 show a selection of collocation
candidates obtained from data analysis and dis-
play the absolute frequency of each candidate in
the corpus. The excerpts include only direct co-
hyponyms of a specific base (the base is written in
bold characters), but deeper and/or multiple taxo-
nomic levels should also be taken into account in a
large-scale analysis. The cross-linguistic compar-
ison has demonstration purposes and is restricted
to the most frequent equivalents of the same con-
cept in the displayed languages, but, not least due
to its context-free nature, it is not meant to exclude
other lexical combinations.
The first data set (Table 1) covers binary combi-
nations with a few syntactic variations on the mul-
tilingual level (signaled by =, e.g. nominal com-
pounds like Angstschrei besides n-grams). De-
spite limited semantic specificity of the collocates,
their inheritance is governed by selection prefer-
ences which do not seem to substantially differ
across the four languages.
Table 2 shows collocations following more
stringent selection rules. These rules regard, for
instance, the polarity of emotion concepts: ances-
tral modifies names of negative emotions, whereas
the word fleeting usually accompanies positive
feelings). Another example are emotion nouns
which, especially in their role as subjects, require
N(base)+PP N+PP(base) V+N(base)
paura
(1073),
terrore (80),
orrore (66),
angoscia
(79) della
morte
grido di
paura (27),
spavento
(24), terrore
(61), orrore
(17)
suscitare
emozioni
(942), paura
(154), odio
(71), rabbia
(59)
peur (155),
terreur (47),
horreur (21),
angoisse
(40) de la
mort
cri de peur
(19), terreur
(53), horreur
(7), panique
(7)
susciter
emotions
(669),
crainte
(153), col`
ere
(211), haine
(51)
Angst/
Furcht
(1020/113),
Schrecken
(4), Panik
(2) vor dem
Tod
=Angstschrei
(94), =vor
Angst
schreien
(76)
Emotionen
(45), Gef¨
uhl
(127), Angst
(45), Hass
(7) hervor-
rufen
fear/
=afraid
(546/58),
terror (28),
horror (37),
of death
=to
scream in
fear/fright
(12/7), hor-
ror (12),
terror (47)
to arouse
emotions
(123), fear
(84), hatred
(16), anger
(67)
Table 1: Generic selection rules.
verbal collocates with specific aspect and Aktion-
sart (e.g. to creep, denoting a non-stative, contin-
uous action performed by emotions that can man-
ifest themselves gradually and almost unnoticed).
2.2 Results interpretation
The following observations and hypotheses can
now be made in relation to the presented data:
•collocates inheritance seems to be particu-
larly recurrent in the case of abstract (or, bet-
ter, second entity) words, which often feature
fuzzy semantic boundaries and overlapping
traits;
•due to the overall tendency towards termi-
nological univocity, collocates inheritance is
likely to affect the general language more
24
N(base)+V A+N(base) A+N(base)
la paura
(12), terrore
(6), panico
(6), angos-
cia (6) si
insinua
emozione
(13), odio
(6), paura
(189) ances-
trale
emozione
(8), gioia
(38), pi-
acere (48)
effimero/a
la peur (11),
panique (5),
angoisse (4)
s’insinue
´
emotion (6),
haine (31),
peur (87)an-
cestrale
sentiment
(5), joie
(23), plaisir
(42)´
eph´
em`
ere
die Angst
(5), Panik
(2)schleicht
sich ein
urspr¨
ungliche
Emotion
(3), Hass (4),
Angst (4),
=Urangst
(595)
verg¨
angliche
Gef¨
uhle (2),
Freude (18)
fear (10),
panic
(2)creeps in
ancestral
emotion (9),
hatred (7),
fear (9)
fleeting
feeling (5),
emotion (8),
happiness
(7), joy (9)
Table 2: Specific selection rules.
than specialised languages (cf. analysis of
ansia,angoscia,panico and fobia both in
general language and in the domains of psy-
chology, psychiatry and philosophy, Giaco-
mini 2012; the study highlighted interesting
differences in the way in which the same lex-
ical items behaved in general language and
in specialised language from a collocational
perspective, with the exception of the sub-
class of their compounds);
•all selected co-occurrences are composi-
tional, whereas non-compositionality (cf., for
instance, semi-idioms such as to frighten sb
out of their wits,peur bleue,Heidenangst)
possibly inhibits taxonomically contiguous
bases from sharing their collocates;
•generally speaking, in a monolingual context,
collocations can be semantically grouped
together along evident taxonomic patterns
across a number of syntactic structures; how-
ever,
•the identification of inherited collocates can
also highlight differences and similarities in
the way in which distinct languages form col-
locates clusters along their own reality cate-
gorization and encoding models.
The findings from the study, which this paper
introduces, are based on data extracted from web
corpora, which largely match the results obtained
with the help of newspapers corpora in Giacomini
(2012 and 2013). Testing the validity of the orig-
inal hypotheses in other semantic fields and spe-
cialised domains, also by using alternative cor-
pus types and text genres, could contribute to-
wards a better understanding of the phenomenon.
A comparison between corpus data and lexico-
graphic data included in collocation dictionar-
ies (Macmillan Collocations Dictionary, Macmil-
lan 2010; Dizionario Combinatorio Italiano, Ben-
jamins 2013, Dictionnaire des combinaisons de
mots, Le Robert 2007; W¨
orterbuch der Kolloka-
tionen im Deutschen, de Gruyter 2010) reveals the
lack, at least in printed lexicographic resources,
of an overall cross-referencing system which en-
ables the user to recognize shared collocates. Un-
doubtedly, the electronic medium has the poten-
tial to offer this type of information and the repre-
sentation of collocations in e-lexicography would
derive significant benefits from further studies on
this topic.
3 Conclusions
The practice of translation as well as linguistic
applications such as e-lexicography could derive
concrete tangible benefits from an in-depth inves-
tigation of paradigmatic collocates variation, both
from a language-specific and a cross-linguistic
point of view.
For NLP purposes, in general, this investigation
could possibly lead to the specification of suitable
statistical methods for the identification of inher-
itance patterns in corpora (cf. Roark/Sproat 2007
and work done by Alonso Ramos et al. 2010). The
development of collocation-based interlinguistic
models would be particularly useful in the field of
Machine Translation and in enhancing functional-
ity of Translation Memories. Finally, it is crucial
to stress the importance of lexical ontologies for
avoiding a fragmentary approach to collocation in-
vestigation, allowing for a better descriptive repre-
sentation of the lexicon.
25
References
Margarita Alonso Ramos et al. 2010. Tagging collo-
cations for learners. eLexicography in the 21st cen-
tury: new challenges, new applications. Proceed-
ings of eLex: 675-380.
Harald Burger. 2007. Phraseologie: Eine Einfhrung
am Beispiel des Deutschen. Erich Schmidt Verlag,
Berlin.
Laura Giacomini. 2013. Languages in Comparison(s):
Using Corpora to Translate Culture-Specific Simi-
les. SILTA. Studi Italiani di Linguistica Teorica e
Applicata 2013/2: 247-270.
Laura Giacomini. 2012. Un dizionario elettronico
delle collocazioni come rete di relazioni lessicali.
Peter Lang, Frankfurt/Main.
Timothy Hall. 2010. L2 Learner-Made Formulaic
Expressions and Constructions. Teachers College,
Columbia University Working Papers in TESOL and
Applied Linguistics. Vol. 10, No. 2: 1-18.
Franz Josef Hausmann. 1999. Praktische Einf¨
uhrung
in den Gebrauch des Student’s Dictionary of Collo-
cations. Student’s Dictionary of Collocations. Cor-
nelsen, Berlin: iv-xiii.
Paula M. Niedenthal. 2004. A prototype analysis of
the French category ’motion’. Cognition and Emo-
tion. 18 (3): 289-312.
Stefania Nuccorini. 2001. Introduction. When a
torch becomes a candle: variation in phraseology.
SILTA. Studi Italiani di Linguistica Teorica e Appli-
cata 2001/2: 193-198.
Brian Roark and Richard Sproat. 2007. Computa-
tional approaches to morphology and syntax. Ox-
ford University Press, Oxford, UK.
26