Conference PaperPDF Available

Variational models in collocation: taxonomic relations and collocates inheritance

Authors:

Abstract

The paper presents part of the results obtained in the frame of investigations conducted at Heidelberg University on corpus methods in translation practice and, in particular , on the topic of paradigmatic collo-cates variation. It concentrates on collo-cates inheritance across emotion words by focusing on different syntactic frames and a multilingual perspective in order to highlight the potential benefits of this approach for automatic analysis of word combinations and its applications, e.g. in the fields of e-lexicography and machine translation.
Computational, Cognitive, and Linguistic Approaches
to the Analysis of Complex Words and Collocations (CCLCC 2014)
Workshop organized as part of the
ESSLLI European Summer School in Logic, Language and Information
August 11-15, 2014 (ESSLLI first week), Tübingen, Germany
Proceedings
Verena Henrich & Erhard Hinrichs (eds.)
Tübingen, Germany, August 2014
CCLCC website: http://www.sfs.uni-tuebingen.de/~vhenrich/cclcc_2014/
Publisher:
Department of Linguistics (SfS)
University of Tübingen
Wilhelmstr. 19
72074 Tübingen, Germany
and
Collaborative Research Center: Emergence of Meaning (SFB 833)
University of Tübingen
Nauklerstr. 35
72074 Tübingen, Germany
Contact:
acl-sekretariat@sfs.uni-tuebingen.de
http://www.sfs.uni-tuebingen.de/
No part of this book may be reproduced in any form without the prior written permission of
the editors.
This volume has been compiled from the pdf files supplied by the authors.
Table of Contents
List of Reviewers ....................................................................................................................... 4
Workshop Program .................................................................................................................. 5
Acknowledgments ..................................................................................................................... 6
Preface ....................................................................................................................................... 7
INVITED TALKS ..................................................................................................................... 9
Invited Talk: Compound stress, informativity and semantic transparency
Melanie Bell ............................................................................................................................. 11
Invited Talk: The Semantics of Word Collocations from a Distributional Point of View
Eduard Hovy ............................................................................................................................ 13
SUBMITTED PAPERS .......................................................................................................... 15
Statistical methods for Estonian particle verb extraction from text corpus
Eleri Aedmaa ........................................................................................................................... 17
Variational models in collocation: taxonomic relations and collocates inheritance
Laura Giacomini ...................................................................................................................... 23
Automatic Collocation Extraction and Classification of Automatically Obtained
Bigrams
Daria Kormacheva, Lidia Pivovarova and Mikhail Kopotev ................................................. 27
Semantic modeling of collocations for lexicographic purposes
Lothar Lemnitzer and Alexander Geyken .............................................................................. 35
Treatment of Multiword Expressions and Compounds in Bulgarian
Petya Osenova and Kiril Simov ............................................................................................... 41
Cross-language description of shape: Shape-related properties and Artifacts as retrieved
from conventional and novel collocations across different languages
Francesca Quattri .................................................................................................................... 47
Using compound lists for German decompounding in a back-off scenario
Pedro Bispo Santos .................................................................................................................. 51
Multi-label Classification of Semantic Relations in German Nominal Compounds using
SVMs
Daniil Sorokin, Corina Dima and Erhard Hinrichs .............................................................. 57
Too Colorful To Be Real. The meanings of multi word patterns
Konrad Szczesniak ................................................................................................................... 65
Verb-Noun Collocations in PolNet 2.0
Zygmunt Vetulani and Grażyna Vetulani ............................................................................... 73
3
Variational models in collocation: taxonomic relations and collocates
inheritance
Laura Giacomini
Department of Translation and Interpreting
University of Heidelberg
Pl¨
ock 57a, D-69117 Heidelberg
laura.giacomini@iued.uni-heidelberg.de
Abstract
The paper presents part of the results ob-
tained in the frame of investigations con-
ducted at Heidelberg University on corpus
methods in translation practice and, in par-
ticular, on the topic of paradigmatic collo-
cates variation. It concentrates on collo-
cates inheritance across emotion words by
focusing on different syntactic frames and
a multilingual perspective in order to high-
light the potential benefits of this approach
for automatic analysis of word combina-
tions and its applications, e.g. in the fields
of e-lexicography and machine translation.
1 Introduction: Purpose and Method
Paradigmatic variation in collocational structures,
both on the base(s) and the collocate(s) level, al-
ways plays a key role in language production (cf.
Hall 2010, Nuccorini 2001) and is far from be-
ing limited to the mutual substitutability of near-
synonymic lexical elements. In particular, inheri-
tance of collocates (cf. definition of base/collocate
in Hausmann 1999) observed in the context of an
ontology-based semantic analysis, turns out to be
an interesting example of how languages tend to
build collocational clusters and patterns that are
poorly represented in existing lexicographic re-
sources and still cannot be sufficiently grasped by
available corpus query systems.
Initial observations made by Giacomini (2012)
on collocates inheritance across emotion words in
Italian can be summarised as follows:
a- meaning relations inside a semantic field
given a semantic field, a number of semantic
(here taxonomic) relations can be identified
between its lexical items;
b- semantically-based collocates inheritance
a corpus-based study of the collocational
behaviour of these items points out that
collocates of hypernymic bases are fre-
quently inherited by the hyponymic bases
according to semantic contiguity patterns
acknowledged by language use.
This paper enlarges upon the topic of collo-
cates inheritance by focusing on different syntac-
tic frames and a multilingual perspective in order
to highlight the potential benefits of this approach
for automatic analysis of word combinations and
its applications, e.g. in the fields of e-lexicography
and machine translation. The paper presents part
of the results obtained in the frame of investiga-
tions conducted at the Department of Translation
and Interpreting of Heidelberg University on cor-
pus methods in translation practice.
Lexical information on word combinations such
as collocations (Burger 2007) was automatically
retrieved from large multilingual web corpora,
syntactically and semantically evaluated and com-
pared with lexicographic data from collocation
dictionaries. The focus on relatively small se-
mantic fields, such as some subfields of emotions,
and an ontology-based approach to the lexicon
had the advantage of highlighting fine-grained se-
mantic clustering of collocational elements and al-
lowed for possible generalisations on this type of
paradigmatic variation.
2 Observing Collocates Inheritance in
Multilingual Corpora
2.1 Data and analysis
The excerpts from the extracted data contain
equivalent collocations in four languages (Italian,
French, German and English). Data refer to gen-
eral language nouns denoting emotions and to the
collocations they build in some of their usual syn-
tagmatic constellations. For each collocational
pattern, the hypernymic base is emphasized in
bold letters and is followed by a list of relevant hy-
23
ponymic bases that share the same collocate. Tax-
onomic relations were assessed by using existing
language-specific lexical ontologies such as the
Princeton WordNet and by introducing the neces-
sary adjustments on the basis of multilingual stud-
ies on emotion concepts and words (cf. Niedenthal
2004).
Lexical information was extracted with the
help of the corpus-query system Sketch Engine
(https://the.sketchengine.co.uk) from large web
corpora in the four reference languages, namely
itWac, frWac1.1, deTenTen10, and ukWac, that
include around 1,5-2,8 billion tokens, are PoS-
tagged and lemmatised. This level of annotation
was required to identify also co-occurrent but non-
adjacent bases and collocates. In particular, collo-
cation candidates were retrieved by means of the
Word Sketch function, which groups collocates of
a given lexeme along predetermined syntactic pat-
terns.
Relevance and arrangement of equivalent bases
were determined through frequency criteria and
statistical association measures (MI and logDice).
Table 1 and 2 show a selection of collocation
candidates obtained from data analysis and dis-
play the absolute frequency of each candidate in
the corpus. The excerpts include only direct co-
hyponyms of a specific base (the base is written in
bold characters), but deeper and/or multiple taxo-
nomic levels should also be taken into account in a
large-scale analysis. The cross-linguistic compar-
ison has demonstration purposes and is restricted
to the most frequent equivalents of the same con-
cept in the displayed languages, but, not least due
to its context-free nature, it is not meant to exclude
other lexical combinations.
The first data set (Table 1) covers binary combi-
nations with a few syntactic variations on the mul-
tilingual level (signaled by =, e.g. nominal com-
pounds like Angstschrei besides n-grams). De-
spite limited semantic specificity of the collocates,
their inheritance is governed by selection prefer-
ences which do not seem to substantially differ
across the four languages.
Table 2 shows collocations following more
stringent selection rules. These rules regard, for
instance, the polarity of emotion concepts: ances-
tral modifies names of negative emotions, whereas
the word fleeting usually accompanies positive
feelings). Another example are emotion nouns
which, especially in their role as subjects, require
N(base)+PP N+PP(base) V+N(base)
paura
(1073),
terrore (80),
orrore (66),
angoscia
(79) della
morte
grido di
paura (27),
spavento
(24), terrore
(61), orrore
(17)
suscitare
emozioni
(942), paura
(154), odio
(71), rabbia
(59)
peur (155),
terreur (47),
horreur (21),
angoisse
(40) de la
mort
cri de peur
(19), terreur
(53), horreur
(7), panique
(7)
susciter
emotions
(669),
crainte
(153), col`
ere
(211), haine
(51)
Angst/
Furcht
(1020/113),
Schrecken
(4), Panik
(2) vor dem
Tod
=Angstschrei
(94), =vor
Angst
schreien
(76)
Emotionen
(45), Gef¨
uhl
(127), Angst
(45), Hass
(7) hervor-
rufen
fear/
=afraid
(546/58),
terror (28),
horror (37),
of death
=to
scream in
fear/fright
(12/7), hor-
ror (12),
terror (47)
to arouse
emotions
(123), fear
(84), hatred
(16), anger
(67)
Table 1: Generic selection rules.
verbal collocates with specific aspect and Aktion-
sart (e.g. to creep, denoting a non-stative, contin-
uous action performed by emotions that can man-
ifest themselves gradually and almost unnoticed).
2.2 Results interpretation
The following observations and hypotheses can
now be made in relation to the presented data:
collocates inheritance seems to be particu-
larly recurrent in the case of abstract (or, bet-
ter, second entity) words, which often feature
fuzzy semantic boundaries and overlapping
traits;
due to the overall tendency towards termi-
nological univocity, collocates inheritance is
likely to affect the general language more
24
N(base)+V A+N(base) A+N(base)
la paura
(12), terrore
(6), panico
(6), angos-
cia (6) si
insinua
emozione
(13), odio
(6), paura
(189) ances-
trale
emozione
(8), gioia
(38), pi-
acere (48)
effimero/a
la peur (11),
panique (5),
angoisse (4)
s’insinue
´
emotion (6),
haine (31),
peur (87)an-
cestrale
sentiment
(5), joie
(23), plaisir
(42)´
eph´
em`
ere
die Angst
(5), Panik
(2)schleicht
sich ein
urspr¨
ungliche
Emotion
(3), Hass (4),
Angst (4),
=Urangst
(595)
verg¨
angliche
Gef¨
uhle (2),
Freude (18)
fear (10),
panic
(2)creeps in
ancestral
emotion (9),
hatred (7),
fear (9)
fleeting
feeling (5),
emotion (8),
happiness
(7), joy (9)
Table 2: Specific selection rules.
than specialised languages (cf. analysis of
ansia,angoscia,panico and fobia both in
general language and in the domains of psy-
chology, psychiatry and philosophy, Giaco-
mini 2012; the study highlighted interesting
differences in the way in which the same lex-
ical items behaved in general language and
in specialised language from a collocational
perspective, with the exception of the sub-
class of their compounds);
all selected co-occurrences are composi-
tional, whereas non-compositionality (cf., for
instance, semi-idioms such as to frighten sb
out of their wits,peur bleue,Heidenangst)
possibly inhibits taxonomically contiguous
bases from sharing their collocates;
generally speaking, in a monolingual context,
collocations can be semantically grouped
together along evident taxonomic patterns
across a number of syntactic structures; how-
ever,
the identification of inherited collocates can
also highlight differences and similarities in
the way in which distinct languages form col-
locates clusters along their own reality cate-
gorization and encoding models.
The findings from the study, which this paper
introduces, are based on data extracted from web
corpora, which largely match the results obtained
with the help of newspapers corpora in Giacomini
(2012 and 2013). Testing the validity of the orig-
inal hypotheses in other semantic fields and spe-
cialised domains, also by using alternative cor-
pus types and text genres, could contribute to-
wards a better understanding of the phenomenon.
A comparison between corpus data and lexico-
graphic data included in collocation dictionar-
ies (Macmillan Collocations Dictionary, Macmil-
lan 2010; Dizionario Combinatorio Italiano, Ben-
jamins 2013, Dictionnaire des combinaisons de
mots, Le Robert 2007; W¨
orterbuch der Kolloka-
tionen im Deutschen, de Gruyter 2010) reveals the
lack, at least in printed lexicographic resources,
of an overall cross-referencing system which en-
ables the user to recognize shared collocates. Un-
doubtedly, the electronic medium has the poten-
tial to offer this type of information and the repre-
sentation of collocations in e-lexicography would
derive significant benefits from further studies on
this topic.
3 Conclusions
The practice of translation as well as linguistic
applications such as e-lexicography could derive
concrete tangible benefits from an in-depth inves-
tigation of paradigmatic collocates variation, both
from a language-specific and a cross-linguistic
point of view.
For NLP purposes, in general, this investigation
could possibly lead to the specification of suitable
statistical methods for the identification of inher-
itance patterns in corpora (cf. Roark/Sproat 2007
and work done by Alonso Ramos et al. 2010). The
development of collocation-based interlinguistic
models would be particularly useful in the field of
Machine Translation and in enhancing functional-
ity of Translation Memories. Finally, it is crucial
to stress the importance of lexical ontologies for
avoiding a fragmentary approach to collocation in-
vestigation, allowing for a better descriptive repre-
sentation of the lexicon.
25
References
Margarita Alonso Ramos et al. 2010. Tagging collo-
cations for learners. eLexicography in the 21st cen-
tury: new challenges, new applications. Proceed-
ings of eLex: 675-380.
Harald Burger. 2007. Phraseologie: Eine Einfhrung
am Beispiel des Deutschen. Erich Schmidt Verlag,
Berlin.
Laura Giacomini. 2013. Languages in Comparison(s):
Using Corpora to Translate Culture-Specific Simi-
les. SILTA. Studi Italiani di Linguistica Teorica e
Applicata 2013/2: 247-270.
Laura Giacomini. 2012. Un dizionario elettronico
delle collocazioni come rete di relazioni lessicali.
Peter Lang, Frankfurt/Main.
Timothy Hall. 2010. L2 Learner-Made Formulaic
Expressions and Constructions. Teachers College,
Columbia University Working Papers in TESOL and
Applied Linguistics. Vol. 10, No. 2: 1-18.
Franz Josef Hausmann. 1999. Praktische Einf¨
uhrung
in den Gebrauch des Student’s Dictionary of Collo-
cations. Student’s Dictionary of Collocations. Cor-
nelsen, Berlin: iv-xiii.
Paula M. Niedenthal. 2004. A prototype analysis of
the French category ’motion’. Cognition and Emo-
tion. 18 (3): 289-312.
Stefania Nuccorini. 2001. Introduction. When a
torch becomes a candle: variation in phraseology.
SILTA. Studi Italiani di Linguistica Teorica e Appli-
cata 2001/2: 193-198.
Brian Roark and Richard Sproat. 2007. Computa-
tional approaches to morphology and syntax. Ox-
ford University Press, Oxford, UK.
26
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The emergence of lexico-grammatical productivity has been a central issue in the field of Second Language Acquisition (SLA). Ellis (2002) proposed that formulaic chunks of language are one resource for the learner to develop such productivity. This exploratory study sought to determine whether formulaic language chunks were observed in the oral production of three adult beginner learners of English as a Second Language (ESL) over a nine-week period in a community language program. It also attempted to determine whether there was a developmental relationship between formulae and productive forms called constructions. Tasks used to elicit the data included picture description tasks and semi-structured interviews. Results showed that formulae were minimally present in the learner output and that constructions and formulae of similar structure coexisted, but that a developmental relationship between formulae and constructions was not clearly evident. The discussion proposes the existence of a pre-formulaic stage account for the data, and submits that the relationship between formulae and productive constructions requires more intensive study.
Article
Full-text available
Sproat have written a compact and very readable book surveying computational morphology and computational syntax. This text is not introductory; instead, it will help bring computational linguists who do not work on morphology or syntax up to date on these areas' latest developments. Certain chapters (in particular , Chapters 2 and 8) provide especially good starting points for advanced graduate courses or seminars. The text is divided into an Introduction and Preliminaries chapter, four chapters on computational approaches to morphology, and four chapters on computational approaches to syntax. The morphology chapters focus primarily on formal and theoretical issues, and are likely to be of interest to morphologists, computational and not. The syntax chapters are driven more by engineering goals, with more algorithm details. Because a good understanding of probabilistic modeling is assumed, these chapters will also be useful for machine learning researchers interested in language processing. Despite the authors' former affiliations, this book is not an AT&T analogue of Beesley and Karttunen's (2003) pedagogically motivated text on the Xerox finite-state tools. This text is not about the AT&T FSM libraries or the algorithms underlying them (cf. Roche and Schabes 1997).
Article
Full-text available
This paper reports a prototype analysis of the French emotion lexicon, which largely replicates a previous study by Zammuner (1998) of the Italian emotion lexicon. Three measures of prototypicality were assessed, from which an explicit and an implicit indicator were computed. Prototypicality was predicted by aspects of the subjective state denoted by the word (valence, intensity, duration, familiarity) as well as characteristics of the word (objective and subjective frequency in the language, age of acquisition). Results showed that intensity was a more important predictor of prototypicality than was valence, particularly for the explicit measure of prototypicality, which was likely to be more influenced by folk theory. In addition, the predictors of the implicit and explicit measures were somewhat different. The results are discussed in the light of the distinction between émotion and sentiment in the French language. The importance of recent models of & concepts for understanding the semantics of emotion are also considered.
Article
By taking into consideration the specific case of similes in Australian English and Italian as part of a wider comparative study on Australianisms and Europeanisms, this paper explores untranslatability issues posed by culture-specific words and phraseologisms. The paper also aims at highlighting the essential role of corpus analysis in translation and research on phraseologisms, especially when lexicographic resources offer limited coverage of this linguistic phenomenon. Key observations on the syntax, semantics and pragmatics of similes and practical methodological guidance are stepwise provided to translators for heightening their operational awareness of corpus-based documentation and for supporting functionally adequate equivalence choices. (http://www.studitlinguisticateoricappl.it/)
Book
Lo studio, condotto sul campo semantico di paura, mira alla creazione di un modello di dizionario elettronico delle collocazioni italiane con finalità attiva. I dati automaticamente estratti da un corpus di testi giornalistici contenente all’incirca 300 milioni di parole vengono sottoposti ad una approfondita analisi lessicologica che alla descrizione sintagmatica delle collocazioni candidate aggiunge parametri tratti dagli studi psicologici nonché i ruoli tematici, l’azione verbale e il concetto di prototipo. Ciò consente la suddivisione delle collocazioni in macroclassi con specifiche caratteristiche sintattiche, semantiche e pragmatiche. Ne risulta una base lessicografica di tipo onomasiologico, in cui i nessi tra collocazioni sono visibili all’utente grazie ad un solido sistema di rinvii.
Phraseologie: Eine Einfhrung am Beispiel des Deutschen
  • Harald Burger
Harald Burger. 2007. Phraseologie: Eine Einfhrung am Beispiel des Deutschen. Erich Schmidt Verlag, Berlin.
Praktische Einführung in den Gebrauch des Student's Dictionary of Collocations. Student's Dictionary of Collocations. Cornelsen
  • Franz Josef Hausmann
Franz Josef Hausmann. 1999. Praktische Einführung in den Gebrauch des Student's Dictionary of Collocations. Student's Dictionary of Collocations. Cornelsen, Berlin: iv-xiii.
Introduction. When a torch becomes a candle: variation in phraseology
  • Stefania Nuccorini
Stefania Nuccorini. 2001. Introduction. When a torch becomes a candle: variation in phraseology. SILTA. Studi Italiani di Linguistica Teorica e Applicata 2001/2: 193-198.