PosterPDF Available

Moving away from Semantic Overfitting in disambiguation datasets

Authors:

Abstract

Entities and events in the world have no frequency, but our communication about them and the words we use to refer to them do have a strong frequency profile. Language expressions and their meanings follow a Zipfian distribution, featuring a small amount of very frequent observations and a very long tail of low frequent observations. Since our NLP datasets sample texts but do not sample the world, % odo{explain world: Piek, done through first sentence} they are no exception to Zipf’s law. This causes a lack of representativeness in our NLP tasks, leading to models that can capture the head phenomena in language and texts but fail when dealing with the long tail. %It is time to create a task that covers the full complexity of disambiguation to encourage development of systems that show deeper language understanding on the linguistic tail. We therefore propose a referential challenge for semantic NLP that reflects a higher degree of ambiguity and variation and captures a large range of small real-world phenomena. To perform well, systems would have to show deep understanding on the linguistic tail.
Moving away from Semantic Overtting
Marten Postma, Filip Ilievski, Piek Vossen, and Marieke van Erp
http://www.understandinglanguagebymachines.org
Vrije Universiteit Amsterdam
Entities and events in the world
have no frequency
In communication, language
expressions and meanings follow a
Zipan distribution:
)small amount of very
frequent observations
)very long tail of low
frequent observations
Cristiano Ronaldo
Ronaldo de Lima
Cristiano Ronaldo
Ronaldo de Lima
Ronaldo
Ronaldo
POPULARITY POPULARITY
2005 2015
Ronaldinho
Ronaldinho
NLP datasets sample texts but not the world:
)a lack of representativeness of long tail
phenomena:
models overt semantically to
head phenomena of time-bound
training data
models undert semantically to
tail phenomena of time-bound
target data
Task motivation Approach
)incentivize deep semantic processing linked to the head and the tail phenomena
)the set of references in the long tail is enormous, excessively ambiguous, and context-dependent
)no QA task has deliberately addressed the problem of long tail (co)reference
)an event-driven QA task
)high referential complexity
)represent both global and local events
Requirements
RMultiple event instances per event topic, e.g. the
murder of John Doe and the murder of Jane Roe
RMultiple event mentions per event instance within
the same document
RMultiple documents with varying document creation
times in which the same event instance is described to
capture topical information over time
REvent confusability by combining one or multiple
confusion factors, e.g. polysemy,location,participants
RRepresentation of non-dominant events and entities,
i.e. instances that receive little media coverage
Confusion factors
Confusion factor Example
ambiguity of John Smith res a gun
event forms John Smith res an employee
variance of John Smith kills John Doe
event forms John Smith murders John Doe
time murder A that happened in June 
murder B in October 
participants murder A committed by John Doe
murder B committed by the Roe couple
location murder A that happened in Zaire
murder B in Oklahoma
Task creatio n
Pick a subset of
ECB+ topics
Select one or more
confusability
factors
Increase the
amount of events
for an event topic
Retrieve multiple
event mentions for
each event
We favor seminal events
(e.g. murder) whose
surface forms have a
high lexical ambiguity
and/or variance.
Based on the
confusability factors.
We use local news
sources to ensure low
dominance.
View publication statsView publication stats
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.