PresentationPDF Available

Recognizing blends: First experiments with PYBOSSA

Authors:

Abstract

In this presentation, we show how the PYBOSSA crowdsourcing platform can be used for collection of linguistic knowledge about the Dutch language. As a pilot study, we ask speakers of Dutch to recognize and analyze blends, a certain type of compound word.
Recognizing blends:
First experiments with
PYBOSSA
Peter Dekker
Tanneke Schoonheim
Instituut voor de Nederlandse Taal
enetCollect WG3&5 meeting Leiden, 24-25 October 2018
Introduction of INT
Instituut voor de Nederlandse Taal (Dutch Language Institute)
Scholarly institute in the field of the Dutch language
Central position in the Dutch-speaking world
Developer, keeper and distributor of corpora, lexica, dictionaries
and grammars
Provider of necessary building blocks of the study of Dutch
Introduction of INT
Current staff of INT:
17 (computational and corpus) linguists,
lexicographers, terminologists,
4 linguistic assistants and trainees
5 software engineers, 1 system administrator
5 administration and communication
Introduction of INT
Current projects at INT:
Contemporary and historical dictionaries and dictionary portals
Contemporary and historical corpora and lexica
Grammar portal, spelling database, terminology lists
Infrastructure, tools and data for linguistic research (CLARIN)
Introduction of INT
Relatively new projects at INT:
Development and hosting of products for educational purposes, such as
Bilingual dictionaries (New Greek, Portuguese, Estonian)
Dutch Word Combinations
Corpus Eenvoudig Nederlands (Corpus of Elementary Dutch)
Can crowdsourcing help us developing these and other educational products?
Research objective
Crowdsourcing:
Task solved by public: answer unknown
User details not important
Traditional (socio)linguistic research/survey:
Answer of task known beforehand, in many cases
User details important
Our research objective: combination
PYBOSSA can be suitable, more solutions exist
PYBOSSA installation
Hosted version (crowdcrafting.org) vs hosting on own server
Own server: existing infrastructure at INT, more flexibility
Clear installation guide on PYBOSSA website
Complexity: PYBOSSA consists of multiple software packages
Ansible script: recipe for reproducible installation
Experiments with blends: Data
Blend
Compound of two words, where parts of the words are lost
Signifies a new meaning, related to the words it consists of
Examples:
glamping (glamour + camping)
mup (millenial + yup)
Blends in English: Gries, S. T. (2004). Shouldn't it be breakfunch? A quantitative analysis of blend structure in English.
Linguistics, 639-668.
Experiments with blends: Data
Blends are part of two related projects at INT:
Algemeen Nederlands Woordenboek (ANW; Dictionary of Contemporary Dutch)
Neologism portal (upcoming)
Neologism Workflow at INT:
1. Data from newspapers and websites
2. Processed automatically, new words put aside
3. Lexicographer selects neologisms, creates entries in
a. neologism portal (all neologisms)
b. ANW dictionary (rooted neologisms only)
Experiments with blends: Crowd
Where did we find the crowd?
Newsletter Instituut voor de Nederlandse Taal
Congress Internationale Vereniging voor Neerlandistiek
Experiments with blends: Jobs
Can the crowd help in recognizing and analyzing blends?
Two jobs created in PYBOSSA:
Blend recognition
Recognize blends in a text
Blend analysis
Analyze the words a blend consists of
10 tasks per job
User interface design
Freedom in UI design: design using HTML and Javascript
PYBOSSA only loads and saves tasks from database
User details
Blends analysis
Blends recognition
Results: Age
Results: Gender
Results: Location
Results: Blends analysis
n = 326
Results: Blends analysis for preferendum
Analysis Frequency
referendum,
prefereren
154
referendum,
preferentie
60
referendum, pre 16
[Don’t know] 11
referendum,
preferent
8
Multiple word forms (noun, verb) for prefer
Multiple interpretations (at least when word is
presented without context):
referendum + to prefer
referendum + pre
Results: Blends recognition
n = 223
Results: Blends recognition for twittie
Recognized
blends
Frequency
twittie 122
twittie, fittie 56
fittie 16
twittie, tweet, fittie 5
[Do not know] 4
twittie: twitter + fittie ‘fight’ (slang)
fittie itself also occurred in text:
misinterpreted as blend
More input fields (3) than real blends per task (1):
stimulates giving more blends
User feedback
English language of PYBOSSA, while tasks are
about Dutch
Too many buttons
Task not always clear
Make welcome page attractive
Experiences with PYBOSSA
Benefits
Freedom when developing tasks
Share tasks with other researchers
Everything else (account system and loading/saving tasks) handled by PYBOSSA
Quick answers from developers via bug tracker
Drawbacks
No ready-made translation for all languages
PYBOSSA not designed for asking user details
When uploading large number of tasks, there is no clear end of job (you have to code that yourself)
User cannot easily go back to a previous task
User identification by IP address does not always work
Possible alternative for some purposes: Google Forms
Future experiments
Neologisms and dialects
User detail prediction as reward
Interesting issue:
Is PyBossa better suited for these tasks than for instance Google forms?
Conclusion
Is crowdsourcing useful for the analysis of blends?
Yes, because it gives an insight in how blends are interpreted by non-linguists.
Is PYBOSSA useful for this kind of crowdsourcing?
Yes, powerful platform, with its own strengths and drawbacks.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.