PresentationPDF Available

When to use PYBOSSA? Case studies on crowdsourcing for Dutch



Presentation on cases to use PYBOSSA for language crowdsourcing, illustrated by a number of case studies of applying PYBOSSA for Dutch by the Dutch Language Institute.
When to use PYBOSSA?
Case studies on crowdsourcing
for Dutch
Peter Dekker
Tanneke Schoonheim
Instituut voor de Nederlandse Taal
enetCollect WG1 meeting Gothenburg, 5-7 december 2018
Introduction of INT
Instituut voor de Nederlandse Taal (Dutch Language Institute)
Scholarly institute in the field of the Dutch language
Central position in the Dutch-speaking world (the
Netherlands, Flanders (Belgium), Surinam, Dutch
Developer, keeper and distributor of corpora, lexica,
dictionaries and grammars
Making resources about the Dutch language accessible for
researchers and the general public
Introduction of INT: Documenting language
Current projects at INT:
Contemporary and historical dictionaries, corpora and lexica
Grammar portal, spelling database, terminology lists
Introduction of INT: Language learning
Development and hosting of products for educational purposes, such as
Bilingual dictionaries (New Greek, Portuguese, Estonian)
Dutch Word Combinations
Corpus of Elementary Dutch
Research question on experimenting with PYBOSSA
How can we use crowdsourcing for documenting language and language learning?
When and how can PYBOSSA be applied for these purposes?
Our experiments: Taalradar
Blends (september 2018)
Neologisms (november 2018-present)
Language variation (november 2018-present)
Blends: Data
Compound of two words, where parts of the words are lost
Signifies a new meaning, related to the words it consists of
glamping (glamour + camping)
mup (millenial + yup)
Can the crowd help in recognizing and analyzing blends?
Blends: Crowd
Where did we find the crowd?
Newsletter Instituut voor de Nederlandse Taal (3873 subscribers, 519 clicks)
Congress Internationale Vereniging voor Neerlandistiek
Announcements on linguistics blog, Twitter, LinkedIn
Result: 549 participants, spread across two tasks
PYBOSSA installation
Hosted version ( vs hosting on own server
Own server: existing infrastructure at INT, more flexibility
Clear installation guide on PYBOSSA website
Complexity: PYBOSSA consists of multiple software packages
Ansible script: recipe for reproducible installation
User details
Blends analysis
Results: Blends analysis
n = 326
Results: Blends analysis for preferendum
Analysis Frequency
referendum, pre 16
[Don’t know] 11
Multiple word forms (noun, verb) for prefer
Multiple interpretations (at least when word is
presented without context):
referendum + to prefer
referendum + pre
General experiences with PYBOSSA
Freedom when developing tasks
Share tasks with other researchers. Our repository:
Account system and loading/saving tasks handled by PYBOSSA
Quick answers from developers via bug tracker
No ready-made translation for all languages
No kiosk mode: multiple anonymous logins from same computer not allowed
No built-in possibility to stop after number of tasks and show end screen
User cannot go back to a previous task
PYBOSSA for linguistic crowdsourcing?
PYBOSSA designed for "pure" crowdsourcing
Problems when using PYBOSSA for linguistic crowdsourcing
No built-in support for asking user details
Has to be presented as task
User details openly visible
User identification by IP address: problem when using in class room
Alternatives: Survey tools
Survey tools can be alternative
If you do not need the power and customizability of PYBOSSA
If mentioned drawbacks are problematic for you
Google Forms: bad for privacy, data stored on external server
Open source: installation and data on own server
Lime Survey: limited free version
JD Esurvey
Our experiments: Taalradar
Blends (september 2018)
Neologisms (november 2018-present)
Language variation (november 2018-present)
Crowdsourcing for documenting language:
Automating lexicography
Automate lexicographic workflow for neologisms
Neologism workflow at INT:
1. New words automatically collected from newspapers and websites
2. Collect judgments on durability of neologisms via crowdsourcing
3. Lexicographer selects neologisms, added to:
a. neologism portal (all neologisms)
b. ANW dictionary (rooted neologisms only)
Crowdsourcing for language learning
First experiments to get to know PYBOSSA, more focused on language documentation in general than
on language learning in particular
Use crowdsourcing to make dictionary material accessible for language learning
Add associated words to a given list of entries (horse: riding, saddle, pony, tail, chess, gymnastics,
…. to make clusters of sense related associations, for instance for helping people with
Is PYBOSSA useful for linguistic crowdsourcing?
Is PYBOSSA useful for crowdsourcing for educational purposes?
Is PYBOSSA useful for crowdsourcing for language learning?
Yes, powerful platform, with its own strengths and drawbacks.
But you always have to be aware of what you ask from the crowd.
We are happy to share our tasks or collaborate!
... Pybossa 6 was chosen as the crowdsourcing platform because a) it is free and b) because the custom tasks (interface) can be written in Javascript. In addition, one of the team members of the research project has a robust experience with using Pybossa in other crowdsourcing projects (Dekker & Schoonheim 2018a, 2018b and has direct access to a local installation (INL) which ensures that the output data can be kept safely. A multilanguage (Portuguese, Serbian, Dutch and Slovene) crowdsourcing project 7 was created with a common landing page, where the crowd was first asked to pick their language and then was transferred to the corresponding language home page. ...
Conference Paper
Full-text available
Corpora are valuable sources for the development of language learning materials (e.g., books, grammars, dictionaries, exercises), because they contain language as produced in natural contexts. Even though corpora are getting larger, mainly due to crawling data from the web, their pedagogical use remains rather challenging. Not all texts are appropriate for language learning or teaching purposes as they can potentially contain sensitive or offensive content, in addition to exhibit structural problems, errors, among other problems. Corpus cleaning for pedagogical purposes is however a very time-consuming task if done manually. In this paper we present a new and more effective method for creating problem-labelled pedagogical corpora for a group of languages, namely Portuguese, Serbian, Slovene, Dutch and Estonian, by means of crowdsourcing. First, we report on an experiment aimed at verifying the adequacy of crowdsourcing as a technique for corpus labelling. We then outline the lessons learned and discuss how these have led us to explore an alternative way of compiling pedagogical corpora through gamification.
ResearchGate has not been able to resolve any references for this publication.