Content uploaded by Peter Dekker
Author content
All content in this area was uploaded by Peter Dekker on Mar 16, 2019
Content may be subject to copyright.
Introduction
enetCollect
COST Action CA16105
http://enetcollect.eurac.edu/
enetcollect@gmail.si
COST is supported by the
EU Framework Programme
Horizon 2020
Web corpora are valuable sources for the
development of language learning
exercises. However, the data may contain
inappropriate or even offensive language,
thus requiring data checking and filtering
before pedagogical use. We propose a
language-independent crowdsourcing
approach to clean up such corpora, which
we apply to Portuguese, Dutch and
Serbian as case studies.
Challenges
•Evaluation of the efficiency of
crowdsourcing for large-scale data
processing.
•Proper design of the project, so that
not only valuable and reliable results
can be collected, but the crowd also
feels motivated to participate.
Future work
•Create learner's dictionary from
cleaned web corpora
•Create a Machine Learning classifier
that is able to classify sentence
according to an appropriateness of
the content, based on the sentence
obtained after crowdsourcing step
Crowdsourcing corpus cleaning for
language learning
- an approach proposal
Tanara Zingano Kuhn, CELGA-ILTEC, University of Coimbra, Portugal
tanarazingano@outlook.com
Peter Dekker, Dutch Language Institute, The Netherlands
peter.dekker@ivdnt.org
Branislava Šandrih, University of Belgrade, Serbia
branislava.sandrih@fil.bg.ac.rs
Rina Zviel-Girshin, Ruppin Academic Center, Israel
rinazg@gmail.com
Approach proposal
•Sentences to be judged by native
speakers are selected from a sample
corpus and consist of potentially good
and “bad” (offensive) sentences.
•Inappropriate sentences are included
as ground truth for analysis.
•Potentially good sentences are
extracted from the corpus, with
Sketch Engine GDEX filtering on, and
filtered using long blacklist of
offensive and controversial words.
•Bad sentences are obtained from the
corpus, without GDEX filtering, and
filtered using short blacklist of
offensive words, remainder is kept.
•Both blacklists are automatically
extended with synonyms using
semantic similarities of words from a
word embeddings model.
•After performing the crowdsourcing
experiment, contributor judgments
can be fed to a machine learning
classification model, for automatic
cleanup of the remaining corpus.