COST Action CA16105
COST is supported by the
EU Framework Programme
Web corpora are valuable sources for the
development of language learning
exercises. However, the data may contain
inappropriate or even offensive language,
thus requiring data checking and filtering
before pedagogical use. We propose a
approach to clean up such corpora, which
we apply to Portuguese, Dutch and
Serbian as case studies.
•Evaluation of the efficiency of
crowdsourcing for large-scale data
•Proper design of the project, so that
not only valuable and reliable results
can be collected, but the crowd also
feels motivated to participate.
•Create learner's dictionary from
cleaned web corpora
•Create a Machine Learning classifier
that is able to classify sentence
according to an appropriateness of
the content, based on the sentence
obtained after crowdsourcing step
Crowdsourcing corpus cleaning for
- an approach proposal
Tanara Zingano Kuhn, CELGA-ILTEC, University of Coimbra, Portugal
Peter Dekker, Dutch Language Institute, The Netherlands
Branislava Šandrih, University of Belgrade, Serbia
Rina Zviel-Girshin, Ruppin Academic Center, Israel
•Sentences to be judged by native
speakers are selected from a sample
corpus and consist of potentially good
and “bad” (offensive) sentences.
•Inappropriate sentences are included
as ground truth for analysis.
•Potentially good sentences are
extracted from the corpus, with
Sketch Engine GDEX filtering on, and
filtered using long blacklist of
offensive and controversial words.
•Bad sentences are obtained from the
corpus, without GDEX filtering, and
filtered using short blacklist of
offensive words, remainder is kept.
•Both blacklists are automatically
extended with synonyms using
semantic similarities of words from a
word embeddings model.
•After performing the crowdsourcing
experiment, contributor judgments
can be fed to a machine learning
classification model, for automatic
cleanup of the remaining corpus.