PosterPDF Available

Crowdsourcing corpus cleaning for language learning - an approach proposal

  • Centre for General and Applied Linguistics Studies (CELGA-ILTEC)
  • Faculty of Philology, Belgrade University
COST Action CA16105
COST is supported by the
EU Framework Programme
Horizon 2020
Web corpora are valuable sources for the
development of language learning
exercises. However, the data may contain
inappropriate or even offensive language,
thus requiring data checking and filtering
before pedagogical use. We propose a
language-independent crowdsourcing
approach to clean up such corpora, which
we apply to Portuguese, Dutch and
Serbian as case studies.
Evaluation of the efficiency of
crowdsourcing for large-scale data
Proper design of the project, so that
not only valuable and reliable results
can be collected, but the crowd also
feels motivated to participate.
Future work
Create learner's dictionary from
cleaned web corpora
Create a Machine Learning classifier
that is able to classify sentence
according to an appropriateness of
the content, based on the sentence
obtained after crowdsourcing step
Crowdsourcing corpus cleaning for
language learning
- an approach proposal
Tanara Zingano Kuhn, CELGA-ILTEC, University of Coimbra, Portugal
Peter Dekker, Dutch Language Institute, The Netherlands
Branislava Šandrih, University of Belgrade, Serbia
Rina Zviel-Girshin, Ruppin Academic Center, Israel
Approach proposal
Sentences to be judged by native
speakers are selected from a sample
corpus and consist of potentially good
and “bad” (offensive) sentences.
Inappropriate sentences are included
as ground truth for analysis.
Potentially good sentences are
extracted from the corpus, with
Sketch Engine GDEX filtering on, and
filtered using long blacklist of
offensive and controversial words.
Bad sentences are obtained from the
corpus, without GDEX filtering, and
filtered using short blacklist of
offensive words, remainder is kept.
Both blacklists are automatically
extended with synonyms using
semantic similarities of words from a
word embeddings model.
After performing the crowdsourcing
experiment, contributor judgments
can be fed to a machine learning
classification model, for automatic
cleanup of the remaining corpus.
Full-text available
This paper reports on an assessment task carried out among students of Tallinn University and the University of Tartu, who speak Estonian at B2-C1 proficiency level, and among lexicographers working at the Institute of the Estonian Language. The purpose of the task was to determine whether, according to the above two types of annotators, authentic and unedited corpus sentences would be suitable as example sentences for learners’ dictionaries on B2-C1 level. The results of the assessment task were also to help evaluate the output of version 1.4 of the Estonian module of GDEX (GDEX 1.4) used to choose and display web sentences in the Institute’s new language portal Sõnaveeb. GDEX (Good Dictionary Example) is a function of the corpus query system Sketch Engine, designed to find optimal example sentence candidates from large corpora. The results of the assessment task confirmed three hypotheses: 1) Before displaying authentic corpus sentences to end-users, a filtering of corpus sentences is necessary; 2) GDEX 1.4 can identify good example candidates from corpora and filter out inapropriate candidates; 3) example sentences compiled by lexicographers are suitable example sentences. Both types of annotators considered as many as 96% of the dictionary examples to be suitable example sentences and 85% of corpus sentences chosen as good examples by GDEX 1.4. Only 6% of the sentences that were discarded by GDEX 1.4 were considered as suitable, meaning that 94% of the bad candidates had been filtered out successfully. As for unfiltered corpus sentences, 60% of those were considered unsuitable. When asking for the annotators’ reasons for considering a sentence unsuitable, the most common arguments were that the sentences include anaphora and hence need more context, or that the sentences are colloquial, too long or too short. © 2019, Estonian Association Applied Linguists. All rights reserved.
ResearchGate has not been able to resolve any references for this publication.