Conference PaperPDF Available

Towards educating and motivating the crowd-a crowdsourcing platform for harvesting the fruits of NLP students' labour

Authors:

Abstract

This paper presents an idea to bring crowdsourcing to a higher level, for the purpose of acquiring valuable machine translation and natural language processing resources. In the proposed scenario, students are being educated in order to improve the quality and effectiveness of their natural language processing (NLP) related work. Their motivation is ensured by introducing an element of gamification-a ranking is kept, where the best contributing users are decorated with medals. The ranking is available at all times to all users and is always up-to-date, hence the effects of the contributions are immediately visible to the users. This scenario was applied to a group of students enrolled in Natural Language Processing course, who were presented with a task of collecting parallel corpora for less-resourced language pairs, in this case Croatian-English and English-Croatian. The whole experiment was supervised with the help of a custom-made open-source system named TMrepository, developed and maintained by the authors of this paper.
Towards educating and motivating the crowd – a crowdsourcing platform
for harvesting the fruits of NLP students' labour
Rafał Jaworski1, Sanja Seljan2 and Ivan Dunđer2
1Adam Mickiewicz University in Poznań, Poland
rjawor@amu.edu.pl
2University of Zagreb, Croatia
{sseljan,idundjer}@ffzg.hr
Abstract
This paper presents an idea to bring crowdsourcing to a higher level, for the purpose of acquiring valuable machine translation and
natural language processing resources. In the proposed scenario, students are being educated in order to improve the quality and
effectiveness of their natural language processing (NLP) related work. Their motivation is ensured by introducing an element of
gamification a ranking is kept, where the best contributing users are decorated with medals. The ranking is available at all times to all
users and is always up-to-date, hence the effects of the contributions are immediately visible to the users. This scenario was applied to a
group of students enrolled in Natural Language Processing course, who were presented with a task of collecting parallel corpora for less-
resourced language pairs, in this case Croatian-English and English-Croatian. The whole experiment was supervised with the help of a
custom-made open-source system named TMrepository, developed and maintained by the authors of this paper.
Keywords: crowdsourcing, gamification, NLP, machine translation resources, parallel corpora, sentence alignment, less-resourced
languages, Croatian, TMrepository
1. Introduction
Acquiring corpora is an essential task when it comes to
machine translation or other aspects of natural language
processing. However, due to the lack of available digitised
textual corpora the preparation of crucial language
resources becomes very challenging, especially for low-
resourced languages, such as Croatian.
Globalisation and the increase in user-generated content
have boosted the demand for translation. At the same time,
the development of computer-aided translation tools has
enabled virtual collaboration, which is evident in all
translation scenarios and across the whole translation
process from authors to publishers, from translation
agencies to translators (O'Brien, 2011). Collaborative
translation essentially means that two or more agents, i.e.
translators, cooperate in some way to produce a
translation, with the possible inclusion of machine
translation and computer-assisted translation.
Gamification can be defined as the application of game
design principles to change behaviour in non-gaming
context (Robson et al., 2016). If implemented and applied
properly, gamification can increase engagement of all
participants and the interactivity of elements in a specific
environment.
Gamification primarily aims at increasing the positive
motivations of users towards specific activities or the use
of technology, and thereby, increasing the quantity and
quality of the output or outcome of the given activities
(Morschheuser et al., 2017).
However, applying gamification is not straightforward, as
the source of idea, i.e. games, are complex and therefore
difficult to incorporate in other environments.
Gamification also involves understanding and deeper
knowledge of psychology. Furthermore, gamification is
commonly also to affect users’ behaviour which increases
the complexity of application and design of gamification
(Morschheuser et al., 2017).
Crowdsourcing refers to recruiting an undefined, i.e. very
often unknown, large group of people by making an open
call, to handle a specific task which would otherwise be
assigned to in-house employees (O'Brien, 2011).
The basic idea of crowdsourcing derives from the fact that
labour of a talented crowd isn’t always free or cheap, but
it costs less than paying traditional in-house employees
(Howe, 2006).
Crowdsourcing allows the power of the crowd to
accomplish tasks that were once the province of just a
specialised (Howe, 2008).
The paper presented by (Quinn and Bederson, 2011) points
out several dimensions of crowdsourcing: motivation,
quality, aggregation, human skills, participation time and
cognitive load. The paper categorises crowdsourcing into
seven genres: Game with a purpose, Mechanised labour
with payment, Wisdom of crowds with hundreds of people
participating and thinking independently, Crowdsourcing
by the unpaid general public usually motivated by
curiosity, Dual-purpose work in order to perform a task
which can’t be performed automatically (e.g. transcribing
old scanned books which cannot be processed by OCR),
Grand search with the task of finding the solution for a
specific problem, and Knowledge collection from
volunteer contributors based on the idea to create large
databases of common facts.
2. Related work
The following subsections discuss related work associated
with gamification and crowdsourcing in natural language
processing
2.1. Gamification
Gamification has been successfully applied for the purpose
of machine translation evaluation. For instance, in one
research evaluators were required to provide a quality
score, and in return they received feedback in form of stars
that reflect how close their very score is to a gold-standard
score (Abdelali et al., 2016). The intention of such a simple
gamification strategy was to keep the evaluators engaged.
2.2. Crowdsourcing in NLP
The paper by (Sabou et al., 2014) presents a research on
crowdsourcing used for the acquisition of annotated
corpora, which are crucial for experimenting with natural
language processing algorithms. In the paper,
crowdsourcing is approached as a project-based task
divided into specific phases.
Another paper shows that crowdsourcing is used for the
collection of training data and to perform annotation on
clustering and parsing, backed with a statistical natural
language processing analysis (Li et al., 2015). However,
the authors point out that crowdsourcing in science is still
not a recognised research method.
Crowdsourcing can also be used for the collection of
unstructured documents and reports in big open data
(Munro et al., 2012). In the research, several methods have
been applied, such as search-log-based detection, machine
learning techniques and corrected crowdsourced
annotation with the ultimate goal to detect seasonal
epidemics on a global scale.
Crowdsourcing is nowadays being used in order to satisfy
the translation demand, as collaborating on translation
projects clearly has benefits (O'Brien, 2011).
Another paper also discusses the potential of collaborative
work for natural language processing tasks (Sabou et al.,
2012).
A collaborative three-phase workflow for language
translation as a cost-effective crowdsourcing approach was
also proposed (Ambati et al., 2012). The given workflow
breaks an atomic task of sentence translation into separate
phases: word translation, assisted sentence translation and
synthesis of translations. Such sub-tasks are at all times
consistent and verifiable, and enable better translations for
a vast range of diverse non-expert translators.
Another research confirmed that crowdsourcing can play a
very important role in building necessary language
resources (Zaidan and Callison-Burch, 2011). By using
Amazon’s Mechanical Turk, it is indeed possible to obtain
high-quality translations from non-professional translators
while keeping the overall costs below the price of
professional translations.
A similar approach shed some more light on the obstacles
and challenges when formulating crowdsourcing
translation tasks on Amazon’s Mechanical Turk for the
purpose of machine translation (Ambati and Vogel, 2010).
One of the important conclusions was that when working
with non-professional translators it is very important to
address quality concerns alongside checking for any usage
of automatic translation services.
One paper considers crowdsourcing translation to be the
next big breakthrough in the translation industry (Muntés-
Mulero et al., 2012). Interactive, elastic and scalable
machine translation will be important to assist the crowd,
whereas by harvesting the crowd potential to deal with
tasks such as translating or post-editing, the translation
environment becomes capable of increasing its capacity to
quickly deal with large amounts of data and information.
Approaches to teaching computer-assisted translation in
the 21st century are tightly connected with the rise of
emerging translation standards and fast-growing
technologies in the translation industry (Muegge, 2013).
Nowadays, teaching computer-assisted translation heavily
relies on collaborative translation, machine translation,
translation management systems, and especially
crowdsourcing. Taking advantage of a cloud-based and
open-source learning management system in combination
with the crowdsourcing principle, for teaching purposes is
proposed. Such an online learning system should support
students with any internet-enabled computer device, give
them access to translation technology from anywhere and,
at the same time, reduce teaching costs. It should also
enable students to participate in real-time collaborative
work and editing exercises (e.g. NLP tasks, post-editing of
machine translations etc.), allow them to submit
translation-specific assignments and to track their
progress.
Distance and blended learning are also vital features of the
training and teaching environment for NLP-related tasks
(Canovas and Samson, 2011). Classroom facilities also
dictate and limit teaching approaches, therefore, allowance
must be made for hardware diversity. Other aspects such
as application of different operating systems, available
computer programs etc. influence what is perceived the
most suitable for specific training objectives.
3. The TMrepository system as a
crowdsourcing platform
TMrepository is a custom-made web application created
for the purpose of this research. It is being constantly
developed and maintained by the authors of this paper, and
it is publicly available at:
http://concordia.vm.wmi.amu.edu.pl/tmrepository/
Its main goal is to provide an easy-to-use repository of
translation memories, aligned at the sentence level. In this
particular research it was used to store and handle
Croatian-English and English-Croatian translation
memories. TMrepository also features gamification-
inspired functionalities in order to boost the motivation of
contributors.
TMrepository is intended to be used in the future by
computer science and information science students that
work on machine translations and other NLP-related tasks,
or students of translation studies, who produce their own
translation memories during their education process.
Registration to the system is free and open. Upon login,
the user is presented with the list of their own
contributions.
The main activity of the user in the system is uploading
new resources. This is done by the means of the upload
form. In the form, the user is asked to provide the title of
the translation memory, its brief description and type of
resource. Available types are the following:
manual translation
manual translation - automatically aligned
corpus - automatically aligned
renowned corpus
Translation memories of the type "manual translation" are
produced by human translators during their work with a
Computer-Aided Translation (CAT) tool. They translate
documents in a sentence-by-sentence mode, therefore the
resulting TM is already perfectly aligned at the sentence
level and the quality of the translations is assumed to be
high. Another option is "manual translation - automatically
aligned". It is reserved for translation memories which
were produced by human translators but were not aligned
at the sentence level, e.g. a collection of MS Word
documents, where source and target texts are in separate
documents. Such documents must be split into sentences
and aligned. While this can be done in external software,
the TMrepository provides a feature of automatic splitting
and aligning of a pair of MS Word documents. However,
as this process is done automatically, it may not yield
perfect alignments (extensive research on the impact of the
automatic splitting and aligning quality is planned for
future research). Therefore, the translation memory type
"manual translation - automatically aligned" has slightly
lower expected quality than the "manual translation".
Another type of translation memories in the system is
"corpus - automatically aligned". These translation
memories are collected automatically from various
sources, predominantly from multilingual websites. The
collection process includes automatic scraping of source
and target texts, splitting into sentences and aligning. As
in this case the text scraping is another process done
automatically, the quality of resulting translation
memories is expected to be lower than "manual translation
- automatically aligned".
The last option, "renowned corpus", refers to already
existing and aligned corpora, whose quality is expected to
be high and which do not need quality checking, e.g.
Europarl (Koehn, 2005). As one of the very important
goals of TMrepository is to concentrate in one system all
available resources for a specific language pair, the authors
decided to allow users to upload already well-known and
precompiled high standard corpora.
This study focuses on collecting translation memories of
the type "corpus - automatically aligned'', understood as
resources automatically aligned by the contributing user,
consisting of translations performed by other people.
Available import formats include: a pair of text files, a
TMX file and a pair of Word documents. Import from TXT
files assumes that two text files with UTF-8 encoding,
having equal number of lines are provided. This form
reduces various obstacles of text discrepancies, as
presented in Seljan et al. (2007). One of the files contains
sentences in L1, and the other in L2, accepting only 1:1
alignment. TMX files are processed in a natural manner
L1 and L2 are clearly marked in the TMX file.
TMrepository imports units (tu) with a single variant (tuv)
in L1, and a single variant in L2. In order to avoid problems
with excessive memory consumption for large TMX files,
a custom made stream TMX parser was implemented. The
last option, a pair of Word documents in DOC or DOCX
format, automatically aligns any two Word documents on
sentence level. The segmentation is performed by an SRX-
based splitter and the alignment is done by the hunalign
software (Varga et al., 2005). There are no assumptions as
to the format of documents.
Furthermore, an additional gamification element for
evaluation purposes is implemented into the system.
Namely, users fluent in languages of the parallel data are
encouraged to engage in reviewing of uploaded aligned
corpora in the system. A user can perform a review of 50
translation units (in one go) selected at random out of all
units that do not come from renowned corpora and have
not been checked before. Therefore, all registered users
can participate in the evaluation process and are rewarded
10 points for each reviewed translation unit, which is equal
to uploading a new translation memory of 10 units. Each
translation unit can be accepted, which is the default and
more frequently chosen option, or rejected.
For every translation memory a specific translation
memory overview report is created on demand, which
shows how many translation units of the current translation
memory were checked and how many of those were
accepted. It serves as an indicator of the translation
memory quality.
The ranking lists all the contributing users, sorted by the
total number of sentence pairs they uploaded in a
descending order. First three users are graphically exposed
Fig. 1Ranking of contributing users
and awarded virtual medals. This element is used to
introduce competitiveness among the users and thus
increase their motivation. Importantly, the ranking also
reveals the titles of translation memories contributed by
the users. This is done in order to provide inspiration for
other users, concerning potential sources of corpora.
For instance, noticing that people are uploading Lord of
the Rings and other books might direct the search for
corpora to some other book titles. Similarly, the fact of
uploading TV manuals by one of the users might inspire
others to acquire different translated user manuals and
technical documentation. At all times, every user is able to
see the current ranking (see Fig. 1).
The administrators of the TMrepository system have full
access to all translation memories uploaded by the users
via the administrative panel. The panel also shows the
current translation unit (i.e. sentence pair) count in all the
corpora.
4. Experimental scenario and results
This paper presents a study on applying crowdsourcing
and gamification methods to a group of students with the
use of the TMrepository system. The initial experiments
were conducted in Poland.
A total of seven students who were taking part in the
experiment were participating in an academic course of
Natural Language Processing, as a part of their computer
science studies. Before the experiment, they had
completed between four to six semesters of their studies.
Therefore, their background covered areas, such as: basic
algorithms and C++ programming, intermediate
programming, object-oriented programming (C\#, Java or
PHP), basic web applications, mathematical analysis,
algebra, logic and set theory. The background knowledge
and technical skills were helpful in crawling the web and
acquiring parallel data.
The students had no prior training in linguistics, machine
translation or natural language processing. Furthermore,
they were all native Polish speakers who did not speak
Croatian, which was one of the reasons to include the
reviewing feature in the system. Even though Polish and
Croatian, as two Slavic languages, and are to some extent
similar, they are not mutually understandable to speakers
of only one of these languages. Nevertheless, the system
interface is designed in English and is suitable for users
who are not familiar or fluent in languages of the parallel
data.
During the course, the students were first taught the basics
of natural language processing using Linux bash scripts
and the Python programming language. Exercises covered,
among others: character and word frequency calculations,
extracting and counting n-grams, sorting lines of input by
different criteria. Afterwards, the students were given a
lecture presenting the basics of web scraping, with the use
of simple command-line tools such as wget and Python’s
urllib module. The lecture also mentioned the web
crawling software framework PyCrawler.
After the lectures, the students were asked to start a project
of their choice. They chose from a variety of suggested
subjects, such as spell checkers, chatbots, lemmatisers,
surname inflectors and many others. One of the suggested
tasks was collecting Croatian-English or English-Croatian
translation memories. The students who chose this task
were instructed to acquire translation memories by "any
means necessary", and to upload the harvested resources
to the TMrepository system. The system was presented and
explained to the students beforehand, as part of the training
required in order to be able to use the system.
In total, 996343 translation units were collected during the
study (see Fig. 1 which reports the actual numbers of units
of participating users). Most common sources for parallel
corpora included: commonly known Croatian-English
parallel corpora, such as SETIMES or Ted Talks, but also
books, technical documentation, song lyrics and business
correspondence documentation.
The main characteristics of the collected parallel data are
discrepancies in size of the translation memories, diversity
of domains and domain independence, language register
differences and style diversities. For instance, as the
uploaded renowned corpora were very large in size, the
distribution of collected units is very skewed across
participants, i.e. one participant collected more than half of
the total units, whereas another collected only 25 units.
The system also allows users to update collected resources.
Taking into account that only seven participants took part
in this pilot study, the authors consider the magnitude of
the collected language resources a promising success.
Furthermore, the resulting corpus contains interesting,
original and previously unseen language resources, such as
the song lyrics corpus.
5. Conclusions and future work
Data collection is traditionally very important in machine
translation and other natural language processing tasks.
Parallel data is frequently being used for training machine
translation models, tuning machine translation systems,
extracting data for developing monolingual statistical
language models, building translation memories etc.
Parallel corpora are also being utilised as an input for a
vast range of natural language processing tasks.
Unfortunately, the acquisition of large amounts of parallel
data is very time-consuming and tedious. Furthermore,
parallel data is often collected biasedly, i.e. steered by
various motivations. Crowdsourcing, on the other hand,
can be initiated by an open call directed to the unknown
public, or, as in this case, directed to a group of NLP
students in order to harvest their labour.
As a result, the authors created a new tool, i.e. a custom-
built crowdsourcing platform named TMrepository for
which students managed to collect a total of nearly one
million translation units. The collected data is ready to be
exploited for the purpose of machine translation and in a
variety of NLP tasks.
The authors’ main focus for future work lies on building
machine translation systems that are fed with corpora
collected by the proposed method. Furthermore, the
authors plan to combine the TMrepository system with
other developed NLP software, such as Concordia a tool
which encompasses capabilities of standard concordance
searchers and translation memory systems (Jaworski et al.,
2017). The process of collecting resources with the help of
students will be continued, resulting in a constantly
expanding repository of machine translation resources, i.e.
parallel corpora. As for now, the system does not check for
duplicate translation units, and therefore, it is possible to
upload the same translation units. However, an intelligent
deduplication mechanism is planned in the near future.
Considering the variety, uniqueness and magnitude of the
corpora, the authors expect the performance of the
machine translation pipeline and possibly other natural
language processing tools trained with the acquired data to
be competitive among other solutions for a less-resourced
language, such as Croatian.
References
Abdelali, A., Durrani, N. and Guzmán, F. (2016)
iAppraise: A Manual Machine Translation Evaluation
Environment Supporting Eye-tracking. In Proceedings of
the Demonstrations Session, NAACL HLT 2016, The
2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, San Diego California, USA,
June 12-17, 2016, pages 1721. The Association for
Computational Linguistics, 2016.
Ambati, Vamshi and Stephan Vogel. (2010). Can Crowds
Build Parallel Corpora for Machine Translation Systems?
In Proceedings of the NAACL HLT 2010 Workshop on
Creating Speech and Language Data with Amazon’s
Mechanical Turk, CSLDAMT ’10, pages 6265,
Stroudsburg, PA, USA, 2010. Association for
Computational Linguistics.
Ambati, Vamshi, Stephan Vogel, and Jaime Carbonell.
(2012) Collaborative workflow for crowdsourcing
translation. In Proceedings of the ACM 2012 conference
on Computer Supported Cooperative Work, CSCW ’12,
pages 11911194, New York, NY, USA, 2012.
Canovas, Marcos and Richard Samson (2011). Open
source software in translator training. Tradumática, 9,
2011.
Howe, Jeff. (2006) The Rise of Crowdsourcing. Wired
Magazine, 14(6), 2006.
Howe, Jeff. (2008) Crowdsourcing: Why the Power of
the Crowd Is Driving the Future of Business. Crown
Publishing Group, New York, NY, USA, 1 edition, 2008.
Jaworski, Rafał, Ivan Dunđer, and Sanja Seljan (2017).
Usability Analysis of the Concordia Tool Applying
Novel Concordance Searching. In Proceedings of the
10th International Conference on Natural Language
Processing (HrTAL2016). In print, Springer Verlag
Lecture Notes in Computer Science (LNCS/LNAI).
Croatian Language Technologies Society (Zagreb),
Institute of Linguistics, Faculty of Humanities and Social
Sciences, University of Zagreb (Zagreb), 2017.
Koehn, P. (2005) Europarl: A Parallel Corpus for
Statistical Machine Translation. MT Summit 2005.
Li, Hanchuan, Haichen Shen, Shengliang Xu, and Congle
Zhang. (2015). Visualizing NLP annotations for
Crowdsourcing. CoRR, abs/1508.06044, 2015.
Morschheuser, Benedikt, Karl Werder, Juho Hamari, and
Julian Abe. (2017). How to gamify? Development of a
method for gamification. In Proceedings of the 50th
Annual Hawaii International Conference on System
Sciences (HICSS), Hawaii, USA, 2017.
Muegge, Uwe. (2013). Teaching computer-assisted
translation in the 21st century. Frank & Timme, 2013.
Munro, Robert, Lucky Gunasekara, Stephanie Nevins,
Lalith Polepeddi, and Evan Rosen. (2012). Tracking
Epidemics with Natural Language Processing and
Crowdsourcing. In Wisdom of the Crowd, Papers from
the 2012 AAAI Spring Symposium, volume SS-12-06 of
AAAI Technical Report. AAAI, 2012.
Muntés-Mulero, Victor, Patricia Paladini, Marc Solé, and
Jawad Manzoor. (2012). Multiplying the Potential of
Crowdsourcing with Machine Translation. In
Proceedings, Tenth Biennial Conference AMTA, San
Diego, 2012.
Packham, Sean and Hussein Suleman. (2015).
Crowdsourcing a Text Corpus is not a Game, pages 225
234. Springer International Publishing, Cham, 2015.
Quinn, Alexander J. and Benjamin B. Bederson. (2011)
Human Computation: A Survey and Taxonomy of a
Growing Field. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems,
CHI ’11, pages 14031412, New York, NY, USA, 2011.
ACM.
Robson, Karen, Kirk Plangger, Jan H. Kietzmann, Ian
McCarthy, and Leyland Pitt. (2016). Game on: Engaging
customers and employees through gamification. Business
Horizons, 59(1):2936, 2016.
Sabou, Marta, Kalina Bontcheva, and Arno Scharl.
(2012). Crowdsourcing Research Opportunities: Lessons
from Natural Language Processing. In Proceedings of the
12th International Conference on Knowledge
Management and Knowledge Technologies, i-KNOW ’12,
pages 17:117:8,
Seljan, S., Gašpar, A. and Pavuna, D. Sentence Alignment
as the Basis For Translation Memory Database. In: The
Future of Information Sciences: INFuture 2007 - Digital
Information and Heritage. Zagreb: Department of
Information Sciences, Faculty of Humanities and Social
Sciences, Zagreb, 2007. 299-311.
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L.,
Trón, V. (2005) Parallel corpora for medium density
languages. In: Proceedings of the RANLP 2005, pages
590596.
Zaidan, Omar F. and Chris Callison-Burch. (2011).
Crowdsourcing Translation: Professional Quality from
Non-professionals. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Linguistics: Human Language Technologies - Volume 1,
HLT ’11, pages 12201229
... This knowledge, however, derives from several disciplines, such as computer science, information science, computational linguistics, statistics etc. Therefore, such specially crafted corpora can also be used for teaching purposes in various courses in higher education [14], where they could be examined in numerous ways in order to comprehend the underlying fundamentals of phishing attack methods and strategies. ...
... This knowledge, however, derives from several disciplines, such as computer science, information science, computational linguistics, statistics etc. Therefore, such specially crafted corpora can also be used for teaching purposes in various courses in higher education [14], where they could be examined in numerous ways in order to comprehend the underlying fundamentals of phishing attack methods and strategies. ...
... Parallel corpora can be used in the education process, especially in the domain of computer and information sciences, for research on NLP [12], in language studies on tasks of evaluating and assessing semantics [13], for the translation process and terminology analysis [14], for post-editing tasks after the use of machine translation [15], or CAT technology [16]. However, building scalable, high-quality parallel corpora is a challenging and resource-intensive task in terms of time, effort, cost, and knowledge. ...
Article
Full-text available
Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.
... Therefore, human experts should look into the data, but this process is very time-consuming and expensive. This is why another research tried a gamification approach in order to increase user engagement in a specially-built crowdsourcing platform for collecting and verifying parallel corpora for Croatian [4]. ...
... The authors plan to manually correct the verse alignments in the data set, i.e. to set it to the required 1-1 parallel corpus format, so that the misalignments do not influence the overall evaluation process. Verification of the data set alignments could be done through a speciallybuilt NLP platform [15]. ...
... Another research tried a gamification approach in order to increase user engagement in a specially built crowdsourcing platform for collecting parallel corpora for Croatian [7]. ...
Article
Full-text available
Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.
Article
Full-text available
Automatsko strojno prevođenje sve je popularnija istraživačka tema u znanosti i raznim znanstvenim disciplinama, kao što su informacijske i komunikacijske znanosti, računarstvo, računalna lingvistika i sl. Razlog tome je prvenstveno to što danas omogućuje nezaobilaznu komunikaciju i brz prijenos informacija između različitih prirodnih jezika. To je posebno bitno za manje govorene jezike poput hrvatskoga, za koji još uvijek ne postoji dovoljan broj softverskih alata i digitalnih resursa potrebnih za razvoj specijaliziranih i kvalitetnih sustava za strojno prevođenje koji bi bili optimizirani za upotrebu u jednom specifičnom području. Sve brži rast količine podataka i sve veća potreba raznih dionika u sektorima industrije, gospodarstvu, znanosti, ali i u svakodnevnom životu ljudi impliciraju motivaciju za sistematiziranim i organiziranim razvojem te naknadnom prilagodbom sustava za automatsko strojno prevođenje za različite jezične parove. Budući da strojni prijevodi nisu savršeni, važno je primijeniti metode za računalno generiranje prijevoda prihvatljive razine kvalitete koja ovisi o samom zadatku i području primjene sustava za strojno provođenje. U ovom radu analiziran je model sustava za automatsko statističko strojno prevođenje, njegove komponente te uloga i značaj pojedinih elemenata unutar modela.
Conference Paper
Full-text available
We present iAppraise: an open-source framework that enables the use of eye-tracking for MT evaluation. It connects Appraise, an open-source toolkit for MT evaluation, to a low-cost eye-tracking device, to make its usage accessible to a broader audience. It also provides a set of tools for extracting and exploiting gaze data, which facilitate eye-tracking analysis. In this paper, we describe different modules of the framework, and explain how the tool can be used in a MT evaluation scenario. During the demonstration, the users will be able to perform an evaluation task, observe their own reading behavior during a replay of the session , and export and extract features from the data.
Article
Full-text available
Translator training implies the use of procedures and tools that allow students to become familiar with professional contexts. Specialised open source software includes professional quality tools and procedures accessible to academic institutions and distance students working at home. Authentic projects using open source software and crowdsourcing procedures are indispensable resources in translator training. RESUM (El programari lliure en la formació del traductor) La formació de traductors implica l´ús de procediments i eines que permetin els estudiants familiaritzar-se amb contextos professionals. El software lliure especialitzat inclou eines de qualitat professional i procediments accessibles per a les institucions acadèmiques i els estudiants a distància que treballen a casa seva. Els projectes reals que utilitzen software lliure i traducció col·laborativa (crowdsourcing) constitueixen recursos indispensables en la formació de traductors. Paraules clau: formació de traductors, software lliure, traducció assistida per ordinador, aprenentatge col·laboratiu, enfocament socioconstructivista, projectes de traducció, traducció col·laborativa RESUMEN (El software libre en la traducción del traductor) La formación de traductores implica el uso de procedimientos y herramientas que permitan a los estudiantes familiarizarse con contextos profesionales. El software libre especializado incluye herramientas de calidad profesional y procedimientos accesibles para las instituciones académicas y los estudiantes a distancia que trabajan en su casa. Los proyectos reales que utilizan software libre y traducción colaborativa (crowdsourcing) son recursos indispensables en la formación de traductores. Palabras clave: formación de traductores, software libre, traducción asistida por ordenador, aprendizaje colaborativo, enfoque socioconstructivista, proyectos de traducción, traducción colaborativa.
Conference Paper
Full-text available
Machine Translation (MT) is said to be the next lingua franca. With the evolution of new technologies and the capacity to produce a humungous number of written digital documents, human translators will not be able to translate documentation fast enough. However, some applications require a level of quality that is still beyond that provided by MT. Thanks to the increased capacity of communication provided by new technologies, people can also interact and collaborate to work remotely. With this, crowd computing is becoming more common and it has been proposed as a feasible solution for translation. In this paper, we discuss about the relationship between crowdsourcing and MT, and the main challenges for the MT community to multiply the potential of the crowd.
Article
Visualizing NLP annotation is useful for the collection of training data for the statistical NLP approaches. Existing toolkits either provide limited visual aid, or introduce comprehensive operators to realize sophisticated linguistic rules. Workers must be well trained to use them. Their audience thus can hardly be scaled to large amounts of non-expert crowdsourced workers. In this paper, we present CROWDANNO, a visualization toolkit to allow crowd-sourced workers to annotate two general categories of NLP problems: clustering and parsing. Workers can finish the tasks with simplified operators in an interactive interface, and fix errors conveniently. User studies show our toolkit is very friendly to NLP non-experts, and allow them to produce high quality labels for several sophisticated problems. We release our source code and toolkit to spur future research.
Conference Paper
The rapid growth of human computation within research and industry has produced many novel ideas aimed at organizing web users to do great things. However, the growth is not adequately supported by a framework with which to understand each new system in the context of the old. We classify human computation systems to help identify parallels between different systems and reveal "holes" in the existing work as opportunities for new research. Since human computation is often confused with "crowdsourcing" and other terms, we explore the position of human computation with respect to these related topics.
Conference Paper
In this paper we explore the challenges in crowdsourcing the task of translation over the web in which remotely located translators work on providing translations independent of each other. We then propose a collaborative workflow for crowdsourcing translation to address some of these challenges. In our pipeline model, the translators are working in phases where output from earlier phases can be enhanced in the subsequent phases. We also highlight some of the novel contributions of the pipeline model like assistive translation and translation synthesis that can leverage monolingual and bilingual speakers alike. We evaluate our approach by eliciting translations for both a minority-to-majority language pair and a minority-to-minority language pair. We observe that in both scenarios, our workflow produces better quality translations in a cost-effective manner, when compared to the traditional crowdsourcing workflow.
Article
Remember outsourcing? Sending jobs to India and China is so 2003. The new pool of cheap labor: everyday people using their spare cycles to create content, solve problems, even do corporate R
Can Crowds Build Parallel Corpora for Machine Translation Systems?
  • Ambati
  • Stephan Vamshi
  • Vogel
Ambati, Vamshi and Stephan Vogel. (2010). Can Crowds Build Parallel Corpora for Machine Translation Systems? In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 62-65, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
How to gamify? Development of a method for gamification
  • Benedikt Morschheuser
  • Karl Werder
  • Juho Hamari
  • Julian Abe
Morschheuser, Benedikt, Karl Werder, Juho Hamari, and Julian Abe. (2017). How to gamify? Development of a method for gamification. In Proceedings of the 50th