Conference Paper

Assembling Chinese-Mongolian Speech Corpus via Crowdsourcing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We describe our experience using both Amazon Mechanical Turk (MTurk) and Crowd-Flower to collect simple named entity annotations for Twitter status updates. Unlike most genres that have traditionally been the focus of named entity experiments, Twitter is far more informal and abbreviated. The collected annotations and annotation techniques will provide a first step towards the full study of named entity recognition in domains like Facebook and Twitter. We also briefly describe how to use MTurk to collect judgements on the quality of "word clouds."
Conference Paper
Full-text available
EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian governmental web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.
Article
Full-text available
We present a compendium of recent and current projects that utilize crowdsourcing technologies for language studies, finding that the quality is comparable to controlled laboratory experiments, and in some cases superior. While crowdsourcing has primarily been used for annotation in recent language studies, the results here demonstrate that far richer data may be generated in a range of linguistic disciplines from semantics to psycholinguistics. For these, we report a number of successful methods for evaluating data quality in the absence of a 'correct' response for any given data point.
Article
The ability to reliably identify sarcasm and irony in text can improve the performance of many Natural Language Processing (NLP) systems including summarization, sentiment analysis, etc. The existing sarcasm detection systems have focused on identifying sarcasm on a sentence level or for a specific phrase. However, often it is impossible to identify a sentence containing sarcasm without knowing the context. In this paper we describe a corpus generation experiment where we collect regular and sarcastic Amazon product reviews. We perform qualitative and quantitative analysis of the corpus. The resulting corpus can be used for identifying sarcasm on two levels: a document and a text utterance (where a text utterance can be as short as a sentence and as long as a whole document).
Conference Paper
Recent work has established the efficacy of Amazon's Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community.
Article
Remember outsourcing? Sending jobs to India and China is so 2003. The new pool of cheap labor: everyday people using their spare cycles to create content, solve problems, even do corporate R
Article
This paper describes the approach to spoken corpus design used by the British National Corpus (BNC) project.1 A twopart approach to spoken corpus design has been adopted. The demographic approach uses demographic parameters to sample the everyday speech of the population of British English speakers in the United Kingdom. The context governed approach is designed to cover the full range of linguistic variation found in spoken language using a typology based on four contextual categories: educational, business, public/institutional, and leisure. Details of the processing of recordings are given, together with a description of the context features included in the corpus (such as setting, location, topic, and participant details). The processing and transcription of BNC spoken recordings will be discussed in a forthcoming paper
Lexical tagging of Mongolian corpus. S.C. Inne Mon
  • J Xingwu
Research on the construction of Uygur, Kazak and Kirgiz public opinion tagging corpus based on crowdsourcing
  • H Chen
Building a speech corpus
  • S Adolphs
  • D Knight
Research on Mongolian speech recognition
  • R Mu
Problems of recording tagging and solutions in “Mongolian Speech Corpus
  • R Yu
Processing of Mongolian by computer
  • I Dawa
  • Y Zhang
  • K Uezono
  • S Zhang
Research on Uyghur speech language speech corpus of telephone channel
  • Y Tingyang
  • X Huadong
  • L Wang