BookPDF Available

Machine Learning in Translation Corpora Processing

Authors:

Abstract

This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.
A preview of the PDF is not available
... Although so far an automatic translator capable of replacing humans has not been developed, in accordance with Oxford Report, whose authors examine how susceptible jobs are to computerisation, the estimated probability of computerisation for the profession of translator and interpreter is 38% [Frey, Osborne 2013: 67]. As is well known, the computing power of computers as well as the text base that enables artificial intelligence to learn is constantly growing [Kisielewicz 2017: 326-327;Chan Sin-wai 2018;Wołk 2019]. What is more, according to Michael Cronin's view, not only can machine translators learn, but they are also able to enter into some types of dialogue and interaction, which makes them increasingly reliable tools [Cronin 2016: 119-138]. ...
Article
Full-text available
This article provides an overview of the challenges faced at the intersection of philosophy and translation studies. The interdisciplinary scope under consideration covers philosophical contributions to theoretical investigation of translation, which are substantively grounded in the experiences of practicing translators, amongst whom are philosophers. Furthermore, the paper emphasizes the prominent role of philosophy in establishing the fundamental conditions and concepts of translation, and addresses the numerous philosophical approaches contributing to the interdisciplinary discourse on the nature of translation. Considerable attention is given to metaphysical and epistemological points that seem to play a significant role in conceptualising translation issues and establishing the theoretical foundations of translation studies, as well as in providing a deeper insight into the translation process. On the whole, the philosophisation of the translation phenomenon contributes to the substantial growth of both philosophical thought and translation studies.
... From the above discussion, we may state that Turkish NLP studies has to deal with language processing tasks before modelling a solution to the target problem. In general, most-words are composed of many morphemes and they may occur only once on the training data that generates the so called data-sparsity and curse of dimensionality problems [42,43] from computational modelling point of view. It is important to observe that this complexity constrains implementation of state-ofthe-art models and algorithms developed for example for English. ...
Article
Full-text available
Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.
Article
Full-text available
Re-speaking is a mechanism for obtaining high-quality subtitles for use in live broadcasts and other public events. Because it relies on humans to perform the actual re-speaking, the task of estimating the quality of the results is nontrivial. Most organizations rely on human effort to perform the actual quality assessment, but purely automatic methods have been developed for other similar problems (like Machine Translation). This paper will try to compare several of these methods: BLEU, EBLEU, NIST, METEOR, METEOR-PL, TER, and RIBES. These will then be matched to the human-derived NER metric, commonly used in re-speaking. The purpose of this paper is to assess whether the above automatic metrics normally used for MT system evaluation can be used in lieu of the manual NER metric to evaluate re-speaking transcripts.
Conference Paper
Full-text available
A parallel text is the set formed by a text and its translation (in which case it is called a bitext) or translations. Parallel text alignment is the task of identifying correspondences between blocks or tokens in each halve of a bitext. Aligned parallel corpora are used in several different areas of linguistic and computational linguistics research. In this paper, a survey on parallel text alignment is presented: the historical background is provided, and the main methods are described. A list of relevant tools and projects is presented as well.
Article
On the matter of memory, there is no comparision. Neural networks are potentially faster and more accurate than humans.
Conference Paper
To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned.We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.