Chapter
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper presents an open source and extendable Morphological Analyser cum Generator (MAG) for Tamil named ThamizhiMorph. Tamil is a low-resource language in terms of NLP processing tools and applications. In addition, most of the available tools are neither open nor extendable. A morphological analyser is a key resource for the storage and retrieval of morphophonological and morphosyntactic information, especially for morphologically rich languages, and is also useful for developing applications within Machine Translation. This paper describes how ThamizhiMorph is designed using a Finite-State Transducer (FST) and implemented using Foma. We discuss our design decisions based on the peculiarities of Tamil and its nominal and verbal paradigms. We specify a high-level meta-language to efficiently characterise the language’s inflectional morphology. We evaluate ThamizhiMorph using text from a Tamil textbook and the Tamil Universal Dependency treebank version 2.5. The evaluation and error analysis attest a very high performance level, with the identified errors being mostly due to out-of-vocabulary items, which are easily fixable. In order to foster further development, we have made our scripts, the FST models, lexicons, Meta-Morphological rules, lists of generated verbs and nouns, and test data sets freely available for others to use and extend upon.
Conference Paper
Full-text available
This paper evaluates the impact of different types of data sources in developing a domain-specific statistical machine translation (SMT) system for the domain of official government letters, between the low-resourced language pair Sinhala and Tamil. The baseline was built with a small in-domain parallel data set containing official government letters. The translation system was evaluated with two different test datasets. Test data from the same sources as training and tuning gave a higher score due to over-fitting, while the test data from a different source resulted in a considerably lower score. With the motive to improve translation, more data was collected from, (a) different government sources other than official letters (pseudo in-domain), and (b) online sources such as blogs, news and wiki dumps (out-domain). Use of pseudo in-domain data showed an improvement for both the test sets as the language is formal and context was similar to that of the in-domain though the writing style varies. Out-domain data, however, did not give a positive impact, either in filtered or unfiltered forms, as the writing style was different and the context was much more general than that of the official government documents.
Chapter
Full-text available
This chapter presents the Dutch Parallel Corpus (DPC)—a 10-millionword,high-quality, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French. The corpus contains five different text types and is balanced with respect to text type and translation direction. Rich metadata information is stored for each text sample. All texts included in the corpus have been cleared from copyright. The entire corpus is aligned at sentence level and enriched with linguistic annotations. Twenty-five thousand words of the Dutch-English part have been manually aligned at the sub-sentential level. The corpus is released as full texts in XML format and can also be queried via a web concordancer.
Article
Full-text available
One of the most promising and leading machine translation strategies would be Statistical Translation Approach. Being pertinent even to structurally dissimilar language pairs, it has confirmed its suitability for large text translation. Rising demand is present for automatic translation between Sinhala and Tamil for quite a lot of decades. Statistical approach is the best preference to resolve the unavailability of a machine translation tool for the languages concerned. Because of language similarity, statistical approach could thrive agreeably, exclusive of more concern on linguistic knowledge. A basic translation system has been modelled and implemented in this research, with the preparation of parallel corpora from parliament order papers. This paper demonstrates only the preliminary system runs of the research, devoid of various parameter refinements and actual design and evaluation strategies. Language Model, Translation Model and Decoder Configurations are done consistent with recent literature. To facilitate the improvement of output quality, MERT technique is integrated to tune the decoder. To stay away from sole dependence on BLEU, two other automatic metrics namely TER and NIST are utilised for the evaluation in different aspects. In addition, directions to future research are also recognized and specified for the refinements of this system.
Article
For many types of machine learning algorithms, one can compute the statistically `optimal' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.
Chapter
The syllabic orientation to the study of writing systems leads to a five-way typology, which is useful in investigating the historical development of writing. The five types of writing system are: logosyllabary, syllabary, abjad; alphabet; and abugida. While some or all of the modern-day West African syllabaries may have come about by stimulus diffusion, this is unlikely in the extreme for syllabaries like the Caroline Islands and the Alaska. The first alphabet was that for Greek, and the Greek alphabet influenced Western Civilization. The names of the first two Greek letters, alpha and beta, provide the word alphabet. Abjad is an Arabic word containing the first four letters of the Arabic script. In an abugida, each letter represents a consonant followed by the vowel /a/or an individual vowel. The branch of scholarship that has been concerned with writing systems is philology, the study of texts in all their complexity.
A novel kernel regression based machine translation system for Sinhala-Tamil translation
  • M Jeyakaran
Jeyakaran, M.: A novel kernel regression based machine translation system for Sinhala-Tamil translation, unpublished BSc thesis, University of Colombo (2013)
Sinhala and Tamil: a case of contact-induced restructuring
  • H D Thampoe
Thampoe, H.D.: Sinhala and Tamil: a case of contact-induced restructuring. Ph.D. thesis, Newcastle University (2017)
A sandhi splitter for Malayalam
  • V Devadath
  • L J Kurisinkel
  • D M Sharma
  • V Varma
Devadath, V., Kurisinkel, L.J., Sharma, D.M., Varma, V.: A sandhi splitter for Malayalam. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 156-161 (2014)
  • M Morishita
  • J Suzuki
  • M Nagata
Morishita, M., Suzuki, J., Nagata, M.: Jparacrawl: a large scale web-based English-Japanese parallel corpus. arXiv preprint arXiv:1911.10668 (2019)
  • De Silva
De Silva, N.: Survey on publicly available sinhala natural language processing tools and research. arXiv preprint arXiv:1906.02358 (2019)