Preprint

Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the author.

Abstract

A pre-print of our paper can be found at https://arxiv.org/abs/2011.01139 Our study utilizes deep learning methods for the automated transcription of late nineteenth- and early twentieth-century periodicals written in Arabic script Ottoman Turkish (OT) using the Transkribus platform. We discuss the historical situation of OT text collections and how they were excluded for the most part from the late twentieth century corpora digitization that took place in many Latin script languages. This exclusion has two basic reasons: the technical challenges of OCR for Arabic script languages, and the rapid abandonment of that very script in the Turkish historical context. In the specific case of OT, opening periodical collections to digital tools require training HTR models to generate transcriptions in the Latin writing system of contemporary readers of Turkish, and not, as some may expect, in right-to-left Arabic script text. In the paper we discuss the challenges of training such models where one-to-one correspondence between the writing systems do not exist, and we report results based on our HTR experiments with two OT periodicals from the early twentieth century. Finally, we reflect on potential domain bias of HTR models in historical languages exhibiting spatio-temporal variance as well as the significance of working between writing systems for language communities that have experienced language reform and script change.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The second half of the nineteenth century saw the establishment of several Hebrew newspapers in Eastern Europe and Palestine that provided a platform for a lively political discourse reflecting varied ideological approaches. This work focuses on one decade, 1874-1883, in the relatively long lifespan of the Hebrew weekly HaTzfira, which was founded in Warsaw in 1862. Applying computational tools to the study of the early Hebrew press requires a unique effort. The Hebrew language in general is distinct in its characters, morphological structure, and word order. The contribution of this proof-of-concept study is two-folds: First, computational analysis provides a long-term indication of trends in the discourse that cannot be attained through qualitative study. The second contribution is on the micro level: Computational analysis can potentially shed light, in a diachronic perspective, on the use of a specific term or the discussion of a specific geographical location.
Article
Full-text available
Bu çalışmanın amacı elektronik imla kılavuzu Dervaze'yi (dervaze.com) ve Osmanlıca metinleri Türkçeye, Türkçesini Osmanlıcaya çeviren morfolojik çeviri aracını tanıtmaktır. Dervaze'de halihazırda Osmanlı Türkçesine ait 72600 adet kelime yer almaktadır. Kelimelerin hem Latince hem Osmanlıca yazımı hem de ebced değerlerini sözlükte bulmak mümkündür. Elektronik bir kaynak olması dolayısıyla kelimelerin farklı ilişkileri (eşanlam, eşköken vs.) eklenebilmektedir ve benzer Osmanlıca yazılışa sahip kelimeleri listeleyebilmektedir. Bu imla kılavuzunun Osmanlı Türkçesi döneminin tüm söz varlığını barındırması planlanmaktadır. Dervaze aynı zamanda Osmanlı Türkçesiyle yazılmış metinleri günümüze kazandırmak için hazırladığımız çeviri programına da işlerlik kazandırmaktadır. Çeviri aracı Türkçe bir kelimenin morfolojik analizini yaparak, kök ve eklerini bulmakta, daha sonra bu kök ve ekleri ayrı ayrı Osmanlıcaya çevirmektedir. Türkçe ve Osmanlıca arasındaki yazım kuralları tam olarak birbirine çevrilemediği için bu yol tercih edilmiştir. Amaç Osmanlı Türkçesinin XIX. ve XX. yüzyıl metinlerini çevirmek olduğu için, imla kılavuzunun kaynakları da bu yönde belirlenmiştir. 1856 yılında yayımlanan Redhouse sözlüğü ve 1928 İmla Encümeni'nin kurallarına dayanan Belviranlı'nın Osmanlıca İmla Lügati tercih edilmiş, bununla birlikte İnternet'te bulunan çeşitli kelime listeleri de imla varyasyonlarını belirtmek amacıyla kullanılmıştır. İmla varyasyonları özellikle Osmanlı Türkçesinin imla kuralları çok net belirlenmemiş Türkçe unsurlarında önem kazanmaktadır. Bunların elektronik olarak tanınması ancak listelenmesiyle mümkündür ve bu nedenle tüm imla varyasyonları sözlük maddelerine dahil edilmiştir. Oluşturmakta olduğumuz sistemin OCR (Optik Karakter Tanıma) özellikleri kullanıma girdikçe, dönemin tüm kelimeleri asıl kaynaklardan taranacak ve farklı imlalar da kılavuzda yer alacaktır. XIX. yüzyıldan geriye doğru sürecek bir kaynak taraması çalışmasıyla, Osmanlı Türkçesinin söz varlığının tamamını kayda geçirmeyi hedeflemekteyiz.
Article
Full-text available
Electronic technology has changed the way scholars in the humanities do their work, creating two distinct groups of scholars: first, those who perform leading-edge humanities computing research (a relatively small number); and second, scholars who perform traditional humanities research with new electronic tools (a fairly large number). How is it possible to bring these two groups together? The Text Creation Partnership at the University of Michigan provides one way of providing services to both. And as the electronic publishing community looks for ways to provide reliable cyberinfrastructure in the humanities, the Text Creation Partnership provides a model for building large digital collections that meet the needs of future scholars.
Article
The written Persian language is remarkable for its stability over a millennium of time. In contrast, the interesting thing about Ottoman written culture is that although Ottoman Turkish was intimately linked with Persian throughout its existence, although Ottoman scribes based their organization and culture on that of Persian scribes, and although Persian literature and documents formed the most important models for those of the Ottomans, the Ottoman written language was not at all stable or unchanging.1 To an Ottomanist, it seems odd even to think about an unchanging language, because Ottoman Turkish was constantly changing and the changes were one of its most notable features. Ottoman was similar to Persian, however, in that it was a written lingua franca for the governing elite of an empire whose people spoke a variety of different languages and dialects, whether other varieties of Turkish or other languages entirely, such as Greek, Serbian, or Arabic. It therefore shared many of Persian's characteristics as an elite administrative and literary vehicle. The culture of the scribal cadre who were the producers and upholders of written Turkish was, as far as we know, similar to that of the Persian scribes, as described by Hanaway in chapter2. But there are striking differences in the outcome. If in Persia the scribes were the guardians of the stability of the written language, in the Ottoman Empire the scribal class was responsible for its transformations. In addition, the Ottoman elite was multilingual; its members wrote in Arabic and Persian as well as Turkish and probably spoke several other languages-Mehmed II, for example, knew Greek well. © 2012 by the University of Pennsylvania Museum of Archaeology and Anthropology.
Article
The high level of ambiguity of the Ara-bic script poses special challenges to developers of NLP tools in areas such as morphological analysis, named entity extraction and machine translation. These difficulties are exacerbated by the lack of comprehensive lexical resources, such as proper noun databases, and the multiplicity of ambiguous transcription schemes. This paper focuses on some of the linguistic issues encountered in two subdisciplines that play an increasingly important role in Arabic information processing: the romanization of Arabic names and the arabization of non-Arabic names. The basic premise is that linguistic knowledge in the form of lin-guistic rules is essential for achieving high accuracy.
Rethinking the Transcription of Ottoman Texts: The Case for Reversible Transcription
  • Andrews
http://www.digitalhumanities.org/dhq/vol/10/4/000268/000268.html [Andrews et al. 2008] Andres, W., Inan, M., Kebeli, S., Waters, S., 2008. Rethinking the Transcription of Ottoman Texts: The Case for Reversible Transcription.
On Transcribing Ottoman Texts
  • R Anhegger
[Anhegger 1988] Anhegger, R., 1988. On Transcribing Ottoman Texts. Manuscripts of the Middle East 12-15.
Why Transcribe Ottoman Turkish Texts?
  • H E Boeschoten
[Boeschoten 1988] Boeschoten, H.E., 1988. Why Transcribe Ottoman Turkish Texts? Manuscripts of the Middle East 3, 23-26.
Morphology and Lexicon-Based Machine Translation of Ottoman Turkish to Modern Turkish
  • J Korkut
[Korkut 2019] Korkut, J., 2019. Morphology and Lexicon-Based Machine Translation of Ottoman Turkish to Modern Turkish. https://www.cs.princeton.edu/~ckorkut/papers/ottoman.pdf
  • M Romanov
  • M T Miller
  • S B Savant
  • B Kiessling
[Romanov 2017] Romanov, M., Miller, M.T., Savant, S.B., Kiessling, B., 2017. Important New Developments in Arabographic Optical Character Recognition (OCR). arXiv:1703.09550 [cs].