Article

Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Link to article: http://www.digitalhumanities.org/dhq/vol/16/2/000577/000577.html Our study discusses the automated transcription with deep learning methods of a digital newspaper collection printed in a historical language, Arabic-script Ottoman Turkish (OT), dating to the late nineteenth- and early twentieth-century. We situate OT text collections within a larger history of digitization of periodicals, underscoring special challenges faced by Arabic script languages. Our paper approaches the question of automated transcription of non-Latin script languages, such as OT, from the broader perspective of debates surrounding OCR use for historical archives. In our study with OT, we have opted for training handwritten text recognition (HTR) models that generate transcriptions in the left-to-right, Latin writing system familiar to contemporary readers of Turkish, and not, as some scholars may expect, in right-to-left Arabic script text. As a one-to-one correspondence between the writing systems of OT and modern Turkish does not exist, we also discuss approaches to transcription and the creation of ground truth and argue that the challenges faced in the training of HTR models also draw into question straightforward notions of transcription, especially where divergent writing systems are involved. Finally, we reflect on potential domain bias of HTR models in other historical languages exhibiting spatio-temporal variance as well as the significance of working between writing systems for language communities that also have experienced language reform and script change.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the broader field of digital humanities, several pioneering projects have demonstrated the potential of deep learning in handling ancient scripts. For instance, projects like [2] and [3] have successfully bridged the gap between historical and modern languages, enhancing accessibility and understanding. Similarly, [4] and [5] have shown significant advancements in recognizing and digitizing scripts with complex characteristics. ...
Article
Full-text available
p class="ICST-abstracttext"> This study presents the development and evaluation of a deep learning-based optical character recognition (OCR) model specifically designed for recognizing Old Turkic script. Utilizing a convolutional neural network (CNN), the project aimed to achieve high classification accuracy across a dataset comprising 38 distinct Old Turkic characters. To enhance the model’s robustness and generalization capabilities, sophisticated data augmentation techniques were employed, generating 760 augmented images from the original 38 characters. The model was rigorously trained and validated, achieving an overall ac- curacy of 96.34%. Evaluation metrics such as precision, recall, and F1-scores were systematically analyzed, showing superior performance in most classes while identifying areas for further optimization. The results underscore the effectiveness of CNN architectures in specialized OCR tasks, demonstrating their potential in preserving and digitizing historical scripts. This study not only advances the field of document analysis and OCR but also contributes to the digital preservation and accessibility of ancient scripts. </p
... However, some studies have explored direct conversion of Ottoman images to Turkish text using OCR or HTR techniques [12] [13] [14]. [13] employs deep learning techniques for automated transcription of late 19th-and early 20th-century periodicals written in Arabicscript Ottoman using the Transkribus platform. [14] addresses the challenge of automatic transcription of printed Ottoman documents. ...
Preprint
Full-text available
Ottoman-Turkish transliteration is a relatively new problem. To make a vast amount of historical documents, books, newspapers, and magazines accessible to a wider audience unfamiliar with the Ottoman script, it is necessary to transliterate the Ottoman script into the Latin-based Turkish script. This study employs traditional NLP techniques to develop a dictionary-based Ottoman-Turkish transliteration system. Using a dataset of 2403 sentences and 31K words, we achieved a Word Error Rate (WER) of 20.69% (raw), 6.31% (normalized) and a Character Error Rate (CER) of 6.46% (raw) 3.01% (normalized), resulting in a BLEU score of 51.90 (raw) 77.18 (normalized). The results show that the proposed system has a promising performance for Ottoman-Turkish transliteration.
... It is emphasized that modern metrics used in intralingual translation processes provide an objective measurement of system performance. It has been noted that deep learning-based systems are more efficient than manual methods in automatic transcription of old-alphabet Ottoman Turkish texts [10,26]. ...
Conference Paper
Full-text available
Old-alphabet Ottoman Turkish translation systems are methods developed for the translation of historical texts into Latin-based Turkish. While traditional methods (transcription and simplification) remain faithful to the historical context, modern deep learning-based approaches offer the advantage of speed and accuracy. Modern technologies are supported by Natural Language Processing and context-sensitive models. However, problems such as limited data sets and context losses limit the success of these systems. In the future, hybrid translation models may combine the strengths of traditional and modern methods. Diversification of datasets and development of context-sensitive algorithms can increase translation accuracy. Additionally, applications supported by digital archives and educational materials can enable translation systems to reach large audiences. The wider development of these systems with international cooperation offers an important opportunity to preserve the cultural and historical heritage in the transition from old-alphabet Ottoman Turkish to Latin-based Turkish. Ottoman Turkish translation systems are not only a linguistic transformation tool, but also a bridge carrying historical and cultural memory into the future. Therefore, the implementation of technological developments in harmony with language and culture is critical to increase success in this field.
... The platform distinguishes itself by swiftly adapting to contemporary technological frameworks, providing functionalities that include right-to-left HTR. It also offers crucial content analysis tools such as TEI and XML formats, and thematic tagging capabilities for Ottoman Turkish texts (Kırmızıaltın and Wrisley, 2022). The DOC initiative also dissects periods through text analysis and digital edition of Ottoman Turkish periodicals Kadınlar Dünyası and Küçük Mecmua, tracing discursive and linguistic shifts in response to socio-political dynamics. ...
Article
Full-text available
Digital Humanities (DH) have revolutionized the way we approach historical and cultural studies by integrating advanced information technology, enriching our understanding of the humanities. This evolution is particularly evident in Ottoman studies, where DH tools have fostered a dynamic, analytical engagement with historical sources, moving beyond mere digitization to facilitate deeper, multidimensional analysis. This article examines the progress and current state of digital Ottoman studies , highlighting the importance of interdisciplinary collaboration and the transforma-tive potential of digital methodologies. Through a critical review of existing projects and potential future directions, it underscores the significant methodological shift towards "Digital Humanities and Ottoman Studies 2.0," emphasizing a more a deeper, analytical discourse of the Ottoman history. This shift not only challenges traditional research boundaries but also democratizes access to knowledge, offering new insights into Ottoman studies. The integration of digital tools in Ottoman studies exemplifies the broader impact of DH on historical research, suggesting ongoing innovations and a promising future for the field.
... The platform distinguishes itself by swiftly adapting to contemporary technological frameworks, providing functionalities that include right-to-left HTR. It also offers crucial content analysis tools such as TEI and XML formats, and thematic tagging capabilities for Ottoman Turkish texts (Kırmızıaltın and Wrisley, 2022). The DOC initiative also dissects periods through text analysis and digital edition of Ottoman Turkish periodicals Kadınlar Dünyası and Küçük Mecmua, tracing discursive and linguistic shifts in response to socio-political dynamics. ...
Article
Full-text available
Digital Humanities ( DH ) have revolutionized the way we approach historical and cultural studies by integrating advanced information technology, enriching our understanding of the humanities. This evolution is particularly evident in Ottoman studies, where DH tools have fostered a dynamic, analytical engagement with historical sources, moving beyond mere digitization to facilitate deeper, multidimensional analysis. This article examines the progress and current state of digital Ottoman studies, highlighting the importance of interdisciplinary collaboration and the transformative potential of digital methodologies. Through a critical review of existing projects and potential future directions, it underscores the significant methodological shift towards “Digital Humanities and Ottoman Studies 2.0,” emphasizing a more a deeper, analytical discourse of the Ottoman history. This shift not only challenges traditional research boundaries but also democratizes access to knowledge, offering new insights into Ottoman studies. The integration of digital tools in Ottoman studies exemplifies the broader impact of DH on historical research, suggesting ongoing innovations and a promising future for the field.
... Various issues encountered in working with digitisation in relation to Ottoman manuscripts were highlighted in the study undertaken by Kirmizialtin and Wrisley (2022). Their study of a digital newspaper collection printed in Arabic-script Ottoman Turkish in the late 19 th and early 20 th centuries emphasised the difficulties of automated transcription of non-Latin script languages along with difficulties in training HTR models, particularly with writing systems which have experienced change over time. ...
... Gelişen teknoloji, yazılımlar ve yapay zeka ile üretilen çevriyazı uygulamalarının gelecekte diliçi çeviri çalışmak isteyen araştırmacıların önündeki bu dil ve alfabe engelini kaldıracağı öngörülmektedir. Çevriyazının otomatik olarak yapılmasını araştıran "Latin Harfli Olmayan Süreli Yayınların Otomatik Transkripsiyonu: Osmanlı Türkçesi Baskı Arşivinde Bir Vaka Çalışması" adlı makale (Kirmizialtin & Wrisley, 2020) gibi çalışmalar, yapay zeka ile Osmanlı Türkçesi Matbu metinlerin otomatik çevriyazısı yapılması hakkında yakın gelecekte teknolojik sıçramalar olacağını göstermektedir. Başka bazı çalışmalara değinmek gerekirse okunamayan Osmanlı Türkçesi kelimeleri çözen ve anlamlarını veren "LexiQamus" (LexiQamus, 2016) ve Osmanlı Türkçesi metinlerin Latin harflerine yapay zeka tarafından otomatik olarak çevriyazıyla çevrilmesini sağlayacak, gelecek vadeden çok değerli bir girişim olan Transkribus projesi örnek verilebilir (Transkribus, 2023). ...
Article
Full-text available
Diliçi çeviri, son yıllardaki güncel araştırmalarla birlikte geçmişe göre adından daha çok söz ettirmeye başlayan bir Çeviribilim alt alanıdır. Teorik olarak diliçi çevirinin sınırları ilk kez 1959 yılında Roman Jakobson tarafından belirlenmiş olmasına karşın dil içi ve diller arası meselesinde çizilen sınırların muğlaklığı Jakobson’ın tanımlamalarının çeşitli araştırmacılarca sorunsallaştırılmasına yol açmıştır. Diliçi çeviri bir dil/kültür içerisinde gerçekleşen bir edim olduğundan bu alandaki araştırmalar sıklıkla tek bir kültüre özgüdür, dolayısıyla Çeviribilim içerisinde özel bir yerinin olduğu iddia edilebilir. Diliçi çeviri, Türkiye Cumhuriyeti tarihinde yapılan dil reformları ve alfabe değişikliği ile, diğer kültür ve ülkelerden farklı, ideolojik bir kökten beslenen kendine has bir alanı oluşturmaktadır. Kültüre özgü oluşu ve Türkiye Cumhuriyeti’ndeki özel konumuna rağmen, Çeviribilim içerisinde bu alanda yapılan araştırmalarla hala aydınlatılmayı bekleyen birçok nokta kalmıştır. Bu çalışmada, diliçi çevirinin kuramsal çerçevesi irdelenerek, çeviribilimdeki yeri, üç temel nedene dayandırılarak Türkiye’de çeviribilimde karşılaşılan zorluklar ve Türk modernleşmesindeki rolü incelenmeye çalışılmıştır. Bu çerçevede vaka çalışmasında Hüseyin Rahmi Gürpınar’ın Efsuncu Baba adlı romanının çapraz zamansal (cross-temporal) diliçi çevirileri incelenip, çeviribilimin bir alt alanı olarak diliçi çeviriyi kavramsallaştırmak ve çeviribilim içinde yeniden konumlandırmak amaçlanmıştır. Çalışmada Efsuncu Baba romanının 1924’te Arap alfabesiyle ilk basımı ve sonrasında sırasıyla 1954, 1966, 1995 ve 2009’da yapılan farklı diliçi çevirileri karşılaştırılıp çözümlemesi yapılacaktır.
Article
Full-text available
The article addresses the multilingual landscape in Digital Humanities, focusing on understanding its practitioners. We adopt the concept of user profiles from UX design to help create visibility and empathy for the unique needs of multilingual scholars. In a DH2023 workshop, using a dataset of six user profiles, participants examined multilingual DH, exploring the complex interaction between language use, identity, inclusivity, and infrastructure. Only by including multilingual perspectives, we argue, can DH promote diverse knowledge systems towards more supportive infrastructures and a more inclusive scholarly community.
Article
Multilingual expression is not exclusive to scholars in the (digital) humanities, but it is a lived reality of a great number of people around the world. The authors of this article argue that there is a specific role to be played by the digital humanist in describing and modelling the design of workflows that assume multilinguality (and multiscriptual and multidirectional practices). This work cannot be left only to the tech industry and commercial interests. On the other hand, the larger community of digital humanists is not fully aware of the issues that multilingual users and communities face. In this paper we argue that one way this can be done most effectively in these early stages is by user persona creation.Our method is to perceive the problem from a UX (user experience) persona design point of view. The present paper synthesizes our efforts to date in creating data-driven UX profiles (based upon our insights drawn from a survey, multiple interactive workshops, an open forum series organized by the authors, and a workshop at the DH Unbound conference 2022), which aim to capture shared experiences of multilingual DH textual research, recognizing how such multilinguality might appear as a marginal phenomenon. Also drawing on research in persona studies, our paper attempts to theorize the UX profile of each specific persona, not in isolation, but in interaction with other users, bringing those personas into dialogue. The purpose of this dialogue is to centre multilingual voices with shared concerns in the scholarly community of DH. We argue that so-called “marginal” multilingual cases constitute a much larger proportion of the scholarly community than is commonly believed, and as such, we must distinguish such cases from the concept of “edge cases” in UX product development research.At the same time, in such interaction between personas, conflicts emerge. Both shared and conflicting concerns are the most interesting results of our inquiry. This is why we have chosen to present our results in the form of a fictional plenary discussion to explore what new spaces of possibility can be created in the global DH community. On the other hand, for once, a diversity of these global, multilingual voices is actually recorded for the larger community, and we hope they provide it with a starting point for inclusive discussions about infrastructure and multilinguality. In sum, we argue that user persona creation can be an effective tool for familiarizing the larger community of digital humanists with the issues that multilingual users and communities face.L'expression multilingue n'est pas l'apanage des chercheurs en sciences humaines (numériques), mais constitue une réalité vécue par un grand nombre de personnes dans le monde. Les auteurs de cet article soutiennent que l'humaniste numérique a un rôle spécifique à jouer dans la description et la modélisation de la conception de flux de travail qui supposent la multilingualité (et des pratiques multiscriptuelles et multidirectionnelles). Ce travail ne peut pas être laissé uniquement à l'industrie technologique et aux intérêts commerciaux. D'autre part, la communauté plus large des humanistes numériques n'est pas pleinement consciente des problèmes auxquels sont confrontés les utilisateurs et les communautés multilingues. Dans cet article, nous soutenons que l'un des moyens les plus efficaces pour y parvenir dès les premières étapes consiste à créer des personas utilisateurs.Notre méthode consiste à percevoir le problème du point de vue de la conception de UX (expérience utilisateur). Le présent article synthétise les efforts que nous avons déployés à ce jour pour créer des profils UX fondés sur des données (sur la base de nos observations tirées d'une enquête, de multiples ateliers interactifs, d'une série de forums ouverts organisés par les auteurs et d'un atelier organisé lors de la conférence DH Unbound 2022), qui visent à capturer les expériences partagées de la recherche textuelle multilingue en DH, en reconnaissant la façon dont cette multilingualité peut apparaître comme un phénomène marginal. S'inspirant également de la recherche sur les personas, notre article tente de théoriser le profil UX de chaque persona spécifique, non pas de manière isolée, mais en interaction avec d'autres utilisateurs, en faisant dialoguer ces personas. L'objectif de ce dialogue est de faire entendre des voix multilingues partageant les mêmes préoccupations au sein de la communauté scientifique des Humanités Numériques. Nous soutenons que les cas multilingues dits "marginaux" constituent une proportion beaucoup plus importante de la communauté scientifique qu'on ne le croit généralement, et qu'à ce titre, nous devons distinguer ces cas du concept de "cas marginaux" dans la recherche sur le développement de produits UX.Dans le même temps, l'interaction entre les personas fait émerger des conflits. Les préoccupations partagées et conflictuelles sont les résultats les plus intéressants de notre enquête. C'est pourquoi nous avons choisi de présenter nos résultats sous la forme d'une discussion plénière fictive afin d'explorer les nouveaux espaces de possibilités qui peuvent être créés dans la communauté mondiale des Humanités Numériques. D'autre part, pour une fois, une diversité de ces voix mondiales et multilingues est enregistrée pour la communauté dans son ensemble, et nous espérons qu'elles lui fourniront un point de départ pour des discussions inclusives sur l'infrastructure et le multilinguisme. En résumé, nous soutenons que la création de persona utilisateur peut être un outil efficace pour familiariser la grande communauté des humanistes numériques avec les problèmes auxquels sont confrontés les utilisateurs et les communautés multilingues.
Article
Bu makalede "Osmanlıcadan Günümüz Türkçesine Uçtan Uca Aktarım Projesi" kapsamında geliştirilen ve nesih hattıyla basılmış Osmanlıca (Osmanlı Türkçesi) doküman görüntülerini derin sinir ağı modelleriyle metne dönüştüren web tabanlı bir optik karakter tanıma (OCR) sistemi sunulmuştur. Sistemin derin sinir ağı mimarisi görüntü tanımada yaygın kullanılan CNN katmanlarından ve doğal dil işlemede yaygın kullanılan bir RNN türü olan iki yönlü LSTM katmanlarından oluşmaktadır. Eğitim için orijinal, sentetik ve hibrit olmak üzere 3 farklı veri kümesi hazırlanmış ve bunlarla aynı isimde 3 farklı OCR modeli oluşturulmuştur. Orijinal veri seti yaklaşık 1.000 sayfadan, sentetik veri seti ise yaklaşık 23.000 sayfadan oluşmaktadır. Geneline Osmanlica.com OCR adı verilen bu 3 model Tesseract’ın Arapça ve Farsça, Google Docs’ın Arapça, Abby FineReader’ın Arapça ve Miletos firmasının OCR model/araçlarıyla test için hazırladığımız 21 sayfalık orijinal doküman kümesi kullanılarak karşılaştırılmıştır. Kesin referans ve OCR çıktı metinleri kullanıcı ve yazılım kaynaklı hatalar içerdiğinden karşılaştırmadan önce metinler özel bir normalizasyon sürecinden geçirilmiştir. Karşılaştırma ham, normalize ve bitişik olmak üzere 3 farklı metin ve karakter, katar ve kelime tanıma olmak üzere 3 farklı ölçüt ile yapılmıştır. Osmanlica.com Hibrit modeli karakter tanımada %88,86 ham, %96,12 normalize ve %97,37 bitişik doğruluk oranlarıyla; bağlı karakter katarı tanımada %80,48 ham, %91,60 normalize ve %97,37 bitişik doğruluk oranlarıyla; kelime tanımada %44.08 ham ve %66.45 normalize doğruluk oranlarıyla diğer modellerden belirgin şekilde daha iyi sonuçlar üretmiştir. Makalede Osmanlı alfabesinin kendine özgü karakteristiklerinin OCR üstündeki etkilerini gözlemlemek için Osmanlıcanın karakter, katar ve kelime ölçütlerinde sıklık analizi çalışması yapılmıştır. Bu sıklık analizi çalışmasında alfabedeki karakterler bitişebilme, harf gövdesi, noktaların konumu ve sayıları, karakterin türü, kaynak dil vb. ayırt edici özelliklere göre gruplandırılmış grup bazında sıklıklar hesaplanmıştır. Yapılan karşılaştırma deneylerinde karakter tanıma doğruluk oranları grup bazında hesaplanarak ayrıca incelenmiştir. Deneylerde sadece karakter tanıma doğruluk oranlarıyla yetinilmemiş, hatalar detaylı olarak incelenmiş, harf bazındaki OCR hataları ekleme, silme ve yer değiştime işlemleri cinsinden ortaya konulmuştur. Böylece en çok hangi harfin hangi harflerle karıştırıldığı, en çok hangi harflerin gözden kaçırıldığı, hangi durumlarda hangi tür hataların daha çok ortaya çıktığı vb. durumlar sadece kendi OCR modelimizde değil diğer tüm modeller için ortaya konulmuştur. Bu bulguların hem verilerin ön/son işlemesinde hem de modellerin iyileştirmesinde değerli katkılar sağlayacağını düşünüyoruz. Karşılaştırmada kullanılan 21 sayfalık orijinal doküman görüntüleri, kesin referans metinleri, modellerin OCR çıktıları ve normalizasyonu yapıp doğruluk oranlarını hesaplayan Python programını içeren test veri kümesi osmanlica.com/test adresinde paylaşılmıştır.
Preprint
HTR models development has become a conventional step for digital humanities projects. The performance of these models, often quite high, relies on manual transcription and numerous handwritten documents. Although the method has proven successful for Latin scripts, a similar amount of data is not yet achievable for scripts considered poorly-endowed, like Arabic scripts. In that respect, we are introducing and assessing a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib{\=i} scripts. The comparison between several state-of-the-art HTR demonstrates the relevance of a word-based neural approach specialized for Arabic, capable to achieve an error rate below 5% with only 10 pages manually transcribed. These results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing. This research is part of the development of RASAM dataset in partnership with the GIS MOMM and the BULAC.
Conference Paper
Full-text available
“The East Meets the West: The Establishment of the Ottoman Culinary Culture and its Cultural Importance” Food in any culture is combined with many religious, political, secular and social aspects and practices of a community. Food is a site which helps us understand how a flavour is widely accepted and developed. Conceptually, this site-food- will enable us to comprehend many other features of a society within the context of the everyday practices such as how a space is transformed into a secular or a religious space. This study looks at Turkish food in India as a site of inquiry where the Indians accepted a composite culture through the mixing of available flavours. The contested meanings, symbolism and authenticity, and spatial significance of the Turkish food are some of the concerns that the study would like to take up. It is believed that food always plays a vital role in bringing different cultures together. This paper is an attempt to understand the evolution of Turkish food in Indian tables. In this paper, we would like to study the Turkish food culture which the Indians wholeheartedly accepted and preserved through generation. Indian food culture is unique in many ways, but within the state, the taste becomes different among different communities. The food habit of Indian also brings many discussions such as their relationship with Ottomans, Arabs, Portuguese, British and other fellow Indians. In our paper, we would like to study the evolution of Turkish food in India and its present state.
Article
Full-text available
A recent project at the University of Denver Libraries used handwritten text recognition (HTR) software to create transcriptions of records from the Jewish Consumptives’ Relief Society (JCRS), a tuberculosis sanatorium located in Denver, Colorado from 1904 to 1954. Among a great many other potential uses, these type- and hand-written records give insight into the human experience of disease and epidemic, its treatment, its effect on cultures, and of Jewish immigration to and early life in the American West. Our intent is to provide these transcripts as data so the text may be computationally analyzed, pursuant to a larger effort in developing capacity in services and infrastructure to support digital humanities as a library, and to contribute to the emerging HTR ecosystem in archival work. Just because we can, however, doesn’t always mean we should: the realities of publishing large datasets online that contain medical and personal histories of potentially vulnerable people and communities introduce serious ethical considerations. This paper both underscores the value of HTR and frames ethical considerations related to protecting data derived from it. It suggests a terms-of-use intervention perhaps valuable to similar projects, one that balances meeting the research needs of digital scholars with the care and respect of persons, their communities and inheritors, who lives produced the very data now valuable to those researchers.
ResearchGate has not been able to resolve any references for this publication.