PreprintPDF Available

The Potential of GPT in Ottoman Studies: Computational Analysis of Evliya Çelebi's Travelogue with NLP and Text Mining and Digital Edition with TEI

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This study presents a methodological approach for analyzing Evliya Çelebi's 17th-century travelogue of Istanbul through the application of Natural Language Processing (NLP), multi-model text mining, particularly the use of GPT, and a digital edition based on the Text Encoding Initiative (TEI). Utilizing co-occurrence and frequency analysis, sentiment analysis through transformer models like BERT, and topic modeling with Latent Dirichlet Allocation (LDA), along with Named Entity Recognition (NER) applications, the research tests the technical boundaries in uncovering semantic and thematic patterns within the travelogue. It specifically addresses the challenges of integrating computational analysis methods with Ottoman Turkish and emphasizes the need for developing language models tailored for historical texts, highlighting the potentials of GPT in this context. Additionally, the study undertakes the application of TEI standards for the digital edition of the travelogue, addressing the challenges and potential of reconstituting Ottoman Turkish texts in digital formats. As a combination of quantitative and thematic analyses, this work offers a deep analysis of the socio-cultural and historical content of the travelogue. The article aims to contribute to digital humanities and specifically digital history in Ottoman studies, through the integration of technological methods in the analysis and understanding of historical and linguistic texts.
Content may be subject to copyright.
(5-7 October 2023, Istanbul)
(Preprint V1)
The Potential of GPT in Ottoman Studies: Computational Analysis of Evliya Çelebi’s
Travelogue with NLP and Text Mining and Digital Edition with TEI
Fatma Aladağ
          -century
travelogue of Istanbul through the application of Natural Language Processing (NLP), multi-
model text mining, particularly the use of GPT, and a digital edition based on the Text Encoding
Initiative (TEI). Utilizing co-occurrence and frequency analysis, sentiment analysis through
transformer models like BERT, and topic modeling with Latent Dirichlet Allocation (LDA),
along with Named Entity Recognition (NER) applications, the research tests the technical
boundaries in uncovering semantic and thematic patterns within the travelogue. It specifically
addresses the challenges of integrating computational analysis methods with Ottoman Turkish
and emphasizes the need for developing language models tailored for historical texts,
highlighting the potentials of GPT in this context. Additionally, the study undertakes the
application of TEI standards for the digital edition of the travelogue, addressing the challenges
and potential of reconstituting Ottoman Turkish texts in digital formats. As a combination of
quantitative and thematic analyses, this work offers a deep analysis of the socio-cultural and
historical content of the travelogue. The article aims to contribute to digital humanities and
specifically digital history in Ottoman studies, through the integration of technological methods
in the analysis and understanding of historical and linguistic texts.
Keywords:     GPT, LLM, NLP, NER, LDA, Co-occurrence
Analysis, Frequency Analysis, TEI, 17th Century, Digital Humanities, Digital History
Digital Humanities open new horizons in the in-depth analysis and understanding of historical
texts in the context of text mining. This interdisciplinary field enables complex analytical
          n the rich
linguistic structure of Ottoman Turkish is analyzed with the computational analysis techniques
offered to researchers by digital humanities, in-depth information on the social and cultural
dynamics of historical periods can be obtained.
Text mining, which has an important place in
the field of digital humanities and provides high quality information from text, is an
interdisciplinary field that intersects with data science, linguistics and computer science. Text
mining is an approach that focuses on analyzing large volumes of text using advanced
algorithms and statistical methods to extract information from it. However, various
methodologies and algorithms are used to transform the unstructured texts used as the source
of the research into structured data suitable for analysis. This article presents a comprehensive
application and methodology for analyzing the 17th century Ottoman traveler 
Istanbul travelogue, which has a wealth of socio-cultural, geographical and historical
but also an important historical document that provides detailed information about the
         encyclopedic in scope,
covering descriptions of daily life, architecture, customs and administrative systems of the
Ottoman Empire. The application of text mining to such a corpus requires an in-depth
understanding of the language and context, which is complicated by the fact that Ottoman
Turkish has Arabic and Persian influences.
The unique features of Ottoman Turkish reveal the necessity of language models that can handle
of Ottoman Turkish, standard text mining tools need to be improved. Preprocessing techniques
such as tokenization, lemmatization and tagging should be redefined to accommodate historical
syntax and morphology. Efforts to digitize Ottoman Turkish, which began in the 1990s by
computer scientists, have reached a significant level of development in the field of automatic
transcription and digital edition in 2023 thanks to artificial intelligence.
This study discusses
the challenges, limitations and potentials of state-of-the-art text mining techniques when
applied to travelogue texts. Thus, the capacities of existing technological approaches will be
revealed through their application to both the Ottoman Turkish texts converted into Latin script
and the simplified Turkish texts of the travelogue. Thus, the study also aims to provide insights
for the computational analysis and digitization of Ottoman Turkish texts.
           
 
      
    
             
 
   
             
            
Computational Analyses on Travelogue: Application of Multi-Model Text Mining and
NLP Techniques
         
              
          
   
              
            
Transformation from Text to Data: Preparing the Travelogue for Analysis and Preprocessing
      
       
    
  
              
    
              
  
Analyzing the Digital Interaction of Tradesmen and Shops
    
 
 
            
  
             
             
 
  
     
          
              
         
        
Frequency Analysis of the Travelogue with Named Entity Recognition (NER)
           
       
      
 
             
          
              
           
         
   
           
   
Network Based Semantic Profile of "Istanbul" and "Mahalle" (Neigborhood): Co-
occurrence Analysis
           
   
        
 
           
             
         
 
             
         
            
  
        
        
            
  
             
    
           
            
  
              
  
            
         
 
              
         
          
             
           
Topic Modeling of the Travelogue Using the Latent Dirichlet Allocation (LDA)
           
            
              
           
          
 
  
             
      
  
            
 
Istanbul, Galata and Üsküdar: Sentiment Analysis of Spaces
             
 
  
          
   
            
          
        
 
    
            
             
            
           
A Regex-Based Analysis of the Travelogue: Phrases and Reduplications
            
           
            
            
 
              
  
Designing a Digital Edition of the Travelogue with Text Encoding Initiative -TEI
              
         
  
        
          
              
            
           
       
              
          
  
           
          
    
               
            
   
  
     
      
            
          
            
             
        
            
         
            
            
          
             
            
            
 
              
  
           
            
            
             
             
         
           
    
  
        
          
          
          
         
           
           
       
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
The Ottoman script, which was in use for over five centuries, is an Arabic alphabet-based writing system. It became obsolete after the change of alphabet in Turkey. There are plenty of Ottoman documents, overwhelmingly printed in Naskh style. This work presents a DL-based character recognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus and then augment it using some image processing methods. We develop a hybrid convolutional neural network-bidirectional long short-term memory recognizer and train it with the original and the augmented datasets. Finally, we apply a transfer learning procedure for adapting the system to real image data. The proposed system obtains 0.11 CER on synthetic data and 0.16 CER on real data comprising of line images from a printed historical Ottoman book.
Full-text available
The rapid growth of Internet-based applications, such as social media platforms and blogs, has resulted in comments and reviews concerning day-to-day activities. Sentiment analysis is the process of gathering and analyzing people’s opinions, thoughts, and impressions regarding various topics, products, subjects, and services. People’s opinions can be beneficial to corporations, governments, and individuals for collecting information and making decisions based on opinion. However, the sentiment analysis and evaluation procedure face numerous challenges. These challenges create impediments to accurately interpreting sentiments and determining the appropriate sentiment polarity. Sentiment analysis identifies and extracts subjective information from the text using natural language processing and text mining. This article discusses a complete overview of the method for completing this task as well as the applications of sentiment analysis. Then, it evaluates, compares, and investigates the approaches used to gain a comprehensive understanding of their advantages and disadvantages. Finally, the challenges of sentiment analysis are examined in order to define future directions.
Full-text available
Social networking platforms have become an essential means for communicating feelings to the entire world due to rapid expansion in the Internet era. Several people use textual content, pictures, audio, and video to express their feelings or viewpoints. Text communication via Web-based networking media, on the other hand, is somewhat overwhelming. Every second, a massive amount of unstructured data is generated on the Internet due to social media platforms. The data must be processed as rapidly as generated to comprehend human psychology, and it can be accomplished using sentiment analysis, which recognizes polarity in texts. It assesses whether the author has a negative, positive, or neutral attitude toward an item, administration, individual, or location. In some applications, sentiment analysis is insufficient and hence requires emotion detection, which determines an individual’s emotional/mental state precisely. This review paper provides understanding into levels of sentiment analysis, various emotion models, and the process of sentiment analysis and emotion detection from text. Finally, this paper discusses the challenges faced during sentiment and emotion analysis.
Full-text available
Morphological analysis is an important component of natural language processing systems like spelling correction tools, parsers, machine translation systems, and dictionary tools. In this paper, we present TRMOR, a morphological analyzer for Turkish, which uses the SFST tool (Stuttgart Finite-State Transducer). TRMOR can be freely used for academic research (see schmid/tools/SFST/). It covers a large part of Turkish morphology including inflection, derivation, and some compounding. It uses morphotactic and morphophonological rules and a stem lexicon. We describe the morphological structure of Turkish, explain the phonological and morphological rules implemented in TRMOR, evaluate the system, and test it in special cases. The evaluation of TRMOR was executed on gold-standard words. One thousand words were randomly selected from Wikipedia word lists. For those words, we achieved gold-standard analysis. TRMOR has 94.12% precision on these 1000 words that were randomly selected from Wikipedia word lists. Morphological analyses of Turkish are prepared for the gold-standard version since, to our knowledge, there is no gold-standard segmentation available for Turkish morphological analyzers for noncommercial purposes.
Full-text available
The field of Spatial Humanities has advanced substantially in the past years. The identification and extraction of toponyms and spatial information mentioned in historical text collections has allowed its use in innovative ways, making possible the application of spatial analysis and the mapping of these places with geographic information systems. For instance, automated place name identification is possible with Named Entity Recognition (NER) systems. Statistical NER methods based on supervised learning, in particular, are highly successful with modern datasets. However, there are still major challenges to address when dealing with historical corpora. These challenges include language changes over time, spelling variations, transliterations, OCR errors, and sources written in multiple languages among others. In this article, considering a task of place name recognition over two collections of historical correspondence, we report an evaluation of five NER systems and an approach that combines these through a voting system. We found that although individual performance of each NER system was corpus dependent, the ensemble combination was able to achieve consistent measures of precision and recall, outperforming the individual NER systems. In addition, the results showed that these NER systems are not strongly dependent on preprocessing and translation to Modern English.
Bu çalışmada çevrimiçi kullanılabilecek bir konu tespit sistemi önerilmiştir. Gizli Dirichlet Ayırımı ile 4 farklı kategoriye ait toplam 400.000 haber dokümandan oluşan bir Türkçe derlem eğitilmiştir. Model, eğitim verisinde yer almayan, yeni gelen dokümanların konu tespitini yüksek başarı ile gerçekleştirebilmektedir. Konu modellerinin başarı değerlendirmesinde tutarlılık (coherence) değerine ek olarak sınıflandırma yöntemleri için geçerli olan kesinlik (precision), hassasiyet (recall), F-ölçümü gibi skorların elde edilmesine yönelik 2 farklı yaklaşım geliştirilmiştir. Bu yaklaşımlarda, konular ile dokümanların ait olduğu sınıfların eşleştirilmesinden yararlanılmıştır. İlk yaklaşımda, dokümanın ait olduğu sınıfa karşılık gelen konunun mevcut olup olmadığı üzerinden genel bir başarı ölçütü sunulmuştur. İkinci yaklaşımda ise modelin yüksek güven (confidence) ile gerçekleştirmediği tahminleri eleyen, “dokümanın en belirgin konusu, ait olduğu sınıftır” kabulüne göre bir eşik (threshold) değeri üzerinden değerlendirme yapılan bir ölçüt sunulmuştur.
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Dijital Beşerî Bilimler ve Türkiye Araştırmaları: Bir Literatür Değerlendirmesi
  • Fatma Aladağ
Aladağ, Fatma. "Dijital Beşerî Bilimler ve Türkiye Araştırmaları: Bir Literatür Değerlendirmesi." Türkiye Araştırmaları Literatür Dergisi 18 (2020): 773-796.
Programlamadan Yapay Zekaya
  • Fatma Aladağ
  • Elif Derin Can
Aladağ, Fatma and Elif Derin Can. "Programlamadan Yapay Zekaya." Yunus Uğur (ed.), Dijital Osmanlı Çalışmaları. İstanbul: Vakıfbank Yayınları, 2023.
Research Methods for Reading Digital Data in the Digital Humanities
  • Dawn Archer
Archer, Dawn. "Data Mining and Word Frequency Analysis." Research Methods for Reading Digital Data in the Digital Humanities. Edinburgh: Edinburgh University Press, 2016, 72-92.