David Joseph Wrisley’s research while affiliated with New York University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (4)


Figure 1 Sample pages from the volume File 5/190 II Manumission of slaves at Muscat: individual cases showing examples of handwritten English and Arabic together with typewritten English and blank pages. Depicted here are 149r-151r (images 306-310). Digitized by the Qatar Digital Library with an Open Government License.
Figure 2 Key terms ("nakhuda" [upper left], "powers" [upper right], "weapons" [center right], "traffic" [center right], "clerk" [bottom center], and "mistress" [lower left]) in red and their contextual associations in black plotted in a two-dimensional space, generated using PCA on our custom trained WVM. The figure illustrates how specific words are spatially distributed based on their contextual similarities. The model has been trained for 2-grams: 150 dimensions, 20 iterations, 6 word window with negative sampling of 5.
Exploring Gulf Manumission Documents with Word Vectors
  • Article
  • Full-text available

December 2024

·

15 Reads

·

David Joseph Wrisley

In this article we analyze a corpus related to manumission and slavery in the Arabian Gulf in the late nineteenth- and early twentieth-century that we created using Handwritten Text Recognition ( HTR ). The corpus comes from India Office Records ( IOR ) R/15/1/199 File 5 . Spanning the period from the 1890s to the early 1940s and composed of 977K words, it contains a variety of perspectives on manumission and slavery in the region from manumission requests to administrative documents relevant to colonial approaches to the institution of slavery. We use word2Vec with the WordVectors package in R to highlight how the method can uncover semantic relationships within historical texts, demonstrating some exploratory semantic queries, investigation of word analogies, and vector operations using the corpus content. We argue that advances in applied computer vision such as HTR are promising for historians working in colonial archives and that while our method is reproducible, there are still issues related to language representation and limitations of scale within smaller datasets. Even though HTR corpus creation is labor intensive, word vector analysis remains a powerful tool of computational analysis for corpora where HTR error is present.

Download

Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive

July 2022

·

34 Reads

·

14 Citations

Link to article: http://www.digitalhumanities.org/dhq/vol/16/2/000577/000577.html Our study discusses the automated transcription with deep learning methods of a digital newspaper collection printed in a historical language, Arabic-script Ottoman Turkish (OT), dating to the late nineteenth- and early twentieth-century. We situate OT text collections within a larger history of digitization of periodicals, underscoring special challenges faced by Arabic script languages. Our paper approaches the question of automated transcription of non-Latin script languages, such as OT, from the broader perspective of debates surrounding OCR use for historical archives. In our study with OT, we have opted for training handwritten text recognition (HTR) models that generate transcriptions in the left-to-right, Latin writing system familiar to contemporary readers of Turkish, and not, as some scholars may expect, in right-to-left Arabic script text. As a one-to-one correspondence between the writing systems of OT and modern Turkish does not exist, we also discuss approaches to transcription and the creation of ground truth and argue that the challenges faced in the training of HTR models also draw into question straightforward notions of transcription, especially where divergent writing systems are involved. Finally, we reflect on potential domain bias of HTR models in other historical languages exhibiting spatio-temporal variance as well as the significance of working between writing systems for language communities that also have experienced language reform and script change.


Figure 1: Distribution of Named Entities in the training and gold test datasets
Figure 2: Named Entity Recognition output from en-core-web-sm model tested on gold data
Custom tags used for our historical corpora
Varieties of Models Used in Our Research
Evaluation results for NER models in section 5 tested on gold annotated data
Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf

March 2022

·

75 Reads

·

5 Citations

Digital Humanities in the Nordic and Baltic Countries Publications

Text recognition technologies increase access to global archives and make possible their computational study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach to extracting a variety of named entities (NE) in unstructured historical datasets from open digital collections dealing with a space of informal British empire: the Persian Gulf region. The sources are largely concerned with people, places and tribes as well as economic and diplomatic transactions in the region. Since models in state-of-the-art NER systems function with limited tag sets and are generally trained on English-language media, they struggle to capture entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further improvements. Our study makes an intervention into thinking beyond Western notions of the entity in digital historical research by creating more inclusive models using non-metropolitan corpora in English.


Fig 1. Character correspondence chart of polyphonic OT letters
Fig 2. A snapshot from ​ Transkribus​ interface demonstrating the transcription process. Note the left-justified, yet reversed, Latin-alphabet transcription (center bottom) of the OT text (top right). The OT text displayed in the canvas tool is from ​ Küçük Mecmua.
Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive

November 2020

·

51 Reads

Our study utilizes deep learning methods for the automated transcription of late nineteenth- and early twentieth-century periodicals written in Arabic script Ottoman Turkish (OT) using the Transkribus platform. We discuss the historical situation of OT text collections and how they were excluded for the most part from the late twentieth century corpora digitization that took place in many Latin script languages. This exclusion has two basic reasons: the technical challenges of OCR for Arabic script languages, and the rapid abandonment of that very script in the Turkish historical context. In the specific case of OT, opening periodical collections to digital tools require training HTR models to generate transcriptions in the Latin writing system of contemporary readers of Turkish, and not, as some may expect, in right-to-left Arabic script text. In the paper we discuss the challenges of training such models where one-to-one correspondence between the writing systems do not exist, and we report results based on our HTR experiments with two OT periodicals from the early twentieth century. Finally, we reflect on potential domain bias of HTR models in historical languages exhibiting spatio-temporal variance as well as the significance of working between writing systems for language communities that have experienced language reform and script change.

Citations (2)


... The balanced performance of both RTC-NER models was hence assured. Corpus Annotation 'NER-Annotator', a user-friendly web interface for manual annotation of entities for spaCy model training, was used [25]. We defined a set of custom tags/labels of relevance to RTC incidents, as presented in Table 2. ...

Reference:

Geo-parsing and Analysis of Road Traffic Crash Incidents for Data-Driven Emergency Response Planning
Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf

Digital Humanities in the Nordic and Baltic Countries Publications

... It is emphasized that modern metrics used in intralingual translation processes provide an objective measurement of system performance. It has been noted that deep learning-based systems are more efficient than manual methods in automatic transcription of old-alphabet Ottoman Turkish texts [10,26]. ...

Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive
  • Citing Article
  • July 2022