Content uploaded by Suphan Kirmizialtin
Author content
All content in this area was uploaded by Suphan Kirmizialtin on Jul 20, 2022
Content may be subject to copyright.
Handwritten Text
Recognition(HTR) for Archival
Collections
Suphan Kirmizialtin, NYUAD
We are using Handwritten Text Recognition (HTR) to automate the transcription of the archival collections to:
●increase accessibility of the collections and open the collections up to wider audiences
●generate searchable metadata and/or document summaries
●create keyword searchable and computer readable corpora
●open the collections up to higher order analysis and distant reading tools
Automatic Text Recognition for Historical Archives with HTR
2
Image source: Wikimedia Commons
https://upload.wikimedia.org/wikipedia/commons/4/4d/Wikimedia_OCR_advanced_form.png
Optical Character Recognition (OCR)
English language text OCR’d with Tesseract
Image from https://readcoop.eu/insights/ocr-vs-htr/
●Most commonly used OCR algorithms are trained with modern, printed texts
●OCR tends to work best for printed texts generated with modern publishing technologies; it is
prone to higher error rates when applied to handwritten texts, non-latin script texts, text with
complex page layouts, and historical corpora
●HTR is developed for handwritten collections and work well for historical documents with complex
page layouts and ‘messy’ handwriting
HTR:
● deep learning approach to text recognition
●“language-agnostic”, relies on brute force techniques that utilizes large data sets for pattern recognition
●scalable corpora creation
●allows recycling and sharing of training data and text recognition models
Automatic Text Recognition for Historical Archives with HTR
5
Automatic Text Recognition of the Ottoman Turkish Print Collections and the
Bushire Residency Archive and with HTR
7
●British colonial presence in Gulf (late 18th - mid 20th
cen)
●Archive of 26K+ handwritten pages, 180+ volumes
●Written in multiple hands
●Collection digitized and available at Qatar Digital
Library, but is not in computer-readable format
●Arabic script Ottoman Turkish collections
●1,500 periodicals from 1840s to 1920s
●~400,000 pages
●printed in similar but multiple typefaces
●Collection digitized and available at HTU website, but
is not in computer-readable format
Additional Sources
HTR- Transkribus How-to Guides
Transkribus Lite
New Zealand Alpine Journal Archive
Beyond 2022- Public Record Office of Ireland
Qatar Digital Library
Terras, M., et all., 2018. Enabling complex analysis of large-scale digital collections: humanities research,
high-performance computing, and transforming access to British Library digital collections. Digital
Scholarship Humanities 33, 456–466. https://doi.org/10.1093/llc/fqx020
Thank you!
suphan@nyu.edu