PresentationPDF Available

Handwritten Text Recognition(HTR) for Archival Collections

Authors:
Handwritten Text
Recognition(HTR) for Archival
Collections
Suphan Kirmizialtin, NYUAD
We are using Handwritten Text Recognition (HTR) to automate the transcription of the archival collections to:
increase accessibility of the collections and open the collections up to wider audiences
generate searchable metadata and/or document summaries
create keyword searchable and computer readable corpora
open the collections up to higher order analysis and distant reading tools
Automatic Text Recognition for Historical Archives with HTR
2
Image source: Wikimedia Commons
https://upload.wikimedia.org/wikipedia/commons/4/4d/Wikimedia_OCR_advanced_form.png
Optical Character Recognition (OCR)
English language text OCR’d with Tesseract
Image from https://readcoop.eu/insights/ocr-vs-htr/
Most commonly used OCR algorithms are trained with modern, printed texts
OCR tends to work best for printed texts generated with modern publishing technologies; it is
prone to higher error rates when applied to handwritten texts, non-latin script texts, text with
complex page layouts, and historical corpora
HTR is developed for handwritten collections and work well for historical documents with complex
page layouts and ‘messy’ handwriting
HTR:
deep learning approach to text recognition
“language-agnostic”, relies on brute force techniques that utilizes large data sets for pattern recognition
scalable corpora creation
allows recycling and sharing of training data and text recognition models
Automatic Text Recognition for Historical Archives with HTR
5
https://readcoop.eu/
Automatic Text Recognition of the Ottoman Turkish Print Collections and the
Bushire Residency Archive and with HTR
7
British colonial presence in Gulf (late 18th - mid 20th
cen)
Archive of 26K+ handwritten pages, 180+ volumes
Written in multiple hands
Collection digitized and available at Qatar Digital
Library, but is not in computer-readable format
Arabic script Ottoman Turkish collections
1,500 periodicals from 1840s to 1920s
~400,000 pages
printed in similar but multiple typefaces
Collection digitized and available at HTU website, but
is not in computer-readable format
Established Projects with HTR
https://www.nzaj-archive.nz/#/ https://beyond2022.ie/
Additional Sources
HTR- Transkribus How-to Guides
Transkribus Lite
New Zealand Alpine Journal Archive
Beyond 2022- Public Record Office of Ireland
Qatar Digital Library
Terras, M., et all., 2018. Enabling complex analysis of large-scale digital collections: humanities research,
high-performance computing, and transforming access to British Library digital collections. Digital
Scholarship Humanities 33, 456–466. https://doi.org/10.1093/llc/fqx020
Thank you!
suphan@nyu.edu
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.