ArticlePDF Available

Automated Transcription of Handwritten Text: READ and TRANSKRIBUS (An Experiment with Transcribing Letters of Andrej Kmeť)

Authors:

Abstract and Figures

The study focuses on a brief description of the READ (Recognition and Enrichment of Archival Documents) European Research Project 1 , 2 , implemented in 2016-2019 under the Horizon 2020 programme. The research project was supervised directly by the European Commission and was evaluated by independent evaluators on annual basis 3. The main outcome of the project is the Transkribus platform and tool, which represents a fundamental global innovation focused on transcription of historical manuscripts and documents. The author, as one of the evaluators of the READ Project describes his experience and knowledge gained during the experimental transcription of the handwritten letters by Andrej Kmeť. He explains his view of Digital Humanities as the methodological context of the READ project, and provides a brief description of the scanning process, image loading, segmentation and automatic transcription, as well as particular examples of automatic transcription of the handwritten letters by Andrej Kmeť, and the experiment's results.
Content may be subject to copyright.
1
Automated Transcription of Handwritten Text: READ and TRANSKRIBUS (An
Experiment with Transcribing Letters of Andrej Kmeť)
Prof. PhDr. Dušan Katuščák, PhD
Abstract
The study focuses on a brief description of the READ (Recognition and Enrichment of
Archival Documents) European Research Project
1
,
2
, implemented in 2016-2019
under the Horizon 2020 programme. The research project was supervised directly by
the European Commission and was evaluated by independent evaluators on annual
basis
3
. The main outcome of the project is the Transkribus platform and tool, which
represents a fundamental global innovation focused on transcription of historical
manuscripts and documents. The author, as one of the evaluators of the READ
Project describes his experience and knowledge gained during the experimental
transcription of the handwritten letters by Andrej Kmeť. He explains his view of Digital
Humanities as the methodological context of the READ project, and provides a brief
description of the scanning process, image loading, segmentation and automatic
transcription, as well as particular examples of automatic transcription of the
handwritten letters by Andrej Kmeť, and the experiment's results.
1.1 Introduction
Digital Humanities (DH) are considered as a general name of an area, which is a kind
of an umbrella for different fields of scientific and practical activities aimed at the use
of ICT in the social sciences and humanities. In essence, we believe it is "old wine in
a new bottle" because digital technologies have been used in the social sciences and
humanities, to a varying extent, from the 1970s. Digital Humanities are seen as a
concept having a broad extent and unclear content. As far as the extent of the
concept is concerned, it does not refer to a single entity but to what is common to
many individual entities. What single entities have in common is the application of
what is traditionally referred to as information and communication technology (ICT),
or digital technology. Digital Technology
4
is the content of the concept of digital
humanities because it is a common feature of all the elements of a set of entities, to
which the term refers. Digital Humanities are neither a branch of science, nor a
discipline. Digital Humanities represent a cross-sectional methodology in the social
sciences and humanities applied in research, development, and management, and in
practice.
1
https://read.transkribus.eu
2
Mühlberger, Günter. READ (Recognition and Enrichment of Archival Documents) - 2016-2019. [project study].
Available at:
https://www.academia.edu/22653102/H2020_Project_READ_Recognition_and_Enrichment_of_Archival_Documents_-_2016-
2019
3
Christophe DOIN. Project Officer. European Commission. DG CONNECT C1. EUFO 01/150A. Rue Robert Stumper.
L-2350 Luxembourg-Ville. Luxembourgh. Christophe.DOIN@ec.europa.eu
Reinhard Altenhöner. Deputy Director General. Staatsbibliothek zu Berlin - Preußischer Kulturbesitz
Zentralabteilung SV/Z. Potsdamer Straße 33, 10785 Berlin. E-mail: reinhard.altenhoener@sbb.spk-berlin.de
Lorna M. Hughes. Professor of Digital Humanities. Head of Subject. Information Studies. 11 University Gardens.
University of Glasgow. Glasgow, G12 8QQ. Scotland. E-mail: Lorna.Hughes@glasgow.ac.uk
Dušan Katuščák. Professor of Library and Information Science. Silesian University in Opava. Faculty of Philosophy
and Science. The Institute of the Czech language and Library Science; State Research Library, Banská Bystrica.
Dusan.katuscak@fpf.slu.cz
4
LIS Library and Information Science / Studies
2
In our opinion, in the field of library and information science and practice (LIS
5
),
Slovakia caught the first wave of implementing digital technologies in social and
humanities in the late seventies, and, with we see the years to come with some
optimism and expectations of continuity and innovation, too. We believe there have
been some early, also advanced phases or stages of Digital Humanities in the field of
library and information systems, specifically in library and information systems and
services in Slovakia. Digital Humanities have significantly affected LIS and other
humanities and social sciences; and their influence is definitely set to grow.
The first stage of Digital Humanities in the field is the machine processing of the
Slovak National Bibliography (SNB) (1968-1975), relatively soon after the discovery
of an integrated chip, in 1970.
The second stage is the mechanisation and automation (of SNB) (P13 State
Programme) in 1975-1980.
The third stage includes IKIS and CASLIN (Integrated Cooperative Information
System and Czech and Slovak Information Network) (P18State Programme)
launched in 1985.
The fourth stage is the information society programme including in particular the
third-generation library and information system – KIS3G - and the Slovak Library
portal (a state programme for supporting the implementation of standards,
internationalisation of the sector), 1994-2005. In the European context, this project is
viewed as one of the most significant practical services for knowledge management
6
.
The fifth stage is a qualitatively new aspect of Digital Humanities. This is a true
scientification of our field, extensive scientific collaboration and the application of
engineering approach (chemists, biologists, computer scientists) to address specific
problems of our field (acidic paper). This included preservation research under the
KNIHA SK (deacidification) state basic research programme in 2000-2010 and
specific application solutions of the major professional and civilisational problem of
perishing paper information carriers from the period 1830-1990.
The sixth stage is a unique digitisation project – the Digital Library and Digital
Archives, which has a national and European dimension conceived as a service for
the humanities and social sciences in a generously-funded Information Society
Operational Programme (OPIS2) (EU-funded and state programme) in 2004-2015.
The seventh stage is characterised by such accompanying phenomena of digital
humanities as Big Data, Data Open, Open Access, Open Archive, Linked Data, data
visualisation, the use of digital content, clouding etc.
Due to the cross-sectional methodology, digital humanities are fading away, and the
old paradigm in LIS established on cumulation slowly becomes obsolete, and the
emphasis is placed on the use of the accumulated records, knowledge, and data.
Document records and documents in digital form have been produced and stored in
databases and repositories for decades. However, a mere use of digital records is
6
European Commission The fact sheets present an overview of the state and progress of eGovernment in European
countries. Joinup is a joint initiative by the Directorate General for Informatics (DG DIGIT) and the Directorate General for
Communications Networks, Content & Technology (DG CONNECT). Production/Publishing: ISA Editorial Team, Wavestone
Luxembourg S.A. May 2018. Available at: https://joinup.ec.europa.eu/sites/default/files/inline-
files/eGovernment_in_Slovakia_2018_0.pdf
3
insufficient (for science, research, education, entertainment, industry, economy,
business, public and private sectors).
1.2 Attributes of Digital Humanities
In Digital Humanities, new methods of research and ICT use are applied
7
. Research
in DH is characterised by:
1. cooperation of researchers in research projects,
2. scientification in social sciences and humanities,
3. interdisciplinarity – informatics, chemistry, history, economics, medicine,
sociology, education, psychology...
4. teamwork (inter-institutional, international, among universities, libraries, archives,
galleries, museums),
5. significant involvement of ICT in research, education and access to knowledge,
6. artificial intelligence (Hidden Markov Model, HMM) – recognition of speech,
handwriting, gestures, bioinformatics.
We consider the concept of digital humanities as the common name for all
applications of information and communication technologies (ICT) in social sciences
and humanities, in the related fields and disciplines and the corresponding practice.
In the social sciences and humanities and in practice, the knowledge and tools from
the fields of ICT are used
8
. The flow of knowledge is not unidirectional from ICT to
practice of social sciences and humanities, because the application of knowledge,
methods and tools of ICT in the social sciences and humanities brings in turn some
requirements from ICT. For example, there is such interaction in LIS in the
requirements related to integrated library and information systems, infrastructures
and workflow of digitisation, optical character recognition, text analysis, information
retrieval tools, long-term archiving of digital content, data formats, databases, etc. As
knowledge, methods and tools of ICT disciplines are used in social sciences and
humanities and in practice, they can be considered as fields that belong under a
common umbrella of Digital Humanities.
1.3 Digital Humanities and the READ Project
The READ Project has all the attributes of a digital humanities methodology. The
project was implemented under the Horizon 2020 framework programme. It is a
research and innovation program, contract number 674943. The project ended on
June 30th, 2019. The final evaluation of the project took place on September 12th,
2019 in Luxembourg. The project is coordinated by its author, Professor Günter
Mühlberger (University of Innsbruck, Digitisation and Digital Preservation Group).
7
Digital technology a) a field of scientific and engineering knowledge dealing with the creation and practic al use of
digital or computerised devices, methods, systems, etc.; b) digital equipment, methods, systems, etc. created with such
knowledge; c) the application of the knowledge for practical purposes, for example, in the field of digital communications and
social media. (According to: https://www.dictionary.com/browse/digital-technology )
8
Theoretical Informatics (also for natural sciences); Applied Informatics; Software Engineering (also for natural
sciences); Economic Informatics; Telecommunications; Military communication and information systems; Telecommunication
technology; Telecommunication systems; Computer Engineering; Artificial Intelligence; Information Systems; Information
Theory; Process Managem ent; Robotics (also for mechanical engineering); Cybernetics; Technical Cybernetics; Other related
fields of Information and Communication Technology;
4
The University of Innsbruck has explored the core technology of handwriting
recognition, layout analysis and keyword search for historical documents in
collaboration with 13 other partners from Europe since 2016. Research into all of
these three areas is undertaken by teams of universities in Valencia, Rostock, Vienna
University of Technology and other research institutions participating in the READ
Project.
The READ Project was supported by the European Union with a sum of EUR 8,2
million. The financing period ended on June 30th, 2019. However, follow-up projects
are started to continue the basic and applied research. The author of the study seeks
to involve Slovak and Czech institutions in this exceptional scientific innovation effort
covered by the concept of digital humanities.
1.4 Importance of the Transkribus Platform
9
The Transkribus platform implements the outcome of basic research. In addition to
basic research, creating the Transkribus research platform was one of the main
objectives of the READ project.
Approximately 2.5 million euros of the above 8,2 million was invested in the
development of the research infrastructure that shifted digitisation, recognition,
transcription and search in historical documents to a technologically completely new
level. The technology, which is based on machine learning methods, is particularly
important because:
- archives, libraries and museums that want to improve access to their collections,
- researchers in humanities, who are allowed to build research on completely new
foundations ("digital humanities")
- the general public, which benefits from dramatically improved access to "family
data" in the archives, and
- computer scientists and technology providers, who receive very substantial data
sets for their research, thus enabling them to develop improved algorithms and
methods.
Transkribus has a transformative power for the entire process of value creation in the
digitisation of historical documents. According to the NUMERIC (2010) statistics there
are about 26.98 billion pages in European archives. It is expected that about 10.45
billion pages of this amount will gradually become digitised.
The Slovak archives hold an estimated length of 170 kilometres of archival materials.
For over twenty years, Manuskriptorium has been built in a cooperative way in the
Czech Republic, which is a digital library of manuscripts containing over 46,000 fully
digitised documents and approximately 400,000 descriptive records
10
. In the
archives, one metre accounts for about 7,000 pages. Ideally, digitisation should
9
Mühlberger, Günter. READ. D3.4. READ Platform Business Implementation. Report for Period 3. [Confidential].
05.08.2019. H2020 Project 674943.
10
PSOHLAVEC, Tomáš. Digitální knihovna Manuscriptorium. In: Libraries V4 in the Decoy of Digit al Age. Proce edings
of 6th Colloqui um of Library and In formation Experts of the V4+ Cou ntries held from 31st May 1st J une 20 16 in Brno. Brno :
Moravská zemská knihovna v Brně, 2016. S.(cze) 367-374. ISBN 978-80-7051-216-6 (paperback)
5
include automated conversion of selected archival handwritten, typewritten and other
materials. That is why Transkribus!
1.5 Unique Features of Transkribus
Transkribus is the only platform in the world that allows non-technical users to train
the neural network and specific models, which are then able to recognise
manuscripts and printed materials in any language and script with excellent, good or
very good results.
Printed publications from the 16th to 19th centuries can be recognised at an error
rate significantly below one percent, the various manuscripts at 2 to 5 %, and
collective manuscripts at 6 to 10 %. A few years ago, these figures were quite
unthinkable.
Automated transcription using the Transkribus platform often provides nearly flawless
text. However, this is only possible by training, or teaching the system, and through
patient work on creating a model for a specific manuscript or collection. It is also one
of the strongest arguments in favour of using the platform as it allows each individual
user to train models corresponding exactly to their requirements. In practice, this
means that if we have one handwriting style on more than 10,000 pages (e.g.
Lauček's collection), then we will train a model on 50 to 70 pages. After that,
Transkribus is able to automatically transcribe the remaining pages with good
accuracy and at least significantly facilitate text editing, modification, translation, full
text search etc.
1.6 Experiment
Automatic transcription of handwritten text is what historians, linguists, archivists,
librarians, documentalists and all others who come into contact with handwritten text
have dreamt of for decades
11
. Step by step, automatic transcription of manuscripts
becomes a reality. In the background, there is massive international basic research in
artificial intelligence and thousands of hours of work.
An indicative information on the work with the Transkribus platform was published in
a blog and a Facebook status. The huge interest in this work came as a surprise.
This is understandable, because many historians, linguists, librarians, educators and
others are becoming more skilled in the use of new technologies in their work and
understand that innovation to facilitate their work is very important.
Transkribus, of course, is not a substitute for professional and scientific erudition of
historians and archivists. Automatic transcription is only one step in scientific work.
This is followed by historical research of the text and the context of the transcribed
texts and information, editing texts obtained by transcription, identification of entities,
keywords that are discovered in the text (dates, names of people, geographic
locations, corporations, etc.).
11
In 1991, in cooperation with Ján Mišík, we made attempts to use a character recognition system for automatic
transcription of handwritten catalogue records from the old catalogue of the Slovak National Library (Matica slovenská). The
resulting efficiency of IRIS OCR transcription was about 35 to 40 %, and the transcription was unusable.
6
The goal of more extensive transcription using cutting-edge Transkribus platform is to
make available unique collections of documents, archival units, preserved in the
archives usually only one copy. That is the difference between the occurrence of
units in libraries and archives. The archives are unique, authentic original documents,
collections, archival units, while libraries hold titles of documents that often come in
hundreds to thousands items.
After transcription of historical texts and manuscripts, digital content may be edited,
rendered, used and made accessible for use on a larger scale also in public
information systems and services. In addition, transcribed the original text, for
example, in Latin, Hungarian, German or another language can be at least
approximately automatically translated in another language. This quite substantially
changes the nature of the work of archivists and historians.
Here are the results for those interested in this work
12
.
1.7 What was the goal of the experiment?
A collection of handwritten correspondence of Andrej Kmeť, mostly in Slovak, kept in
the Library of the Slovak National Museum in Martin, after the previous gracious
consent of the director of the museum, Dr. Mária Halmová.
13
Letters of Andrej Kmeť
(SNM, Martin) from 1841-1908. Andrej Kmeť and his correspondence is a subject of
systematic research by Karol Hollý
14
,
15
who also provides additional resources
relating to Kmeť's manuscript heritage.
For further experiments, materials from the archives of the family Zay, the Bučany
archives of the SNA, the Lauček's collection, 1500-1800) were scanned
(approximately 5 000 pages). In the future we plan to scan, virtually complete and
make available the entire collection of unpublished manuscripts of Martin Lauček.
Collectanea was used by historians who directly quoted or translated some parts of it.
Our goal was to scan, transcribe and allow translation of the whole collection, or at
least some selected parts. Overall, this includes 22 volumes and the estimated
quantity is more than 10,000 pages. It is an extremely valuable collection especially
for the history of the Lutheran Church, but also for the history of our modern age.
Very meritorious work in processing, scanning, translation and publishing on the
history of the Lutheran Church in Slovakia has been done since 2004 by the
Association of Lutherans of the Augsburg Confession of the Považie District
(ZEAVPS) in Dolné Srnie and its extremely active but modest collector and organiser
12
The detailed instructions for working with the Transkribus platform are contained in good and available
documentation. This study contains only the essential information and knowledge of a particular experiment requiring
approximately 1 000 hours, because it was necessary to study the entire system, become familiar with the architecture and
documentation. Experience, know-how and expertise were gained, described herein only in general terms.
13
The author of this paper wishes to express acknowledgements for permitting the work in the library premises and
providing assistance, namely to archivist Mgr. Viera Varínska and librarian Mgr. Anna Peťová. Acknowledgements for finding
the information about the circumstances of Andrej Kmeť's life in Prenčov go to Ms. Oľga Kuchtová of Banská Štiavnica. The
acknowledgement for allowing the author to scan the collection of Martina Lauček (Collectanea) in the premises of the Slovak
National Archives in Bratislava are extended to the Central Archives Administration of the Ministry of the Interior, and the thanks
for technical assistance go to Mgr. Eva Kowalská, PhD. of the Historical Institute of the Slovak Academy of Sciences in
Bratislava.
14
HOLLÝ, Karol: Andrej Kmeť a slovenské národné hnutie: Sondy do života a kreovanie historickej pamäti do roku
1914. (A ndrej Kmeť and Slovak national movement: probes into the lives and creating of historical memory before
1914.) Bratislava : Veda Historický ústav SAV, 2015. 279 p. ISBN 978-80-224-1480-7
15
HOLLÝ, Karol: Veda a slovenské národné hnutie : snahy o organizovanie a inštitucionalizovanie vedy v slovenskom
národnom hnutí v dokumentoch 1863-18 98. (Science and Slovak national movement: the efforts to organise and institutionalise
science in the Slovak national movement in documents from 1863 to 1898.) Bratislava : Historický ústav SAV v Typoset Print s .r.o.,
2013.
7
of the activities, the Association's chairman, Mgr. Pavel Černaj. P. Černaj also
collected information about Martin Lauček
16
and his manuscript collection
Collectanea relying mainly on the basic book by Ján Ďurovič
17
.
Figure 1 An older handwritten letter by Andrej Kmeť
Figure 2 A handwritten letter by Andrej Kmeť
16
Martin Lauček, služobník Slova Božieho Cirkvi augsburského vyznania v Skalici. (Martin Lauček, the servant of the
Word of God, Church of the Augsburg Confession in Skalica.) Centuria diplomatum et epistolarum Thurzonianarum. Sto
Turzovských listov. Diel 1. Ed. Pavel Černaj. Dolné Srnie : ZEAVPS, 2016. 78 p. ISBN 978-80-89486-13-7
17
Ďurovič, Ján: Martin Lauček, tolerančný kňaz spisovateľ. (Martin Lauček, priest and writer of tolerance) Myjava
1933.
8
Figure 3 A sample Latin manuscript by Martin Lauček. Collectanea Vol. 18.
Figure 4 A sample of Lauček's manuscript refers to Juraj Thurzo
1.1.1.1 SCANNING
Scanning app 3500 pages took place between 23rd and 30th May 2018 in the Library
of SNM. The ScanTent (scanning tent) equipment and DocScan application was
used for scanning.
This equipment was used for the purpose of verifying the entire Transkribus worflow,
including the ScanTent and DocScan equipment offered. It is well-known that many
archives have scanned some parts of their collections at a more or less good quality.
The equipment selected in this case is useful in cases where collections have not
been scanned yet. It is also known that ordinary scientists and users may not take
any archival materials away from study rooms, and making amateur photographs
using a mobile phone or a camera is problematic in the case of larger volumes
(thousands of pages).
Therefore, ScanTent and DocScan are good and affordable and acceptable choice,
although with some practical issues (format, focus, quality). It should be noted,
however, that in this case it is photography rather than scanning.
9
Figure 5 ScanTent
Five archival boxes were scanned. Some of the letters were on multiple pages, there
were also some incomplete pages, blank pages etc. One image can also contain
more pages of a handwritten document. In the scanning step, images are created
and not actual pages, unless a page is scanned individually. It is preferable to scan
sheets by pages, individually, because if a sheet is scanned as a double page, then
one will have to organise pages in the right order in the post-processing step.
The total scanning time was approximately 20 hours. Scanning was performed in
single-page mode by individual sheets, not in series (with automatic scanning after a
page is turned), as the handwritten material is on separate sheets of different
formats. A part of the materials comprises original letters, another part consists of
photocopies. In particular, original letters are often brittle on paper which requires
some conservation and preservation actions. Business cards and similar smaller
paper sizes – DocScan required to move the scanned object closer and zoom, this
was resolved by placing a blank A4 page underneath the missing areas of the
sheets. Some sheets were damaged (a missing corner, damaged edges of a sheet).
In such cases, the system reported "no page found". This was resolved by using a
white sheet as background sheet under the scanned page and its missing parts, then
DocScan was capable of focusing.
Some components of the first box needed re-scanning, because not enough attention
was paid to focusing. The materials in the first box were used to gain experience and
calibrate for the remaining boxes. DocScan focuses on a sheet's surface in several
spots, indicated by red and green markers. When focus is satisfactory, "OK" appears,
then one can pull the trigger. A Samsung Galaxy 6 mobile phone running on the
Android operating system was used. There were some issues in the download of
data from Samsung (Android) to MacBook Air (iOS). Finally, a Windows PC was
used to load images from the Samsung device.
10
Figure 6 Letters of Andrej Kmeť in an archival box
When scanning, it is possible to connect DocScan directly to server and the
Transkribus platform (in Innsbruck or Rostock) and scan directly to the Transkribus
platform that provides experimental transcription of handwritten text printed in Latin
script or other script. This option was not used for insufficient connectivity. Some
operations on the Transkribus platform required the use of such tools as Preview,
Adobe Acrobat, File Zilla and others.
The scanned digital content (images) was
1. prepared for further processing in the DocScan software (content
identification, metadata)
2. recorded without modification on CD-ROM for use at the discretion of SNM
and Archives management.
3. The images were prepared for an upload to the Transkribus platform and for
further processing in the Transkribus software. Loading, segmentation and
transcription of the handwritten text followed.
The digital content was divided based on the system as found in the archival boxes.
Five compact discs (CDs) were recorded handed over to the manager of the
Ethnographic Museum of the Slovak National Museum in Martin, dr. Mária Halmová.
The collection's custodians can now use and publish the digital content. Further, a
CD may be placed in each box. Then they can decide whether to allow access to CD
or work with the relatively brittle original paper archival sheets.
1.1.1.2 LOADING DIGITAL IMAGES AFTER SCANNING
The scanned images can be processed either locally or edited after being imported to
the remote Transkribus server. Before import to the server and before using the
Transkribus platform, one has to sign in, download the platform and create an own
private collection, which is available only to those who created it, unless the user
decides otherwise. It is possible that a transcriber allows access to certain operations
to students, operators, cooperants. It allows access to one's own collection for
preparation of training samples, editing after transcription etc. Automatic transcription
is carried out exclusively on the remote server using the Transkribus infrastructure.
Locally, it is possible to work with own documents and collections as needed.
11
Figure 7 Imported files (pages, owner, date, collection)
Before importing, it is necessary to create one's private collection and folder. A single
upload and import of images is possible up to 500 megabytes at a time. If the size of
imported images is larger, they can be divided and uploaded in multiple batches.
Larger image files can also be uploaded and imported using an FTP client, via URL
or DFG Viewer METS. Images can be imported as PDFs, JPGs, TIFFs and other
formats. The collection of images, created by scanning letters of Andrej Kmeť was
11.7 gigabytes in size at 300 dpi resolution. The research did not concentrate on the
efficiency of the scan resolution in relation to the accuracy of the automatic
conversion by Transkribus, although, hypothetically, it may be meaningful
relationship.
Our experience shows that before importing it is advisable to check the digital
images, the quality, sharpness, completeness, page orientation, and so on. After
some experience from practicing the PDFs were imported.
1.1.1.3 SEGMENTATION
Having imported the files on the server, an automated process of segmentation must
be performed on the server. For segmentation of text and images, the client
application must be connected to the server. Segmentation means that the image of
handwritten textual document, which is still on the server as an image, is
automatically fragmented into blocks, areas, lines of text. Manual corrections can be
made as necessary. This involves, for example, the merging and splitting blocks,
expanding the boundary segment, and the like.
12
Figure 8 A sample of segmented text from the Andrej Kmeť collection of letters with marked text block and lines of
text
1.1.1.4 HTR MACHINE TRAINING
18
A sample dataset of pages is selected from the collection imported based on a
certain algorithm, which is then used for training and setting up of a model for a
certain handwriting type. It is necessary to show the machine some correct
examples of text. Then, the machine learns the patterns of letters and words in
accordance with the training set.
If a collection of texts is written with more than one hand, then it will be necessary to
select the appropriate size of the test sample. Page selection can also be performed
automatically so that a batch sample is prepared on about 20,000 words. The training
dataset is created directly in the Transkribus editor client both locally and on the
server.
Basically, it is necessary to transcribe a handwritten text carefully and very accurately
by lines, with no corrections. Text should be transcribed using the language used at
the time of creation, including all grammatical errors and also by further instructions
and manuals that are available for this operation.
The author and creator of the transcription model should determine the order of text
parts, tagging, selecting and editing keywords, descriptive metadata, and so on. The
outcome of transcription is then viewable and can be evaluated on a test dataset. If
the outcome is satisfactory, the remaining files or the entire collection can be
transcribed automatically.
18
HTR = Historical Text Recognition. It is a recognition of historical texts, letters, postcards, manuscripts and medieval
documents. A HTR engine of Computational Intelligence Tec hnology Lab (CITlab).
13
Figure 9 Example of a page edited for a training set
1.1.1.5 AUTOMATIC TRANSCRIPTION
Automatic transcription serves as the basis for scientific editing, in which the text can
be modified, corrected, proofread, explicitly enriched with more data, context data,
data deciphering, tagging, adding notes, metadata, annotations, corrections of
diacritical marks, abbreviations, uppercase and lowercase letters, paleographical
processing, ligatures etc.
Automatic transcription was made after a run of training and testing. A custom model
of transcription was used using HTR+.
Figure 10 A screen with data following automatic conversion using the internal model A_DUSAN_KMET.
The result of learning in the automatic handwritten text transcription of Andrej Kmeť's
letters was 1.37 % in the training dataset and 1.76 % in the test dataset (CER -
Character Error Rates). The training set contained 29,411 words and 4,573 lines. The
model can be deployed on the entire collection.
14
Figure 11 Demonstration of "raw" automatic transcription Error rate is shown for each line added by the author.
During Y3 implemented by the researcher the new HTR+ engine from the CITlab
team. It showed a dramatic decrease in the character error rates of 60-80%. Also the
training process was simplified and speeded up by a factor of at least 5 using GPU
servers for computation. The new HTR+ was announced and introduced to the users
at the 2nd Transkribus User Conference in Vienna in November 2018.
Figure 12 Summary of trials and errors when working on the Transkribus platform (from 22.81 % error rate to 1.76
% error rate). Transcription efficiency has significantly improved by use of HTR+ as recommended by Professor
Muehlberger.
1.8 The Future of Transkribus
Project ended on June 30th, 2019. The experts and institutions are interested in the
continuation and development of the Transkribus service. Currently (2019), there are
more than 20,000 Transkribus users. The plan is to continue the research and
15
implementation of the results in the NewsEye EU project
(https://www.newseye.eu/project/about/). The READ-COOP (Societas Cooperativa
Europeae - SCE) society is formed. Starting from July 1st, 2019 the READ project is
changed to READ European Cooperative Society (SCE). The READ-COOP
Cooperative will maintain and further develop the Transkribus platform and the
related services and tools
19
.
1.9 Possible Goals of Continuing Research at national level
In further research it would be useful to focus attention on the following areas: a) the
selection and description of the extensive manuscript collections of European and/or
national importance to be made available using the Transkribus platform, b)
expanding the project to the regional project Franciscan land register, stable land
register and definitive land register – of who owned what in the 19th century, c)
training and analysis of patterns of transcription according to Modern Age collections
and languages (especially Slovak, Czech, Hungarian, Latin, German, Polish), d)
automatic transcription of a substantial part of the Lauček's manuscript collection, e)
increasing the efficiency of recognition of handwritten texts through Transkribus and
related tools, f) making transcribed and interpreted collections available through a
digital repository to the general public.
Figure 13 Experimental automatic transcription of the page
19
We expect that thanks to the positive approach of the Matej Bel University and doc. Imrich Nagy, PhD. of the
Department of History, we will be able to continue this interesting innovative research work in the framework of a project.
16
Figure 14 Overview of automatic transcription efficiency and error rate
1.10 Conclusion Efficiency of the Transkribus Platform
Our experience verified by the experiment confirms that individual handwritten
documents can be automatically transcribed at an error rate of 2 to 5 %, and
collective handwritten documents (collections) have an error rate of 6 to 10 %. The
results of transcription are readable, usable and can be exported (DOC, TXT, PDF,
TEI, METS, etc.), edited, proofread, corrected. In the experiment, we achieved an
error rate (CER) of 1.76%.
In terms of perception, understanding and use of transcribed text in general, the
authors of Transkribus hold that a) if error rate of "words" is counted strictly and if
word error rate is up to 30 %, the text is still understandable and usable for humans,
b) f error rate of "characters" is counted strictly and if the character error rate is up to
15 %, the text is still understandable and usable for humans.
In the experiment, the word error rate was 16,88 % (of acceptable 30 %).
The character error rate was 5,89 % (of acceptable 15 %).
The accuracy of transcription of words on the page under evaluation was 72.78 %.
The accuracy of characters transcribed from the same page was 90.52 %.
The Transkribus platform is an excellent tool for patient and conscientious scholars,
which can be very helpful when fine-tuning transcription. The platform is not, and
hardly ever will be, intended for "clickers", i.e. users who are accustomed to "clicking"
rather than innovating.
17
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.