Conference PaperPDF Available

Retro-digitizing and Automatically Structuring a Large Bibliography Collection

Authors:

Abstract and Figures

In this paper, we present a generic workflow for retro-digitizing and structuring large entry-based documents, using the 33.000 entries of Internationale Bibliographie der Lexikographie, by Herbert Ernst Wiegand, as an example (published in four volumes (Wiegand 2006-2014)). The goal is to convert the large bibliography, at present available as collection of images, to TEI compliant XML, a structured format that enables enhanced interoperability and search functionalities (Lindemann, Kliche and Heid, 2018). Images of the printed publication are first processed with Optical Character Recognition (OCR) tools which are part of the Transkribus software application (Mühlberger and Terbul, 2018),1 the output of which is used for creating manually validated Hand-Written Text Recognition (HTR) training material. The retro-digitised output is the used to train and create dedicated machine learning models in GROBID-Dictionaries2 (Khemakhem, Foppiano and Romary, 2017), a tool for automatic segmentation of entry-based text documents and representation as TEI-compliant XML. Both Transkribus and GROBID-Dictionaries are tools freely available to the community. Preliminary results suggest that the proposed workflow yields good precision in retro-digitisation and segmentation.
Content may be subject to copyright.
Retro-digitizing
and Automatically Structuring
a Large Bibliography Collection
David Lindemann
Mohamed Khemakhem
Laurent Romary
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 2/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
Introduction
Context
Project “LexBib“: A bibliography, e-science corpus, and domain ontology of Lexicography
Lindemann, Kliche & Heid 2018 (Proceedings of Euralex)
Bibliometrics: Citation Extraction > Citation Network
Topic Modeling, Term Extraction > Keyword Indexation
Goal: A workflow for digitization and information extraction
In the context of LexBib, many use cases
Digitization of PDFs without (noise-free) text layer
Extraction of bibliographic references
First use case: Wiegand‘s large bibliography
For retro-digitization: TRANSKRIBUS
For bibliography structuring: GROBID/GROBID-dictionaries
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 3/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
Wiegand: ‘Internationale Bibliographie der Lexikographie‘
33,000 items, 1850 to 2014
All standard item types + press articles; some with comments.
Available as book and as PDF
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 4/13
David Lindemann
Mohamed Khemakhem
Laurent Romary


 !
"#$%&'
( )*++,-' . #/$
!012
!''343567&
89 
-$
:; !<"#$%&'
( ='''->' . #/$!012
!''3?3567&
@ 

:; !<"#$%&'
( ='''A->' . #/$!012
!''3B3567&
Wiegand: ‘Internationale Bibliographie der Lexikographie‘
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 5/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
 

 !
"#$%&'
( )*++,-' . #/$!
012
!''343567&
89 
-$
:; !<"#$%&'
( ='''->' . #/$!012
!''3?3567&
@ 

:; !<"#$%&'
( ='''A->' . #/$!012
!''3B3567&
PDF
with noisy
text layer
OCR
manual
correction,
HTR training
PDF
with noise free
text layer
Wiegand-Bib: Workflow in TRANSKRIBUS
HTR
TRANSKRIBUS: https://transkribus.eu, Innsbruck, H2020-Project „READ“
Free for academic use. Includes ABBYY OCR-software, CITlab HTR Engine
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 6/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
Wiegand-Bib: Workflow in TRANSKRIBUS
PDF
with noisy
text layer
OCR
manual
correction,
HTR training
PDF
with noise free
text layer
HTR
TRANSKRIBUS: https://transkribus.eu, Innsbruck, H2020-Project „READ“
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 7/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
Wiegand-Bib: Information Extraction
The Goal:
TEI-XML
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 8/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
GROBID: Cascaded CRF models, plugged for this task
„Dictionary Segmentation“ and „Dictionary Body Segmentation“ models from
GROBID-Dictionaries (https://github.com/MedKhem/grobid-dictionaries)
“Citation” model for entry (TEI <biblStruct>) segmentation from GROBID-core
(https://github.com/kermitt2/grobid)
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 9/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
GROBID-dictionaries: Text body segmentation
Link to video file
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 10/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
GROBID-core: Bibliography entry segmentation
Link to video file
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 11/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
Discussion
TRANSKRIBUS
Free tool
OCR for automatically generating HTR training material
HTR training allows to associate any glyph to UTF-8 characters
In this case, OCR did not recognize non-standard [^A-Za-z0-9] characters
Very useful for structure markers, like bullets, arrows, etc.
Absolutely generic, even for handwritten text
Generates PDF
Open issue: Font-styles
GROBID
Free and Open Source tool
Pluggable CRF models
Training & evaluation of models
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection | 12/13
David Lindemann
Mohamed Khemakhem
Laurent Romary
Discussion
Results, so far
Manual effort in creating training material yielded very encouraging results
TRANSKRIBUS: 104 pages training > 99,5% precision, large character set (159 glyphs)
GROBID-dictionaries: 5 pages training > 100% precision for body segmentation
Outlook
Wiegand‘s bibliography
GROBID: evaluation and further training for entry segmentation
TEI-XML for import in LexBib database
Other use cases for TRANSKRIBUS
LexBib project: PDF text layer generation
Other use cases for GROBID / GROBID-Dictionaries
LexBib project: Citation extraction
Legacy Dictionaries
9 December 2018
EADH 2018 | Retro-digitizing and Automatically Structuring a Large Bibliography Collection
Thank you for your attention
David Lindemann,
Universität Hildesheim, UPV/EHU University of the Basque Country
Mohamed Khemakhem,
Inria ALMAnaCH, Centre Marc Bloch, Paris Diderot University
Laurent Romary,
Inria ALMAnaCH, Centre Marc Bloch, Berlin-Brandenburgische Akademie der Wissenschaften
See detailed bibliography in the proceedings
The research leading to these results
has recieved funding from the
Basque Government Eusko Jaurlaritza IT-665-13.
... On the other hand, the cascading behaviour of the system comes from a modular structure of the API which makes CRF models, implemented following the same logic, portable and easily pluggable to the core GROBID models (Lindemann, Khemakhem, and Romary, 2018). ...
... We use the term OCRisation to designate the process of using an OCR system, or any character recognition system, to recognise the layout and the text of a document. For this experiment, we carried out the OCRisation by using the Transkribus platform (Kahle et al., 2017) and following the workflow described by Lindemann, Khemakhem, and Romary, 2018. The process consists of using a default OCR model to produce a first layer of OCRs that will be manually corrected and then used to train a Handwritten Text Recognition (HTR) model to produce higher quality digitised text. ...
... Being compatible with the GROBID-family tools, our models can be exploited to build new tools based on models coming from different systems. A first attempt in this direction to parse large bibliographic collections (see Figure 8.1) was successful by combining models of GROBID-Dictionaries and GROBID (Lindemann, Khemakhem, and Romary, 2018). We also plan to further enrich the output of GROBID-Dictionaries by applying models from ...
Thesis
Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the common awareness of a certain communityabout every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new onesare continuously compiled, even with the recent strong move to digital resources.However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats.In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks.First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challengesand a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the archiecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios.After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system.After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup andseries of experiments carried out with a selected pool of varied dictionaries.We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique.Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.
... Transkribus offered OCR capability, providing an inbuilt ABBYY Finereader function, before licensing issues in 2021. Out of the 10 indexed materials which mentioned OCR, only 2 abstracts included a description of using this function within Transkribus (Lindemann et al. 2018;Ströbel and Clematide 2019) while others used OCR externally through self-built platforms or those supplied by ABBYY (n = 2), suggesting that the licensing issue did not impact users greatly. Others mentioned OCR in comparison to HTR, comparing their accuracy rates in deciphering text (n = 6). ...
Article
Full-text available
Handwritten Text Recognition (HTR) technology is now a mature machine learning tool, becoming integrated in the digitisation processes of libraries and archives, speeding up the transcription of primary sources and facilitating full text searching and analysis of historic texts at scale. However, research into how HTR is changing our information environment is scant. This paper presents a systematic literature review regarding how researchers are using one particular HTR platform, Transkribus, to indicate the domains where HTR is applied, the approach taken, and how the technology is understood. 381 papers from 2015 to 2020 were gathered from Google Scholar, Scopus, and Web of Science, then grouped and coded into categories using quantitative and qualitative approaches. Published research that mentions Transkribus is international and rapidly growing. Transkribus features primarily in archival and library science publications, while a long tail of broad and eclectic disciplines, including history, computer science, citizen science, law and education, demonstrate the wider applicability of the tool. The most common paper categories were humanities applications (67%), technological (25%), users (5%) and tutorials (3%) . This paper presents the first overarching review of HTR as featured in published research, while also elucidating how HTR is affecting the information environment.
... It also offers a graphical user interface for the creation of training sets, and the manual correction of the output, see https://transkribus.eu/Transkribus/. For a use case, see(Lindemann et al., 2018). 3 Spanish National Library BNE (http://bdh-rd.bne.es/viewer.vm?id=0000015622), Google Books (https://play.google.com/books/reader?id=whdf0XXf6gwC), and Bavarian State Library BSB (https: //opacplus.bsb-muenchen.de/title/BV035479582) ...
Conference Paper
Full-text available
In this paper, we present a workflow for historical dictionary digitization, with a 1745 Spanish-Basque-Latin dictionary as use case. We start with scanned facsimile images, and get to represent attestations of modern standard Basque lexemes as Linked Data, in the form they appear in the dictionary. We are also able to produce an index of the dictionary, i. e. a Basque-Spanish version, and to map extracted Spanish and Basque lexical items to reference dictionary lemma list entries. The workflow is entirely based on freely available software. OCR and information extraction are performed using Machine Learning algorithms; data exhibits and the transcription curation environment are provided using Wikisource and Wikidata. Our evaluation of a first iteration of the workflow suggests its capability to deal with early modern printed dictionary text, and to reduce manual effort in the different stages significantly.
Conference Paper
Full-text available
This paper presents preliminary considerations regarding objectives and workflow of LexBib, a project which is currently being developed at the University of Hildesheim. We briefly describe the state of the art in electronic bibliographies in general, and bibliographies of lexicography and dictionary research in particular. The LexBib project is intended to provide a collection of full texts and metadata of publications on metalexicog-raphy, as an online resource and research infrastructure; at the same time, LexBib has a strong experimental component: computational linguistic methods for automated keyword indexing, topic clustering and citation extraction will be tested and evaluated. The goal is to enrich the bibliography with the results of the text ana-lytics in the form of additional metadata. 1 Introduction Domain-specific bibliographies are important tools for scientific research. We believe that much of their usefulness depends on the metadata they provide for (collections of) publications, and on advanced search functionalities. What is more, bibliographies for a limited domain may offer hand-validated publication metadata. As for lexicography and dictionary research, several bibliographies with different scopes and formats exist independently from each other; none of them covers the field completely, and most of them do not support advanced search functionalities, so that usability is dramatically reduced. Searches for bibliographical data and for the corresponding full texts are therefore most often performed using general search engines and domain-independent bibliography portals. However, big domain-independent repositories have two major shortcomings: They often contain noisy or incomplete publication metadata which have to be hand-validated by the users when copying them into their personal bibliographies, e. g. for citations. Closely related to that, the search functions of leading bibliography portals still focus on query-based information retrieval, since a combination of cascaded filter options using keywords and metadata such as persons, places, events, and relations to other items, only yields good results if the metadata meet certain requirements on precision and completeness. Our goal is a domain-specific online bibliography of lexicography and dictionary research (i.e. metalexicography) which offers hand-validated publication metadata as they are needed for citations, and which in addition is complemented with the output of an NLP toolchain. Several methods from computational linguistics produce useful results for seeking and retrieving scientific publications. For example, topic clustering has become very popular in the Digital Humanities. We suggest that assigning topics to publications provides valuable metadata for finding related work. Methods for term extraction have a similar objective. They detect text patterns (thus: terms) that are more significant in a (more specific) domain corpus than in a (more general) reference corpus.
Article
Full-text available
Good OCR results on historical documents rely on diplomatic transcriptions of printed material as ground truth which is both a scarce resource and time-consuming to generate. A strategy is proposed which starts from a mixed model trained on already available transcriptions from different centuries giving accuracies over 90% on a test set from the same period of time, overcoming the typography barrier of having to train individual models separately for each historical typeface. It is shown that both mean character confidence (as output by the OCR engine OCRopus) and lexicality (a measure of correctness of OCR tokens compared to a lexicon of modern wordforms taking historical spelling patterns into account, which can be calculated for any OCR engine) correlate with true accuracy determined from a comparison of OCR results with ground truth. These measures are then used to guide the training of new individual OCR models either using OCR prediction as pseudo ground truth (fully automatic method) or choosing a minimum set of hand-corrected lines as training material (manual method). Already 40-80 hand- corrected lines lead to OCR results with character error rates of only a few percent. This procedure minimizes the amount of ground truth production and does not depend on the previous construction of a specific typographic model.
Article
Full-text available
We describe CITlab's recognition system for the HTRtS competition attached to the 13. International Conference on Document Analysis and Recognition, ICDAR 2015. The task comprises the recognition of historical handwritten documents. The core algorithms of our system are based on multi-dimensional recurrent neural networks (MDRNN) and connectionist temporal classification (CTC). The software modules behind that as well as the basic utility technologies are essentially powered by PLANET's ARGUS framework for intelligent text recognition and image processing.
Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields
  • M Khemakhem
  • L Foppiano
  • L Romary
Khemakhem, M., Foppiano, L. and Romary, L. (2017) 'Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields', in Kosem, I. et al. (eds) Electronic lexicography in the 21st century: Lexicography from scratch. Proceedings of eLex 2017. eLex 2017 conference, 19-21 September 2017, Leiden, The Netherlands, Leiden: Lexical Computing. Available at: https://elex.link/elex2017/wp-content/uploads/2017/09/paper37.pdf.
Enhancing Usability for Automatically Structuring Digitised Dictionaries
  • M Khemakhem
  • A Herold
  • L Romary
Khemakhem, M., Herold, A. and Romary, L. (2018) 'Enhancing Usability for Automatically Structuring Digitised Dictionaries', in GLOBALEX workshop at LREC 2018. Miyazaki, Japan. Available at: https://hal.archives-ouvertes.fr/hal-01708137 (Accessed: 20 April 2018).
Handschriftenerkennung für historische Schriften
  • G Mühlberger
  • T Terbul
Mühlberger, G. and Terbul, T. (2018) 'Handschriftenerkennung für historische Schriften. Die Transkribus Plattform', b.i.t., 21(3), pp. 218-222.
Internationale Bibliographie zur germanistischen Lexikographie und Wörterbuchforschung
  • H E Wiegand
Wiegand, H. E. (2006b) Internationale Bibliographie zur germanistischen Lexikographie und Wörterbuchforschung, Band 2: I-R. Berlin, Boston: De Gruyter. doi: 10.1515/9783110892918.
Internationale Bibliographie zur germanistischen Lexikographie und Wörterbuchforschung, Band 4: Nachträge. Includes a print version and an ebook
  • H E Wiegand
Wiegand, H. E. (2014) Internationale Bibliographie zur germanistischen Lexikographie und Wörterbuchforschung, Band 4: Nachträge. Includes a print version and an ebook. Berlin, Boston: De Gruyter. doi: 10.1515/9783110403145.