Chapter

An Interpretable Deep Learning Approach for Morphological Script Type Analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Submitted to the Journal of Data Mining and Digital Humanities, and accepted. Pending last revisions. Please cite: @article{gille_levenson_2023_towards, author = {Gille Levenson, Matthias}, date = {2023}, journaltitle = {Journal of Data Mining and Digital Humanities}, doi = {10.5281/zenodo.7387376}, editor = {Pinche, Ariane and Stokes, Peter}, issuetitle = {Special Issue: Historical documents and automatic text recognition}, title = {Towards a general open dataset and models for late medieval Castilian text recognition (HTR/OCR)}, note = {Accepted, to be published.} } GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castilian text recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : Special Issue : Historical documents and automatic text recognition, eds. Ariane PINCHE and Peter STOKES, DOI : 10.5281/zenodo.7387376. Link to the data: https://doi.org/10.5281/zenodo.7386489 Final published version
Article
Full-text available
Optical character recognition (OCR) has proved a powerful tool for the digital analysis of printed historical documents. However, its ability to localize and identify individual glyphs is challenged by the tremendous variety in historical type design, the physicality of the printing process, and the state of conservation. We propose to mitigate these problems by a downstream fine-tuning step that corrects for pathological and undesirable extraction results. We implement this idea by using a joint energy-based model which classifies individual glyphs and simultaneously prunes potential out-of-distribution (OOD) samples like rubrications, initials, or ligatures. During model training, we introduce specific margins in the energy spectrum that aid this separation and explore the glyph distribution’s typical set to stabilize the optimization procedure. We observe strong classification at 0.972 AUPRC across 42 lower- and uppercase glyph types on a challenging digital reproduction of Johannes Balbus’ Catholicon, matching the performance of purely discriminative methods. At the same time, we achieve OOD detection rates of 0.989 AUPRC and 0.946 AUPRC for OOD ‘clutter’ and ‘ligatures’ which substantially improves upon recently proposed OOD detection techniques. The proposed approach can be easily integrated into the postprocessing phase of current OCR to aid reproduction and shape analysis research.
Article
Full-text available
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitisation, restoration, attribution, linguistic analysis, textual criticism, translation and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the Humanities and Machine Learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, flagging promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the Humanities and Machine Learning.
Article
Full-text available
The main goal of the present work is to determine the hand that has written two newly discovered documents in Romania. For giving the proper answer, the authors introduced the notion of “Ideal Representative”, namely of an object that very well represents the corresponding ideal alphabet symbol that a writer had in his/her mind when writing a document by hand. Moreover, the authors have introduced a novel method, which leads to the optimal evaluation of the Ideal Representative of any alphabet symbol in association with any handwritten document. Furthermore, the authors have introduced methods for comparing these Ideal Representatives, so as a final decision about the hand that has written a document may be obtained with a highly considerable likelihood. The related analysis manifests that the two documents discovered in Romania in 1998, belong to the great personality of Rigas Feraios. The presented method of automatic handwriting Identification seems to be of general applicability.
Article
Full-text available
This paper addresses the question of objective categories of medieval scripts and their elaboration through both medieval palaeography and image analysis. It introduces a dataset of 9800 images and metadata from the catalogues of dated manuscripts in France, as a ground truth and evaluation protocol, to be used for image feature analysis, taxonomy building, and clustering methods. It further compares the results of the categorization performed by two teams, one in Lyon (LIRIS/INSA, Frank Lebourgeois) and the other in Tel-Aviv (The Blavatnik School of Computer Science at Tel Aviv University, Lior Wolf). It also addresses the questions of taxonomy, interpretation and goals of the interdisciplinary research, such as development of expert systems or exploratory research.
Conference Paper
Full-text available
Convolutional neural networks (CNNs) have recently become the state-of-the-art tool for large-scale image classification. In this work we propose the use of activation features from CNNs as local descriptors for writer identification. A global descriptor is then formed by means of GMM supervector encoding, which is further improved by normalization with the KL-Kernel. We evaluate our method on two publicly available datasets: the ICDAR 2013 benchmark database and the CVL dataset. While we perform comparably to the state of the art on CVL, our proposed method yields about 0.21 absolute improvement in terms of mAP\mathrm {mAP} on the challenging bilingual ICDAR dataset.
Conference Paper
Full-text available
Biometric identification of persons has mainly been based on fingerprints, face, iris and other similar attributes. We propose a handwriting-based biometric identification system using a large database of Arabic handwritten documents. The system first extracts, from each handwritten sample, a set of features including run lengths, edge-hinge and edge-direction features. These features are used by a Multiclass SVM (Support Vector Machine) classifier. Experiments are conducted on a new large database of Arabic handwritings contributed by 1000 writers. The highest identification rate achieved by the combination of run-length and edge-hinge features stands at 84.10%.
Article
Full-text available
This article shows how the System for Palaeographic Inspections (SPI) software suite developed at the University of Pisa can be used to assist palaeographers in their attempts to classify and identify medieval scripts. Working with a small corpus of Tuscan manuscripts from the tenth-through twelfth-century now owned by the Biblioteca Comunale degli Intronati in Siena, the article shows how the software can be used to characterise the "calligraphic ideal" for each script in a given manuscript, compare letterforms in different scribes' work, and define relationships among individual scripts and manuscripts. The article concludes with a discussion of potential improvements for the SPI system.
Conference Paper
Full-text available
This paper presents a method for extracting rotation-invariant features from images of handwriting samples that can be used to perform writer identification. The proposed features are based on the Hinge feature [1], but incorporating the derivative between several points along the ink contours. Finally, we concatenate the proposed features into one feature vector to characterize the writing styles of the given handwritten text. The proposed method has been evaluated using Fire maker and IAM datasets in writer identification, showing promising performance gains.
Conference Paper
Full-text available
Codebook-based representations have been effectively employed for writer identification. Most of the codebook-based methods generate a codebook by clustering a set of patterns extracted from an independent data set. The probability of occurrence of the codebook patterns in a given writing is then used to characterize its author. This study investigates the hypothesis that the codebook is merely a representation space and the codebook patterns themselves do not affect the writer identification performance. The idea is validated by first using codebooks in different scripts from those of writings in question and then by using a synthetically generated codebook. A number of data sets with handwritten samples in Arabic, French, English, German, Urdu and Greek are considered in our series of evaluations. Experiments conducted with different codebooks report interesting results which validate the ideas put forward in this study.
Article
Full-text available
In this paper, a new technique for offline writer identification is presented, using connected-component contours (COCOCOs or CO3s) in uppercase handwritten samples. In our model, the writer is considered to be characterized by a stochastic pattern generator, producing a family of connected components for the uppercase character set. Using a codebook of CO3s from an independent training set of 100 writers, the probability-density function (PDF) of CO3s was computed for an independent test set containing 150 unseen writers. Results revealed a high-sensitivity of the CO3 PDF for identifying individual writers on the basis of a single sentence of uppercase characters. The proposed automatic approach bridges the gap between image-statistics approaches on one end and manually measured allograph features of individual characters on the other end. Combining the CO3 PDF with an independent edge-based orientation and curvature PDF yielded very high correct identification rates.
Article
Full-text available
If current OCR engineering trends continue, then, we believe, "general--purpose" systems --- that is, fully automatic and nonretargetable systems --- will leave many potential users unsatisfied, and lucrative application niches unfilled, for years to come. However, for users who care enough to volunteer some manual effort --- to help customize the system to their document(s) --- significantly higher accuracy may be achievable, without delay. We discuss in detail two state-- of--the--art document recognition systems --- Lucent Technologies' Table Reader System (TRS) and Xerox's "document image decoding" (DID) research prototype --- which yield high accuracy by reliance on explicitly stated models of properties of the target document, whether iconic (known typefaces and image degradations), geometric (restricted classes of layouts), or symbolic (linguistic and pragmatic contextual constraints). How great are the performance advantages that can be realized by sacrificing automation in the...
Article
There are two types of information in each handwritten word image: explicit information which can be easily read or derived directly, such as lexical content or word length, and implicit attributes such as the author's identity. Whether features learned by a neural network for one task can be used for another task remains an open question. In this paper, we present a deep adaptive learning method for writer identification based on single-word images using multi-task learning. An auxiliary task is added to the training process to enforce the emergence of reusable features. Our proposed method transfers the benefits of the learned features of a convolutional neural network from an auxiliary task such as explicit content recognition to the main task of writer identification in a single procedure. Specifically, we propose a new adaptive convolutional layer to exploit the learned deep features. A multi-task neural network with one or several adaptive convolutional layers is trained end-to-end, to exploit robust generic features for a specific main task, i.e., writer identification. Three auxiliary tasks, corresponding to three explicit attributes of handwritten word images (lexical content, word length and character attributes), are evaluated. Experimental results on two benchmark datasets show that the proposed deep adaptive learning method can improve the performance of writer identification based on single-word images, compared to non-adaptive and simple linear-adaptive approaches.
Book
First submission on Sept. 21st, 2012; peer reviews and comments communicated on June 3rd, 2013; submission of revised version on Nov. 22nd, 2013; copy editing communicated on Oct. 20th, 2017; revised version submitted on Nov. 3rd, 2017.
Article
This paper presents a texture based approach for identification of writers from offline images of handwriting. Contrary to the classical texture based techniques which extract texture information at page or block level, we exploit the texture at a very small observation scale. The proposed technique divides a given handwriting into small fragments and considers each fragment as a texture. Texture descriptors including histograms of Local Binary Patterns (LBP), Local Ternary Patterns (LTP) and Local Phase Quantization (LPQ) are then computed from these fragments. The writer of a document is characterized by the set of histograms calculated from all the fragments in the writing. Two writings are compared by computing the distance between the descriptors of their writing fragments. The technique evaluated on IFN/ENIT and IAM databases comprising handwritten text in Arabic and English, respectively, realized high identification rates.
Article
In this paper, we propose a novel junction detection method in handwritten images, which uses the stroke-length distribution in every direction around a reference point inside the ink of texts. Our proposed junction detection method is simple and efficient, and yields a junction feature in a natural manner, which can be considered as a local descriptor. We apply our proposed junction detector to writer identification in two ways: direct junction matching and by Junclets which is a codebook-based representation trained from the detected junctions. A new challenging data set which contains multiple scripts (English and Chinese) written by the same writers is introduced to evaluate the performance of the proposed junctions for writer identification. Furthermore, two other common data sets are used to evaluate our junction-based descriptor. Experimental results show that our proposed junction detector is stable under rotation and scale changes, and the performance of writer identification indicates that junctions are important atomic elements to characterize the writing styles. The proposed junction detector is applicable to both historical documents and modern handwritings, and can be used as well for junction retrieval. Keywords: Handwriting recognition, junction detection, writer identification, Junclets
Article
This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows. © The Author 2013. Published by Oxford University Press on behalf of ALLC. All rights reserved.
Article
The Cairo Genizah is a collection containing approximately 250,000 hand-written fragments of mainly Jewish texts discovered in the late 19th century. The fragments are today spread out in some 75 libraries and private collections worldwide, and there is an ongoing effort to document and catalogue all extant fragments. Paleographic information plays a key role in the study of the Genizah collection. Script style, and – more specifically – handwriting, can be used to identify fragments that might originate from the same original work. Such matched fragments, commonly referred to as "joins", are currently identified manually by experts, and presumably only a small fraction of existing joins have been discovered to date. In this work, we show that automatic handwriting matching functions, obtained from non-specific features using a corpus of writing samples, can perform this task quite reliably. In addition, we explore the problem of grouping various Genizah document by script style, without being provided any prior information about the relevant styles. The results show that the automatically obtained grouping agrees, for the most part, with the paleographic taxonomy. In cases where the system fails, it is due to apparent similarities between related scripts.
Article
Manuscript transcription in the Cotton Nero A.x. Project is at a graphetic level and captures each distinguishable glyph used by the scribe. When the transcription is organized as a series of XML entities within a codicologi- cal DTD a search-and-count algorithm can be appied to the database of graphetic information. Initial statistical analysis of the data reveals dra- matic changes in the scribe's writing system at two points in the manu- script that are roughly coincident with quire boundaries (and also textual boundaries). Hypotheses that will guide further investigation of this phe- nomenon include the possibility that substantial gaps of time separated the scribe's work in copying the four Middle English poems that make up the manuscript.
Article
We propose an effective method for automatic writer recognition from unconstrained handwritten text images. Our method relies on two different aspects of writing: the presence of redundant patterns in the writing and its visual attributes. Analyzing small writing fragments, we seek to extract the patterns that an individual employs frequently as he writes. We also exploit two important visual attributes of writing, orientation and curvature, by computing a set of features from writing samples at different levels of observation. Finally we combine the two facets of handwriting to characterize the writer of a handwritten sample. The proposed methodology evaluated on two different data sets exhibits promising results on writer identification and verification.
Article
Recent advances in 'off-line' writer identification allow for new applications in handwritten text retrieval from archives of scanned historical documents. This paper describes new algorithms for forensic or historical writer identification, using the contours of frag- mented connected-components in free-style handwriting. The writer is considered to be characterized by a stochastic pattern generator, producing a family of character fragments (fraglets). Using a codebook of such fraglets from an independent training set, the probability distribution of fraglet contours was computed for an independent test set. Results revealed a high sensitivity of the fraglet histogram in identifying individual writers on the basis of a paragraph of text. Large-scale experiments on the optimal size of Kohonen maps of fraglet contours were performed, showing usable classification rates within a non-critical range of Kohonen map dimensions. The proposed automatic approach bridges the gap between image-statistics approaches and purely knowledge-based manual character-based methods. � 2006 Elsevier B.V. All rights reserved.
Article
In this paper we introduce the numerical tools that have been developed in the context of the Graphem project, in order to automate or leverage several steps in the study of medieval writing samples. We first describe various kinds of features that have been extracted from the samples, and then present two graphical tools to compare writing samples according to the features that have been extracted.
Article
Cette thèse a pour objet l'élaboration de méthodologies d'analyse permettant de décrire et de comparer les écritures manuscrites anciennes, méthodologies d'analyse globale ne nécessitant pas segmentation. Elle propose de nouveaux descripteurs robustes basés sur des statistiques d'ordre 2, la contribution essentielle reposant sur la notion de cooccurrence généralisée qui mesure la loi de probabilité conjointe d'informations extraites des images c'est une extension de la cooccurrence des niveaux de gris, utilisée jusqu'à présent pour caractériser les textures qui nous a permis d'élaborer diverses cooccurrences, spatiales relatives aux orientations et aux courbures locales des formes, paramétriques qui mesurent l'évolution d'une image subissant des transformations successives. Le nombre de descripteurs obtenu étant très (trop) élevé, nous proposons des méthodes conçues à partir des plus récentes méthodes d'analyse statistique multidimensionnelle de réduction de ce nombre. Ces démarches nous ont conduit à introduire la notion de matrices de cooccurrences propres qui contiennent l'information essentielle permettant de décrire finement les images avec un nombre réduit de descripteurs. Dans la partie applicative nous proposons des méthodes de classification non supervisées d'écritures médiévales. Le nombre de groupes et leurs contenus dépendent des paramètres utilisés et des méthodes appliquées. Nous avons aussi développé un moteur de recherche d'écritures similaires. Dans le cadre du projet ANR-MCD Graphem, nous avons élaboré des méthodes permettant d'analyser et de suivre l'évolution des écritures du Moyen Age.
Conference Paper
This paper presents our first contribution to the discrimination of the medieval manuscript texts in order to assist palaeographers to date the ancient manuscripts. Our method is based on spatial grey-level dependence (SGLD) which measures the join probability between grey level values of pixels for each displacement. We use the Haralick features to characterise 15 Latin medieval text styles and then to characterise 7 Arabic styles. The achieved discrimination results are between 50% and 81% for the Medieval Latin styles, and up to 100% for Arabic ones.
Article
To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print
Article
An approach to supervised training of character templates from page images and unaligned transcriptions is proposed. The template training problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding framework. This leads to a three-phase iterative training algorithm consisting of transcription alignment, aligned template estimation (ATE), and channel estimation steps. The maximum likelihood ATE problem is shown to be NP-complete and, thus, an approximate solution approach is developed. An evaluation of the training procedure in a document-specific decoding task, using the University of Washington UW-II database of scanned technical journal articles, is described
Article
We describe an automated script identification system for typeset document images. Templates for each script are created by clustering textual symbols from a training set. Symbols from new images are compared to the templates to find the best script. Our current system processes thirteen scripts with minimal preprocessing and high accuracy
HTRomance, Medieval Italian corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation
  • R Alba
  • G Rubin
  • F Boschetti
  • F Fischer
  • T Clérice
  • A Chagué
Alba, R., Rubin, G., Boschetti, F., Fischer, F., Clérice, T., Chagué, A.: HTRomance, Medieval Italian corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation [dataset] (2023). https://doi.org/10.5281/zenodo.8272751, https://github.com/ HTRomance-Project/medieval-italian, v1.0.1
Lineamenti di Storia della scrittura latina: dalle lezioni di Paleografia (Bologna a.a. 1953-54)
  • G Cencetti
Cencetti, G.: Lineamenti di Storia della scrittura latina: dalle lezioni di Paleografia (Bologna a.a. 1953-54). Guerrini Ferri, G., Bologna (1997)
The palaeography of Gothic manuscript books: From the twelfth to the early sixteenth century
  • A Derolez
Derolez, A.: The palaeography of Gothic manuscript books: From the twelfth to the early sixteenth century. Cambridge University Press (2003)
HTRomance, Medieval Spain corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation
  • J Bordier
  • M Gille Levenson
  • O Brisville-Fertin
  • T Clérice
  • A Chagué
Bordier, J., Gille Levenson, M., Brisville-Fertin, O., Clérice, T., Chagué, A.: HTRomance, Medieval Spain corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation [dataset] (2023), https://github.com/ HTRomance-Project/middle-ages-in-spain, v0.0.6
Choco-Mufin, a tool for controlling characters used in OCR and HTR projects
  • T Clérice
  • A Pinche
Clérice, T., Pinche, A.: Choco-Mufin, a tool for controlling characters used in OCR and HTR projects (Sep 2021). https://doi.org/10.5281/zenodo.5356154, https:// github.com/PonteIneptique/choco-mufin