Thibault Clérice’s research while affiliated with Centre Jean Perrin and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (18)


Diachronic Document Dataset for Semantic Layout Analysis
  • Preprint

November 2024

·

2 Reads

Thibault Clérice

·

Juliette Janes

·

Hugo Scheithauer

·

[...]

·

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.



Moly\'e: A Corpus-based Approach to Language Contact in Colonial France
  • Preprint
  • File available

August 2024

·

21 Reads

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Moly\'e corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.

Download


Synthetic Lines from Historical Manuscripts: An Experiment Using GAN and Style Transfer

January 2024

·

8 Reads

·

5 Citations

Lecture Notes in Computer Science

Given enough data of sufficient quality, HTR systems can achieve high accuracy, regardless of language, script or medium. Despite growing pooling of datasets, the question of the required quantity of training material still remains crucial for the transfer of models to out-of-domain documents, or the recognition of new scripts and under-resourced character classes. We propose a new data augmentation strategy, using generative adversarial networks (GAN). Inspired by synthetic lines generation for printed documents, our objective is to generate handwritten lines in order to massively produce data for a given style or under-resourced character class. Our approach, based on a variant of ScrabbleGAN, demonstrates the feasibility for various scripts, either in the presence of a high number and variety of abbreviations (Latin) and spellings or letter forms (Medieval French), in a situation of data scarcity (Armenian), or in the instance of a very cursive script (Arabic Maghribi). We then study the impact of synthetic line generation on HTR, by evaluating the gain for out-of-domain documents and under-resourced classes.



You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

December 2023

·

10 Reads

·

9 Citations

Journal of Data Mining & Digital Humanities

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.


Figure 5: Excerpt and transcription example of Marie de Gournay, Egalité, 1622
Distribution of the prints in the training corpus per decade
Confusion table for the best Calamari models, in- domain test, 17th c. prints
Confusion table for the best Calamari models, out- of-domain test, 16th c. prints
Confusion table for the best Calamari models, out- of-domain test, 18th c. prints

+1

OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more)

June 2023

·

42 Reads

·

4 Citations

Journal of Data Mining & Digital Humanities

Machine learning begins with machine teaching: in the following paper, we present the data that we have prepared to kick-start the training of reliable OCR models for 17th century prints written in French. The construction of a representative corpus is a major challenge: we need to gather documents from different decades and of different genres to cover as many sizes, weights and styles as possible. Historical prints containing glyphs and typefaces that have now disappeared, transcription is a complex act, for which we present guidelines. Finally, we provide preliminary results based on these training data and experiments to improve them. L'apprentissage machine commence avec l'enseignement machine : dans cet article, nous présentons les données que nous avons préparées pour entraîner des modèles OCR fiables pour les imprimés du XVIIe siècle écrits en français. La construction d'un corpus représentatif est un enjeu majeur : il faut rassembler des documents de différentes décennies et de différents genres pour couvrir un maximum de tailles, de graisse et de styles. Les imprimés historiques contenant des glyphes et des caractères aujourd'hui disparus, la transcription est un acte complexe, pour lequel nous présentons des lignes directrices. Enfin, nous fournissons des résultats préliminaires basés sur ces données d'entraînement et des expériences pour les améliorer.


Figure 1 Examples of contraction use of superscript letters. Manuscripts in the following order: BIS 193, CML 13027, Montpelier H-318, Montpelier H-318, Vat. Pal. lat.373, BIS 193.
Figure 2 All examples come from the CML 13027 manuscript.
Figure 3 Manuscripts in the following order: Latin 16195, Phi. 10 a. 135 (x3), BIS 193, CML13027, Egerton 821, Latin 6395.
Figure 4 Snippet of Arabic numerals from BnF, lat.15461, fol.13r for comparison purposes.
CREMMA Medii Aevi: Literary Manuscript Text Recognition in Latin

April 2023

·

88 Reads

·

3 Citations

Journal of Open Humanities Data

This paper presents a novel segmentation and handwritten text recognition dataset for Medieval Latin from the 11th to the 16th century. It connects with Medieval French datasets, as well as earlier Latin datasets, by enforcing common guidelines, bringing 263,000 new characters and now totaling over a million characters for medieval manuscripts in both languages. We provide our own addition to Ariane Pinche’s Old French guidelines to deal with specific Latin cases. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the Old French base model on Latin datasets, improving accuracy by 5% on unknown Latin manuscripts.


Artificial colorization of digitized microfilms: a preliminary study

April 2023

·

42 Reads

·

1 Citation

Journal of Data Mining & Digital Humanities

A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.


Citations (7)


... Handwritten word recognition is an open problem in the field of pattern recognition, since the writing styles in handwritten documents are highly diverse and it is also a complex problem [1]. A lot of work has been done in this field for different languages such as Latin [1,2], Hindi [3], Norwegian [4] and Persian/Arabic [5]. Handwritten word recognition in handwritten documents in English [1,2], Chinese [2,6], and Japanese [2,6] has been done with a good accuracy. ...

Reference:

A novel word recognition system in Persian/Arabic handwritten words using stacking ensemble classifier of deep learning
CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond
  • Citing Chapter
  • September 2024

... Synthetic data generation can be a way to solve this problem. Many approaches, the majority of which are based on generative adversarial networks (GANs) [7], have been proposed for synthetic data generation in multiple domains: semantic segmentation [8,9], handwritten text recognition, for both contemporary documents [10,11] and historical documents [12,13], as well as scene text detection and recognition [14,15]. ...

Synthetic Lines from Historical Manuscripts: An Experiment Using GAN and Style Transfer
  • Citing Chapter
  • January 2024

Lecture Notes in Computer Science

... Its latest release, YOLOv11 [15], has been released in October 2024. This high throughput and accuracy added to the bounding-box compatible nature of most layouts have led, across fields ranging from DH to CV, to multiple benchmarks for various document layout analyses, including generic [7], domain-specific [17], and partial [4] applications. These models demonstrate superior adaptability to smaller datasets compared to R-CNN and mixed transformer approaches [14]. ...

You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine
  • Citing Article
  • December 2023

Journal of Data Mining & Digital Humanities

... In order to produce different OCR transcriptions, we used two freely available OCR systems. First we used Kraken 4 , a tool which is currently being developed and proposes convenient interfaces for DH researchers [15] and easily makes it possible to fine tune for particular languages [27] or periods [8]. On the other hand, Tesseract 5 [23] is a tool which can also be fine-tuned for particular tasks but still shows very good performances with its default configuration [6]. ...

OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more)

Journal of Data Mining & Digital Humanities

... The concept of "stylistic fi ngerprints", comparable to an author's DNA, was pointed out by Gogołek (2006), with roots tracing back to the early 20th century when Markov successfully confi rmed the authorship of Eugene Onegin (Sękiewicz, 2012). Stylometric analysis, particularly in the form of analyzing bigram distributions, has demonstrated remarkable accuracy in identifying the works of various authors, including Hemingway, Poe, Baldwin, Joyce, Shakespeare, Cummings, Washington, and Lincoln (Camps, Clérice, & Pinche, 2021), thereby underscoring the effi cacy of stylometry in uncovering textual DNA. ...

Noisy medieval data, from digitized manuscript to stylometric analysis: Evaluating Paul Meyer’s hagiographic hypothesis
  • Citing Article
  • November 2021

Digital Scholarship in the Humanities

... Libraries, archives and museums, among others, are digitizing large numbers of historical sources, from which high quality data must be extracted for further study by specialists of human sciences following new approaches such as "distant reading" (Moretti, 2013). Many (sub)tasks such as automatic OCR post-correction (Rijhwani et al., 2021) and linguistic annotation (Camps et al., 2021) benefit from pre-trained language models to improve their accuracy. ...

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Journal of Data Mining & Digital Humanities

... Indeed, to save time, morphological information was not added manually, but was instead projected using the lexicon of inflected forms Morphalou [ATILF-CNRS and Université de 6 Since the initial publication of this article, the effects of Unicode NFKD normalisation have also been tested, with an unclear effect on training accuracy Gabay et al. [2020b]. 7 A recent version of the model for Old French can be found as part of the web application Deucalion ; they are also directly usable through Pyrrha's interface . ...

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French
  • Citing Conference Paper
  • October 2020