Available via license: CC BY 4.0
Content may be subject to copyright.
The Learnable Typewriter: A Generative Approach to Text Line Analysis
Ioannis Siglidis Nicolas Gonthier Julien Gaubil Tom Monnier Mathieu Aubry
LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vall´
ee, France
https://imagine.enpc.fr/˜siglidii/learnable-typewriter
Abstract
We present a generative document-specific approach to
character analysis and recognition in text lines. Our main
idea is to build on unsupervised multi-object segmentation
methods and in particular those that reconstruct images
based on a limited amount of visual elements, called sprites.
Our approach can learn a large number of different char-
acters and leverage line-level annotations when available.
Our contribution is twofold. First, we provide the first adap-
tation and evaluation of a deep unsupervised multi-object
segmentation approach for text line analysis. Since these
methods have mainly been evaluated on synthetic data in
a completely unsupervised setting, demonstrating that they
can be adapted and quantitatively evaluated on real text im-
ages and that they can be trained using weak supervision are
significant progresses. Second, we demonstrate the potential
of our method for new applications, more specifically in the
field of paleography, which studies the history and variations
of handwriting, and for cipher analysis. We evaluate our ap-
proach on three very different datasets: a printed volume of
the Google1000 dataset [19,46], the Copiale cipher [2,27]
and historical handwritten charters from the 12th and early
13th century [6,44].
1. Introduction
A popular approach to document analysis in the 1990s
was to learn document-specific character prototypes, which
enabled Optical Character Recognition (OCR) [1,28,29,47]
but also other applications, such as font classification [21]
or document image compression and rendering [38]. This
idea culminated in 2013, with the Ocular system [3] which
proposed a generative model for printed text lines inspired
by the printing process and held the promise of achieving a
complete explanation of their appearance. These document-
specific generative approaches were however overshadowed
by discriminative approaches, whose sole purpose is to per-
form predictions, and which lead to higher performance at
the cost of interpretability, e.g. [16,33]. In this paper, we
explore how modern deep approaches can enable to revisit
(a) The Learnable Typewriter idea
(b) Results on a cipher [27](c) Application to Paleography [44]
Figure 1.
The Learnable Typewriter. (a)
Given a text line dataset,
we learn to reconstruct images to discover the underlying characters.
Such a generative approach can be used to analyze complex ciphers
(b)
and can be used as an automatic tool to help the study of
handwriting variations in historical documents (c).
and extend model-based approaches to text line analysis. In
particular, we demonstrate an approach that can deal with
challenging examples of handwritten documents, opening a
new perspective for the automatic analysis of ciphered docu-
ments and the study of historical handwriting, paleography.
While discriminative approaches are largely dominant in
today’s deep learning-based computer vision, a recent set
of works revisited generative approaches for unsupervised
multi-object object segmentation [5,7,9
–
11,17,18,23,37,41,
48]. Most of them provide results on synthetic data or simple
real images [37], and sometimes show qualitative results on
simple printed text images [40,41]. Surprisingly, images of
handwritten characters, which were notoriously used in the
development of convolutional neural networks [31,32] and
generative adversarial networks [14] were largely overlooked
1
arXiv:2302.01660v1 [cs.CV] 3 Feb 2023
by these approaches.
We build on recent sprite-based unsupervised image de-
composition approaches [37,41] that provide an interpretable
decomposition of images into a dictionary of visual elements,
referred to as sprites. These methods jointly optimize both
the sprites and the neural networks that predict their position
and color. Intuitively, we would like to adapt these methods
so that, from text lines that are extracted from any given
document, they could learn sprites that correspond to each
character. By adapting MarioNette [41] to perform text line
analysis, we provide quantitative evaluation on real data and
an analysis of the limitations of state-of-the-art unsupervised
multi-object segmentation approaches. We argue that text
line recognition highlights the challenges they face, but also
their benefits, and should be used as a benchmark in such
future work.
One of the core limitations we highlight is the intrinsic
ambiguity of the unsupervised multi-object segmentation
task: there is no clear reason why a ’d’ character should not
be decomposed into the concatenation of a ’c’ and ’l’ sprite.
Such reuse of character parts is actually a design choice for
many common fonts (see for example [42]). Handcrafted
priors could be added in the specific case of printed text, for
example via a loss that limits how close sprites can be when
reconstructing a given image. Unfortunately, such priors
will typically not generalize well to more challenging cases
where different characters overlap or ligatures are present,
and even less to images which are more complex than text
lines. A more generic solution is to leverage weak supervi-
sion. For example, an important line of work uses language
models to identify characters [3,19,30]. However, these
cannot be applied to unknown ciphers, or historical texts
which are often strongly abbreviated. Instead, we propose
to disambiguate sprites by relying on line level transcription
of text lines, which are widely available and easy to produce
with dedicated software, e.g. [24]. We believe that similar
weak (i.e., image-level) annotations could also be considered
for more general images.
Contribution. To summarize, we present:
•
a deep generative approach to text line analysis, in-
spired by deep unsupervised multi-object segmentation
approaches and adapted to work in both a weakly super-
vised and unsupervised setting,
•
a demonstration of the potential of our approach in chal-
lenging applications, particularly ciphered text analysis
and paleographic analysis,
•
an extended evaluation on three very different datasets:
a printed volume of the Google1000 dataset [19,46], the
Copiale cipher [2,27] and historical handwritten charters
from the 12th and early 13th century [6,44].
Our complete implementation can be found at:
github.com/ysig/learnable-typewriter.
2. Related works
Text recognition. Image Text Recognition, including Op-
tical Character Recognition (OCR) and Handwritten Text
Recognition (HTR), is a classic pattern recognition prob-
lem, and one of the earliest successful application of deep
learning [31,32]. The mainstream approaches for text line
recognition rely on discriminative supervised learning. Typi-
cally, a Convolutional Neural Network (CNN) encoder will
map the input image to a sequence of features and a decoder
will associate them to the ground truth, e.g. through a re-
current architecture trained with a Connectionist Temporal
Classification (CTC) loss [4,8,15,16,39], or a transformer
trained with cross entropy [25,33].
More related to our work, ScrabbleGAN [12] proposed
a generative adversarial approach for semi-supervised text
recognition, but their method is neither able to reconstruct
an input text line nor to decompose it into individual charac-
ters. Also related are methods which use already annotated
sprites (referred to as exemplars or supports) to perform
OCR/HTR [43,50] by matching them to text lines. Recent
unsupervised approaches either cluster input images in a
feature space [2] or rely on an existing text corpus of the
recognized language [19]. Closest to our work are classical
prototype-based methods [1,28,29,47] and in particular the
Ocular system [3] which follows a generative probabilistic
approach to jointly model text and character fonts in bi-
narized documents, and is optimized through Expectation
Maximization (EM). Different from us, it also relies on a pre-
trained n-gram language model, originally from the english
language and later extended to multiple languages [13].
Unsupervised multi-object segmentation.
Unsuper-
vised multi-object segmentation corresponds to a family of
approaches that decompose and segment scenes into multiple
objects in an unsupervised manner [26]. Some techniques
perform decomposition by computing pixel level segmen-
tation masks over the whole input image [5,10,17,18,48],
while others focus on smaller regions of the input image
and learn to compose objects in an iterative fashion, mostly
using a recurrent architecture [7,9,11,23]. All of these
techniques can isolate objects by producing segmentation
masks, but our goal is also to capture recurring visual
elements.
We thus build on techniques that explicitly model the ob-
jects located inside the input image, by associating them to a
set of image prototypes referred to as sprites [37,41]. Sprites
are color images with an additional transparency channel and
are accompanied by a transformation that composes them
onto a target canvas. However, DTI-sprite [37] can only pre-
dict a small amount of sprites for a collection of fixed-size
images and fails to scale when the size of images and the
number of objects inside each image scales to those of real
2
(a) Overview of our full pipeline
(b) Details of our Typewriter module
Figure 2.
Overview. (a)
An image is encoded into a sequence of features, each decoded by the Typewriter module into image layers. They
are then fused by alpha compositing with a predicted uniform background.
(b)
The Typewriter module takes a feature as input, computes
sprites and associated probabilities from learned latent codes, and composes them into a composite sprite that is transformed and positioned
onto an image-sized canvas.
documents. At the same time, MarioNette [41] suffers from
a high reconstruction error of fuzzy sprites that suboptimally
reconstruct multiple letters on a text toy-dataset.
3. The Learnable Typewriter
Given a collection of text lines written using consistent
font or handwriting, our goal is to learn the shape of all the
characters it contains as well as a deep network that can
predict the exact way these characters were used to generate
any input text line. Since complete supervision (i.e., the
position and shape of every character used in a document)
for such a task would be extremely costly to obtain, we
propose to proceed in an analysis-by-synthesis fashion and
to build on sprite-based unsupervised image decomposition
approaches [37,41] which jointly learn a set of character
images - called sprites - and a network that transforms and
positions them on a canvas in order to reconstruct input
lines. Because the definition of sprites can be ambiguous,
we introduce a complementary weak-supervision from line-
level transcriptions.
In this section, we first present an overview of our image
model and approach (Section 3.1). Then, we describe the
deep architecture we use (Section 3.2). Finally, we discuss
our loss and training procedure (Section 3.3).
Notations.
We write
a1:n
the sequence
{a1, . . . , an}
, and
use bold letters
a
for images. An RGBA image
a
corre-
sponds to an RGB image denoted by
ac
, alongside an alpha-
transparency channel denoted by
aα
. We use
θ
as a generic
notation for network parameters and thus any character in-
dexed by θ, e.g., aθ, is a network.
3.1. Overview and image model
Figure 2a presents an overview of our pipeline. An in-
put image
x
of size
H×W
is fed to an encoder network
eθ
generating a sequence of
T
features
f1:T
associated to
uniformly-spaced locations
x1:T
in the image. Each feature
ft
is processed independently by our Typewriter module
(Section 3.2) which outputs an RGBA image
ot
correspond-
ing to a character. The images
o1:T
are then composited
with a canvas image we call
oT+1
. This canvas image
oT+1
is a completely opaque image (zero transparency). Its color
is predicted by a Multi-Layer Perceptron (MLP)
bθ
which
takes as input the features
f1:T
and outputs RGB values.
All resulting images
o1:T+1
can be seen as ordered image
layers and merged using alpha compositing, as proposed by
both [37,41]. More formally, the reconstructed image
ˆ
x
can
be written:
ˆ
x=
T+1
X
t=1 hY
j<t
(1 −oα
j)ioα
toc
t.(1)
3
In practice, we randomize the order of
o1:T
in the composit-
ing to reduce overfitting, as advocated by the MarioNette
approach [41]. The full system is differentiable and can be
trained end-to-end.
3.2. Typewriter architecture
We now describe in detail the Typewriter module, which
takes as input a feature
f
from the encoder and its position
x
, and outputs an image layer
o
, to be composited. An
overview of the module is presented in Figure 2b. On a high
level, it is similar to the MarioNette architecture [41], but
handles blanks (i.e., the generation of a completely transpar-
ent image) in a different way and has a richer deformation
model, similar to the ones used in DTI-Sprites [37]. More
specifically, the module learns jointly RGBA images called
sprites corresponding to character images, and networks that
use the features
f
to predict a probability for each sprite and
a transformation of the sprite. We detail how we obtain the
following three elements: the set of
K
parameterized sprites,
the sprites compositing and the transformation model.
Sprite parametrization.
We model characters as a set of
K
sprites which are defined using a generator network. More
specifically, we learn
K
latent codes
z1:K
which are used
as an input to a generator network
gθ
in order to generate
sprites
s1:K=gθ(z1:K)
. These sprites are images with a
single channel that corresponds to their opacity. Similar to
DTI-Sprites [37], we model a variable number of sprites
with an empty (i.e., completely transparent) sprite which we
write
sK+1
. In comparison with directly learning sprites in
the image space as in DTI-Sprites [37], we found that using
a generator network yields faster and better convergence.
Sprite probabilities and compositing.
To predict a prob-
ability
pk
for each sprite
sk
, each latent code
zk
is associated
through a network
pθ
to a probability feature
zp
k=pθ(zk)
of
the same dimension
D
as the encoder features (
D= 64
in
our experiments). We additionally optimize directly a proba-
bility feature
zp
K+1
which we associate to the empty sprite.
Given a feature
f
predicted by the encoder, we predict the
probability
pk
of each sprite
sk
by computing the dot prod-
uct between the probability features
zp
1:K+1
and a learned
projection of the feature
πθ(f)
, and applying a
softmax
to
the result:
p1:K+1(f) = softmax λzp
1:K+1 ·πθ(f)T,(2)
where
·
is the dot product applied to each element of the se-
quence,
λ= 1/√D
is a scalar temperature hyper-parameter,
and the softmax is applied to the resulting vector. We use
these probabilities to combine the sprites into the weighted
average
s=PK
k=1 pkgθ(zk)
. Note that this compositing
can be interpreted as attention operation [45]:
s=attention(¯
Q, ¯
K, ¯
V) = softmax ¯
Q¯
KT
√D¯
V , (3)
with
¯
Q=πθ(f)
,
¯
K=pθ(z1:K+1)
,
¯
V=gθ(z1:K+1)
,
D
is
the dimension of the features, and by convention
gθ(zK+1)
is the empty sprite and pθ(zK+1) = zp
K+1.
We found that learning to predict the probability fea-
tures
zp
1:K
from the sprite latent codes
z1:K
, similar to Mari-
oNette [41], yields slightly better results than directly opti-
mizing
zp
1:K
. Note that we learn a probability code
zp
K+1
to
compute the probability of empty sprites instead of having
a separate mechanism as in MarioNette [41] because it is
critical for our supervised loss (see Sec. 3.3).
Positioning and coloring.
The final step of our module is
to position the selected sprite in a canvas of size
H×W
and
to adapt its color. We implement this operation as a sequence
of a spatial transformer [22] and a color transformation,
similar to DTI-sprite [37]. More specifically, the feature
f
is given as input to a positioning network
tθ
that predicts 3
parameters for isotropic scaling and 2D-translation that are
used by a spatial transformer [22] to deform
s
and
cθ
that
predicts color for the sprite. Finally, using the location
x
associated to the feature
f
, we paste the defromed colored
sprite onto a background canvas of size
H×W
at position
x
to obtain a reconstructed RGBA image layer
o
. Positioning
the sprites with respect to the position of the associated local
features helps us obtain results co-variant to translations
of the text lines. To produce the background canvas, the
features
f1:T
are first each passed through a shared MLP
cbkg
θ
. We then use bi-linear interpolation to upscale these
T
colors to fit the size of the input image. Details on the
parametrization of the transformation networks are presented
in the supplementary material.
3.3. Losses and training details
Our system is designed in an analysis-by-synthesis spirit,
and thus relies mainly on a reconstruction loss. This recon-
struction loss can be complemented by a loss on the selected
sprites in the supervised setting where each text line is paired
with a transcription. In the following, we define these losses
for a single text line image and its transcription, using the
notations of the previous section.
Reconstruction loss.
Our core loss is a simple mean
square error between the input image
x
and its reconstruction
ˆ
xpredicted by our system as described in Sec. 3.1:
Lrec(x,ˆ
x) = kx−ˆ
xk2.(4)
4
(a) Supervised (b) Unsupervised
Figure 3.
Results on a printed document (Google1000).
In both the supervised and unsupervised settings our method produces meaningful
sprites and accurate reconstructions. See text for details and more reconstructions in supplementary material. In the supervised we show the
60 most used sprites in the train set, and we show ll the sprites in the unsupervised case.
Supervised learning.
The intrinsic of the sprite decompo-
sition problem may result in sprites that do not correspond to
individual characters. Using line-level annotation is an easy
way to remove this ambiguity. We find that simply adding
the classical CTC loss [15] computed on the sprite proba-
bilities to our reconstruction loss is enough to learn sprites
that exactly correspond to characters. More specifically, we
chose the number of sprites as the number of different char-
acters and associate arbitrarily each sprite to a character and
the empty sprite to the separator token of the CTC. Then
given the one-hot line-level annotation
y
and the predicted
sprite probabilities
ˆy= (p1:K+1(f1), ..., p1:K+1 (fT))
, we
optimize our system parameters by minimizing:
Lsup(x, y, ˆ
x,ˆy) = Lrec(x,ˆ
x) + λctcLctc (y, ˆy)(5)
where
λctc
is a hyper-parameter and
Lctc(y, ˆy)
is the CTC
loss computed between the ground-truth
y
and the predicted
probabilities
ˆy
. In our experiments we have used
λctc = 0.1
for printed text and λctc = 0.01 for handwritten text.
Implementation and training details.
We train on the
Google1000 [46] and Fontenay [44] datasets with lines of
height
H= 64
and on the Copiale dataset [27] with
H=
92
. The generated sprites
s1:K
are of size
H/2×H/2
. In
the supervised setting, we use as many sprites as there are
characters, and in the unsupervised we set
K= 60
for
Google1000 and
K= 120
for the Copiale cipher. In the
supervised case we train for 100 epochs on Google1000 and
for 500 epochs on Copiale with a batch size of 16, and we
select the model that performs best on the validation set for
evaluation. In the unsupervised setting we use line crops of
width
W= 2H
and train for 1000 epochs on Google1000
and for 5000 on the Copiale cipher with a batch size of 32
and use the final model. The number of epochs is much
higher in the unsupervised case than in the supervised case
in part because the network sees only a small crop of each
line at each epoch, but each epoch thus becomes much faster
to perform.
Our encoder network is a ResNet-32-CIFAR10 [20], that
is truncated after layer 3 with a Gaussian feature pooling
described in supplementary material. For our unsupervised
experiments, we use as generator
gθ
the U-Net architecture
of Deformable Sprites [49] which converged quickly, and
for our supervised experiments a 2-layer MLP similar to
MarioNette [41] which produces sprites of higher quality.
The networks
πθ
and
pθ
are a single linear layers followed
by layer-normalization. We use the AdamW [34] optimizer
with a learning rate of
10−4
and apply a weight-decay of
10−6to the encoder parameters.
4. Experiments
4.1. Datasets and metric
We experiment with three datasets with different char-
acteristics: Google1000 [46], the Copiale cipher [27] and
Fontenay manuscripts [6,44]. We check that our method
5
(a) Supervised (b) Unsupervised
Figure 4.
Results on the Copiale cipher [27].
Despite the high number of characters and their variability, our method learns meaningful
sprites and performs accurate reconstructions in both settings. In both the supervised and the unsupervised we show the 108 most used
sprites from the training set.
leads to transcription results on par with related baselines
using the Character Error Rate metric.
Google1000.
The Google1000 dataset contains scanned
historical printed books, arranged into Volumes [46]. We
use the English Volume 0002 which we process with the
pre-processing code of [19], using 317 out of 374 pages and
train-val-test set with 5097-567-630 lines respectively. This
leads to a total number of 83 distinct annotated characters.
Although supervised printed font recognition is largely con-
sidered a solved problem, this document is still challenging
for an analysis-by-synthesis approach, containing artifacts
such as ink bleed, age degradation, as well as variance in
illumination and geometric deformations.
The Copiale cipher.
The Copiale cipher is an oculist Ger-
man text dating back to a 18th century secret society [27].
Opposite to Baro et al. [2] which uses a binarized version
of the dataset, we train our model on the original text-line
images, which we segmented using the docExtractor [35]
and manually assigned to the annotations, respecting the
train-val-test split of Baro et al. [2] with 711-156-908 lines
each. The total number of distinct annotated characters is
112. This dataset is challenging because of the large number
and wide variety of characters and calligraphic elements.
However, the characters are in general well separated from
each other, and despite being handwritten, most of them
exhibit a limited variability.
Fontenay manuscripts.
The Fontenay dataset contains
digitized charters that originate from the Cistercian abbey
of Fontenay in Burgundy (France) [6,44] and were created
during the 12th and early 13th century. Each document has
been digitized and each line has been manually segmented
and transcribed. For our experiments, we selected a subset
of 14 different documents sharing a similar script which falls
into the family of praegothica scripts. These correspond
to 163 lines, using 47 distinct characters. While they were
carefully written and preserved, these documents are still
very challenging (Figure 6). They exhibit degradation, clear
intra-document letter shape variations, and letters can over-
lap or be joined by ligature marks. Moreover, each document
represents only a small amount of data, e.g., the ones used
in Figure 6contain between 8 and 25 lines.
Metric.
To evaluate our models for Optical Character
Recognition, we use the standard Character Error Rate (CER)
metric. Given ground-truth and predicted sequences of char-
acters, σand ˆσ, it is defined as:
CE R(σ, ˆσ) = S+D+I
|σ|,(6)
the minimum number of substitutions
S
, deletions
D
, and
insertions
I
of characters necessary to match the predicted
6
Method Type Rec. CER ↓
DTI-Sprites [37] unsup. X18.4 %
FontAdaptor [50] 1-shot 6.7 %
ScrabbleGAN [12] sup. 0.6 %
Learnable Typewriter unsup. X8.1%
Learnable Typewriter sup. X0.9%
Table 1.
Recognition results on Google1000 [46].
The ’Rec.’
column emphasizes methods that can reconstruct the input image.
sequence
ˆσ
to the ground truth sequence
σ
, normalized by
the size of the ground truth sequence
|σ|
. For simplicity, we
ignore spaces. Predictions are obtained by selecting at every
position the character associated to the most likely sprite.
In the supervised setting, the association between sprites
and characters is fixed at the beginning of training. In the
unsupervised setting, we associate every sprite to a single
character using a simple assignment strategy described in
supplementary material. More complex assignments, for
example associating sprite bi-grams to individual charac-
ters, or even incorporating their relative positions, could be
considered for a performance boost.
4.2. Results and analysis
4.2.1 Printed fonts (Google1000)
Qualitative results on Google1000 in both the supervised
and the unsupervised settings are presented in Figure 3and
quantitative results are reported in Table 1. In the supervised
setting (Figure 3a), the sprites closely correspond to the char-
acters (with the exception of very rare characters, such as
the capital ’Z’ character, as can be seen in supplementary
material), the reconstruction is of very high quality and each
character is reconstructed with the expected sprite. In the
unsupervised setting (Figure 3b), the characters can be re-
constructed using several sprites (e.g. ’r’ and ’m’ sprites
used jointly to better reconstruct the ’m’), some rarer charac-
ters are not modeled, some sprites include parts of several
characters, and several sprites can reconstruct the same char-
acter to account for appearance variation (e.g. two sprites for
the very frequent ’e’). Some sprites also do not correspond
clearly to any character, and are typically not used by the
network. These behaviors are expected because of the ambi-
guity of the reconstruction and the statistics of the character
frequencies: without additional supervision, there is a clear
benefit for the network to model well variations of common
characters, and to approximate or discard rare ones.
The quantitative results (Tab. 1) are consistent with these
observations. The CER in the supervised setting is close to
perfect, while it is
8.1%
for the unsupervised setting. To
give references for these performances, we trained Scrabble-
GAN [12] and tested FontAdaptor [50] on our data, using
Method Type Rec. CER ↓
HTRbyMatching [43] few-shot 10 −47%∗
Learnable Typewriter unsup. X52.7%
Learnable Typewriter sup. X4.8%
Table 2.
Recognition results on Copiale [27]. ∗
See text for details.
The ’Rec.’ column emphasizes methods that can reconstruct the
input image.
the code provided by the authors, and we adapted DTI-
Sprites [36] to text lines using a simple greedy strategy
(see details in supplementary material). The adaptation of
DTI-Sprites [36] provides an unsupervised baseline, that we
clearly outperform. FontAdaptor [50] is a 1-shot method that
learns to match given examples of each character to text lines.
It leads to a CER of
6.7%
which, as expected, is better than
our unsupervised approach but clearly worse than the su-
pervised approach. Finally, we selected ScrabbleGAN [12]
to compare with a supervised approach since this specific
architecture includes a generator network, able to generate
an image from a character sequence. Note however that this
generator cannot reconstruct a specific image, that it is not
as structured as ours and that the network cannot decompose
an input image into characters. ScrabbleGAN [12] produced
a CER of
0.6%
slightly better but comparable to our super-
vised approach. Note that the goal of our approach is not
to boost CER performances, but instead to learn character
models and image decomposition.
4.2.2 Ciphered handwritten text (Copiale)
Qualitative results on the Copiale cipher in both the su-
pervised and unsupervised settings are presented in Fig-
ure 4. The learned sprites are harder to analyze than the
ones learned form printed English text, but similar trends
appear: our supervised approach learns accurate shapes for
all characters except a few rare ones; our unsupervised ap-
proach learns several versions of frequent characters, leads
to some unused non-interpretable sprites, and strokes that are
used to reconstruct different characters and enable modeling
shape variations. Both reconstructions are of high quality
and as expected the supervised version uses a single sprite
per character.
As a sanity check for our method, we again provide quan-
titative results for CER in Table 2. We were not able to com-
pare with FontAdaptor [50] in this case, due to its limited
alphabet size. We compare instead to the HTRbyMatching
method [43]. HTRbyMatching was evaluated on a wide
range of few-shot scenarios, ranging from a scenario similar
to FontAdaptor where a single annotation of each character
is available, to another where 5 samples are available for
each character together with 5 completely annotated pages.
Results were reported only for confident character predic-
7
Figure 5.
Sprites learned for similar documents in Praegothica
script. Each line corresponds to a different document. Looking at
any column, one can notice the small differences that characterise
the handwriting in each document. Colored boxed correspond to
cases analysed in more details in Figure 6.
tions with different confidence thresholds, but summing the
error rate of the predicted symbols and the percentage of
non-annotated symbols, one can estimate the CER to vary
between 10% and 47% depending on the scenario. This is
consistent with the quantitative results we obtain with our
approach, which are better in the supervised setting (
5.5%
)
and worse in the unsupervised one (
52.9%
). These are also
consistent with the qualitative results: given that many char-
acters are reconstructed using several sprites, one would
have to associate sprite bi-grams to characters to obtain good
CER performances. Designing a principled approach for
merging sprites falls out of the scope of this work. However,
we demonstrate the effect of a simple post-processing trick
where we associate new sprites to the most frequent bi-grams
and tri-grams. It boosts our unsupervised performance to
29.9% CER, a significant improvement, which validates our
previous qualitative analysis.
4.2.3 Historical texts and paleography (Fontenay)
To test our approach in a more challenging case and demon-
strate its potential for paleographic analysis, we applied it
on a collection of 14 historical charters from the Fontenay
abbey [6,44]. While they all use similar scripts from the
Praegothica type, they also exhibit clear variations. One of
the goals of a paleographic analysis would be to identify and
characterize these variations. We focus on the variations in
the shape of letters, which are quite challenging to describe
with natural language. One solution would be to choose a
specific example for each letter in each document or manu-
(a) ’a’ and ’g’ sprite for each document and associated example of the
character. Note how the variations of the descending part of the ’g’ sprites
closely match the variations observed in the documents. Also note the
subtle variations of the ’a’ which are clear in the sprites but would be hard
to notice and describe from the original images for a non-expert.
(b) The appearance variations of individual instances associated to the ’e’
character in the document are accurately visually summarized by the sprite.
(c) The double appearance of the ascending line of the ’d’ sprite shown on
the left is related to the co-existence of two different kinds of ’d’ in the
document, as shown in the examples on the right.
Figure 6. The sprites summarize the key attributes of a character
in each specific document, averaging its variations. Note the com-
plexity of the documents: characters can overlap or be connected
ligature, the parchment is often stained, and there are important
intra-document character variations.
ally drawing of a ’typical’ one by a paleographer. However,
this is very time consuming and might reflect priors or bias
from the historian in addition to the actual variations. Instead,
we propose to fine-tune our Learnable Typewriter approach
on each document and visualize the sprites associated to each
character and each document. Because of the difficulty of
the dataset, we focus on the results of our supervised setting.
Figure 5visualizes the sprites obtained for five different
documents from the characters ’a’ to ’h’ and Figure 6high-
lights different aspects of the results. Figure 6a emphasizes
the fact that the differences in the learned sprites correspond
to actual variations in the different documents, whether sub-
tle, such as for the ’a’ sprite, or clearer, such as for the
descending part of the ’g’ sprite. Figure 6b shows how a
sharp sprite can be learned for the character ’e’, summariz-
ing accurately its shape despite variations in the different
occurrences. Finally, Figure 6c shows the case of a docu-
ment in which two types of ’d’ co-exist. In this case, the
learned sprite, shown on the left, reassembles an average of
the two, with both versions of the ascending parts visible
with intermediate transparency. Such a limitation could be
overcome by learning several sprites per character. We thus
experimented with learning two per character, simply by
summing their probabilities when optimizing the CTC-loss.
We find that when different appearances of the same letter
8
exist, the two sprites learn two different appearances. In
Figure 6c, we show the example of the two different learned
’d’ sprites on the right of the original one.
Our approach could benefit paleographic analysis in more
ways than simply analyzing the characters shapes. Indeed,
our model also gives access to the position and scale varia-
tion for each letter. This would enable a quantitative analysis
of more global appearance factors of the text, related to
the space between letters or their respective size variations.
Because they would be tremendously tedious to annotate,
such variations have rarely been quantified, and their anal-
ysis could open new research topics, for example the study
of the handwriting evolution of a single writer copying a
book across several months. Another natural application of
our approach is font or writer classification, which could be
achieved either using a single model to compare positions
and errors statistics for the different letters and bi-grams, or
by training different models for different fonts or writers.
The main advantage compared to most existing approaches
would be the high interpretability of the predictions, which a
user could easily validate.
5. Conclusion
We have presented a document-specific generative ap-
proach to document analysis. Inspired by deep unsuper-
vised multi-object segmentation methods, we extended them
to accurately model standard printed documents as well
much more complex ones, such as a handwritten ciphered
manuscript or ancient charters. We outlined that a completely
unsupervised approach suffers from the ambiguity of the de-
composition problem and imbalance characters distributions.
We thus extended these approaches using weak supervision
to obtain high-quality results. Finally, we demonstrated the
potential of our Learnable Typewriter approach for a novel
application: paleographic analysis.
Acknowledgments
We would like to thank Malamatenia Vlachou and Do-
minique Stutzmann for sharing ideas, insights and data
for applying our method in paleography; Vickie Ye and
Dmitriy Smirnov for useful insights and discussions; Ro-
main Loiseau, Mathis Petrovich, Elliot Vincent, Sonat
Baltacı for manuscript feedback and constructive insights.
This work was partly supported by ANR project En-
Herit ANR-17-CE23-0008, ANR project VHS ANR-21-
CE38-0008 and HPC resources from GENCI-IDRIS (2021-
AD011011697R1).
References
[1]
Henry S Baird. Model-directed document image analysis. In
Proceedings of the Symposium on Document Image Under-
standing Technology, volume 1, page 3, 1999. 1,2
[2]
Arnau Bar
´
o, Jialuo Chen, Alicia Forn
´
es, and Be
´
ata Megyesi.
Towards a Generic Unsupervised Method for Transcription of
Encoded Manuscripts. In Proceedings of the 3rd International
Conference on Digital Access to Textual Cultural Heritage,
pages 73–78, Brussels Belgium, May 2019. ACM. 1,2,6
[3]
Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. Un-
supervised Transcription of Historical Documents. In Pro-
ceedings of the 51st Annual Meeting of the Association for
Computational Linguistics, volume 1, pages 207–217, 2013.
1,2
[4]
Th
´
eodore Bluche and Ronaldo Messina. Gated convolutional
recurrent neural networks for multilingual handwriting recog-
nition. In 2017 14th IAPR international conference on doc-
ument analysis and recognition (ICDAR), volume 1, pages
646–651. IEEE, 2017. 2
[5]
Christopher P Burgess, Loic Matthey, Nicholas Watters,
Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander
Lerchner. Monet: Unsupervised scene decomposition and
representation. arXiv preprint arXiv:1901.11390, 2019. 1,2
[6]
Jean-Baptiste Camps, Chahan Vidal-Gor
`
ene, Dominique
Stutzmann, Marguerite Vernet, and Ariane Pinche. Data
Diversity in handwritten text recognition: challenge or op-
portunity? In DH2022 Local Organizing Committee, editor,
Digital Humanities 2022. Conference Abstracts (The Univer-
sity of Tokyo, Japan, 25-29 July 2022), pages 160–165. Tokyo,
2022. 1,2,5,6,8
[7]
Eric Crawford and Joelle Pineau. Spatially invariant unsuper-
vised object detection with convolutional neural networks. In
Proceedings of the AAAI Conference on Artificial Intelligence,
volume 33, pages 3412–3420, 2019. 1,2
[8]
Arthur Flor de Sousa Neto, Byron Leite Dantas Bezerra, Ale-
jandro H
´
ector Toselli, and Estanislau Baptista Lima. Htr-flor:
a deep learning system for offline handwritten text recog-
nition. In 2020 33rd SIBGRAPI Conference on Graphics,
Patterns and Images (SIBGRAPI), pages 54–61. IEEE, 2020.
2
[9]
Fei Deng, Zhuo Zhi, Donghun Lee, and Sungjin Ahn. Gener-
ative scene graph networks. In International Conference on
Learning Representations, 2020. 1,2
[10]
Patrick Emami, Pan He, Sanjay Ranka, and Anand Rangara-
jan. Efficient iterative amortized inference for learning sym-
metric and disentangled multi-object representations. In Inter-
national Conference on Machine Learning, pages 2970–2981.
PMLR, 2021. 1,2
[11]
S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval
Tassa, David Szepesvari, Koray Kavukcuoglu, and Geoffrey E.
Hinton. Attend, Infer, Repeat: Fast Scene Understanding
with Generative Models. In Advances in Neural Information
Processing Systems, volume 29, Aug. 2016. 1,2
[12]
Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Ma-
zor, and Roee Litman. ScrabbleGAN: Semi-Supervised Vary-
ing Length Handwritten Text Generation. arXiv:2003.10557
[cs], Mar. 2020. 2,7
[13]
Dan Garrette, Hannah Alpert-Abrams, Taylor Berg-
Kirkpatrick, and Dan Klein. Unsupervised Code-Switching
for Multilingual Historical Document Transcription. In Pro-
ceedings of the 2015 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Human
9
Language Technologies, pages 1036–1041, Denver, Colorado,
2015. Association for Computational Linguistics. 2
[14]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. Advances in
neural information processing systems, 27, 2014. 1
[15]
Alex Graves, Santiago Fern
´
andez, Faustino Gomez, and
J
¨
urgen Schmidhuber. Connectionist temporal classification:
Labelling unsegmented sequence data with recurrent neural
networks. In Proceedings of the 23rd International Confer-
ence on Machine Learning, ICML ’06, page 369–376, New
York, NY, USA, 2006. Association for Computing Machinery.
2,5
[16]
Alex Graves and J
¨
urgen Schmidhuber. Offline handwriting
recognition with multidimensional recurrent neural networks.
Advances in neural information processing systems, 21:545–
552, 2008. 1,2
[17]
Klaus Greff, Rapha
¨
el Lopez Kaufman, Rishabh Kabra, Nick
Watters, Christopher Burgess, Daniel Zoran, Loic Matthey,
Matthew Botvinick, and Alexander Lerchner. Multi-object
representation learning with iterative variational inference.
In International Conference on Machine Learning, pages
2424–2433. PMLR, 2019. 1,2
[18]
Klaus Greff, Sjoerd Van Steenkiste, and J
¨
urgen Schmidhu-
ber. Neural expectation maximization. Advances in Neural
Information Processing Systems, 30, 2017. 1,2
[19]
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman.
Learning to Read by Spelling: Towards Unsupervised Text
Recognition. arXiv:1809.08675 [cs], Dec. 2018. 1,2,6
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 5
[21]
Judith Hochberg, Patrick Kelly, Timothy Thomas, and Lila
Kerns. Automatic script identification from document images
using cluster-based templates. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(2):176–181, 1997. 1
[22]
Max Jaderberg, Karen Simonyan, and Andrew Zisserman.
Spatial Transformer Networks. In Advances in Neural Infor-
mation Processing Systems, volume 28, 2015. 4
[23]
Jindong Jiang and Sungjin Ahn. Generative neurosymbolic
machines. Advances in Neural Information Processing Sys-
tems, 33:12572–12582, 2020. 1,2
[24]
Philip Kahle, Sebastian Colutto, G
¨
unter Hackl, and G
¨
unter
M
¨
uhlberger. Transkribus-a service platform for transcription,
recognition and retrieval of historical documents. In 2017
14th IAPR International Conference on Document Analysis
and Recognition (ICDAR), volume 4, pages 19–24. IEEE,
2017. 2
[25]
Lei Kang, Pau Riba, Mar
c¸
al Rusi
˜
nol, Alicia Forn
´
es, and
Mauricio Villegas. Pay attention to what you read: Non-
recurrent handwritten text-line recognition. arXiv preprint
arXiv:2005.13044, 2020. 2
[26]
Laurynas Karazija, Iro Laina, and Christian Rupprecht. Clevr-
Tex: A Texture-Rich Benchmark for Unsupervised Multi-
Object Segmentation. arXiv:2111.10265 [cs], Nov. 2021.
2
[27]
Kevin Knight, Beata Megyesi, and Christiane Schaefer. The
Copiale Cipher. In Proceedings of the ACL Workshop on
Building and Using Comparable Corpora, pages 2–9, 2011.
1,2,5,6,7
[28]
Gary E Kopec and Mauricio Lomelin. Document-specific
character template estimation. In Document Recognition III,
volume 2660, pages 14–26. SPIE, 1996. 1,2
[29]
Gary E Kopec and Mauricio Lomelin. Supervised template
estimation for document image decoding. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(12):1313–
1324, 1997. 1,2
[30]
Gary E Kopec, Maya R Said, and Kris Popat. N-gram lan-
guage models for document image decoding. In Document
Recognition and Retrieval IX, volume 4670, pages 191–202.
SPIE, 2001. 2
[31]
Yann LeCun, Bernhard Boser, John S Denker, Donnie Hen-
derson, Richard E Howard, Wayne Hubbard, and Lawrence D
Jackel. Backpropagation applied to handwritten zip code
recognition. Neural computation, 1(4):541–551, 1989. 1,2
[32]
Yann LeCun, L
´
eon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 1,
2
[33]
Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Flo-
rencio, Cha Zhang, Zhoujun Li, and Furu Wei. Trocr:
Transformer-based optical character recognition with pre-
trained models. arXiv preprint arXiv:2109.10282, 2021. 1,
2
[34]
Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017. 5
[35]
Tom Monnier and Mathieu Aubry. docExtractor: An off-
the-shelf historical document element extraction. In 2020
17th International Conference on Frontiers in Handwriting
Recognition (ICFHR), pages 91–96, Dortmund, Germany,
Sept. 2020. IEEE. 6
[36]
Tom Monnier, Thibault Groueix, and Mathieu Aubry. Deep
Transformation-Invariant Clustering. In NeurIPS, Oct. 2020.
7
[37]
Tom Monnier, Elliot Vincent, Jean Ponce, and Mathieu Aubry.
Unsupervised Layered Image Decomposition into Object Pro-
totypes. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pages 8640–8650, Apr. 2021. 1,
2,3,4,7
[38]
Joseph C Nolan and Robert Filippini. Method and appa-
ratus for creating a high-fidelity glyph prototype from low-
resolution glyph images, Apr. 20 2010. US Patent 7,702,182.
1
[39]
Joan Puigcerver. Are multidimensional recurrent layers really
necessary for handwritten text recognition? In 2017 14th
IAPR International Conference on Document Analysis and
Recognition (ICDAR), volume 1, pages 67–72. IEEE, 2017. 2
[40]
Pradyumna Reddy, Paul Guerrero, and Niloy J Mitra. Search
for concepts: Discovering visual concepts using direct opti-
mization. arXiv preprint arXiv:2210.14808, 2022. 1
[41]
Dmitriy Smirnov, Michael Gharbi, Matthew Fisher, Vitor
Guizilini, Alexei A. Efros, and Justin Solomon. MarioNette:
Self-Supervised Sprite Learning. arXiv:2104.14553 [cs], Apr.
2021. 1,2,3,4,5
10
[42]
M. Solomon. The Art of Typography: An Introduction to
Typo-icon-ography. Art Direction Book Company, 1994. 2
[43]
Mohamed Ali Souibgui, Alicia Forn
´
es, Yousri Kessentini,
and Crina Tudor. A few-shot learning approach for historical
ciphered manuscript recognition. CoRR, abs/2009.12577,
2020. 2,7
[44]
Dominique. Stutzmann. Fontenay dataset. original char-
ters from fontenay before 1213
https://doi.org/10.
5281/zenodo.6507963.1,2,5,6,8
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez,
Ł
ukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 4
[46]
L. Vincent. Google Book Search: Document Understanding
on a Massive Scale. In Ninth International Conference on
Document Analysis and Recognition (ICDAR 2007) Vol 2,
pages 819–823, Curitiba, Parana, Brazil, Sept. 2007. IEEE. 1,
2,5,6,7
[47]
Yihong Xu and George Nagy. Prototype extraction and adap-
tive ocr. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(12):1280–1296, 1999. 1,2
[48]
Yanchao Yang, Yutong Chen, and Stefano Soatto. Learning to
manipulate individual objects in an image. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 6558–6567, 2020. 1,2
[49]
Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa,
and Noah Snavely. Deformable sprites for unsupervised video
decomposition. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2022. 5
[50]
Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. Adap-
tive Text Recognition through Visual Matching. ECCV 2020,
Sept. 2020. 2,7
11