Conference PaperPDF Available

Deep learning features for handwritten keyword spotting

Authors:

Figures

Content may be subject to copyright.
Deep Learning Features for Handwritten Keyword
Spotting
Baptiste Wicht, Andreas Fischer, Jean Hennebert
University of Fribourg, Switzerland
HES-SO, University of Applied Science of Western Switzerland
Email: baptiste.wicht@unifr.ch, andreas.fischer@unifr.ch, jean.hennebert@unifr.ch
Abstract—Deep learning had a significant impact on diverse
pattern recognition tasks in the recent past. In this paper, we
investigate its potential for keyword spotting in handwritten
documents by designing a novel feature extraction system based
on Convolutional Deep Belief Networks. Sliding window features
are learned from word images in an unsupervised manner. The
proposed features are evaluated both for template-based word
spotting with Dynamic Time Warping and for learning-based
word spotting with Hidden Markov Models. In an experimental
evaluation on three benchmark data sets with historical and
modern handwriting, it is shown that the proposed learned
features outperform three standard sets of handcrafted features.
I. INTRODUCTION
Although it has been the subject of research for a long time,
handwriting recognition is still a widely unsolved problem [1].
Under difficult conditions, such as large vocabularies, different
writing styles or degraded documents, keyword spotting solu-
tions have been suggested instead of a complete transcription
to spot words in document images [2].
Keyword spotting methods can be separated in two cate-
gories. Template-based methods are comparing template im-
ages of the keyword query with document images. This has the
advantage that template images are easy to obtain and that no
knowledge of the underlying language is necessary. However,
at least one template image is necessary for each keyword
query. Moreover, these systems typically do not generalize
well to unknown writing styles. Dynamic Time Warping
(DTW) has been extensively studied to match template images
with segmented word images based on a sliding window [3]
and different features, such as word profiles [4], closed con-
tours [5] or gradient features [6], [7]. Recent segmentation-
free methods match template images with whole document
images [8], [9], [10].
On the other hand, learning-based systems are using su-
pervised learning to train keyword models. These methods
are expected to generalize better to unknown writing styles
but they require a considerable amount of labeled training
data. Hidden Markov models (HMM) have been proposed for
modeling words [11] or characters [12], [13]. The character-
based approach is inspired by systems for complete transcrip-
tion [14]. It does not depend on keyword images for training
and can be used to spot arbitrary keywords. Another character-
based approach is proposed in [15] using recurrent neural
networks.
Unlabeled Data
HMM
DTW
Deep Learning
Feature Extractor
Keyword
Query
+
Word
Image Keyword
Score
Keyword
Score
Labeled Data
Fig. 1: System overview
Both categories are relying on features extracted from the
images. Such features are generally handcrafted and opti-
mizing them for different data sets is often difficult. Deep
Learning solutions have shown that it is possible to learn
features directly from pixels. Restricted Boltzmann Machines
(RBM) [16] have been extensively used to extract features
from data sets [17]. Once stacked into Deep Belief Networks
(DBN), they are able to extract multi-layer features from
images [18], [19]. Convolutional RBMs have proven especially
successful on images [18], [20]. General Convolutional Neural
Networks (CNNs) are also used to extract features on large
data sets of images [21], [22] or videos [23].
In the present paper, we investigate the potential of deep
learning for handwritten keyword spotting by designing a
novel feature extraction system based on Convolutional Deep
Belief Networks. This system has the advantage that features
are learned from the images using unsupervised learning, mak-
ing use of unlabeled handwriting images which are abundantly
available. Also, this system does not require knowledge of the
language and its alphabet, which is particularly convenient for
historical manuscripts. However, it requires a segmentation of
document images into words, which can be prone to errors.
Moreover, contrary to handcrafted feature sets, it needs to
be trained. Both DTW and HMMs have been used to spot
keywords based on the deep learning features. An overview
of the system is presented in Figure 1.
Our research focuses on handwritten documents (such as let-
ters, memorandums and historical manuscripts). The proposed
features have been tested on three well-known benchmark
data sets for keyword spotting (IAM offline database, George
Washington database and Parzival database) and are compared
with three benchmark feature sets [6], [7], [14].
The rest of this paper is organized as follows. The feature
extraction system is introduced in Section II. Section III
presents the spotting methods. The experimental setup is
detailed in Section IV and results are discussed in Section V.
Finally, conclusions are drawn in Section VI.
II. FE ATUR E EXTRACTION
In the proposed system, small patches are extracted from the
segmented word images using an horizontal sliding window
(Section II-A) and features are extracted from each patch using
a Convolutional Deep Belief Network (Section II-C).
A. Image Preprocessing and Patch Extraction
The proposed system uses segmented, binary word images.
The images are binarized using a simple global threshold after
local edge enhancement. After segmentation, the word images
are normalized to remove the skew and slant of the text. This
process is described in details in [14]. Finally, the word images
are resized to a third of their height. This research focuses
on word spotting, therefore the word segmentation errors are
not taken into account. Instead, the perfectly segmented word
images of the benchmark data sets are considered.
A horizontal sliding window is used to extract patches from
each image. The window is Wpixels wide and has the same
height as the image (no vertical overlapping). The window is
moved one pixel at a time from left to right. Therefore, for
an image of width Nand height H,Npatches of dimension
W×Hare extracted. Pixels outside the boundaries of the
image are considered to be background pixels.
B. Convolutional Restricted Boltzmann Machine
A Restricted Boltzmann Machine (RBM) is a generative
stochastic Artificial Neural Network (ANN). They were in-
troduced, in 1986, under the name Harmonium [24]. It has
two layers of neurons, a visible layer and an hidden layer,
without any connection between units of the same layer, i.e.
the neurons form a bipartite graph. They were designed for
learning probability distributions over input samples. An RBM
is trained in order to maximize the log-likelihood of the
learned input distribution. Exact computation of the gradients
being intractable, Markov chain Monte Carlo methods were
used to approximate them, but are not efficient. Contrastive
Divergence (CD) was later introduced [25] to train RBM much
faster, in a completely unsupervised manner, i.e. no labels are
used. CD is used to train an RBM in a manner similar to
the gradient descent techniques for an ANN. It approximates
the log-likelihood gradients by minimizing the reconstruction
error rate, thus training the model as an autoencoder.
The simple RBM model can be extended to a Convolutional
Restricted Boltzmann Machine (CRBM) model [18]. By using
convolution to connect layers together, it learns features shared
among all locations in an image, an idea known as weight
sharing [26], [27]. This brings translation invariance to the
learned features. Moreover, this also reduces memory footprint
and improves the performance so that learning is able to scale
to large images. A CRBM can be trained like an RBM, using
a form of convolutional Contrastive Divergence. The proposed
feature extraction system is based on this model.
Fig. 2: Convolutional Restricted Boltzmann Machine
Fig. 3: Convolutional Deep Belief Network features
Figure 2 shows an example of a CRBM. Like the RBM,
it has two layers, the visible layer and the hidden layer. K
convolutional filters are connecting the two layers. The visible
layer is composed of NV×NVneurons while the hidden layer
is made of Kgroups of NH×NHneurons. By definition,
the filters are constrained to a shape of NW×NW(NW,
NVNH+ 1). While only square two-dimensional filters are
considered in our research, the model allows filters of arbitrary
shape and dimensions.
C. Feature Extraction
To extract features from one patch of the image, two CRBM
are stacked to form a Convolutional Deep Belief Network
(CDBN) [18]. The complete network is trained as a feature
extractor using Contrastive Divergence. This training being
unsupervised, no labels are necessary to train the network.
The network is trained one layer after another. After the first
CRBM has been trained, its weights are frozen and its outputs
are used as the input of the second layer (the second layer
learns from the features extracted by the first layer).
To further improve translation invariance and feature robust-
ness and to control overfitting, Convolutional Neural Networks
are using pooling layers to shrink the representation by a small
factor. Max pooling computes the maximum activation of the
units in a small region of the feature map. Such a layer shrinks
each dimension of the feature maps by a factor C. In our
research we only considered non-overlapping pooling, i.e. the
stride is equal to the pooling ratio (S=C). In the proposed
network, each CRBM is followed by a Max Pooling layer.
An overview of the complete network used for feature
extraction is shown in Figure 3.
From an image Xformed of Nsliding window patches,
the features can be extracted using the network as follows.
One patch is passed to the first CRBM layer and its pooled
activation probabilities are computed. They are passed to
the second CRBM layer from which the pooled activation
probabilities are taken as the final features from the patch. For
the complete image, the features of each patch are combined
in F(x)as a sequence of feature vectors:
F(X)=[CDBN (x1), C DBN (x2), ..., CDBN (xN)] (1)
D. Feature Normalization
At each position of the window, the system is extracting K
groups of features. Each of these feature groups is normalized
so that their components sum to 1. This normalization process
can be compared to a simple form of local contrast normal-
ization, thus improving the invariance to the writing style.
While global normalization is not crucial when HMM is
used for word spotting, it is very important for keyword
spotting when DTW is used because it is based on Euclidean
distances. Therefore, the final features are normalized so that
each feature has zero-mean and unit variance. This proved to
perform better than linear scaling with an [0,1] interval and
significantly improved performance for DTW, while slightly
improving performance for HMM.
III. WORD SPOT TI NG
Once the features have been extracted for an image X, the
dissimilarity between the image and the searched keyword K
(ds(X, K )) is computed. In this paper, we are comparing two
different approaches for word scoring, namely Dynamic Time
Warping (DTW) and Hidden Markov Model (HMM).
The input of the system is a keyword query and a word
image (see Figure 1). For each input, the system must decide
whether the image is the requested keyword or not. The
decision for the image Xand keyword Kis based on a
threshold over the dissimilarity measure: ds(X, K)< T . The
threshold Tcan be selected based on a trade-off between
system precision and recall.
A. Dynamic Time Warping
Dynamic Time Warping (DTW) is an algorithm for finding
an optimal alignment between two sequences of different
length. It is a well established technique for word spotting [4].
The sequences are warped non-linearly so that they match each
other and their similarity can be measured. The cost of an
alignment is the sum of the d(x, y)distances of each aligned
pair. We use the squared Euclidean distance.
The DTW distance D(F(X), F (Y)) of two feature vec-
tor sequences F(X)and F(Y)is given by the minimum
alignment cost, found by dynamic programming. A Sakoe-
Chiba band [28] is used to speedup the search and improve
the results. This constraint limits the search of the optimal
alignment to be within a band of a certain width around
the shortest alignment. The final distance is normalized with
respect to the warping path (length of the optimal alignment).
When several examples of the searched keyword are available
in the training set, the example that minimizes the distance
for the current image is selected. The DTW distance over the
features is used as the final dissimilarity measure ds(X, K )
between a word image Xand a keyword string K.
B. Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical learning
model, principally used for sequential pattern recognition
problems such as speech and handwriting. It is used to model a
feature probability distribution over consecutive observations.
For handwriting analysis, they have the advantage that no
explicit character segmentation is needed, neither for training
nor for recognition. Instead, HMM find the optimal start and
end positions of the characters during recognition.
For word spotting, we have adopted the approach detailed
in [29]. In this paper, it is applied to word images rather than
line images in order to have the same experimental setup
for DTW and HMM with a focus on comparing different
feature sets. During training, character HMMs are trained
for each character contained in labeled word images. The
Baum-Welch algorithm [30] is used for training the models.
For recognition, a keyword model is created for each query
keyword by connecting character HMMs. This keyword model
is used to compute the log-likelihood score of the input word
image (p(X|K)) with the Viterbi algorithm [30]. A second
unconstrained model (the filler model) is constructed from
the character HMMs to model a word image as an arbitrary
sequence of characters. The obtained log-likelihood from the
filler model (p(X|F)) indicates the general conformance of
word image to the trained models. The filler score is used
to normalize the keyword score with respect to the writing
style, allowing a better generalization for unknown writing
styles. This is achieved by subtracting the filler score from the
keyword score, which is also known as log-odds scoring [31].
Finally the log-likelihood score is normalized with respect to
the keyword length in pixels (LK):
ds(X, K ) = p(X|K)p(X|F)
LK
(2)
This score is used as the final dissimilarity measure
ds(X, K )for word spotting. Our HMM implementation is
based on the HTK toolkit1.
IV. EXP ER IM EN TAL EVALUATIO N
For evaluating the features extracted by the proposed sys-
tem, the keyword spotting performance was evaluated on three
different data sets: one multiple-writer data set (IAM) and two
single-writer data sets (GW and PAR). Using the DTW and
HMM approaches, the performance of the proposed system
is compared to the performance of three different reference
feature sets (See Section IV-B).
For each data set, the system uses the normalized word
images, ground truth, keywords, training sets, validation sets
and test sets made available by [29].
A. Data sets
The IAM off-line database (IAM) data set is made of 1539
pages of modern English text from the Lancaster-Oslo/Bergen
(LOB) corpus [32], written by 657 writers. Each of the subsets
1http://htk.eng.cam.ac.uk
for training, validation and testing, respectively, contains text
lines from a different set of writers. Hence, the main challenge
on this data set is retrieving keywords in writing styles
unknown during training. It contains 70871 word images.
The George Washington data set (GW) data set [33] consists
of 20 pages of letters written in English by George Washington
and his associates. The writing style being very consistent, it
is considered as a single-writer data set. It is made of 4894
word images. Due to the small amount of available samples, a
four-fold cross validation is used for experimental evaluation.
The presented results for this data set are averaged over the
four cross validation runs.
The Parzival data set (PAR) [34] contains 45 pages of a
medieval manuscript, written with ink, in the 13th century,
using Middle High German language. The manuscript contains
the epic poem Parzival, an important work of the European
Middle Ages. The different styles observed in the data set are
very similar, thus the data set is also considered as a single-
writer data set. It contains 23485 word images.
B. Reference feature sets
We compare the features extracted by the proposed system
with three different feature sets known to work well for
keyword spotting. Marti2001 [14] is a well-established
heuristic set of features and has been used repeatedly for
keyword spotting. It is made of nine geometrical features
per column of the image. Rodriguez2008 [6] uses local
gradient histogram features, inspired from SIFT descriptor,
with overlapping sliding windows. At each position, the win-
dow is divided into a grid and histogram of orientations are
accumulated for each cell. Terasawa2009 [7] proposed a
slit-style Histogram Of Gradients (HOG) feature. This is a
modification of the standard HOG feature using no horizontal
overlapping and narrower window images. Moreover, they are
using the signed gradient instead of the unsigned gradient.
C. Performance Evaluation
For evaluation, a set of keywords is spotted on the test set
of each data set. Two different scenarios are considered for
performance evaluation. In the local scenario, a local threshold
is used for each keyword, to measure the Mean Average
Precision (MAP). The global scenario uses a single global
threshold to measure the Average Precision (AP). These two
values are used to assess the overall system performance. They
are computed using the trec_eval2software [29].
D. System setup
The training parameters and the structure of the different
networks have been optimized for the task. For each data set,
the parameters have been optimized individually with respect
to the performance on the validation set.
Each network is made of two CRBM layers, each being
followed by a Max Pooling layer. The pooling ratio for each
layer has been set to 2(C= 2). Each extracted patch is 20
pixels wide (W= 20). The GW network is made of 8 9×9
2http://trec.nist.gov/trec eval
TABLE I: Mean Average Precision (MAP) and Average Pre-
cision (AP) for the different features with DTW. The relative
improvement over the best baseline is also mentioned.
GW PAR IAM
System AP MAP AP MAP AP MAP
Marti2001 33.24 45.26 50.67 46.78 5.10 13.57
Rodriguez2008 41.20 63.39 55.82 47.52 00.80 09.73
Terasawa2009 43.76 64.80 69.10 73.49 00.56 09.55
Proposed 56.98 68.64 72.71 72.38 1.04 10.27
Improvement 23.20%5.59%4.96%1.53% - -
TABLE II: Mean Average Precision (MAP) and Average
Precision (AP) for the different features with HMM. The
relative improvement over the best baseline is also mentioned.
GW PAR IAM
System AP MAP AP MAP AP MAP
Marti2001 48.80 69.42 69.47 77.98 16.67 49.24
Rodriguez2008 32.60 59.40 25.43 32.53 5.47 21.11
Terasawa2009 68.01 79.49 90.50 90.53 59.66 71.59
Proposed 71.21 85.06 92.34 94.57 64.68 72.36
Improvement 4.49%6.54%1.99%4.27%7.76%1.06%
filters followed by 8 3×3filters in the second CRBM layer.
The PAR and IAM networks have 12 9×9filters followed
by 12 3×3filters. Each network has been trained for 25
epochs with Contrastive Divergence, using mini-batch training
and a batch size of 64 for GW and 128 for PAR and IAM.
Before each epoch, the order of the inputs is randomized. L2
weight decay [35] has been applied to every weight to improve
generalization. The hidden biases are initialized to 0.1, the
visible biases to 0.0and a zero-mean normal distribution with
a variance of 0.01 is used to initialize the convolutional filters.
The parameters for the reference feature sets have been
obtained from their respective published research [6], [7], [14].
The HMM used for evaluation uses 3 gaussian mixtures for
GW, 5 for PAR and 7 for IAM. The number of states for
each character model has been optimized with respect to the
mean width of the letters, found with forced alignment using
an HMM, as proposed in [36].
V. RESULTS AND DISCUSSION
A. Results
The experimental results are presented in Table I and
Table II for DTW and HMM, respectively.
In the global scenario (one single threshold for all the key-
words), our system clearly outperforms all reference feature
sets. In the local scenario, except on PAR when using DTW,
the proposed system also outperforms the reference features.
In the following discussion, the relative improvements are dis-
cussed with respect to Terasawa2009, which has performed
best among the reference features. The very low performance
achieved on IAM with DTW for all feature sets is due to the
fact that template matching is failing when the tested writing
styles are not present in the available templates. A performance
comparison in this scenario is therefore not conclusive.
Although the relative improvements are not large, it is
important to note that the Terasawa2009 baseline is already
performing very well for both AP and MAP. In the case of
PAR with HMM, the results are excellent and even small
improvements are already very interesting.
Overall, the proposed features exhibit an excellent perfor-
mance and are very stable from one data set to another.
Although the data sets are very different, the results are quite
similar, while the performance of the handcrafted features
differ more across the data sets. This demonstrates an ad-
vantage of the unsupervised feature learning system against
handcrafted features that are harder to generalize over different
data sets, although Terasawa2009 features are proving
more resilient to change than the other baselines.
B. System Optimization
While our system is performing reasonably well under all
tested conditions, its optimization was challenging. Neural
networks are known to be complex to configure and tune,
mainly due to the high number of free parameters they involve.
Moreover, it was necessary to optimize the model to be able
to handle both DTW and HMM as word scoring techniques,
both being very different in their capabilities and limits.
Especially for DTW, the number of outputs is a crucial
parameter. Having too many features to compute the distance
between two aligned pairs in DTW may result in a decrease
in performance. Networks with one, two and three layers have
been evaluated. Single-layer models were only learning low-
level features and were producing too many features for DTW
to process. Three-layer models failed to generalize, probably
due to the small size and complexity of the input patches.
Therefore, two-layer CDBNs were selected.
Another important factor is the patch width. In every case,
the optimal patch width has been experimentally found to be
20 pixels. While wider patches increased the training time
without increasing the performance for both word spotting
techniques, narrower patches have proven highly unsuccessful.
The number of filters in each layer (K) has to be kept
relatively small for DTW to perform well, contrary to standard
CNNs, for which it generally ranges from 50 to 400 per
layer. Indeed, while increasing the number of filters generally
increases the network learning capacity, it also increases the di-
mensionality of the features which decreased the performance
with DTW. The HMM technique is less susceptible to this
problem, although it does increase the training and evaluation
times. Optimizing the network for HMM with higher Kcould
potentially lead to better performance.
For the GW data set, standard binary hidden units proved to
be the best unit type. However, for PAR and IAM, Rectified
Linear Units (ReLU) [37] were experimentally found to be
superior for hidden units. On average, considering the data
sets and the different word scoring techniques, they improved
the AP by 13% and the MAP by 11%, on the validation set.
They still did perform well on the GW data set, but were 5% to
8% less effective compared to binary hidden units, depending
on the conditions. This may indicate that the small number of
samples available in the GW data set was not enough to learn
generic features with ReLUs. On the other hand, the large
Method Marti2001 Rodriguez2008 Terasawa2009 Proposed
DTW 27.16 65.05 87.60 43.90
HMM 30.16 461.97 1063.11 171.74
Dim. 9 128 384 168
TABLE III: Average time, in seconds, to evaluate one cross
validation test set of GW, and dimensionality of the features.
amount of patches extracted from the PAR and IAM data sets
made them more effective. This could also indicate too much
overfitting by the binary units on the larger data sets.
When ReLUs were not used, enforcing sparsity of the binary
hidden units significantly improved the performance (21% on
average with DTW and 18% with HMM, on the validation
set). When sparsity was not forced, the binary features were
highly tied to the training set and the features were not
generic enough. To enforce sparsity, Lee et al. regularization
method [17] was used. This method makes updates on the
visible biases after each update of the gradients to ensure that
the target sparsity is reached, with a certain sparsity learning
rate. To let the network learn, the sparsity parameters have to
be chosen by considering carefully both the final sparsity of
the features and their performance for the task.
C. Runtime performance
Table III presents the time necessary for the evaluation on
one of the cross validation sets of the GW database for each
feature set and for each classifier. The time is dominated by
the classification itself and not by the feature extraction. Both
DTW and HMM evaluation times depend on the dimensional-
ity of the features, with HMM being affected stronger as the
dimensionality augments. For DTW, the evaluation is twice
faster with our method than with Terasawa2009 and for
HMM, it is 6 times faster.
If we consider only feature extraction itself, our system is
almost 30 times slower than Marti2001, but it is comparable
to the more advanced features, being about 40% faster than
Terasawa2009 and 47% faster than Rodriguez2008.
While the proposed features are more efficient for testing
than other advanced features, they need to be trained. Training
the network for GW took about 4 hours to complete. However,
this is only necessary once for each dataset.
VI. CONCLUSION AND FUTURE WORK
A feature extraction system using Convolutional Deep Be-
lief Networks was presented for handwritten keyword spotting.
The proposed system performs unsupervised feature learning
on sliding window patches extracted from word images. The
deep learning features were experimentally evaluated using
HMM and DTW for word spotting and were compared with
three standard feature sets on three benchmark data sets. The
proposed system clearly outperformed the different baselines,
exhibiting a robust performance under all tested conditions.
However, optimizing the network architecture and parame-
ters was non-trivial. In this paper, we present a configuration
that has proven stable on diverse data sets. Nevertheless, we
believe that there is still room for improvements regarding the
network design choices.
Future work could go in several directions. While the net-
works are quite similar between the different data sets, it would
still be very interesting to find a single configuration that
performs equally well under all conditions. Using grayscale
images instead of binary images could also lead to stronger
features. Finally, augmenting the data set with synthetic dis-
tortions may also lead to even more robust features.
IMPLEMENTATION
The C++ implementations of the proposed system3and our
CDBN library4are freely available online.
REFERENCES
[1] A. Vinciarelli, “A survey on off-line cursive word recognition,” Pattern
recognition, vol. 35, pp. 1433–1446, 2002.
[2] R. Manmatha, C. Han, and E. M. Riseman, “Word spotting: A new
approach to indexing handwriting,” in Proceedings of the IEEE Conf.
on Computer Vision and Pattern Recognition. IEEE, 1996, pp. 631–637.
[3] T. M. Rath and R. Manmatha, “Word image matching using Dynamic
Time Warping,” in Proceedings of the IEEE Conf. on Computer Vision
and Pattern Recognition, vol. 2. IEEE, 2003, pp. 521–527.
[4] ——, “Word spotting for historical documents,Int. Journal of Docu-
ment Analysis and Recognition (IJDAR), vol. 9, pp. 139–152, 2007.
[5] T. Adamek, N. E. O’Connor, and A. F. Smeaton, “Word matching using
single closed contours for indexing handwritten historical documents,”
Int. Journal of Document Analysis and Recognition, vol. 9, pp. 153–165,
2007.
[6] J. A. Rodrıguez and F. Perronnin, “Local gradient histogram features for
word spotting in unconstrained handwritten documents,” in Proceedings
of the Int. Conf. on Frontiers in Handwriting Recognition, 2008, pp.
7–12.
[7] K. Terasawa and Y. Tanaka, “Slit style HOG feature for document image
word spotting,” in Proceedings of the IEEE Int. Conf. on Document
Analysis and Recognition. IEEE, 2009, pp. 116–120.
[8] M. Rusinol, D. Aldavert, R. Toledo, and J. Llad´
os, “Browsing het-
erogeneous document collections by a segmentation-free word spotting
method,” in Document Analysis and Recognition (ICDAR), 2011 Inter-
national Conference on. IEEE, 2011, pp. 63–67.
[9] L. Rothacker, M. Rusinol, and G. A. Fink, “Bag-of-features HMMs
for segmentation-free word spotting in handwritten documents,” in
Document Analysis and Recognition (ICDAR), 2013 12th International
Conference on. IEEE, 2013, pp. 1305–1309.
[10] J. Almaz´
an, A. Gordo, A. Forn´
es, and E. Valveny, “Segmentation-free
word spotting with exemplar svms,Pattern Recognition, vol. 47, no. 12,
pp. 3967–3978, 2014.
[11] J. A. Rodr´
ıguez-Serrano and F. Perronnin, “Handwritten word-spotting
using hidden Markov models and universal vocabularies,” Pattern Recog-
nition, vol. 42, pp. 2106–2116, 2009.
[12] A. Fischer, E. Inderm¨
uhle, H. Bunke, G. Viehhauser, and M. Stolz,
“Ground truth creation for handwriting recognition in historical docu-
ments,” in Proceedings of the IAPR Int. Workshop on Document Analysis
Systems. ACM, 2010, pp. 3–10.
[13] A. H. Toselli and E. Vidal, “Fast HMM-filler approach for key word
spotting in handwritten documents,” in Document Analysis and Recog-
nition (ICDAR), 2013 12th International Conference on. IEEE, 2013,
pp. 501–505.
[14] U.-V. Marti and H. Bunke, “Using a statistical language model to
improve the performance of an HMM-based cursive handwriting recog-
nition system,” Int. Journal of Pattern Recognition and Artificial Intel-
ligence, vol. 15, pp. 65–90, 2001.
[15] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke, “A novel word
spotting method based on recurrent neural networks,” IEEE Transactions
oni Pattern Analysis and Machine Intelligence, vol. 34, pp. 211–224,
2012.
3https://github.com/wichtounet/word spotting/tree/paper v3
4https://github.com/wichtounet/dll
[16] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
[17] L. Honglak, E. Chaitanya, and A. Y. Ng, “Sparse Deep Belief Net
Model for Visual Area V2,” in Proceedings of the Advances in Neural
Information Processing Systems, 2008, pp. 873–880.
[18] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
Deep Belief Networks for scalable unsupervised learning of hierarchical
representations,” in Proceedings of the Int. Conf. on Machine Learning.
ACM, 2009, pp. 609–616.
[19] B. Wicht and J. Hennebert, “Camera-based Sudoku recognition with
Deep Belief Network,” in Proceedings the of IEEE Int. Conf. of Soft
Computing and Pattern Recognition. IEEE, 2014, pp. 83–88.
[20] ——, “Mixed handwritten and printed digit recognition in Sudoku with
Convolutional Deep Belief Network,” in Proceedings of the IEEE Int.
Conf. on Document Analysis and Recognition. IEEE, 2015.
[21] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural networks
applied to house numbers digit classification,” in Pattern Recognition
(ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 3288–
3291.
[22] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end text
recognition with convolutional neural networks,” in Pattern Recognition
(ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 3304–
3308.
[23] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[24] P. Smolensky, “Information processing in dynamical systems: Founda-
tions of harmony theory,Parallel distributed Processing, vol. 1, pp.
194–281, 1986.
[25] G. E. Hinton, “Training Products of Experts by minimizing Contrastive
Divergence,Neural Computation, vol. 14, pp. 1771–1800, 2002.
[26] H. K. R. Grosse and A. Y. Ng, “Shift-invariant sparse coding for audio
classification,” in Proceedings of the Conf. on Uncertainty in Artificial
Intelligence, 2007.
[27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
zip code recognition,” Neural computation, vol. 1, pp. 541–551, 1989.
[28] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization
for spoken word recognition,” IEEE Transactions on Acoustics Speech
and Signal Processing, vol. 26, pp. 43–49, 1978.
[29] A. Fischer, A. Keller, V. Frinken, and H. Bunke, “Lexicon-free handwrit-
ten word spotting using character HMMs,” Pattern Recognition Letters,
vol. 33, pp. 934–942, 2012.
[30] L. R. Rabiner, “A tutorial on hidden Markov models and selected
applications in speech recognition,” Proceedings of the IEEE, vol. 77,
no. 2, pp. 257–286, 1989.
[31] C. Barrett, R. Hughey, and K. Karplus, “Scoring hidden Markov
models,” Computer applications in the biosciences: CABIOS, vol. 13,
no. 2, pp. 191–199, 1997.
[32] S. Johansson, G. N. Leech, and H. Goodluck, Manual of Information
to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for
Use with Digital Computer. Department of English, University of Oslo,
1978.
[33] V. Lavrenko, T. M. Rath, and R. Manmatha, “Holistic word recogni-
tion for handwritten historical documents,” in Proceedings of the Int.
Workshop on Document Image Analysis for Libraries. IEEE, 2004, pp.
278–287.
[34] A. Fischer, M. W¨
uthrich, M. Liwicki, V. Frinken, H. Bunke,
G. Viehhauser, and M. Stolz, “Automatic transcription of handwritten
medieval documents,” in Proceedings of the IEEE Int. Conf. on Virtual
Systems and Multimedia. IEEE, 2009, pp. 137–142.
[35] G. E. Hinton, “A practical guide to training restricted Boltzmann
machines,” in Neural Networks: Tricks of the Trade. Springer, 2012,
pp. 599–619.
[36] S. G¨
unter and H. Bunke, “Optimizing the number of states, training iter-
ations and Gaussians in an HMM-based handwritten word recognizer,
in Proceedings of the Seventh International Conference on Document
Analysis and Recognition. IEEE, 2003, pp. 472–476.
[37] V. Nair and G. E. Hinton, “Rectified Linear Units improve Restricted
Boltzmann Machines,” in Proceedings of the Int. Conf. on Machine
Learning, 2010, pp. 807–814.
... This limitation is further aggravated by the absence of a complete alphabet in some manuscripts, with new characters being discovered during processing. While some unsupervised methods have surfaced to address the challenge of the open set problem [5,6,7], their effectiveness often leaves much to be desired. ...
... Nevertheless, these methods usually require an extensive corpus of data to enable adequate training and are unable to identify new categories. [5] using a sliding window scheme with deep features to accomplish the text spotting task is also one of the solutions. This approach can be seen as a kind of unsupervised learning, so there is no open-set problem, unfortunately, this sliding window based approach usually does not achieve the best results. ...
... To make a fair comparison, we have also added multi-scale pyramids to the training and testing of CoAE. Furthermore, we have experimented with DSW [5], a classic unsupervised approach for text spotting based on the "sliding window" technique, and reproduced its main idea by directly mapping the feature map of the support image as a convolution on the query image after extracting the features using a Siamese network with specifications identical to our method. ...
Preprint
Historical manuscript processing poses challenges like limited annotated training data and novel class emergence. To address this, we propose a novel One-shot learning-based Text Spotting (OTS) approach that accurately and reliably spots novel characters with just one annotated support sample. Drawing inspiration from cognitive research, we introduce a spatial alignment module that finds, focuses on, and learns the most discriminative spatial regions in the query image based on one support image. Especially, since the low-resource spotting task often faces the problem of example imbalance, we propose a novel loss function called torus loss which can make the embedding space of distance metric more discriminative. Our approach is highly efficient and requires only a few training samples while exhibiting the remarkable ability to handle novel characters, and symbols. To enhance dataset diversity, a new manuscript dataset that contains the ancient Dongba hieroglyphics (DBH) is created. We conduct experiments on publicly available VML-HD, TKH, NC datasets, and the new proposed DBH dataset. The experimental results demonstrate that OTS outperforms the state-of-the-art methods in one-shot text spotting. Overall, our proposed method offers promising applications in the field of text spotting in historical manuscripts.
... Another drawback of these approaches is their reliance on vast amounts of training data, a requirement that isn't always feasible in certain application scenarios. Traditional sliding-window-based spotting methods (Wicht, Fischer, and Hennebert 2016;Chan and Riek 2020) though seemingly promising for addressing the open recognition issue, often underperform in practice. This underperformance can likely be attributed to variations in object morphology across different categories and the breakdown of data distribution assumptions when faced with novel categories. ...
... For comparison, we consider the following methods: (1) DSW (Wicht, Fischer, and Hennebert 2016), an established unsupervised method grounded in the "sliding window" technique, which we faithfully reproduce. To ensure a fair comparison, DSW applies the same pre-trained network backbone with CoNet, and maintains precise aspect ratios for the image feature maps; (2) Table 4: Performance evaluation (%) on the VML-HD. ...
Article
Humans often require only a few visual archetypes to spot novel objects. Based on this observation, we present a strategy rooted in ``spotting the unseen" by establishing dense correspondences between potential query image regions and a visual archetype, and we propose the Consensus Network (CoNet). Our method leverages relational patterns intra and inter images via Auto-Correlation Representation (ACR) and Mutual-Correlation Representation (MCR). Within each image, the ACR module is capable of encoding both local self-similarity and global context simultaneously. Between the query and support images, the MCR module computes the cross-correlation across two image representations and introduces a reciprocal consistency constraint, which can incorporate to exclude outliers and enhance model robustness. To overcome the challenges of low-resource training data, particularly in one-shot learning scenarios, we incorporate an adaptive margin strategy to better handle diverse instances. The experimental results indicate the effectiveness of the proposed method across diverse domains such as object detection in natural scenes, and text spotting in both historical manuscripts and natural scenes, which demonstrates its sparkling generalization ability. Our code is available at: https://github.com/infinite-hwb/conet.
... To place our results in comparison with previously published work, Table 3 presents word-segmentation-free, query-by-string, line-level KWS results obtained by other authors on the same three datasets. The following approaches have been considered: convolutional deep belief network (CDBN) [76,77], dynamic time warping (DTW) [48,76], Bayesian logistic regression classifier (BLRC) [78], HMMs [19,51,77,79] histogram of gradients (HOG) [48], our previous work on lexicon-based KWS [19], HMM-filler with background modeling (Filler-BGR) [52], and bidirectional long-short term memory (BLSTM) [50], The pyramidal histogram of characters (PHOC) approach presented in [58] is not included because it is fully segmentation-based [13]. ...
... To place our results in comparison with previously published work, Table 3 presents word-segmentation-free, query-by-string, line-level KWS results obtained by other authors on the same three datasets. The following approaches have been considered: convolutional deep belief network (CDBN) [76,77], dynamic time warping (DTW) [48,76], Bayesian logistic regression classifier (BLRC) [78], HMMs [19,51,77,79] histogram of gradients (HOG) [48], our previous work on lexicon-based KWS [19], HMM-filler with background modeling (Filler-BGR) [52], and bidirectional long-short term memory (BLSTM) [50], The pyramidal histogram of characters (PHOC) approach presented in [58] is not included because it is fully segmentation-based [13]. ...
Article
Full-text available
Keyword Spotting (KWS) is here considered as a basic technology for Probabilistic Indexing (PrIx) of large collections of handwritten text images to allow fast textual access to the contents of these collections. Under this perspective, a probabilistic framework for lexicon-based KWS in text images is presented. The presentation aims at providing formal insights which help understanding classical statements of KWS (from which PrIx borrows fundamental concepts), as well as the relative challenges entailed by these statements. The development of the proposed framework makes it clear that word recognition or classification implicitly or explicitly underlies any formulation of KWS. Moreover, it suggests that the same statistical models and training methods successfully used for handwriting text recognition can advantageously be used also for PrIx, even though PrIx does not generally require or rely on any kind of previously produced image transcripts. Experiments carried out using these approaches support the consistency and the general interest of the proposed framework. Results on three datasets traditionally used for KWS benchmarking are significantly better than those previously published for these datasets. In addition, good results are also reported on two new, larger handwritten text image datasets (Bentham and Plantas), showing the great potential of the methods proposed in this paper for indexing and textual search in large collections of untranscribed handwritten documents. Specifically, we achieved the following Average Precision values: IAMDB: 0.89, George Washington: 0.91, Parzival: 0.95, Bentham: 0.91 and Plantas: 0.92.
... The coded layer (Wang, Z.R. et al., 2016) uses several coding filters (or kernels) to extract information at a higher level, for instance for detecting borders, corners, bond points and other visual elements. As the size of the map is greatly increased with a lot of convergence filters, we use the pooling layer (Wicht, B. et al., 2016) to reduce the map and thus to boost the convergence rate for the network. Finally, every map of multi-dimensional features is ...
Article
Full-text available
In general, signatures are used to authenticate an individual's authenticity. A solid way of certifying the legitimacy of a signature is still waiting for. The approach offered in this paper will allow people to identify signatures to determine whether a signature is forged or genuine. We have tried to automate the signature verification procedure using Convolutionary Neural Networks in our system. In computer applications such as picture registration and identification of objects, image classification and recovery, feature detectors and descriptors have a key role to play. This article discusses the performance analysis of numerous characteristic and descriptor detectors like SIFT, SURF, ORB, BRIEF, BRISK, FREAK. The number of matching points in areas overlapping between images and subjective stitching accuracy were analysed in terms of the number of features. The removal of varied symmetrical characteristics of the image enhances the possibility that different scene views can reliably satisfy the requirements. One experiment result show that the maximum number of detected key points is AGAST, FAST and BRISK detection, but the lesser number of extracted key points is STAR, AKAZE and MSER. In addition, each algorithm's speed is recorded.
Article
Full-text available
Over the past decade, artificial intelligence (AI) has become a popular subject both within and outside of the scientific community; an abundance of articles in technology and non-technology-based journals have covered the topics of Machine Learning, Deep Learning, and Artificial Intelligence. Artificial Intelligence has started to become the mainstay of a number of applications online and in the market worldwide. While AI takes a front seat, Classical Machine Learning algorithms have been around for nearly five decades and continue to be the bedrock of future development and research in the field of machine learning. Besides this, deep learning is the current and a stimulating field of machine learning. Yet there still remains confusion around AI, ML, and DL. Despite their strong associations, the names cannot be used interchangeably. In order to better communicate these concepts to a clinical audience, we (try to) avoid technical jargon in our review study. The purpose of the paper is to familiarize the reader with the various machine learning and deep learning approaches as well as the various kinds of algorithms that are the foundation of the machine learning field.
Chapter
This paper aims to inspect the often neglected role of Graphical User Interfaces (GUI) in AI-based tools designed to assist in the transcription of handwritten documents. While the precision and recall of the handwritten word recognition have traditionally been the primary focus, we argue that the time parameter associated with the GUI, specifically in terms of validation and correction, plays an equally crucial role. By investigating the influence of GUI design on the validation and correction aspects of transcription we want to highlight how the time that the user must take to interact with the interface must be taken into account to evaluate the performance of the transcription process. Through comprehensive analysis and experimentation, we illustrate the profound impact that GUI design can have on the overall efficiency of transcription tools. We demonstrate how the time saved through the utilization of an assistant tool is heavily dependent on the operations performed within the interface and the diverse features it offers. By recognizing GUI design as an essential component of transcription tools, we can unlock their full potential and significantly improve their effectiveness.
Article
Sanskrit character and number documents have a lot of errors. Correcting those errors using conventional spell-checking approaches breaks down due to the limited vocabulary. This is because of high inflexions of Sanskrit, where words are dynamically formed by Sandhi rules, Samasa rules, Taddhita affixes, etc. Therefore, correcting OCR documents require huge efforts. Here, we can present different machine learning approaches and various ways to improve features for ameliorating the error corrections in Sanskrit documents. Simulation of Sanskrit dictionary for synthesizing off-the-shelf dictionary can be done. Most of the proposed methods can also work for general Sanskrit word corrections and Hindi word corrections. Handwriting recognition in Indic scripts, like Devanagari, is very challenging due to the subtitles in the scripts, variations in rendering and the cursive nature of the handwriting. Lack of public handwriting datasets in Indic scripts has long stymied the development of offline handwritten word recognizers and made comparison across different methods a tedious task in the field. In this paper, a new handwritten word dataset will be released for Devanagari, IIIT-HW-Dev to alleviate some of these issues. This process is required for successful training of deep learning architecture, availability of huge amounts of training data is crucial, as any typical architecture contains millions of parameters. A new method for the classification of freeman chain code using four-connectivity and eight-connectivity events with deep learning approach is presented. Application of CNN LeNet-5 is found to be suitable to get results in this cases as the numbers are formed with curved lines In contrast with the existing FCC event data analysis techniques, sampled grey images of the existing events are not used, but image files of the three-phase PQ event data are analysed by taking the advantage of the success of the deep learning approach on imagefile-classification. Therefore, the novelty of the proposed approach is that image files of the voltage waveforms of the three phases of the power grid are classified. It is shown that the test data can be classified with 100% accuracy. The proposed work is believed to serve the needs of the future smart grid applications, which are fast and taking automatic countermeasures against potential PQ events.
Chapter
Full-text available
Deep Learning is the present and restorative field of machine learning. It uses the well‐organized machine learning approach in terms of efficacious, supervised, precise time and computational cost. It paves the way to solve various complicated problems in the computational areas. It has done a major breakthrough with measurable work performance in a various applications. It is used for exploring complicated structure in infinity amount of data using back propagation algorithm. It has done a remarkable performance and advancements in various applications like detection of objects, share market analysis, remote sensing, speech recognition, health care, parking system, big data and deep vision system. Artificial Intelligence takes learning algorithms consistently by enhancing the amounts of data continuously. This will lead the efficiency of the training processes because efficiency is considered, based on the large volume of data. Such a preparation method is called Deep. It has two phase training and inference. In Training phase bundle of data identified by some labels for taking the decisions on their static characteristics. Final decision on the matching data and display the unexposed data identified by new labeling based on their traversing knowledge on the data can be done by inferring phase.
Article
Full-text available
In this paper, we propose a method to detect and recognize a Sudoku puzzle on images taken from a mobile camera. The lines of the grid are detected with a Hough transform. The grid is then recomposed from the lines. The digits position are extracted from the grid and finally, each character is recognized using a Deep Belief Network (DBN). To test our implementation, we collected and made public a dataset of Sudoku images coming from cell phones. Our method proved successful on our dataset, achieving 87.5% of correct detection on the testing set. Only 0.37% of the cells were incorrectly guessed. The algorithm is capable of handling some alterations of the images, often present on phone-based images, such as distortion, perspective, shadows, illumination gradients or scaling. On average, our solution is able to produce a result from a Sudoku in less than 100ms.
Conference Paper
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
Technical Report
Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data. RBMs are usually trained using the contrastive divergence learning procedure. This requires a certain amount of practical experience to decide how to set the values of numerical meta-parameters. Over the last few years, the machine learning group at the University of Toronto has acquired considerable expertise at training RBMs and this guide is an attempt to share this expertise with other machine learning researchers.
Article
In this paper we propose an unsupervised segmentation-free method for word spotting in document images. Documents are represented with a grid of HOG descriptors, and a sliding-window approach is used to locate the document regions that are most similar to the query. We use the Exemplar SVM framework to produce a better representation of the query in an unsupervised way. Then, we use a more discriminative representation based on Fisher Vector to rerank the best regions retrieved, and the most promising ones are used to expand the Exemplar SVM training set and improve the query representation. Finally, the document descriptors are precomputed and compressed with Product Quantization. This offers two advantages: first, a large number of documents can be kept in RAM memory at the same time. Second, the sliding window becomes significantly faster since distances between quantized HOG descriptors can be precomputed. Our results significantly outperform other segmentation-free methods in the literature, both in accuracy and in speed and memory usage.
Article
Full end-to-end text recognition in natural images is a challenging problem that has received much atten-tion recently. Traditional systems in this area have re-lied on elaborate models incorporating carefully hand-engineered features or large amounts of prior knowl-edge. In this paper, we take a different route and com-bine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detec-tor and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.
Conference Paper
The so-called filler or garbage Hidden Markov Models (HMM) are among the most widely used models for lexicon-free, query by string key word spotting in the fields of speech recognition and (lately) handwritten text recognition. An important drawback of this approach is the large computational cost of the keyword-specific HMM Viterbi decoding process needed to obtain the confidence scores of each word to be spotted. This paper presents a novel way to compute such confidence scores, directly from character lattices produced during a single Viterbi decoding process using only the "filler" model (i.e. no explicit keyword-specific decoding is needed). Experiments show that, as compared with the classical HMM-filler approach, the proposed method obtains essentially the same spotting results, while requiring between one and two orders of magnitude less query computing time.
Conference Paper
Recent HMM-based approaches to handwritten word spotting require large amounts of learning samples and mostly rely on a prior segmentation of the document. We propose to use Bag-of-Features HMMs in a patch-based segmentation-free framework that are estimated by a single sample. Bag-of-Features HMMs use statistics of local image feature representatives. Therefore they can be considered as a variant of discrete HMMs allowing to model the observation of a number of features at a point in time. The discrete nature enables us to estimate a query model with only a single example of the query provided by the user. This makes our method very flexible with respect to the availability of training data. Furthermore, we are able to outperform state-of-the-art results on the George Washington dataset.