Chapter

Active Interaction and Learning in Handwritten Text Transcription

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Computer-assisted systems are being increasingly used in a variety of real-world tasks, though their application to handwritten text transcription in old manuscripts remains largely unexplored. The basic idea explored in this chapter is to follow a sequential, line-by-line transcription of the whole manuscript in which a continuously retrained system interacts with the user to efficiently transcribe each new line. User interaction is expensive in terms of time and cost. Our top priority is to take advantage of these interactions, while trying to reduce them as most as possible. To this end, we study three different frameworks: (a) improve a recognition system from newly recognized transcriptions via adaptation techniques, using semi-supervised learning techniques; (b) study how to best adapt from limited user supervisions, which is related to active learning; and (c) develop a simple error estimate, which is used to let the user adjust the error in a computer-assisted transcription task. In addition, we test these approaches in the sequential transcription of two old text documents.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Active learning is well motivated in many modern machinelearning problems, where unclassified examples may be abundant but finding the class is difficult, time-consuming or expensive to obtain [1,25]. Active learning has been applied in several fields such as speech recognition [25,26], information extraction [27– 30], robotics [31], transcription of text images [32,33], object classification in general3435363738, [68], biometrics [65], image segmentation [66], clustering [67]. And in general, it has been used for parameter selection [39]. ...
Article
Full-text available
EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history and computer scientists participated in the development of remote and collaborative access to digitized Renaissance books, necessary because of the reduced accessibility to digital libraries in image mode through the Internet. The size of files for the storage of images, the lack of a standard file format exchange suitable for progressive transmission, and limited querying possibilities currently limit remote access to digital libraries. To improve accessibility, historical documents must be digitized and retro-converted to extract a detailed description of the image contents suited to users’ needs. Specialists of the Renaissance have described the metadata generally required by end-users and the ideal functionalities of the digital library. The retro-conversion of historical documents is a complex process that includes image capture, metadata extraction, image storage and indexing, automatic conversion in a reusable electronic form, publication on the Internet, and data compression for faster remote access. The steps of this process cannot be developed independently. DEBORA proposes a global approach to retro-conversion from the digitization to the final functionalities of the digital library centered on users’ needs. The retro-conversion process is mainly based on a document image analysis system that simultaneously extracts the metadata and compresses the images. We also propose a file format to describe compressed books as heterogeneous data (images/text/links/ annotation/physical layout and logical structure) suitable for progressive transmission, editing, and annotation. DEBORA is an exploratory project that aims at demonstrating the feasibility of the concepts by developing prototypes tested by end-users.
Conference Paper
Full-text available
An effective approach to transcribe old text documents is to follow an interactive-predictive paradigm in which both, the system is guided by the human supervisor, and the supervisor is assisted by the system to complete the transcription task as efficiently as possible. In this paper, we focus on a particular system prototype called GIDOC, which can be seen as a first attempt to provide user-friendly, integrated support for interactive-predictive page layout analysis, text line detection and handwritten text transcription. More specifically, we focus on the handwriting recognition part of GIDOC, for which we propose the use of confidence measures to guide the human supervisor in locating possible system errors and deciding how to proceed. Empirical results are reported on two datasets showing that a word error rate not larger than a 10% can be achieved by only checking the 32% of words that are recognised with less confidence.
Conference Paper
Full-text available
An effective approach to transcribe handwritten text documents is to follow an interactive-predictive paradigm in which both, the system is guided by the user, and the user is assisted by the system to complete the transcription task as efficiently as possible. This approach has been recently implemented in a system prototype called GIDOC, in which standard speech technology is adapted to handwritten text (line) images: HMM-based text image modelling, n-gram language modelling, and also confidence measures on recognized words. Confidence measures are used to assist the user in locating possible transcription errors, and thus validate system output after only supervising those (few) words for which the system is not highly confident. Here, we study the effect of using these partially supervised transcriptions on the adaptation of image and language models to the task.
Conference Paper
Full-text available
Annotation of digitized pages from historical document collections is very important to research on automatic extraction of text blocks, lines, and handwriting recognition. We have recently introduced a new handwritten text database, GERMANA, which is based on a Spanish manuscript from 1891. To our knowledge, GERMANA is the first publicly available database mostly written in Spanish and comparable in size to standard databases. In this paper, we present another handwritten text database, RODRIGO, completely written in Spanish and comparable in size to GERMANA. However, RODRIGO comes from a much older manuscript, from 1545, where the typical difficult characteristics of historical documents are more evident. In particular, the writing style, which has clear Gothic influences, is significantly more complex than that of GERMANA. We also provide baseline results of handwriting recognition for reference in future studies, using standard techniques and tools for preprocessing, feature extraction, HMM-based image modelling, and language modelling.
Article
Full-text available
Since their first inception more than half a century ago, automatic reading systems have evolved substantially, thereby showing impressive performance on machine-printed text. The recognition of handwriting can, however, still be considered an open research problem due to its substantial variation in appearance. With the introduction of Markovian models to the field, a promising modeling and recognition paradigm was established for automatic offline handwriting recognition. However, so far, no standard procedures for building Markov-model-based recognizers could be established though trends toward unified approaches can be identified. It is therefore the goal of this survey to provide a comprehensive overview of the application of Markov models in the research field of offline handwriting recognition, covering both the widely used hidden Markov models and the less complex Markov-chain or n-gram models. First, we will introduce the typical architecture of a Markov-model-based offline handwriting recognition system and make the reader familiar with the essential theoretical concepts behind Markovian models. Then, we will give a thorough review of the solutions proposed in the literature for the open problems how to apply Markov-model-based approaches to automatic offline handwriting recognition.
Article
Full-text available
For large vocabulary continuous speech recognition systems, the amount of acoustic training data is of crucial importance. In the past, large amounts of speech were thus recorded from various sources and had to be transcribed manually. It is thus desirable to train a recognizer with as little manually transcribed acoustic data as possible. Since untranscribed speech is available in various forms nowadays, the unsupervised training of a speech recognizer on recognized transcriptions is studied in this paper. A low-cost recognizer trained with between one and six h of manually transcribed speech is used to recognize 72 h of untranscribed acoustic data. These transcriptions are then used in combination with a confidence measure to train an improved recognizer. The effect of the confidence measure which is used to detect possible recognition errors is studied systematically. Finally, the unsupervised training is applied iteratively. Starting with only one h of transcribed acoustic data, a recognition system is trained fully automatically. With this iterative training procedure, the word error rates are reduced from 71.3% to 38.3% on the Broadcast News'96 evaluation test set and from 65.6% to 29.3% on the Broadcast News'98 evaluation test set. In comparison with an optimized system trained with the manually generated transcriptions of the complete 72 h training corpus, the word error rates increase by 14.3% relative and 18.6% relative, respectively.
Article
Full-text available
There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.
Article
The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer labeled training instances if it is allowed to choose the data from which is learns. An active learner may ask queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator). Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant but labels are difficult, time-consuming, or expensive to obtain. This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. An analysis of the empirical and theoretical evidence for active learning, a summary of several problem setting variants, and a discussion of related topics in machine learning research are also presented.
Article
This paper investigates various ensemble methods for offline handwritten text line recognition. To obtain ensembles of recognisers, we implement bagging, random feature subspace, and language model variation methods. For the combination, the word sequences returned by the individual ensemble members are first aligned. Then a confidence-based voting strategy determines the final word sequence. A number of confidence measures based on normalised likelihoods and alternative candidates are evaluated. Experiments show that the proposed ensemble methods can improve the recognition accuracy over an optimised single reference recogniser.
Conference Paper
Information Extraction methods can be used to automatically "fill-in" database forms from unstructured data such as Web documents or email. State-of-the-art methods have achieved low error rates but invariably make a number of errors. The goal of an interactive information extraction system is to assist the user in filling in database fields while giving the user confidence in the integrity of the data. The user is presented with an interactive interface that allows both the rapid verification of automatic field assignments and the correction of errors. In cases where there are multiple errors, our system takes into account user corrections, and immediately propagates these constraints such that other fields are often corrected automatically. Linear-chain conditional random fields (CRFs) have been shown to perform well for information extraction and other language modelling tasks due to their ability to capture arbitrary, overlapping features of the input in a Markov model. We apply this framework with two extensions: a constrained Viterbi decoding which finds the optimal field assignments consistent with the fields explicitly specified or corrected by the user; and a mechanism for estimating the confidence of each extracted field, so that low-confidence extractions can be highlighted. Both of these mechanisms are incorporated in a novel user interface for form filling that is intuitive and speeds the entry of data--providing a 23% reduction in error due to automated corrections.
Interactive information extraction with constrained conditional random fields
  • T Kristjannson
  • A Culotta
  • P Viola
  • A Mccallum
  • T. Kristjannson