R. Manmatha's research while affiliated with Amazon and other places

Publications (149)

Preprint
Full-text available
In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework. Under this paradigm, both tasks are accomplished by operating over a shared global feature map extracted from the input image. Among the main challenges that end-to-end approaches face is the perform...
Preprint
Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components. Existing methods usually have a distinct separation between the detection and recognition branches, requiring exact annotations for the two tasks. We introduce TextTranSpotter (TT...
Article
Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components. Existing methods usually have a distinct separation between the detection and recognition branches, requiring exact annotations for the two tasks. We introduce TextTranSpotter (TT...
Article
Starting from the seminal work of Fully Convolutional Networks (FCN), there has been significant progress on semantic segmentation. However, deep learning models often require large amounts of pixelwise annotations to train accurate and robust models. Given the prohibitively expensive annotation cost of segmentation masks, we introduce a self-train...
Preprint
Full-text available
We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout info...
Preprint
Full-text available
We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which en...
Preprint
Full-text available
In this work, we study the problem of word-level confidence calibration for scene-text recognition (STR). Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored. We analyze several recent STR methods and show th...
Preprint
Full-text available
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, w...
Article
Full-text available
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, w...
Preprint
Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomp...
Preprint
Full-text available
This paper presents results of Document Visual Question Answering Challenge organized as part of "Text and Documents in the Deep Learning Era" workshop, in CVPR 2020. The challenge introduces a new problem - Visual Question Answering on document images. The challenge comprised two tasks. The first task concerns with asking questions on a single doc...
Preprint
Full-text available
We present a new dataset for Visual Question Answering on document images called DocVQA. The dataset consistsof 50,000 questions defined on 12,000+ document images. We provide detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension. We report several baseline results by adopting existing VQA and readin...
Preprint
Deep learning usually achieves the best results with complete supervision. In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models. In this paper, we show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm. We...
Preprint
Full-text available
While image classification models have recently continued to advance, most downstream applications such as object detection and semantic segmentation still employ ResNet variants as the backbone network due to their simple and modular structure. We present a simple and modular Split-Attention block that enables attention across feature-map groups....
Preprint
Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area of research. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER). SC...
Preprint
Full-text available
We propose a new end-to-end trainable model for lossy image compression which includes a number of novel components. This approach incorporates 1) a hierarchical auto-regressive model; 2)it also incorporates saliency in the images and focuses on reconstructing the salient regions better; 3) in addition, we empirically demonstrate that the popularly...
Preprint
Full-text available
Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG). A standard way of comparing image compression schemes today is to use perceptual similarity metrics such as PSNR or MS...
Preprint
Full-text available
Several deep learned lossy compression techniques have been proposed in the recent literature. Most of these are optimized by using either MS-SSIM (multi-scale structural similarity) or MSE (mean squared error) as a loss function. Unfortunately, neither of these correlate well with human perception and this is clearly visible from the resulting com...
Preprint
Full-text available
In this age of social media, people often look at what others are wearing. In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes.We propose a system to automatically find the closest visually similar clothes in the online Catalog (s...
Article
Full-text available
The main goal of existing word spotting approaches for searching document images has been the identification of visually similar word images in the absence of high quality text recognition output. Searching for a piece of arbitrary text is not possible unless the user identifies a sample word image from the document collection or generates the quer...
Article
Full-text available
Training robust deep video representations has proven to be much more challenging than learning deep image representations and consequently hampered tasks like video action recognition. This is in part due to the enormous size of raw video streams, the associated amount of computation required, and the high temporal redundancy. The 'true' and inter...
Article
Deep embeddings answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search. The most prominent approaches optimize a deep convolutional network with a suitable loss function, such as contrastive loss or triplet loss. While a rich line of work focuses solely...
Conference Paper
Full-text available
An adaptive image sampling framework is proposed for identifying text regions in natural scene images. A small fraction of the pixels actually correspond to text regions. It is desirable to eliminate non-text regions at the early stages of text detection. First, the image is sampled row-by-row at a specific rate and each row is tested for containin...
Conference Paper
The task of automatic image annotation involves assigning relevant multiple labels/tags to query images based on their visual content. One of the key challenge in multi-label image annotation task is the class imbalance problem where frequently occurring labels suppress the participation of rarely occurring labels. In this paper, we propose to expl...
Conference Paper
Full-text available
We propose simple and effective models for the image annotation that make use of Convolutional Neural Network (CNN) features extracted from an image and word embedding vectors to represent their associated tags. Our first set of models is based on the Canonical Correlation Analysis (CCA) framework that helps in modeling both views - visual features...
Article
Relevance feedback has been shown to improve retrieval for a broad range of retrieval models. It is the most common way of adapting a retrieval model for a specific query. In this work, we expand this common way by focusing on an approach that enables us to do query-specific modification of a retrieval model for learning-to-rank problems. Our appro...
Conference Paper
Event detection is a recent and challenging task. The aim is to retrieve the relevant videos given an event description. A set of training examples associated with the events are generally provided as well, since retrieving relevant videos from textual queries solely is not feasible. Early attempts of event detection are based on low-level features...
Article
In this work we present a handwritten word spotting approach that takes advantage of the a priori known order of appearance of the query words. Given an ordered sequence of query word instances, the proposed approach performs a sequence alignment with the words in the target collection. Although the alignment is quite sparse, i.e. the number of wor...
Conference Paper
Full-text available
In this work, we present a hybrid model (SVM-DMBRM) combining a generative and a discriminative model for the image annotation task. A support vector machine (SVM) is used as the discriminative model and a Discrete Multiple Bernoulli Relevance Model (DMBRM) is used as the generative model. The idea of combining both the models is to take advantage...
Article
In this paper, we present a practical and scalable retrieval framework for large-scale document image collections, for an Indian language script that does not have a robust OCR. OCR-based methods face difficulties in character segmentation and recognition, especially for the complex Indian language scripts. We realize that character recognition is...
Chapter
Shape descriptors have been used frequently to characterize an image for classification and retrieval tasks. For example, shape features are applied in medical imaging to classify whether an image shows a tumor or not. The patent office uses the similarity of shape to ensure that there are no infringements of copyrighted trademarks. This paper focu...
Article
Full-text available
Detecting multimedia events in web videos is an emerging hot research area in the fields of multimedia and computer vision. In this paper, we introduce the core methods and technologies of the framework we developed recently for our Event Labeling through ...
Conference Paper
Social media platforms allow rapid information diffusion, and serve as a source of information to many of the users. Particularly, in Twitter information provided by tweets diffuses over the users through retweets. Hence, being able to predict the retweet count of a given tweet is important for understanding and controlling information diffusion on...
Article
We welcome you to the Second International Workshop on Historical Document Imaging and Processing (HIP'13), held in conjunction with ICDAR 2013. HIP'11 was a gratifying and overwhelming success as attendees from six continents assembled in Beijing to share their work. The proceedings of HIP'11 have become part of the ACM Digital Library. The procee...
Conference Paper
This paper evaluates an automated scheme for aligning and combining optical character recognition (OCR) output from three scans of a book to generate a composite version with fewer OCR errors. While there has been some previous work on aligning multiple OCR versions of the same scan, the scheme introduced in this paper does not require that scans b...
Conference Paper
Action recognition is one of the major challenges of computer vision. Several approaches have been proposed using different descriptors and multi-class models. In this paper, we focus on binary ranking models for the action recognition problem and address the action recognition as a ranking problem. A binary ranking model is trained for each action...
Conference Paper
The objective of this work is to show the importance of a good line segmentation to obtain better results in the segmentation of words of historical documents. We have used the approach developed by Manmatha and Rothfeder to segment words in old handwritten documents. In their work the lines of the documents are extracted using projections. In this...
Article
This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a book must preserve the linear progression of ideas for it to be a valid translation. Consider two b...
Article
Conventional retrieval systems view documents as a unit and look at different retrieval types within a document. We introduce Proteus, a frame-work for seamlessly navigating books as dynamic collections which are defined on the fly. Proteus allows us to search various retrieval types. Navigable types include pages, books, named persons, locations,...
Article
Full-text available
An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline...
Conference Paper
Full-text available
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they const...
Conference Paper
Existing large-scale scanned book collections have many shortcomings for data-driven research, from OCR of variable quality to the lack of accurate descriptive and structural metadata. We argue that complementary research in inferring relational metadata is important in its own right to support use of these collections and that it can help to mitig...
Conference Paper
Full-text available
This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned wi...
Conference Paper
Retrieval from Hindi document image collections is a challenging task. This is partly due to the complexity of the script, which has more than 800 unique ligatures. In addition, segmentation and recognition of individual characters often becomes difficult due to the writing style as well as degradations in the print. For these reasons, robust OCRs...
Article
Keyword spotting refers to the process of retrieving all instances of a given keyword from a document. In the present paper, a novel keyword spotting method for handwritten documents is described. It is derived from a neural network-based system for unconstrained handwriting recognition. As such it performs template-free spotting, i.e., it is not n...
Chapter
Optical character recognizers have been combined with text search engines to successfully build digital book libraries in a number of European languages. The recognition of printed books in Indian languages and scripts is, however, still a challenging problem. This paper describes some of the challenges in building recognizers for Indian languages....
Conference Paper
Being able to search for words or phrases in historic handwritten documents is of paramount importance when preserving cultural heritage. Storing scanned pages of written text can save the information from degradation, but it does not make the textual information readily available. Automatic keyword spotting systems for handwritten historic documen...
Article
This paper develops word recognition methods for historical handwritten cursive and printed documents. It employs a powerful segmentation-free letter detection method based upon joint boosting with histograms of gradients as features. Efficient inference on an ensemble of hidden Markov models can select the most probable sequence of candidate chara...
Article
Recent advances in sensor networks permit the use of a large number of relatively inexpensive distributed computa- tional nodes with camera sensors linked in a network and possibly linked to one or more central servers. We argue that the full potential of such a distributed system can be realized if it is designed as a distributed search engine whe...
Conference Paper
The word error rate of any optical character recognition system (OCR) is usually substantially below its component or character error rate. This is especially true of Indic languages in which a word consists of many components. Current OCRs recognize each character or word separately and do not take advantage of document level constraints. We propo...
Chapter
Paper archives need to be converted to electronic form to enable the search and organization of documents. This article provides a brief introduction to the basic ideas and the remaining challenges. Essentially, this involves first scanning the image, then discovering the page layout to separate text segments and finally recognizing characters and...
Conference Paper
Recent advances in sensor networks permit the use of a large number of relatively inexpensive distributed computational nodes with camera sensors linked in a network and possibly linked to one or more central servers. We argue that the full potential of such a distributed system can be realized if it is designed as a distributed search engine where...
Conference Paper
This paper presents an efficient indexing and retrieval scheme for searching in document image databases. In many non-European lan- guages, optical character recognizers are not very accurate. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so...
Article
Today's digital libraries increasingly include not only printed text but also scanned handwritten pages and other multimedia material. There are, however, few tools available for manipulating handwritten pages. Here, we extend our algorithm from [5]based on dynamic time warping (DTW) for a word by word alignment of handwritten documents with (ASCII...
Article
Searching and indexing historical handwritten collections are a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting” clusters, an index that links words to the locations where they occur can be...
Conference Paper
In this paper we explore different approaches for improving the performance of dependency models on discrete features for handwriting recognition. Hidden Markov models have often been used for handwriting recognition. Conditional random fields (CRF's) allow for more general dependencies and we investigate their use. We believe that this is the firs...
Conference Paper
Training and evaluation of techniques for handwriting recog- nition and retrieval is a challenge given that it is di-cult to create large ground-truthed datasets. This is especially true for historical handwrit- ten datasets. In many instances the ground truth has to be created by manually transcribing each word, which is a very labor intensive pro...
Conference Paper
A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar eorts from Yahoo and Microsoft. Content-based on line book retrieval usually re- quires rst converting printed text into machine readable (e.g. ASCII) text using an optical character recognitio...
Article
Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/ retrieval tools is to automatically segment handwritten pages into words. State of the art segmentatio...
Article
This paper investigates different machine learning models to solve the historical handwritten manuscript recognition problem. In particular, we test and compare support vector machines, conditional maximum entropy models and Naive Bayes with kernel density estimates and explore their behaviors and properties when solving this problem. We focus on a...
Conference Paper
Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized pixels as features form the basis of a highly accurate AdaBoost classifier, trained on a corpus of...
Conference Paper
Shape descriptors have been used frequently as features to characterize an image for classification and image retrieval tasks. For example, the patent office uses the similarity of shape to ensure that there are no infringements of copyrighted trademarks. This paper focuses on using machine learning and information retrieval techniques to classify...
Conference Paper
We discuss the opportunities, state of the art, and open research issues in using multi-modal features in video indexing. Specifically, we focus on how imperfect text data obtained by automatic speech recognition (ASR) may be used to help solve challenging problems, such as story segmentation, concept detection, retrieval, and topic clustering. We...
Article
Numerous examples exist of the benefits of the timely access to information in emergencies and disasters. Information technology (IT) is playing an increasingly important role in information-sharing during emergencies and disasters. The effective use of IT in out-of-hospital (OOH) disaster response is accompanied by numerous challenges at the human...
Conference Paper
Most image retrieval systems only allow a fragment of text or an example image as a query. Most users have more complex infor- mation needs that are not easily expressed in either of these forms. This paper proposes a model based on the Inference Network framework from information retrieval that employs a powerful query language that allows structu...
Conference Paper
In this paper, we propose the use of the Maximum Entropy approach for the task of automatic image annotation. Given labeled training data, Maximum Entropy is a statistical technique which allows one to predict the probability of a label given test data. The techniques allow for relationships between features to be eectively captured. and has been s...
Conference Paper
We apply a continuous relevance model (CRM) to the problem of directly retrieving the visual content of videos using text queries. The model computes a joint probability model for image features and words using a training set of annotated images. The model may then be used to annotate unseen test images. The probabilistic annotations are used for r...
Article
We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vecto...