Multi-page document analysis based on format consistency and clustering.

International Journal of Computer Applications in Technology 01/2010; 38:306-315. DOI: 10.1504/IJCAT.2010.034531
Source: DBLP


In multi-page documents, document elements belonging to the same component usually share format regularity. We call this regularity 'document component intrinsic format consistency' (DCIFC). We present a new document analysis method based on DCIFC, which is complementary to the traditional document analysis methods based on the visual characteristics of document elements. One key advantage of our method is that DCIFC is stable from document to document, and thus is not impacted by layout variability, which is a major challenge in document analysis. Our method uses clustering techniques to build statistical models and then applies the models to labelling document components. In this way, the method can adapt to specific documents using formal specificities of components. We apply our method to several document recognition tasks and show its superior performance.

6 Reads
  • Proceedings DAS2000, Int'l Workshop on Document Analysis Systems; 12/2000
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With an aim to extract the structural information from the table of contents (TOC) to help develop a digital document library, the requirement of identifying/segmenting the TOC page is obvious. The objective to create a digital document library is to provide a non-labour intensive, cheap and flexible way of storing, representing and managing the paper document in electronic form to facilitate indexing, viewing, printing and extracting the intended portions. Information from the TOC pages is to be extracted for use in a document database for effective retrieval of the required pages. We present a fully automatic identification and segmentation of a table of contents (TOC) page from a scanned document.
    Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on; 09/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Lister Hill National Center for Biomedical Communications has developed a system that incorporates OCR and automated recognition and reformatting algorithms to extract bibliographic citation data from scanned biomedical journal articles to populate the NLM's MEDLINE® database. The multi-engine OCR server incorporated in the system performs well in general, but fares less well with text printed in the small or italic fonts often used to print institutional affiliations. Because of poor OCR and other reasons, the resulting affiliation field frequently requires a disproportionate amount of time to manually correct and verify. In contrast, author names are usually printed in large, normal fonts that are correctly recognized by the OCR system. We describe techniques to exploit the more successful OCR conversion of author names to help find the correct affiliations from MEDLINE data.
Show more