Word Extraction from Table Regions in Document Images
DOI: 10.3745/KIPSTB.2005.12B.4.369 Conference: Digital Libraries: Implementing Strategies and Sharing Experiences, 8th International Conference on Asian Digital Libraries, ICADL 2005, Bangkok, Thailand, December 12-15, 2005, Proceedings
This paper describes a method to extract words from table regions in document images. The proposed approach consists of two stages: cell detection and word extraction. In the cell detection module, a table frame is extracted first by analyzing connected components and then intersection points are detected by a method using masks in the table frame. We correct false intersections, and detect the location of the cells within the table. In the word extraction module, a text region in each cell is located by using the connected components information that was obtained during the cell extraction module, and segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The method correctly included character components touching the table frame with words, so experimental results show that more than 99% of words were successfully extracted from table regions.
Available from: ocean.kisti.re.kr
- "They were printed by a SAMSUNG ML8065 printer, and then copied iteratively by 8 times using a XEROX Document Centre 285 PLUS G copier, and finally scanned by an EPSON GT-30000 scanner at 200 DPI. All the document images were partitioned into word images using the system of (Jeong et al. 2005). One half of the data is used for training, A Keyword Matching for the Retrieval of Low-Quality Hangul Document Images 45 and the other half is used for testing (Table 1). "
[Show abstract] [Hide abstract]
ABSTRACT: It is a difficult problem to use keyword retrieval for low-quality Korean document images because these include adjacent characters that are connected. In addition, images that are created from various fonts are likely to be distorted during acquisition. In this paper, we propose and test a keyword retrieval system, using a support vector machine (SVM) for the retrieval of low-quality Korean document images. We propose a keyword retrieval method using an SVM to discriminate the similarity between two word images. We demonstrated that the proposed keyword retrieval method is more effective than the accumulated Optical Character Recognition (OCR)-based searching method. Moreover, using the SVM is better than Bayesian decision or artificial neural network for determining the similarity of two images.
[Show abstract] [Hide abstract]
ABSTRACT: This paper describes the development and implementation of a algorithm to extract words from image regions mixed text/graphics in document images using statistical analyses, which is a component of DIPS(Document Images Processing System) using statistical methods. To extract word images from image regions, the character components need to be separated from graphic components. For this process, we propose a method to separate them with an analysis of box-plot using a statistics of structural components. An accuracy of this method is not sensitive to the changes of images because the criterion of separation is defined by the statistics of components. And then the character regions are determined by analyzing a local crowdedness of the separated character elements. Finally, we divide the character regions into text lines and word images using projection profile analysis and gap clustering, etc. The proposed system could reduce the influence resulted from the changes of images because it uses the criterion based on the statistics of image regions.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.