Conference Paper

Metadata Extraction System for Chinese Books

DOI: 10.1109/ICDAR.2011.156 Conference: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011
Source: DBLP


Extracting metadata from academic papers has attracted much attention from researchers in past years. But how to extract metadata automatically from books is still seldom discussed. In this paper, we address this task on Chinese books and present a system to extract metadata from the title page of a book. This system consists of three components: metadata segmentation, metadata labeling, and post-processing. Different strategies are adopted in the system to identify different metadata types, and a variety of information sources, including geometric layout, linguistic, semantic content and header-footer, are used to accommodate the wide range of metadata layouts. Experimental results on real-world data have demonstrated the effectiveness of the proposed system.

Download full-text


Available from: Liangcai Gao, Sep 17, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional Information Retrieval (IR) approaches were successfully modified and adapted to Web IR; however, a book being inherently different from a web page makes these approaches less effective to Book IR. A true book search engine with global scope, however, cannot be achieved until all issues and challenges in Book IR are identified and subsequently solved. Based on a comprehensive and analytical study of available literature and book searching solutions, the research lists the most prominent state-of-the-art issues and challenges in Book IR and suggests some possible solutions. Our research shows that regardless of some innovations in Book IR we have a long way to go by solving these issues and challenges.
    Full-text · Article · Jun 2014 · International journal on information