Conference Paper

Metadata Extraction System for Chinese Books.

DOI: 10.1109/ICDAR.2011.156 Conference: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011
Source: DBLP

ABSTRACT Extracting metadata from academic papers has attracted much attention from researchers in past years. But how to extract metadata automatically from books is still seldom discussed. In this paper, we address this task on Chinese books and present a system to extract metadata from the title page of a book. This system consists of three components: metadata segmentation, metadata labeling, and post-processing. Different strategies are adopted in the system to identify different metadata types, and a variety of information sources, including geometric layout, linguistic, semantic content and header-footer, are used to accommodate the wide range of metadata layouts. Experimental results on real-world data have demonstrated the effectiveness of the proposed system.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The dramatic increase in the number of academic publications has led to a growing demand for efficient organization of the resources to meet researchers' specific needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. Furthermore, because the publications utilize many different citation formats, the problem of accurately extracting metadata from a publication list has also attracted a great deal of attention in recent years. In this paper, we extend our previous work by using a gene sequence alignment tool to recognize and parse citation strings from publication lists into citation metadata. We also propose a new tool called BibPro. The main difference between BibPro and our previously proposed tool is that BibPro does not need any knowledge databases (e.g., an author name database) to generate a feature index for a citation string. Instead, BibPro only uses the order of punctuation marks in a citation string as its feature index to represent the string's citation format. Second, by using this feature index, BibPro employs the Basic Local Alignment Search Tool (BLAST) to match the feature's citation sequence with the most similar citation formats in the citation database. The Needleman-Wunsch algorithm is then used to determine the best citation format for extracting the desired citation metadata. By utilizing the alignment information, which is determined by the best template, BibPro can systematically extract the fields of author, title, journal, volume, number (issue), month, year, and page information from different citation formats with a high level of precision. The experiment results show that, in terms of precision and recall, BibPro outperforms other systems (e.g., INFOMAP and ParaCite). The results also show that BibPro scales very well.
    22nd International Conference on Advanced Information Networking and Applications, AINA 2008, Workshops Proceedings, GinoWan, Okinawa, Japan, March 25-28, 2008; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.
    Digital Libraries, 2003. Proceedings. 2003 Joint Conference on; 06/2003
  • Source
    Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings; 01/2004


1 Download
Available from