Conference Paper

BibPro: A Citation Parser Based on Sequence Alignment Techniques

Nat. Taiwan Univ., Taipei
DOI: 10.1109/WAINA.2008.125 Conference: 22nd International Conference on Advanced Information Networking and Applications, AINA 2008, Workshops Proceedings, GinoWan, Okinawa, Japan, March 25-28, 2008
Source: DBLP

ABSTRACT

The dramatic increase in the number of academic publications has led to a growing demand for efficient organization of the resources to meet researchers' specific needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. Furthermore, because the publications utilize many different citation formats, the problem of accurately extracting metadata from a publication list has also attracted a great deal of attention in recent years. In this paper, we extend our previous work by using a gene sequence alignment tool to recognize and parse citation strings from publication lists into citation metadata. We also propose a new tool called BibPro. The main difference between BibPro and our previously proposed tool is that BibPro does not need any knowledge databases (e.g., an author name database) to generate a feature index for a citation string. Instead, BibPro only uses the order of punctuation marks in a citation string as its feature index to represent the string's citation format. Second, by using this feature index, BibPro employs the Basic Local Alignment Search Tool (BLAST) to match the feature's citation sequence with the most similar citation formats in the citation database. The Needleman-Wunsch algorithm is then used to determine the best citation format for extracting the desired citation metadata. By utilizing the alignment information, which is determined by the best template, BibPro can systematically extract the fields of author, title, journal, volume, number (issue), month, year, and page information from different citation formats with a high level of precision. The experiment results show that, in terms of precision and recall, BibPro outperforms other systems (e.g., INFOMAP and ParaCite). The results also show that BibPro scales very well.

Download full-text

Full-text

Available from: Hung-Yu Kao
  • Source
    • "A number of methods on this task have been reported in recent years. Those approaches can be classified into three types: rule-based, learning-based and template-based approaches [2]. Rule-based methods are usually simple and effective, and have been widely used in real-world applications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting metadata from academic papers has attracted much attention from researchers in past years. But how to extract metadata automatically from books is still seldom discussed. In this paper, we address this task on Chinese books and present a system to extract metadata from the title page of a book. This system consists of three components: metadata segmentation, metadata labeling, and post-processing. Different strategies are adopted in the system to identify different metadata types, and a variety of information sources, including geometric layout, linguistic, semantic content and header-footer, are used to accommodate the wide range of metadata layouts. Experimental results on real-world data have demonstrated the effectiveness of the proposed system.
    Full-text · Conference Paper · Sep 2011
  • Source
    • "Figure 1. System architecture 3.1 BibPro Bibro [13] is a template-based citation parser, and the key idea of BibPro is using the order of punctuation marks and reserved words in a citation string to represent its citation style. For a given citation string, BibPro encodes it as a protein sequence, which preserves citation style information. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Researchers usually present their publication records (we call citation records in this paper) on publication lists on the Web, which could be an important data source for many applications to collect more publication records than from some digital libraries, such as DBLP. However, it is still not easy to design an algorithm to extract citation records from publication lists because of the diversity of page layouts and citation formats. In this paper, we propose an automatic approach to extract citation records from publication list pages by utilizing two properties. First, citation records are usually represented as nodes at the same level in the DOM tree. Second, citation records in the same page are presented by similar HTML tags. Extensive experiments are conducted to measure the effects of all parameters and system performance. Experiment results show that our approach performs stable and well (with 86.2% of F-measure on average).
    Full-text · Conference Paper · Oct 2010
  • Source
    • "The precision and recall rates are above 94% for both the HS dataset and the CS dataset. In this paper, we propose a new citation parser called BibPro, which retains the advantages of our previous work [22] [23] (e.g., it uses protein sequences to represent citations and employs BLAST to find similar templates), and integrate the concept of knowledge-based approach and learning-based approach. Instead of relying on a knowledge database and heuristic rules, BibPro developed a canonicalization algorithm to systematically capture the structural features of a citation string and store these features in a sequence template. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The dramatic increase in the number of academic publications has led to a growing demand for efficient organization of the resources to meet researchers' specific needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. Furthermore, because the publications utilize many different citation formats, the problem of accurately extracting metadata from a publication list has also attracted a great deal of attention in recent years. In this paper, we extend our previous work by using a gene sequence alignment tool to recognize and parse citation strings from publication lists into citation metadata. We also propose a new tool called BibPro. The main difference between BibPro and our previously proposed tool is that BibPro does not need any knowledge databases (e.g., an author name database) to generate a feature index for a citation string. Instead, BibPro only uses the order of punctuation marks in a citation string as its feature index to represent the string's citation format. Second, by using this feature index, BibPro employs the Basic Local Alignment Search Tool (BLAST) to match the feature's citation sequence with the most similar citation formats in the citation database. The Needleman-Wunsch algorithm is then used to determine the best citation format for extracting the desired citation metadata. By utilizing the alignment information, which is determined by the best template, BibPro can systematically extract the fields of author, title, journal, volume, number (issue), month, year, and page information from different citation formats with a high level of precision. The experiment results show that, in terms of precision and recall, BibPro outperforms other systems (e.g., INFOMAP and ParaCite). The results also show that BibPro scales very well.
    Full-text · Conference Paper · Jan 2008
Show more