Conference Paper

BibPro: A Citation Parser Based on Sequence Alignment Techniques.

Nat. Taiwan Univ., Taipei
DOI: 10.1109/WAINA.2008.125 Conference: 22nd International Conference on Advanced Information Networking and Applications, AINA 2008, Workshops Proceedings, GinoWan, Okinawa, Japan, March 25-28, 2008
Source: DBLP

ABSTRACT The dramatic increase in the number of academic publications has led to a growing demand for efficient organization of the resources to meet researchers' specific needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. Furthermore, because the publications utilize many different citation formats, the problem of accurately extracting metadata from a publication list has also attracted a great deal of attention in recent years. In this paper, we extend our previous work by using a gene sequence alignment tool to recognize and parse citation strings from publication lists into citation metadata. We also propose a new tool called BibPro. The main difference between BibPro and our previously proposed tool is that BibPro does not need any knowledge databases (e.g., an author name database) to generate a feature index for a citation string. Instead, BibPro only uses the order of punctuation marks in a citation string as its feature index to represent the string's citation format. Second, by using this feature index, BibPro employs the Basic Local Alignment Search Tool (BLAST) to match the feature's citation sequence with the most similar citation formats in the citation database. The Needleman-Wunsch algorithm is then used to determine the best citation format for extracting the desired citation metadata. By utilizing the alignment information, which is determined by the best template, BibPro can systematically extract the fields of author, title, journal, volume, number (issue), month, year, and page information from different citation formats with a high level of precision. The experiment results show that, in terms of precision and recall, BibPro outperforms other systems (e.g., INFOMAP and ParaCite). The results also show that BibPro scales very well.

0 Bookmarks
 · 
212 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Today a huge amount of researchers' publication list pages are available on the Web, which could be an important resource for many value-added applications, such as citation analysis and academic network analysis. How to automatically extract citation records from those pub-lication list pages is still a challenging problem because many of those pages are crafted manually by researchers themselves, and the layouts of those pages could be quite different according to different researchers' affinities. In this paper, we propose a system, called the Citation Record Extractor (CRE), to automatically extract citation records from publi-cation list pages. Our key idea is based on two observations form pub-lication list pages. First, citation records are usually presented in one or more contiguous regions. Second, in the form of HTML structure, ci-tation records are usually presented by using similar tag sequences and organized under a common parent node. Our system first identifies can-didate common style patterns (CSPs) within pages in the DOM tree structure, and then filters out irreverent patterns by using a classifier which is based on the length distribution of citation records. Experimen-tal results show that our method can perform well on real dataset with precision and recall at 80.4 % and 83.7% respectively, and provides more that 80% of F-measure for a majority of (around 90%) of the publication list pages in the real world.
    01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Researchers usually present their publication records (we call citation records in this paper) on publication lists on the Web, which could be an important data source for many applications to collect more publication records than from some digital libraries, such as DBLP. However, it is still not easy to design an algorithm to extract citation records from publication lists because of the diversity of page layouts and citation formats. In this paper, we propose an automatic approach to extract citation records from publication list pages by utilizing two properties. First, citation records are usually represented as nodes at the same level in the DOM tree. Second, citation records in the same page are presented by similar HTML tags. Extensive experiments are conducted to measure the effects of all parameters and system performance. Experiment results show that our approach performs stable and well (with 86.2% of F-measure on average).
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on; 10/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers' needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance.
    IEEE Transactions on Knowledge and Data Engineering 03/2012; · 1.89 Impact Factor

Full-text (2 Sources)

Download
53 Downloads
Available from
Jun 1, 2014