Conference Paper

Mathematical Formula Identification in PDF Documents.

DOI: 10.1109/ICDAR.2011.285 Conference: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011
Source: DBLP

ABSTRACT Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.

0 Bookmarks
 · 
104 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Digital Libraries collect, organize and provide to end users large quantities of selected documents. While these documents come in a variety of formats, it is desirable that they are delivered to final users in a uniform way. Web formats are a suitable choice for this purpose. While Web documents are very flexible as to layout presentation, that is determined at runtime by the interpreter, documents coming from a library should preserve their original layout when displayed to final users. Using raster images would not allow the user to access the actual content of the document's components (text and images). This paper presents a technique to render in an HTML file the original layout of a document, preserving the peculiarity of its components (text, images, formulas, tables, algorithms). It builds on the DoMInUS framework, that can process documents in several source formats.
    Proceedings of the 2013 ACM symposium on Document engineering; 09/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a performance evaluation system for mathematical formula identification. First, a ground-truth dataset is constructed to facilitate the performance comparison of different mathematical formula identification algorithms. Statistics analysis of the dataset shows the diversities of the dataset to reflect the real-world documents. Second, a performance evaluation metric for mathematical formula identification is proposed, including the error type definitions and the scenario-adjustable scoring. The proposed metric enables in-depth analysis of mathematical formula identification systems in different scenarios. Finally, based on the proposed evaluation metric, a tool is developed to automatically evaluate mathematical formula identification results. It is worth noting that the ground-truth dataset and the evaluation tool are freely available for academic purpose.
    01/2012;

Full-text (2 Sources)

View
6 Downloads
Available from
May 30, 2014