Conference Paper

Mathematical Formula Identification in PDF Documents

DOI: 10.1109/ICDAR.2011.285 Conference: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011
Source: DBLP

ABSTRACT Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents mathematical formula detection in heterogeneous document images that may contain figures, tables, text, and math formulas. We adopt the method originally proposed for sign detection in natural images to detect non-homogeneous regions and accordingly achieve text line detection and segmentation. Novel features based on centroid fluctuation information of non-homogeneous regions are proposed to more appropriately characterize both displayed formulas and embedded formulas. By comparing the proposed method with previous works, we demonstrate the effectiveness of the proposed features.
    2013 Conference on Technologies and Applications of Artificial Intelligence (TAAI); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.
    Document Analysis and Recognition 09/2013; 17(3):239-255. DOI:10.1007/s10032-013-0216-1 · 0.86 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Digital Libraries collect, organize and provide to end users large quantities of selected documents. While these documents come in a variety of formats, it is desirable that they are delivered to final users in a uniform way. Web formats are a suitable choice for this purpose. While Web documents are very flexible as to layout presentation, that is determined at runtime by the interpreter, documents coming from a library should preserve their original layout when displayed to final users. Using raster images would not allow the user to access the actual content of the document's components (text and images). This paper presents a technique to render in an HTML file the original layout of a document, preserving the peculiarity of its components (text, images, formulas, tables, algorithms). It builds on the DoMInUS framework, that can process documents in several source formats.
    Proceedings of the 2013 ACM symposium on Document engineering; 09/2013

Full-text (2 Sources)

Available from
May 30, 2014