Conference Paper

Mathematical Formula Identification in PDF Documents

DOI: 10.1109/ICDAR.2011.285 Conference: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011
Source: DBLP

ABSTRACT Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.

Download full-text


Available from: Xiaoyan Lin, Apr 18, 2014
1 Follower
24 Reads
  • Source
    • "Surprisingly, there is very little existing work on how best to realize this process. Lines of research most closely related to the present work include extracting numerical attributes (e.g., [1] [4]), supporting numerical document queries (e.g., [5] [12]), and formula identification (e.g., [7]). However, none of these existing works address the comprehensive extraction of and search for measured information in document data, as described above. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 {\mu}m"). Finally, we present MQSearch: the realization of a search engine with full support for measured information.
  • Source
    • "Formulas and algorithms must be determined and restored through progressive aggregation of the low-level components in the obstacles section, because the higher-level block s of the layout section might mix them with normal running text of the document paragraphs. Inspired by [3], we look for peculiar elements in the document and group them into consistent aggregates: images of very small size overlapping to text blocks (as potential symbols), strokes (e.g., denoting ratios and roots), box es whose text suggests the presence of mathematics or code, and so on. More specifically, we define the following classes: "
    [Show abstract] [Hide abstract]
    ABSTRACT: Digital Libraries collect, organize and provide to end users large quantities of selected documents. While these documents come in a variety of formats, it is desirable that they are delivered to final users in a uniform way. Web formats are a suitable choice for this purpose. While Web documents are very flexible as to layout presentation, that is determined at runtime by the interpreter, documents coming from a library should preserve their original layout when displayed to final users. Using raster images would not allow the user to access the actual content of the document's components (text and images). This paper presents a technique to render in an HTML file the original layout of a document, preserving the peculiarity of its components (text, images, formulas, tables, algorithms). It builds on the DoMInUS framework, that can process documents in several source formats.
    Proceedings of the 2013 ACM symposium on Document engineering; 09/2013
  • Source
    • "In addition, most documents in the existing datasets were published too early to obtain the corresponding source PDF documents. As a result, for mathematical formula recognition methods focused on PDF documents [9] [10] [11], it is difficult to compare the performance directly with image-based methods. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a performance evaluation system for mathematical formula identification. First, a ground-truth dataset is constructed to facilitate the performance comparison of different mathematical formula identification algorithms. Statistics analysis of the dataset shows the diversities of the dataset to reflect the real-world documents. Second, a performance evaluation metric for mathematical formula identification is proposed, including the error type definitions and the scenario-adjustable scoring. The proposed metric enables in-depth analysis of mathematical formula identification systems in different scenarios. Finally, based on the proposed evaluation metric, a tool is developed to automatically evaluate mathematical formula identification results. It is worth noting that the ground-truth dataset and the evaluation tool are freely available for academic purpose.
Show more