Figures
Explore figures and images from publications
Fig 2 - uploaded by Katharina Kaiser
Content may be subject to copyright.
Example of a complex table in a PDF file 

Example of a complex table in a PDF file 

Source publication
Conference Paper
Full-text available
Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to b e develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format...

Contexts in source publication

Context 1
... this procedure we have a table consisting of more than one column. For the first five columns of our example in Fig. 2 we get the resulting columns presented in Table. 1. Finally, we have to identify neighbor cells with the same content and merge them. In our case the four cells with the content ”Families having stock holding direct or indirect” are merged into one single cell with a column spanning of four. These are the main steps of our approach to extract table information from PDF files. Because of the complexity of the task and the used heuristics, which cannot cover all possible table structures, one cannot assume that the approach always returns correct results. For example, our approach cannot distinguish between hidden tables (i.e., tables that are not labeled as such in the original file) and real tables. Further, tables that are positioned vertically on a page cannot be captured. There are also several possible errors, for example, text chunks that do not belong together are merged, multi-line block objects that belong together are not merged, data cells are assigned to wrong columns, and so forth. It is also possible that areas that are not tables are identified as such. This is the case, for example, with bulleted lists, etc. To overcome these limitations we also implemented a graphical user interface which gives the user the ability of making adjustments on the extracted data. The user can make adjustments on cell level (e.g. delete cells, merge cells, edit content of cell, etc.) or on table level (e.g. delete table, merge tables, delete/insert rows or columns). The main limitation of the tool is that it is based on the results of the pdftohtml tool. If this tool returns wrong information or no information at all, our approach cannot be applied. For example, PDF files sometimes contain only the image of a table and not text chunks which are inserted by an author. In such a case, the pdftohtml returns no useful information. We stated this limitation as the main limitation, because the user cannot do anything about it. The graphical user interface will not help, either. The evaluation of an Information Extraction System is a non-trivial issue. Therefore, we can say that the MUCs’ scoring program represented an important first step in the right direction [6]. These conferences served as an evaluation platform where systems from different sites were evaluated by using different measures. Over the years the recall and precision measures established themselves and are widely accepted as a means for giving evidence about a system’s performance. Currently, some research goes in the direction of finding new and proper measures for evaluating table-processing approaches [7]. However, it is hard to predict how good a measure reflects the real situation for a current approach. Our approach, for example, consists of several iterative steps and a failure in the first step would affect the end result to an unpredictable extent. But it would be very hard to evaluate the performance of each heuristic separately. Thus, we decided to evaluate the end result using the mentioned most established measures in the IE community, namely the recall and precision measures [2]. We evaluated the table recognition and decomposition task separately and trans- formed the formula for recall and precision according to the tasks. The formula for the table recognition task is as ...
Context 2
... the appropriate column for a text object requires a heuristic itself, which is described by Algorithm 5. After all these processes we have a list of columns which together compose the tables. Now, the only thing to do is to merge cells with the same content to one cell with a greater spanning value. In the following we will give an example to illustrate several steps of our approach. Assume that we have as input the PDF file with a page like in Fig. 2. Of course, the PDF file contains not only the table but also text paragraphs, footnotes, etc., too. After getting the results from the pdftohtml tool we can go on with our approach. Our first step is to sort all the text elements in respect to their top attributes. Assume that we have already identified the text elements before the table and let us begin with the text elements in the table (refer Fig. 3). In Fig. 3, after the sorting process we have the following ordering: ”Median value among families”, ”Families having stock holdings”, ”with holdings”, ”direct or indirect”, ”Family”, ”(thousands of 1998 dollars)”, ”characteristic”, ”1989”, ”1992”, ”1995”, ”1998”, ”1989”, and so on. Now, Algorithm 1 is applied to create the line objects. Based on the ordering the first text element that is saved in a line object is ”Median value among families”. Thus, a new line object is created and the top and bottom values are actualized in respect to the added text element. The next one is ”Families having stock holdings” and we must look whether we can put this text in an existing line object or not. The first dashed line (see Fig. 3) marks the bottom of the line object we just created. As you can see the current text elements’ top value is between the top and the bottom value of our first line object and thus can be added to this line. After adding, the line objects’ top and bottom values are actualized. This procedure is applied until all text objects have been found in a line object (the last text object in our example is ”1989”). The text elements in this line object are still sorted according to their top values. This ordering is of no use anymore, because we want to gain the text chunks that semantically belong together. For example, we want ”Family” and ”characteristic” merged. Thus, we next sort the text elements in the line object according to their left values. After that we have the wanted ordering, thus ”Family characteristic”, ”Families having stock holdings direct or indirect”, and so on (refer Fig. 3). After building all line objects in this page we have to classify all line objects as single-line object or multi-line object. Algorithm 2 marks successive multi-line objects as multi-line block objects. Because we have no other multi-line block object in this example we do not have to merge anything (refer Algorithm 3). The next step is to create columns and assign the text objects to their corresponding columns. This step is done by Algorithm 4 and Algorithm 5. For our first text object in the first line object (”Family characteristic”) we have to build a new column. For all the text objects in the line objects we have to look whether there exists a column to which that text object can be assigned. If so, we simply add the text object to this column. If not, we create a new column and add this text object to the new one. In both cases, we actualize the columns’ horizontal boundaries according to the new added text element. A text object can be assigned to a column if one of the following four possibilities appears (refer Fig. ...

Similar publications

Article
Full-text available
We present a suite of applications used for the Italian Treebank which share their linguistic processor and end up finally in higher level annotation tool called "FILES". The first application "FILES" – Fully Integrated Linguistic Environment for Syntactic and Functional Annotation -is a prototype for a fully integrated linguistic environment for s...

Citations

... Table structure recognition is the problem of identifying the structural layout of a table [27]. Yildiz et al. [37] proposed a technique for recognizing and reforming columns and rows by using the horizontal and vertical overlaps present between text blocks. Tensmeyer et al. [32] used dilated convolutions but depend upon heuristics approaches during postprocessing [13]. ...
Preprint
Full-text available
Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently identify the number of rows and columns, the alphanumeric characters, the LaTeX tokens, and symbols.
... Table structure recognition is the problem of identifying the structural layout of a table [26]. Yildiz et al. [35] proposed a technique for recognizing and reforming columns and rows by using the horizontal and vertical overlaps present between text blocks. Tensmeyer et al. [31] used dilated convolutions but depend upon heuristics approaches during post-processing [12]. ...
Article
Full-text available
Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently identify the number of rows and columns, the alphanumeric characters, the LaTeX tokens, and symbols.
... They firstly used rule-based methods to detect table structures and to locate tables using bounding-boxes. There have been lots of researches, such as the T-Res system presented by Kieninger [1], the PDF2table method proposed by Yildiz [2], and the method combined with OCR released by Tupaj et al. [3], etc. ...
Article
Full-text available
Using deep learning networks to recognize the table attracts lots of attention. However, due to the lack of high-quality table datasets, the performance of using deep learning networks is limited. Therefore, TableRobot has been proposed, an automatic annotation method for heterogeneous tables. To be more specific, the annotations of table consist of the coordinates of the item block and the mapping relationship between item blocks and table cells. In order to transform the task, we successfully design an algorithm based on the greedy approach to find the optimum solution. To evaluate the performance of TableRobot, we check the annotation data of 3000 tables collected from the LaTex documents in arXiv.com , and the result shows that TableRobot can generate table annotation datasets with the accuracy of 93.2%. Besides, the table annotation data is feed into GraphTSR which is a state-of-the-art table recognition graph neural network, and the F1 value of the network has increased by nearly 10% compared with before.
... The T-Recs system proposed by Kieninger et al. used a bottom-up clustering approach to detect the word segments within the image and then combined them according to some predefined rules to obtain the conceptual text blocks (3). Yildiz et al. developed the pdf2table system, which employs multiple heuristics to identify tables in PDF files (4). Koci et al. adopted a graphic model to represent the layout and spatial features of the potential forms within a page and then identified the form as a subgraph using a genetic algorithm (5). ...
Article
Full-text available
Background: Complete electronic health records (EHRs) are not often available, because information barriers are caused by differences in the level of informatization and the type of the EHR system. Therefore, we aimed to develop a deep learning system [deep learning system for structured recognition of text images from unstructured paper-based medical reports (DeepSSR)] for structured recognition of text images from unstructured paper-based medical reports (UPBMRs) to help physicians solve the data-sharing problem. Methods: UPBMR images were firstly preprocessed through binarization, image correction, and image segmentation. Next, the table area was detected with a lightweight network (i.e., the proposed YOLOv3-MobileNet model). In addition, the text of the table area was detected and recognized with the model based on differentiable binarization (DB) and convolutional recurrent neural network (CRNN). Finally, the recognized text was structured according to its row and column coordinates. DeepSSR was trained and validated on our dataset with 4,221 UPBMR images which were randomly split into training, validation, and testing sets in a ratio of 8:1:1. Results: DeepSSR achieved a high accuracy of 91.10% and a speed of 0.668 s per image. In the system, the proposed YOLOv3-MobileNet model for table detection achieved a precision of 97.8% and a speed of 0.006 s per image. Conclusions: DeepSSR has high accuracy and fast speed in structured recognition of text based on UPBMR images. This system may help solve the data-sharing problem due to information barriers between hospitals with different EHR systems.
... This was done by detecting horizontal and vertical lines on the document to find the area enclosed by the lines and then using that area as the candidate area of the table. A few years later, Yildiz et al.[12] proposed a table detection method for pdf documents based on heuristic rules. The method first extracts text lines from the pdf and then merges them into a table according to the rules. ...
Article
Full-text available
As financial document automation becomes more general, table detection is receiving more and more attention as an important part of document automation. Disclosure documents contain both bordered and borderless tables of varying lengths, and there is currently no model that performs well on these types of documents. To solve this problem, we propose a table detection model based on YOLO-table. We introduce involution into the backbone of the network to improve the network’s ability to learn table spatial layout features and design a simple Feature Pyramid Network to improve model effectiveness. In addition, this paper proposes a table-based augment method. We experiment on a disclosure document dataset, and the results show that the F1-measure of the YOLO-table reaches 97.3%. Compared with YOLOv3, our method improves the accuracy by 2.8% and the speed by 1.25 times. It also evaluates the ICDAR2013 and ICDAR2019 Table Competition datasets and achieves state-of-the-art performance.
... Table extraction (TE) [1,2] is the problem of inferring the presence, structure, and-to some extentmeaning of tables in documents or other unstructured presentations. In its presented form, a table is typically expressed as a collection of cells organized over a two-dimensional grid [3,4]. ...
Preprint
Full-text available
In this paper, we propose a new class of evaluation metric for table structure recognition, grid table similarity (GriTS). Unlike prior metrics, GriTS evaluates the correctness of a predicted table directly in its natural form as a matrix. To create a similarity measure between matrices, we generalize the two-dimensional largest common substructure (2D-LCS) problem, which is NP-hard, to the 2D most similar substructures (2D-MSS) problem and propose a polynomial-time heuristic for solving it. We validate empirically using the PubTables-1M dataset that comparison between matrices exhibits more desirable behavior than alternatives for table structure recognition evaluation. GriTS also unifies all three subtasks of cell topology recognition, cell location recognition, and cell content recognition within the same framework, which simplifies the evaluation and enables more meaningful comparisons across different types of structure recognition approaches. Code will be released at https://github.com/microsoft/table-transformer.
... For PDF documents, heuristics-based approaches are popular and achieved promising performance. Typically, heuristics methods [26,34] need to define various rules and use the meta-data of the documents, meaning that this type of approach cannot process document images. Besides, the generalization ability of heuristics-based methods are often limited because of variations of table structures. ...
Preprint
Full-text available
Tabular data in digital documents is widely used to express compact and important information for readers. However, it is challenging to parse tables from unstructured digital documents, such as PDFs and images, into machine-readable format because of the complexity of table structures and the missing of meta-information. Table Structure Recognition (TSR) problem aims to recognize the structure of a table and transform the unstructured tables into a structured and machine-readable format so that the tabular data can be further analysed by the down-stream tasks, such as semantic modeling and information retrieval. In this study, we hypothesize that a complicated table structure can be represented by a graph whose vertices and edges represent the cells and association between cells, respectively. Then we define the table structure recognition problem as a cell association classification problem and propose a conditional attention network (CATT-Net). The experimental results demonstrate the superiority of our proposed method over the state-of-the-art methods on various datasets. Besides, we investigate whether the alignment of a cell bounding box or a text-focused approach has more impact on the model performance. Due to the lack of public dataset annotations based on these two approaches, we further annotate the ICDAR2013 dataset providing both types of bounding boxes, which can be a new benchmark dataset for evaluating the methods in this field. Experimental results show that the alignment of a cell bounding box can help improve the Micro-averaged F1 score from 0.915 to 0.963, and the Macro-average F1 score from 0.787 to 0.923.
... This project utilizes computer vision techniques based on heuristics for table decomposition to detect and extract data from PDF. This project has been motivated upon document analysis ideas found in academic papers by [17], [37]. This process contains five steps: (1) Import all libraries, (2) Convert PDF file to text format and read data, (3) extract regular expressions to extract keywords, (4) Save list of extracted keywords in a DataFrame, and (5) Save the results in a DataFrame into a CSV file. ...
Article
Full-text available
Designing a database cost model is one of the main research topics related to the physical design phase. It follows the evolution of database technology in order to evaluate and quantify the performance metrics (e.g., response time, energy consumption, etc.). Therefore, it makes the community researchers sensitive to the generated results. However, reusing and comparing database cost models require extracting related information manually from the research publications. This process is error‐prone and time‐consuming. Unfortunately, many researchers claim the difficulty of surveying and reproducing cost models already published in several/journal articles and/or reports. This difficulty is due to the absence of a process describing the cost model itself formally as well as the context of its utilization. This article presents an approach enabling the extraction of cost models information (context, parameters, features, etc.) as a set of orchestrated services. These services are implemented using natural language processing and machine‐learning techniques via a work‐flow pipeline inspired by DevOps practices. We illustrate our approach on a case study to stress the feasibility and benefits of our proposal by emphasizing the reproduction and automatization facilities.
... Otro aspecto a considerar está basado en el tipo de documento, por ejemplo, Correa y Zander [7] analizaron un grupo de métodos y herramientas enfocados en extraer contenido tabular de archivos PDF basándose en dos características principales: facilidad de uso y resultados de salida y la categorización de las herramientas según propuestas teóricas, sin costo y comerciales. En [8] se desarrollaron varias heurísticas, que conjuntamente reconocen y descomponen tablas en archivos PDF y almacenan los datos extraídos en un formato estructurado de datos (XML) para facilitar su uso, estas heurísticas se dividen en dos grupos: reconocimiento y descomposición de tablas. Otras técnicas fueron presentadas en [9] para extraer data tabular de documentos PDF con el fin de identificar los límites de la tabla, donde los autores describen una metodología que aplica dos algoritmos de aprendizaje de máquina, CRF y máquinas de soporte vectorial (SVM, Support Vector Machines). ...
... Las tablas pueden distinguirse de acuerdo con su estructura y orientación. Una tabla relacional u horizontal [8], como la que se ilustra en la Tabla 1, tiene filas que proporcionan datos sobre objetos específicos llamados entidades y columnas, que representan atributos que describen las entidades. Existen tablas más complejas, como aquellas donde los atributos que describen las entidades están colocados verticalmente y las entidades de manera horizontal, u otro tipo de estructuras como las que se muestran en la Tabla 2 y en la Tabla 3. • Los nodos de datos están organizados como una matriz (n, m) que consiste de n filas y m columnas: ...
Article
Full-text available
Las Tablas son una manera bien común de organizar y publicar datos. Por ejemplo, la Web posee un enorme número de tablas publicadas en HTML integradas en documentos PDF, o que pueden ser simplemente descargadas de páginas Web. Sin embargo, las tablas no siempre son fáciles de interpretar pues poseen una gran variedad de características y son organizadas en diferentes formatos. De hecho, se han desarrollado un gran número de métodos y herramientas para la interpretación de tablas. Este trabajo presenta la implementación de un algoritmo, basado en Campos Aleatorios Condicionales (CRF, Conditional Random Fields), para clasificar las filas de una tabla como fila de encabezado, fila de datos y fila metadatos. La implementación se complementa con dos algoritmos para reconocer tablas en hojas de cálculos, específicamente, basados en reglas y detección de regiones. Finalmente, el trabajo describe los resultados y beneficios obtenidos por la aplicación del algoritmo para tablas HTML, obtenidas desde la Web, y las tablas en forma de hojas de cálculo, descargadas desde el sitio Web de la Agencia Nacional de Petróleo de Brasil.
... Tabula was created by Manuel Aristaran et al. with the first release made available early 2013 as an open source project. The developers stated that they were inspired by academic papers [13,43] about analysis and extraction of tabular content. Tabula is available as a Java library 3 . ...