Conference PaperPDF Available

TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images

Authors:
  • Tata Consultancy Services Limited, Delhi

Abstract and Figures

With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular sub-images presents a unique set of challenges. This includes accurate detection of the tabular region within an image, and subsequently detecting and extracting information from the rows and columns of the detected table. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition. Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models. In this paper, we propose TableNet: a novel end-to-end deep learning model for both table detection and structure recognition. The model exploits the interdependence between the twin tasks of table detection and table structure recognition to segment out the table and column regions. This is followed by semantic rule-based row extraction from the identified tabular sub-regions. The proposed model and extraction approach was evaluated on the publicly available ICDAR 2013 and Marmot Table datasets obtaining state of the art results. Additionally, we demonstrate that feeding additional semantic features further improves model performance and that the model exhibits transfer learning across datasets. Another contribution of this paper is to provide additional table structure annotations for the Marmot data, which currently only has annotations for table detection.
Content may be subject to copyright.
TableNet: Deep Learning model for end-to-end
Table detection and Tabular data extraction from
Scanned Document Images
Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, Lovekesh Vig
TCS Research, New Delhi, India,
Email : {shubham.p3, vishwanath.d2, monika.sharma1, rohit.rahul, lovekesh.vig}@tcs.com
Abstract—With the widespread use of mobile phones and
scanners to photograph and upload documents, the need for
extracting the information trapped in unstructured document
images such as retail receipts, insurance claim forms and financial
invoices is becoming more acute. A major hurdle to this objective
is that these images often contain information in the form of
tables and extracting data from tabular sub-images presents a
unique set of challenges. This includes accurate detection of the
tabular region within an image, and subsequently detecting and
extracting information from the rows and columns of the detected
table. While some progress has been made in table detection,
extracting the table contents is still a challenge since this involves
more fine grained table structure(rows & columns) recognition.
Prior approaches have attempted to solve the table detection and
structure recognition problems independently using two separate
models. In this paper, we propose TableNet: a novel end-to-
end deep learning model for both table detection and structure
recognition. The model exploits the interdependence between the
twin tasks of table detection and table structure recognition to
segment out the table and column regions. This is followed by
semantic rule-based row extraction from the identified tabular
sub-regions. The proposed model and extraction approach was
evaluated on the publicly available ICDAR 2013 and Marmot
Table datasets obtaining state of the art results. Additionally,
we demonstrate that feeding additional semantic features further
improves model performance and that the model exhibits transfer
learning across datasets. Another contribution of this paper is to
provide additional table structure annotations for the Marmot
data, which currently only has annotations for table detection.
Index Terms—Table Detection, Table Structure Recognition,
Scanned Documents, Information Extraction
I. INTRODUCTION
With the proliferation of mobile devices equipped with
cameras, an increasing number of customers are uploading
documents via these devices, making the need for information
extraction from these images more pressing. Currently, these
document images are often manually processed resulting in
high labour costs and inefficient data processing times. Fre-
quently, these documents contain data stored in tables with
multiple variations in layout and visual appearance. A key
component of information extraction from these documents
therefore involves digitizing the data present in these tabular
sub-images. The variation in the table structure, and in the
graphical elements used to visually separate the tabular com-
ponents make extraction from these images a very challenging
problem. Most existing approaches to tabular information ex-
traction divide the problem into the two separate sub-problems
of 1) table detection and 2) table structure recognition, and
attempt to solve each sub-problem independently. While table
detection involves detection of the image pixel coordinates
containing the tabular sub-image, tabular structure recognition
involves segmentation of the individual rows and columns in
the detected table.
In this paper, we propose TableNet, a novel end-to-end
deep learning model that exploits the inherent interdependence
between the twin tasks of table detection and table structure
identification. The model utilizes a base network that is
initialized with pre-trained VGG-19 features. This is followed
by two decoder branches for 1) Segmentation of the table
region and 2) Segmentation of the columns within a table
region. Subsequently, rule based row extraction is employed
to extract data in individual table cells.
A multi-task approach is used for the training of the deep
model. The model takes a single input image and produces
two different semantically labelled output images for tables
and columns. The model shares the encoding layer of VGG-
19 for both the table and column detectors, while the decoders
for the two tasks are separate. The shared common layers are
repeatedly trained from the gradients received from both the
table and column detectors while the decoders are trained
independently. Semantic information about elementary data
types is then utilized to further boost model performance. The
utilization of the VGG-19 as a base network, which is pre-
trained on the ImageNet dataset allows for exploitation of prior
knowledge in the form of low level features learnt via training
over ImageNet.
We have evaluated TableNet’s performance on the ICDAR-
2013 dataset, demonstrating that our approach marginally
outperforms other deep models as well as other state-of-the-
art methods in detecting and extracting tabular information in
image documents. We further demonstrate that the model can
generalize to other datasets with minimal fine tuning, thereby
enabling transfer learning. Furthermore, the Marmot dataset
which has previously been annotated for table detection was
also manually annotated for column detection, and these new
annotations will be publicly released to the community for
future research.
In summary, the primary contributions made in this paper
are as follows:
1) We propose TableNet: a novel end-to-end deep multi-
task architecture for both table detection and structure
recognition yielding state of the art performance on the
public benchmark ICDAR and Marmot datasets.
2) We demonstrate that adding additional spatial semantic
features to TableNet during training further boosts model
performance.
3) We show that using a pre-trained TableNet model and
fine tuning it on an another new dataset will boost the
performance of the model on the new dataset, thereby
allowing for transfer learning.
4) We have manually annotated the Marmot dataset for table
data extraction and will release the annotations to the
community.
The rest of the paper is organized as follows: Section II
provides an overview of the related work on tabular infor-
mation extraction. Section III provides a detailed description
of the TableNet model. Section IV outlines the extraction
process with TableNet. Section V provides details about the
datasets, preprocessing steps and training. Section VI outlines
the experiment details and results. Finally, the conclusions and
future work are presented in Section VII.
II. RE LATE D WOR K
There is significant prior work on identifying and extracting
the tabular data inside a document. Most of these have reported
results on table detection and data extraction separately [1]
Before the advent of deep learning, most of the work on
table detection was based on heuristics or metadata. TINTIN
[2] exploited structural information to identify tables and their
component fields. [3] used hierarchical representations based
on the MXY tree for table detection and was the first attempt at
using Machine Learning techniques for this problem. T Kasar
et al. [4] identified intersecting horizontal, vertical lines and
low-level features and used an SVM classifier to classify an
image region as a table region or not.
Probabilistic graphical models were also used to detect ta-
bles; Silva et al. [5] modelled the joint probability distribution
over sequential observations of visual page elements and the
hidden state of a line (HMM) to merge potential table lines
into tables resulted in a high degree of completeness. Jing Fang
et al. [6] used the table header as a starting point to detect
the table region and decompose its elements. Raskovic et al.
[7] made an attempt to detect borderless tables. They utilized
whitespaces as a heuristic rather than content for detection.
Recently, DeepDeSRT [8] was proposed which uses deep
learning for both table detection and table structure recog-
nition, i.e. identifying rows, columns, and cell positions in
the detected tables. This work achieves state-of-the-art per-
formance on the ICDAR 2013 table competition dataset.
After this, [9] combined deep convolutional neural networks,
graphical models and saliency concepts for localizing tables
and charts in documents. This technique was applied on an
extended version of ICDAR 2013 table competition dataset
and outperforms existing models. [10] locates the text compo-
nents and extracts text blocks. After that, the height of each
text block is compared with the average height and if satisfies
a series of rules, the ROI is regarded as a table.
T-Recs [11] was one of the earliest works to extract tabular
data based on clustering of given word segments and overlap
of the text inside the table. Y. Wang et al. [12] estimates prob-
abilities from geometric measurements made on the various
entities in a given document.
Ashwin et al. [13] exploit the formatting cues from semi-
structured HTML tables to extract data from web pages. Here
the cells are already demarcated by tags since they are in
HTML tables. Singh et al. [14] use object detection techniques
for Document Layout understanding.
III. TAB LE NET: DEE P MOD EL F OR TAB LE A ND COLUMN
DET EC TI ON
In all prior deep learning based approaches, table detection
and column detection are considered separately as two differ-
ent problems, which can be solved independently. However,
intuitively if all the columns present in a document are
known apriori, the table region can be determined easily. But
by definition, columns are vertically aligned word/numerical
blocks. Thus, independently searching for columns can pro-
duce a lot of false positives and knowledge of the tabular
region can greatly improve results for column detection (since
both tables and columns have common regions). Therefore, if
convolutional filters utilized to detect tables, can be reinforced
by column detecting filters, this should significantly improve
the performance of the model. Our proposed model, exploits
this intuition and is based on the Long et al. [15], encoder-
decoder model for semantic segmentation. The encoder of the
model is common across both tasks, but the decoder emerges
as two different branches for tables and columns. Concretely,
we enforced the encoding layers to use the ground truth of both
tables and columns of document for training. However, the
decoding layers are separated for table and column branches.
Thus, there are two computational graphs to train.
The input image for the model, is first transformed into
an RGB image and then, resized to 1024 * 1024 resolution.
This modified image is processed using tesseract OCR [16]
as described in the previous section. Since a single model
produces both the output masks for the table and column
regions, these two independent outputs have binary target pixel
values, depending on whether the pixel region belongs to the
table/column region or background respectively.
The problem of detecting tables in documents is similar to
the problem of detecting objects in real world images. Similar
to the generic object detection problem, visual features of the
tables can be used to detect tables and columns. The difference
is that the tolerance for noise in table/column detection is much
smaller than in object detection. Therefore, instead of regress-
ing for the boundaries of tables and columns, we employed a
method to predict table and column regions pixel-wise. Recent
work on semantic segmentation based on pixel wise prediction,
has been very successful. FCN architecture, proposed by Long
et al. [15], has demonstrated the accuracy of encoder-decoder
network architectures for semantic segmentation. The FCN
Fig. 1: (a) Sample training image from Marmot dataset, with highlighted text. (b) TableNet: Proposed model consists of
pre-trained layers of VGG-19 as base network. Layers starting from conv1 to pool5 are used as common encoder layers for
both table and column graph. Two decoder branches, conv7 column and conv7 table emerging after encoder layers, generate
separate table predictions and column predictions.
architecture uses the skip-pooling technique to combine the
low-resolution feature maps of the decoder network with
the high-resolution features of encoder networks. VGG-16 is
used as the base layer in their model and fractionally-strided
convolution layers are used to upscale the found low-resolution
semantic map which is then combined with high resolution
encoding layers.
Our model uses the same intuition for the encoder/decoder
network as the FCN architecture. Our proposed model as
shown in Figure 1, uses a pre-trained VGG-19 layer as the base
network. The fully connected layers (layers after pool5) of
VGG-19 are replaced with two (1x1) convolution layers. Each
of these convolution layers (conv6) uses the ReLU activation
followed by a dropout layer having probability of 0.8 (conv6
+ dropout as shown in Figure 1). Following this layer, two
different branches of the decoder network are appended. This
is according to the intuition that the column region is a subset
of the table region. Thus, the single encoding network can
filter out the active regions with better accuracy using features
of both table and column regions. The output of the (conv6 +
dropout) layer is distributed to both decoder branches. In each
branch, additional layers are appended to filter out the respec-
tive active regions. In the table branch of the decoder network,
an additional (1x1) convolution layer, conv7 table is used,
before using a series of fractionally strided convolution layers
for upscaling the image. The output of the conv7 table layer
is also up-scaled using fractionally strided convolutions, and is
appended with the pool4 pooling layer of the same dimension.
Similarly, the combined feature map is again up-scaled and
the pool3 pooling is appended to it. Finally, the final feature
map is upscaled to meet the original image dimension. In
the other branch for detecting columns, there is an additional
convolution layer (conv7 column) with a ReLU activation
function and a dropout layer with the same dropout probability.
The feature maps are up-sampled using fractionally strided
convolutions after a (1x1) convolution (conv8 column) layer.
The up-sampled feature maps are combined with the pool4
pooling layer and the combined feature map is up-sampled and
combined with the pool3 pooling layer of the same dimension.
After this layer, the feature map is up-scaled to the original
image. In both branches, multiple (1x1) convolution layers are
used before the transposed layers. The intuition behind using
(1x1) convolution is to reduce the dimensions of feature maps
(channels) which is used in class prediction of pixels, since the
output layers (output of encoder network) must have channels
equal to the number of classes (channel with max probability
(a) Figure showing the raw document image. (b) Generated table mask after processing. (c) Generated column mask after processing.
Fig. 2: Sample document image and its output masks generated after processing from TableNet.
is assigned to corresponding pixels) which is later up-sampled.
Therefore, the outputs of the two branches of computational
graphs yield the mask for the table and column regions.
IV. TABLE ROW EXTRACTION
After processing the documents using TableNet, masks for
table and column regions are generated. These masks are used
to filter out the table and its column regions from the image.
Since, all word positions of the document are already known
(using Tesseract OCR), only the word patches lying inside
table and column regions are filtered out. Now, using these
filtered words, a row can be defined as the collection of words
from multiple columns, which are at the similar horizontal
level. However, a row is not necessarily confined to a single
line, and depending upon the content of a column or line
demarcations, a row can span multiple lines. Therefore, to
cover the different possibilities, we formulate three rules for
row segmentation:
In most tables for which line demarcations are present,
the lines segment the rows in each column. To detect the
possible line demarcation (for rows), every space between
two vertically placed words in a column is tested for the
presence of lines via a Radon transform. The presence of
horizontal line demarcation clearly segments out the row.
In case a row spans multiple lines, the rows of the table
which have maximum non-blank entries is marked as the
starting point for a new row. For example in a multi-
column table, some of the columns can have entries
spanning just one line (like quantity, etc), while others
can have multi-line entries (like description, etc). Thus,
each new row begins when all the entities in each column
are filled.
In tables, where all the columns are completely filled and
there are no line demarcations, each line (level) can be
seen as a unique row.
V. DATASE T PREPARATION
Deep-learning based approaches are data-intensive and re-
quire large volumes of training data for learning effective
representations. Unfortunately, there are very few datasets like
Marmot [17], UW3 [18], etc for table detection and even
these contain only a few hundred images. There are even
fewer datasets for table structure identification such as the
ICDAR 2013 table competition dataset for both table detection
and its structural analysis [19]. This creates a constraint for
deep learning models to solve both table detection and table
structural analysis.
For training our model, we have used the Marmot table
recognition dataset. This is the largest publicly available
dataset for table detection but unfortunately did not have
annotations for table columns or rows. We manually annotated
the dataset for table structure recognition since the dataset had
ground truth only for table detection. The dataset was manually
annotated by labeling the bounding boxes around each of the
columns inside the tabular region. The manually annotated
modified dataset is publicly released with the name Marmot
Extended for table structure recognition 1.
A. Providing Semantic Information
Intuitively, any table has common data types in the same
row/column depending on whether the table is in row major or
column major form. For example, a name column will contain
strings, while, a quantity header will contain numbers. To pro-
vide this semantic information to the deep model, text regions
with similar data types are color coded. This modified image
is then fed to the network resulting in improved performance.
We have included spatial semantic features by highlighting
the words with patches as shown in Figure 1 (a) and this
dataset is also made publicly available. The document images
are first processed with histogram equalization. After pre-
processing, the word blocks are extracted using tesseract OCR.
These word patches are colored depending upon their basic
data type. The resulting modified images are used as input to
the network. The TableNet model takes the input image and
generates the binary mask image of both table and columns
separately. The achieved result is filtered using rules outlined
on the basis of the detected table and column masks. An
example of the generated output is shown in Figure 2.
B. Training Data Preparation for TableNet
To provide the basic semantic type information to the
model, the word patches are color coded. The image is first
1https://drive.google.com/drive/folders/1QZiv5RKe3xlOBdTzuTVuYRxixemVIODp
Model Recall Precision F1-Score
TableNet + Semantic Features
(fine-tuned on ICDAR) 0.9628 0.9697 0.9662
TableNet + Semantic Features 0.9621 0.9547 0.9583
TableNet 0.9501 0.9547 0.9547
DeepDeSRT [8] 0.9615 0.9740 0.9677
Tran et al [10] 0.9636 0.9521 0.9578
TABLE I: Results on Table Detection
Model Recall Precision F1-Score
TableNet + Semantic Features
(fine-tuned on ICDAR) 0.9001 0.9307 0.9151
TableNet + Semantic Features 0.8994 0.9255 0.9122
TableNet 0.8987 0.9215 0.9098
DeepDeSRT [8] 0.8736 0.9593 0.9144
TABLE II: Results on Table Structure Recognition & Data
Extraction
processed with tesseract OCR, to get all the word patches in
the image document. Then the words are processed via regular
expressions to determine their data-type. The intuition is to
color the word bounding boxes to impart both the semantic and
spatial information to the network. Each data type is given a
unique color and, bounding-boxes of words with similar data-
types are shaded in the same color. Word bounding boxes
are filtered out to remove spurious detections. However, since
word detection and extraction from the OCR will produce
some noise, the model needs to learn to be resilient to these
cases. Therefore to simulate the case of incomplete word
detection in the training image, a few randomly selected word
patches are dropped deliberately. The formed color coded
image can be used for training, but many visual features like
line demarcations, corners, color highlights, etc are lost, while
using only the word annotated document image. Therefore, to
retain those important visual features in the training data, the
word highlighted image is pixel-wise added to the original
image. These modified document images are then used for
training.
VI. EXPERIMENTS AND RESULTS
This section describes the different experiments performed
on the ICDAR 2013 table competition dataset [19]. The model
performance is evaluated based on the Recall, Precision & F1-
score. These measures are computed for each document and
their average is taken across all the documents.
(a) Table Detection: Completeness and Purity are the two
standard measures used in page segmentation [20]. A
region is complete if it includes all sub-objects present in
the ground-truth. A region is pure if it does not include
any sub-objects which are not in the ground-truth. Sub-
objects are created by dividing the given region into
meaningful parts like heading of a table, body of a table
etc. But these measures do not discriminate between
minor and major errors. So, individual characters in each
region are treated as sub-objects. Precision and recall
measures are calculated on these sub-objects in each
region and the average is taken across all the regions
in a given document.
(b) Table Data Extraction: Each entry inside a table is defined
as a cell. For each cell, adjacency relations are gener-
ated with the nearest horizontal and vertical neighbours.
Thus, adjacency relations of a given cell is a 1D-tuple
containing the text information of its neighboring cells.
The content in each cell was normalized; white spaces
were removed, special characters were replaced with
underscores and lowercase letters with uppercase. This
1D-tuple can then be compared with the ground truth by
using precision and recall.
TableNet requires both table and structure annotated data
for training. We used the Marmot table detection data and
manually annotated the structure information. There are a total
of 1016 documents containing tables including both Chinese
and English documents, out of which 509 English documents
are annotated and used for training. The proposed deep model
has been implemented in Tensorflow and implemented on a
system with Intel(R) Xeon(R) Silver CPU having 32 cores
and RAM of 128 GB Tesla V100-PCIE-1 GPU with 6GB
of GPU memory. In Experiment 1, we trained our model on
all positive samples of Marmot and tested on the ICDAR
2013 table competition dataset for both table and structure
detection. There are two computation graphs which require
training. Each training sample is a tuple of a document
image, table mask and column mask. With each training tuple,
the two graphs are computed at-least twice. In the initial
phase of training, the table branch and column branch are
computed in the ratio of 2:1. With each training tuple, the
table branch of the computational graph is computed twice,
and then the column branch of the model is computed. It
is worth noting that, although the table branch and column
branch are different, the encoder is the same for both. During
initial iterations of training, the learning is more focused on
training the model to generate big active tabular regions which
on subsequent training specializes to column regions. After
around 500 iterations with a batch size of 2, when train loss of
both table and column detectors are comparable, this training
scheme is modified. However, it should be noted that the
table classifier at this stage must exhibit a converging trend
(otherwise, training is extended with the same 2:1 scheme).
The model is then trained in the ratio of 1:1 for both branches
until convergence. Using the same training pattern, the model
is trained for 5000 iterations with a batch size of 2 and
learning rate of 0.0001. The Adam optimizer is used for
improving and optimizing training with parameters beta1=0.9,
beta2=0.999 and epsilon=1e-08. The convergence and over-
fitting behavior was monitored by observing the performance
over the validation set (taken from ICDAR 2013 data-set).
During testing, 0.99 is taken as the threshold probability for
pixel-wise prediction. The results are compiled in Table I and
Table II.
Similarly, in Experiment 2, we used the modified Marmot
data-set where, the words in each document were highlighted
to provide semantic context as described in Section V-B. All
the parameters were identical to Experiment 1. There was
slight improvement in the results, when these spatial, semantic
information are appended to the images (see table for compar-
ison). Output of the model is shown in Figure 2. Additionally,
Experiment 3 was carried out to compare TableNet with
the closest deep-learning based solution, DeepDSert [8]. In
DeepDSert, separate models are made for Table detection and
structure recognition, which were trained on different datasets
such as Marmot for table detection, and the ICDAR 2013 table
dataset for table structure recognition. To generate comparable
results, we fine-tuned our Marmot trained TableNet model,
on ICDAR train and test data splits. As done in DeepDSert,
we also randomly chose 34 images for testing and used the
rest of the data images for fine-tuning our model. Our model
was fine-tuned, with the same parameters, in the ratio of 1:1
for both branches for 3000 iterations with a batch size of
2. The performance of our model further improved after the
fine-tuning, possibly due to the introduction to the domain of
ICDAR documents. The results of this experiments are also
compiled in tables I and II. Average time taken for our system
for each document image is 0.3765 seconds, however this
could not be compared with DeepDSert as their model was
not publicly available. While the results are not conclusively
better than DeepDSert, they are certainly comparable and the
fact that our model is end-to-end means further improvements
can be made with richer semantic knowledge, and additional
branches for learning row based segmentation.
VII. CONCLUSION
This paper presents TableNet, a novel deep learning model
trained on dual tasks of table detection and structure recog-
nition in an end-to-end fashion. Existing approaches to infor-
mation extraction treat detection and structure recognition as
two distinct problems to be solved independently. TableNet
is the first model to jointly address both tasks simultane-
ously, by exploiting the inherent interdependence between
table detection and table structure identification. TableNet
utilizes the knowledge from previously learned tasks and can
transfer that knowledge to newer, related ones demonstrating
transfer learning. This is particularly useful when the training
data is sparse. We also show that highlighting the text to
provide data-type information improves the performance of
the model. In the future we intend to train TableNet with
a third branch to identify rows, however this will require a
manual annotation exercise as currently datasets do not provide
for row based annotations. Another question that arises from
these experiments is what other semantic knowledge could be
provided for better model performance, perhaps the use of
more abstract data types such as currency, country or city,
might be useful.
REFERENCES
[1] D. W. Embley, M. Hurst, D. Lopresti, and G. Nagy, “Table-processing
paradigms: a research survey,” International Journal of Document Anal-
ysis and Recognition (IJDAR), vol. 8, no. 2-3, pp. 66–86, 2006.
[2] P. Pyreddy and W. B. Croft, “Tintin: A system for retrieval in text tables,
in Proceedings of the second ACM international conference on Digital
libraries. ACM, 1997, pp. 193–200.
[3] F. Cesarini, S. Marinai, L. Sarti, and G. Soda, “Trainable table location
in document images,” in Pattern Recognition, 2002. Proceedings. 16th
International Conference on, vol. 3. IEEE, 2002, pp. 236–240.
[4] T. Kasar, P. Barlas, S. Adam, C. Chatelain, and T. Paquet, “Learning
to detect tables in scanned document images using line information,” in
ICDAR, 2013, pp. 1185–1189.
[5] A. C. e Silva, “Learning rich hidden markov models in document
analysis: Table location, in Document Analysis and Recognition, 2009.
ICDAR’09. 10th International Conference on. IEEE, 2009, pp. 843–
847.
[6] J. Fang, P. Mitra, Z. Tang, and C. L. Giles, “Table header detection and
classification.” in AAAI, 2012, pp. 599–605.
[7] M. Raskovic, N. Bozidarevic, and M. Sesum, “Borderless table detection
engine,” Jun. 5 2018, uS Patent 9,990,347.
[8] S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt:
Deep learning for detection and structure recognition of tables in
document images,” in Document Analysis and Recognition (ICDAR),
2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp.
1162–1167.
[9] I. Kavasidis, S. Palazzo, C. Spampinato, C. Pino, D. Giordano, D. Giuf-
frida, and P. Messina, “A saliency-based convolutional neural network
for table and chart detection in digitized documents,” arXiv preprint
arXiv:1804.06236, 2018.
[10] D. N. Tran, T. A. Tran, A. Oh, S. H. Kim, and I. S. Na, “Table
detection from document image using vertical arrangement of text
blocks,” International Journal of Contents, vol. 11, no. 4, pp. 77–85,
2015.
[11] T. Kieninger and A. Dengel, “The t-recs table recognition and analysis
system,” in International Workshop on Document Analysis Systems.
Springer, 1998, pp. 255–270.
[12] Y. Wang, I. T. Phillips, and R. M. Haralick, “Table structure understand-
ing and its performance evaluation, Pattern recognition, vol. 37, no. 7,
pp. 1479–1497, 2004.
[13] A. Tengli, Y. Yang, and N. L. Ma, “Learning table extraction from
examples,” in Proceedings of the 20th international conference on
Computational Linguistics. Association for Computational Linguistics,
2004, p. 987.
[14] P. Singh, S. Varadarajan, A. N. Singh, and M. M. Srivastava, “Multido-
main document layout understanding using few shot object detection,”
arXiv preprint arXiv:1808.07330, 2018.
[15] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” CoRR, vol. abs/1411.4038, 2014. [Online].
Available: http://arxiv.org/abs/1411.4038
[16] R. Smith, “An overview of the tesseract ocr engine, in Document
Analysis and Recognition, 2007. ICDAR 2007. Ninth International
Conference on, vol. 2. IEEE, 2007, pp. 629–633.
[17] J. Fang, X. Tao, Z. Tang, R. Qiu, and Y. Liu, “Dataset, ground-truth and
performance metrics for table detection evaluation, in 2012 10th IAPR
International Workshop on Document Analysis Systems, March 2012,
pp. 445–449.
[18] I. Guyon, R. M. Haralick, J. J. Hull, and I. T. Phillips, “Data sets for ocr
and document image understanding research,” in In Proceedings of the
SPIE - Document Recognition IV. World Scientific, 1997, pp. 779–799.
[19] M. G¨
obel, T. Hassan, E. Oro, and G. Orsi, “Icdar 2013 table com-
petition,” in Document Analysis and Recognition (ICDAR), 2013 12th
International Conference on. IEEE, 2013, pp. 1449–1453.
[20] A. C. e Silva, “Metrics for evaluating performance in document analysis:
application to tables,” International Journal on Document Analysis and
Recognition (IJDAR), vol. 14, no. 1, pp. 101–109, 2011.
... For instance, Agarwal et al. [21] developed CDeC-Net, an end-to-end architecture that utilizes a multi-stage Cascade Mask R-CNN framework to refine table proposals by progressively adjusting intersection over union thresholds. Another significant contribution is TableNet by Paliwal et al. [22], which employs a VGG19 backbone for multi-task learning that enables simultaneous detection of table regions and cell structures. Similarly, Prasad et al. [23] introduced CascadeTabNet, which integrates table detection and cell segmentation into a unified model using a High-Resolution Network (HRNet) to ensure the delineation of table structures. ...
... In other works, DeepDeSRT [27], TableNet [22], and CDeC-Net [21] focus on deep learning-based table structure recognition, with DeepDeSRT achieving an F1-score of 96.77% on ICDAR2013. GraphTSR [31] introduces a graph-12 VOLUME XX, 2025 This article has been accepted for publication in IEEE Access. ...
Article
Full-text available
In academic institutions, processing and evaluating documents such as exam scripts remains a labor-intensive process susceptible to human error. Traditional digitization systems face significant challenges in handling the complexities of mixed handwritten and printed text and varying document structures. These challenges are exacerbated by the absence of annotated datasets due to privacy concerns, particularly in contexts involving sensitive exam evaluations. To address these issues, this study introduces uFOIL, an unsupervised ensemble-based framework that integrates advanced image and language processing techniques to automate the extraction and validation of key information. The framework employs a majority voting mechanism that combines four state-of-the-art optical character recognition systems. Furthermore, a transformer architecture is incorporated to enhance contextual understanding and the structured formatting of extracted text that follows a post-processing confidence scoring mechanism. The proposed framework achieves high performance, with accuracies of 95.77% and 96.48% for student names and IDs, respectively; and 95.07% for total mark validation based on a dataset of exam script samples. Additionally, experiments on the benchmark ICDAR2013 dataset suggest the framework’s strong applicability achieving precision, recall, and F1-scores of 95.89%, 97.86%, and 96.87%, respectively.
... Furthermore, vendors may receive order documents in different languages, posing a challenge for identifying order item tables and columns through keyword analysis. Vendors might also encounter situations where they receive only a limited number of orders from the same buyer, making it hard to train learning algorithms like TableNet [ 6] based on historical order documents. ...
... Current approaches utilize different strategies, including geometrical observations such as lines or intersections of lines, as well as specific words, word arrangements, or font properties (e.g., font size) to extract tables [ 13]. Methods for table extraction include heuristics [ 4,12], probabilistic methods [ 8], neural networks [ 6,10], k-nearest neighbor methods [ 7], and kernel-based methods [ 3]. Most of these approaches are designed to extract tables from a broad range of documents, including PDFs. ...
Conference Paper
In recent decades, numerous algorithms have been proposed for extracting tables from PDF documents, with most designed as general-purpose table extractors. This paper introduces a novel algorithm specifically tailored for extracting order tables, addressing specific issues not covered by existing general-purpose extractors. Our approach, named OTE, identifies leading rows in an order table through a clustering algorithm and employs heuristics to recognize additional row lines and annotate columns. Through an evaluation of 115 order documents from customers of a medium-sized company, we demonstrate that: i) OTE surpasses general-purpose extractors, ii) accurately identifies over 95% of order tables in PDF documents, and iii) correctly identifies 81% of all listed article IDs, even when included in the article description.
... However, they no longer offer state-of-the-art (SOTA) results, having been surpassed by deep learning models like DETR and TATR, which excel at directly predicting table boundaries and structures. TableNet provides an end-to-end solution for table detection and structure recognition by highlighting text zones with different colors based on their type, as shown in Paliwal et al. [14]. Similarly, Prasad et al. [15] use the image-based CascadeTabNet, which employs iterative transfer learning and data augmentation to enhance performance. ...
Preprint
Full-text available
Extracting tables from documents is a critical task across various industries, especially on business documents like invoices and reports. Existing systems based on DEtection TRansformer (DETR) such as TAble TRansformer (TATR), offer solutions for Table Detection (TD) and Table Structure Recognition (TSR) but face challenges with diverse table formats and common errors like incorrect area detection and overlapping columns. This research introduces RAPTOR, a modular post-processing system designed to enhance state-of-the-art models for improved table extraction, particularly for product tables. RAPTOR addresses recurrent TD and TSR issues, improving both precision and structural predictions. For TD, we use DETR (trained on ICDAR 2019) and TATR (trained on PubTables-1M and FinTabNet), while TSR only relies on TATR. A Genetic Algorithm is incorporated to optimize RAPTOR's module parameters, using a private dataset of product tables to align with industrial needs. We evaluate our method on two private datasets of product tables, the public DOCILE dataset (which contains tables similar to our target product tables), and the ICDAR 2013 and ICDAR 2019 datasets. The results demonstrate that while our approach excels at product tables, it also maintains reasonable performance across diverse table formats. An ablation study further validates the contribution of each module in our system.
... Detected title blocks from Faster RCNN are then fed into GPT-4o. Some researchers use ORC ( [18], [19]) to extract information from texts, but because some building drawings are scanned from written title blocks, which are difficult for OCR to extract contents. Unlike ORC, GPT-4o understands (logically reasoning) title block contents rather than just transforming them from scanned content to text content. ...
Preprint
Full-text available
Information from building drawings is important in the architecture, engineering, and construction (AEC) sector for building construction, maintenance, compliance checks, and error checks. However, manual information extraction from building drawings is time-consuming and significantly increases project costs, especially when handling large volumes of drawings. This paper proposes automatic detection and information extraction from one of the key areas of building drawings: title blocks. By integrating Faster RCNN and GPT-4o, the proposed pipeline can detect title blocks from drawings, extract information from the detected title blocks, and store the extracted data in a database. A user interface has been established so users can retrieve building drawings straightforwardly and efficiently. Our model demonstrated strong performance in both vector and historical (scanned handwritten) drawings with an accuracy of 88.2% in title block detection, which has been a challenge in research due to the fact that historical drawings are blurred and noisy INTRODUCTION Building drawings are a key element in the AEC industry. Historic drawings are important for maintenance, renovation and preservation. Managing and retrieving information from technical drawings is time-consuming. Meanwhile, compliance checking on buildings and infrastructure depends on building drawings [1]. Therefore, retrieving building drawings from archives is important. A fast and effective search for building drawings is crucial in the daily work routines of engineers and architects. A technical drawings are typically made of content (building components) and metadata. Metadata is included in a table called title block.Title blocks contain information about the drawings, such as the drawing number, designer name, scale, and date of completion. Thus, searching keywords on title blocks is an ideal means for retrieving building drawings. Constructing a database that contains building drawings and corresponding title block information is the first step in retrieving drawings through title blocks. Title block information is extracted from drawings. The nomenclature in building drawings normally varies and is project-dependent.
Article
Full-text available
Table recognition is a critical component of intelligent document processing, where table structure recognition is the most critical part. It aims to understand the distribution and structure of cells within tables and convert table images into a computer-understandable format. Existing research typically relies on row-column segmentation methods to parse tables, which perform inadequately on table images from natural scenes with diverse backgrounds and types. Therefore, we propose the VertexNet model. We use keypoint detection to detect the center point of a cell, then determine the position of four vertices by regressing a series of attributes, and finally stitch the predicted cells into a complete table. Based on this direct regression out of cells, we propose a new post-processing method will get the logical coordinates of the cells. We achieve an F1 score of 86.9% on the complex scene dataset WTW and 79.4% on Teds, surpassing state-of-the-art methods and demonstrating higher accuracy than row-column segmentation approaches. Evaluation on the ICDAR2013 and ICDAR2019 datasets further showcases the model’s advantages.
Article
Text information extraction from a tabular structure within a compound document image (CDI) is crucial to help better understand the document. The main objective of text extraction is to extract only helpful information since tabular data represents the relation between text lying in a tuple. Text from an image may be of low contrast, different style, size, alignment, orientation, and complex background. This work presents a three-step tabular text extraction process, including pre-processing, separation, and extraction. The pre-processing step uses the guide image filter to remove various kinds of noise from the image. Improved binomial thresholding (IBT) separates the text from the image. Then the tabular text is recognized and extracted from CDI using deep neural network (DNN). In this work, weights of DNN layers are optimized by the Harris Hawk optimization algorithm (HHOA). Obtained text and associated information can be used in many ways, including replicating the document in digital format, information retrieval, and text summarization. The proposed process is applied comprehensively to UNLV, TableBank, and ICDAR 2013 image datasets. The complete procedure is implemented in Python, and precision metrics performance is verified.
Chapter
Portable Document Format files are widely used for storing and sharing documents across different platforms and devices while preserving their original formatting. This research primarily aims to tackle the challenge of extracting rich-text components like bold, italic, underlined text, hyperlinks, important phrases, and tabular data from documents, particularly those in PDF format. The approach involves experimentation with various PDF layouts using seven distinct models: Naive Bayes, LSTM, CNN, CNN-LSTM, MLP, RNN, and FNN. Results demonstrate that the FNN deep learning algorithm outperforms other models, exhibiting higher accuracy across the most PDF documents.The significance of this study lies in its novelty, as existing research primarily focuses on extracting information from plain text within documents, overlooking the diverse data structures found in PDFs. By addressing this gap, the research aims to enhance the accuracy of information retrieval process, ultimately contributing to the development of more effective document analysis and data categorisation systems.
Conference Paper
Full-text available
This paper presents a method to detect table regions in document images by identifying the column and row line-separators and their properties. The method employs a run-length approach to identify the horizontal and vertical lines present in the input image. From each group of intersecting horizontal and vertical lines, a set of 26 low-level features are extracted and an SVM classifier is used to test if it belongs to a table or not. The performance of the method is evaluated on a heterogeneous corpus of French, English and Arabic documents that contain various types of table structures and compared with that of the Tesseract OCR system.
Chapter
We try to address the problem of document layout understanding using a simple algorithm which generalizes across multiple domains while training on just few examples per domain. We approach this problem via supervised object detection method and propose a methodology to overcome the requirement of large datasets. We use the concept of transfer learning by pre-training our object detector on a simple artificial (source) dataset and fine-tuning it on a tiny domain specific (target) dataset. We show that this methodology works for multiple domains with training samples as less as 10 documents. We demonstrate the effect of each component of the methodology in the end result and show the superiority of this methodology over simple object detectors. We will open-source the code, trained models, source and target datasets upon acceptance.
Article
Deep Convolutional Neural Networks (DCNNs) have recently been applied successfully to a variety of vision and multimedia tasks, thus driving development of novel solutions in several application domains. Document analysis is a particularly promising area for DCNNs: indeed, the number of available digital documents has reached unprecedented levels, and humans are no longer able to discover and retrieve all the information contained in these documents without the help of automation. Under this scenario, DCNNs offers a viable solution to automate the information extraction process from digital documents. Within the realm of information extraction from documents, detection of tables and charts is particularly needed as they contain a visual summary of the most valuable information contained in a document. For a complete automation of visual information extraction process from tables and charts, it is necessary to develop techniques that localize them and identify precisely their boundaries. In this paper we aim at solving the table/chart detection task through an approach that combines deep convolutional neural networks, graphical models and saliency concepts. In particular, we propose a saliency-based fully-convolutional neural network performing multi-scale reasoning on visual cues followed by a fully-connected conditional random field (CRF) for localizing tables and charts in digital/digitized documents. Performance analysis carried out on an extended version of ICDAR 2013 (with annotated charts as well as tables) shows that our approach yields promising results, outperforming existing models.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Article
Table detection is a challenging problem and plays an important role in document layout analysis. In this paper, we propose an effective method to identify the table region from document images. First, the regions of interest (ROIs) are recognized as the table candidates. In each ROI, we locate text components and extract text blocks. After that, we check all text blocks to determine if they are arranged horizontally or vertically and compare the height of each text block with the average height. If the text blocks satisfy a series of rules, the ROI is regarded as a table. Experiments on the ICDAR 2013 dataset show that the results obtained are very encouraging. This proves the effectiveness and superiority of our proposed method.
Article
In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved.
Conference Paper
Table understanding is a well studied problem in document analysis, and many academic and commercial approaches have been developed to recognize tables in several document formats, including plain text, scanned page images and born-digital, object-based formats such as PDF. Despite the abundance of these techniques, an objective comparison of their performance is still missing. The Table Competition held in the context of ICDAR 2013 is our first attempt at objectively evaluating these techniques against each other in a standardized way, across several input formats. The competition independently addresses three problems: (i) table location, (ii) table structure recognition, and (iii) these two tasks combined. We received results from seven academic systems, which we have also compared against four commercial products. This paper presents our findings.
Article
Table detection is an important task in the field of document analysis. It has been extensively studied since a couple of decades. Various kinds of document mediums are involved, from scanned images to web pages, from plain texts to PDF files. Numerous algorithms published bring up a challenging issue: how to evaluate algorithms in different context. Currently, most work on table detection conducts experiments on their in-house dataset. Even the few sources of online datasets are targeted at image documents only. Moreover, Precision and recall measurement are usual practice in order to account performance based on human evaluation. In this paper, we provide a dataset that is representative, large and most importantly, publicly available. The compatible format of the ground truth makes evaluation independent of document medium. We also propose a set of new measures, implement them, and open the source code. Finally, three existing table detection algorithms are evaluated to demonstrate the reliability of the dataset and metrics.