ChapterPDF Available

DeepLayout: A Semantic Segmentation Approach to Page Layout Analysis


Abstract and Figures

In this paper, we present DeepLayout, a new approach to page layout analysis. Previous work divides the problem into unsupervised segmentation and classification. Instead of a step-wise method, we adopt semantic segmentation which is an end-to-end trainable deep neural network. Our proposed segmentation model takes only document image as input and predicts per pixel saliency maps. For the post-processing part, we use connected component analysis to restore the bounding boxes from the prediction map. The main contribution is that we successfully bring RLSA into our post-processing procedures to specify the boundaries. The experimental results on ICDAR2017 POD competition dataset show that our proposed page layout analysis algorithm achieves good mAP score, outperforms most of other competition participants.
Content may be subject to copyright.
DeepLayout: A Semantic Segmentation
Approach to Page Layout Analysis
Yixin Li, Yajun Zou, and Jinwen Ma
Department of Information Science, School of Mathematical Sciences
and LMAM, Peking University, Beijing 100871, China
Abstract. In this paper, we present DeepLayout, a new approach to page layout
analysis. Previous work divides the problem into unsupervised segmentation and
classication. Instead of a step-wise method, we adopt semantic segmentation
which is an end-to-end trainable deep neural network. Our proposed segmenta-
tion model takes only document image as input and predicts per pixel saliency
maps. For the post-processing part, we use connected component analysis to
restore the bounding boxes from the prediction map. The main contribution is
that we successfully bring RLSA into our post-processing procedures to specify
the boundaries. The experimental results on ICDAR2017 POD competition
dataset show that our proposed page layout analysis algorithm achieves good
mAP score, outperforms most of other competition participants.
Keywords: Page layout analysis Document segmentation
Document image understanding Semantic segmentation and deep learning
1 Introduction
Page layout analysis, also known as document image understanding and document
segmentation, plays an important role in massive document image analysis applications
such as OCR systems. Page layout analysis algorithms take document images or PDF
les as inputs, and the goal is to understand the documents by decomposing the images
into several structural and logical units, for instance, text, gure, table and formula.
This procedure is critical in document image processing applications, for it usually
brings a better recognition results. For example, once we get the structural and semantic
information of the document images, we only feed the text regions into the OCR
system to recognize text while the gures are saved directly. Thus, page layout analysis
has become a popular research topic in computer vision community.
Most of the conventional methods [15] have two steps: segmentation and clas-
sication. Firstly, the document images are divided into several regions, and then a
classier is trained to assign them to a certain logical class. The major weakness of
these methods is that the unsupervised segmentation involves lots of parameters that
rely on experience, and one set of parameters can hardly t all the document layout
To tackle this problem, the most straightforward way is supervised localization or
segmentation. The parameters in supervised learning can be tuned automatically during
©Springer International Publishing AG, part of Springer Nature 2018
D.-S. Huang et al. (Eds.): ICIC 2018, LNAI 10956, pp. 266277, 2018.
the training, which avoids the large amount of human-dened rules and hand-craft
parameters. On the other hand, supervised segmentation provides semantic information
which means we can perform segmentation and classication simultaneously.
Since the nal output of the page layout analysis is a number of bounding boxes
and their corresponding labels, this problem can be framed as an object detection or
localization problem. Unfortunately, the state-of-the-art object detection approaches
such as Faster R-CNN [6] and Single-Shot Multibox Detector (SSD) [7] have been
proven not working very well in the page layout analysis case [1]. This is because the
common object detection methods are designed to localize certain objects in real life
such as dogs, cars and human, and most of them have a specic boundary unlike text
and formula regions and are not likely to have an extremely aspect ratio like text line in
page layout analysis case. Also, the error causing by the bounding box regression is
inevitable, which is usually the nal stage of common object detection networks.
To address this issue, we adopt semantic segmentation approach to classify each
pixel into their semantic meaning like text, formula, gure or table. The semantic
segmentation model is a deep neural network trained under supervised information
where parameters are learned automatically during training. Moreover, the pixel level
understanding from semantic segmentation is more precise than the bounding box level
understanding from the conventional object detection methods.
In this paper, we propose a page layout analysis algorithm based on semantic
segmentation. The pipeline of our proposed algorithm contains two parts: the semantic
segmentation stage and the post-processing stage. Our semantic segmentation model is
modied on DeepLab [8]tot our problem. As for post-processing part, we get the
bounding box locations along with their condence scores and labels by analyzing the
connected components on the probability map generated from our segmentation model
and adopt the run length smoothing algorithm (RLSA) locally on the original image to
modify the bounding boxes. It is demonstrated in the experiments that our proposed
algorithm achieves both reasonable visualization results and good quantization results
on the ICDAR2017 POD competition dataset [9].
The main contribution of this paper is three fold. First, we propose a powerful and
efcient approach on page layout analysis. Also, we successfully contrast a coarse-to-
ne structure by combining the supervised learning algorithm (DeepLab) and unsu-
pervised algorithm (RLSA). Finally, though the optimization targeted on POD dataset
may not simply be applied to other datasets, the ideas have good extension meaning
and reference value.
The rest of the paper is organized as follow: we briey review the page layout
analysis and semantic segmentation algorithms in Sect. 2. The methodology of our
page layout analysis algorithm is then presented in the next section. In Sect. 4, the
datasets we used and the experiments we conducted are described in detail. And
discussions of the limitation and running time analysis are also given in Sect. 4.
Finally, we conclude the whole paper in the last section.
DeepLayout: A Semantic Segmentation Approach 267
2 Related Work
2.1 Page Layout Analysis
Page layout analysis has been studied for decades and a great number of algorithms
have been established and developed to solve this problem. Most of the page layout
analysis approaches can be roughly divided into two categories by the type of seg-
mentation algorithms. One of them is unsupervised segmentation by hand-craft features
and human-dened rules, and the other is supervised segmentation by a learning based
model and supervised training data.
Most of the conventional methods [15] adopt a two-step strategy: an unsupervised
segmentation model and a learning based classication model. An unsupervised seg-
mentation method is either starting from the pixels then merging them into high level
regions (bottom-up) [2,3] or segmenting the page into candidate regions by projections
or connected components (top-down) [4,5]. Both need a large amount of experience-
depended parameters. And as for classication step, hand-craft features are extracted to
train a classier [3,4] or a CNN is trained to classify the segmented regions [1]. Also,
some algorithms are proposed to detect the specic type of boxes or regions like
equations [10,11] and tables [12] in the PDF les or document images. As we men-
tioned before, the two-step algorithms mostly contain lots of human-dened rules and
handcraft features which involve a large parameter set.
In recent years, supervised segmentation is introduced to solve the page layout case
[13,14]. Supervised segmentation provides semantic information which allows us to
perform segmentation and classication at the same time. Oyebade et al. [13] extracts
textural features from small patches and trains a neural network (fully connected) to
classify them to get the document segmentation result. Due to the patch classication,
there is a so-called mosaic effectwhere the segmentation boundary is rough and
inaccurate. Yang et al. [14]rst introduce semantic segmentation to document seg-
mentation, and an additional tool (Adobe Acrobat) is adopted to specify the segmen-
tation boundary. By the power of deep learning, this type of methods is normally faster
and stronger than two-step ones and is easier to generalize to other type of documents.
2.2 Semantic Segmentation
Semantic segmentation is a computer vision task that aims to understand the image in
pixel level by performing the pixel-wise classication. A demonstration of the semantic
segmentation task is shown in Fig. 1
. Semantic segmentation is a deeper under-
standing of images than image classication. Instead of recognizing the objects in
image classication task, we also have to assign each pixel to a certain object class or
background to lineate the boundary of each object. In the industry, semantic seg-
mentation is widely used in a variety of computer vision scenarios such as image
matting, medical image analysis and self-driving.
268 Y. Li et al.
Before deep learning, the commonly used solutions are random forest based
algorithms [15], this kind of algorithms are inaccurate and extremely slow. With CNN
taking over computer vision, one of the rst attempts on page layout analysis by CNN
was patch classication [16] where the pixel class is assigned based on the classi-
cation result on a small image patch around it. The size of image patches need to be
xed due to the fully connected layer used in the network structure. And to keep the
parameter size acceptable, the patch window which equals the receptive eld needs to
be small. Thus the segmentation result was still not ideal.
In 2014, Fully Convolutional Network (FCN) was proposed by Long et al. [17]
which is a milestone of semantic segmentation. This model allows us to feed the
images in any size to the segmentation model because no fully connected layer was
used in the network structure. Also the end-to-end trainable structure makes it much
faster and much stronger than the patch classication methods.
The next big problem of using CNN on segmentation is the pooling layers. Pooling
layers increase the receptive eld and robustness while weakening the spatial infor-
mation. Spatial information is unimportant in image classication but it is essential in
segmentation task. Two ways to overcome this problem are using short-cut connections
to restore the spatial information [18] and using dilated (atrous) convolution layer to
increase the receptive eld and keep the spatial information at the same time [8]. And
the segmentation model we used in our proposed page layout analysis algorithm is
based on the latter solution.
3 Methodology
3.1 Overview
In this paper, we proposed a semantic segmentation based page layout analysis algo-
rithm. The pipeline can be divided into two major parts: semantic segmentation part
and post-processing part. The segmentation model we use along with some training
details and the post-processing procedures are introduced in this section.
The whole pipeline of our proposed algorithm is shown in Fig. 2. First, a saliency
map is generated by our semantic segmentation model. Connected component analysis
is adopted to the generated saliency map to restore the bounding boxes. Then run
length smoothing algorithm is applied to the local document image of detected logical
boxes to specify the boundaries and get nal detection results.
Fig. 1. Semantic segmentation task.
DeepLayout: A Semantic Segmentation Approach 269
Noticed that our proposed deep learning based page layout analysis algorithm takes
only document image as input, unlike previous work taking benets from structural
information in PDF le [11] or applying additional commercial software to localize the
logical units [14].
3.2 Semantic Segmentation Model
Fully Convolutional Network (FCN) [17] represents a milestone in deep learning based
semantic segmentation. End-to-end convolutional structure is rst introduced and
deconvolutional layers are used to upsample the feature maps. However, loss of spatial
information during pooling stage makes the upsampling produce coarse segmentation
results which leaves a lot of room for improvement.
DeepLab [18] is proposed to overcome this problem and achieves state-of-the-art
performance. So we choose DeepLab v2 structure as our segmentation model, and take
ResNet-101 [19] as the backbone network. The key point in the network structure is
that we use a set of dilated convolution layers to increase the receptive eld without
losing the spatial information or increasing the number of parameters.
DeepLab v2. As we mentioned before, pooling layers are used in deep neural net-
works to increase the receptive eld but it causes loss of whereinformation which is
what we need in semantic segmentation. Dilated (also known as atrous) convolution
[20] is one of the solutions to this problem. Holes in the dilated convolution kernels
make them have the eld of view same as a large convolution kernel and the number of
parameters same as a small convolution kernel. Also, there is no decrease of spatial
dimensions if the stride is set to 1 in dilated convolution.
Atrous Spatial Pyramid Pooling (ASPP) is proposed in DeepLab [18] to aggregate
parallel atrous (dilated) convolutional layers. In our model, dilated convolutional layers
with multiple sampling rates are designed to extract features in the multi-scale manner
and are fused by ASPP layer.
Fig. 2. Pipeline of the proposed page layout analysis algorithm.
270 Y. Li et al.
Consistency Loss. The semantic segmentation model is designed to capture objects
with any shapes but the logical units in document images are all rectangles. Inspired by
Yang et al. [14], we implement the consistency loss to penalize the object shapes other
than rectangle. The training loss is the segmentation loss combining with the consis-
tency loss. And the consistency loss is dened as follow:
Lcon ¼1
Xpi2gt pi
Where gt is the ground truth bounding box, gt
is number of pixels in the ground
truth box, piis the probability given by the segmentation Softmax output, and
pis the
mean value of all the pixels in the bounding box.
Training Details. The segmentation model we use is based on DeepLab v2 [18]. All
the layers except the last prediction layer are restored on the model pretrained on
MSCOCO dataset [21]. And the last layer is random initialized to predict four classes:
background, gure, formula, and table. Parameters like learning rate, weight decay and
momentum are inherited from the Tensorow implementation of DeepLab v2
The ground truth mask is given by the ground truth bounding boxes, that is, the
ground truth label of pixels in the bounding boxes are set to the positive class number
same as the bounding box and the label of pixels that are not in any bounding boxes are
set to zero which represents background.
We random scale the input document images and the corresponding ground truth
masks during training to improve the robustness over multi-scale input. The model is
optimized by Adaptive moment estimation (Adam) to minimize the cross-entropy loss
over pixels between prediction masks and ground truth masks.
3.3 Post-processing Procedures
At inference time, a pixel-wise prediction mask is generated by the segmentation
model. And the post-processing step is to restore the bounding boxes from the pre-
diction mask and get the nal layout analysis result. Our main contribution of this part
is that we adopt the local Run Length Smoothing Algorithm (RLSA) on the original
document image along with the connected component analysis on the prediction mask
to specify the region boundary.
Conditional Random Field (CRF). CRF [22] is a standard post-processing step in
deep learning based semantic segmentation model to improve the nal results. CRF
assumes that similar intensity tends to be the same class then construct a graphic model
to smooth the boundary of each object. Usually, CRF can improve the segmentation
mean IOU for 12%. On the contrary, CRF decreases the segmentation result in our
page layout case which is shown by the experiments in Sect. 4. It is due to the
differences between natural images and document images.
DeepLayout: A Semantic Segmentation Approach 271
Connected Component Analysis (CCA). To restore the bounding boxes from the
saliency map predicted by our segmentation model, we extract connected components
on each class then take the bounding rectangles as candidate bounding boxes. Label of
each candidate bounding box is the same as the connected component and the con-
dence score is calculated by the average of the pixel segmentation Softmax scores.
Run Length Smoothing Algorithm (RLSA). RLSA is widely used in the document
segmentation to aggregate the pixels in the same logical unit for last few decades [2,3].
The input of RLSA is binary image or array where 0 s represent black pixels and 1 s
represent white pixels. The aggregation procedure is under the rule that 1 s are changed
to 0 s if the length of adjacent 1 s is less than a predened threshold C. An example of
RLSA on 1d array (RLSA threshold C is set to 3) is shown in Fig. 3(a) and an example
of RLSA document segmentation procedure is shown in Fig. 3(b).
RLSA is under the assumptions that the horizontal or vertical distance between
black pixels in the same logical region is less than C while distance between different
logical region is large than C. But there are several cases that do not meet these
assumptions. For example, image captions are usually very close to the image, but they
are different logical regions. Thus the determination of threshold C could be very hard
and experience dependent.
In our case, since we have semantic meaning of each pixel by our segmentation
model, we are able to apply RLSA on each connected component where pixels are in
the same logical meaning. For the caption case, the gure and its caption are processed
separately, they thus wont be aggregated together no matter how close they are.
Semantic segmentation model gives us the logical class label of each pixel but the
boundary is rough which is shown in Fig. 2, and local RLSA is adopted to gives us the
exact boundary of each logical region.
Some Other Processing Steps. We investigate the ground truth of POD competition
dataset [9] and design several rules to improve the mAP score. Note that this part is
Fig. 3. (a) A 1d example of RLSA procedure. (b) A document segmentation example by RLSA.
272 Y. Li et al.
designed specically for the POD competition dataset, so it may not be able to gen-
eralize to all types of document images. We briey introduce the problems instead of
solutions, for this part is not the focus of this paper.
We noticed that each subgure is annotated separately in the POD dataset and the
segmentation model tends to predict a single gure, so we set series of rules to split the
gure regions into several subgures. Tables normally have a clear boundary, so
besides removing small regions, there is no additional processing step for tables.
Standard of equation annotation is unclear in POD dataset. Most of equations are
annotated as equation linein POD dataset where multiline equation is labeled as
multiple equation annotations. But some equations are annotated in equation block.
Also, the equation number is annotated in the same box with the corresponding
equation. Equation number may very far away from the equation which leaves the
annotated box a large blank space. Therefore, some human-dened rules are designed
to split equations into single lines and aggregate equation number and equation itself.
The result is still far from ideal, for the splitting procedure creates a new problem (the
start and stop indexes of Rand П) which will be discussed in Sect. 4.
4 Experiments
4.1 Datasets
Since our segmentation model is deep learning based, we need annotated data for
training. We choose the open dataset for ICDAR2017 Page Object Detection Com-
petition [9]. The competition dataset contains 2,400 annotated document images (1600
for training and 800 for testing) and the extended dataset which is not open for the
competition contains about 10,000 annotated images. The document images are
annotated by bounding boxes with three classes: gure, table and formula. The com-
petition dataset can be downloaded on the ofcial website
4.2 Experimental Results
The evaluation metric we use to quantize the segmentation performance is mean IoU.
Mean IoU is a standard metric in semantic segmentation [8,17,18] which calculates
the mean intersection over union metrics on all classes. Also, mean average precision
(mAP) and average F1 measure are calculated to evaluate the nal page layout analysis
results. Mean average precision is a rough estimation of area under precision-recall
curve and is the most common used evaluation on object detection [6,7]. F1 measure is
the harmonic mean value of precision and recall. In particular, mAP and F1 are also the
evaluation metrics used by POD competition [9].
The segmentation models are trained on POD dataset and POD extended dataset to
show the performance gain from more training data. And post-processing methods are
applied to prediction maps generated from the segmentation model. The results of
different training datasets and different post-processing steps are shown in Table 1.
DeepLayout: A Semantic Segmentation Approach 273
The results are evaluated by ofcial evaluation tool of POD competition. And IoU
threshold is set to 0.8 in mAP and average F1 calculation.
In Table 1,baserepresents our segmentation model, RLSArepresents our
whole post-processing steps, (+)means the extended dataset is used to train the
segmentation model, CRFrepresents segmentation model with fully connected CRF
layers [22] and CLrepresents consistency loss considered during training.
From Table 1we can see that the best result comes from segmentation model
trained on extended dataset, follow by our proposed post-processing procedure
including RLSA. The most signicant performance gain is from our proposed RLSA
post-processing which boosts the average F1 for 0.21 and mAP for 0.12. In the seg-
mentation network, a larger training dataset gives us a 4% gain, for the deep learning
structure we use heavily relies on large amount of training data. And the fully con-
nected CRF which usually improves the segmentation results on real life objects, does
not work well on page layout case. The reason is that objects in natural images have a
clear boundary while logical units in document images have holes and blanks which is
inadequate for CRF post-processing. Also the consistency loss is supposed to penalize
the predictions with shapes other than rectangle. But in our experiments, some pre-
dictions vanished to have a zero consistency loss, thus the segmentation and nal
results are not what we expected.
Then we compare our best method with the top results submitted in POD com-
petition [9]. The overall performance and performances on three specic classes
(gure, table and formula) are shown in Table 2. Noticed that we come to a good place
but not the best of all. All the methods in POD competition [9] have not been presented
by academic papers, so the algorithms of their approaches and the authenticity of the
results are unknown.
Most of the gures, equations and tables in document images can be correctly
recognized and localized by our proposed page layout algorithm. Some visualization
results are shown in Fig. 4. Green boxes represent equations detected by our proposed
page layout analysis approach, red boxes represent gures and blue boxes represent
tables. The numbers under the bounding boxes are the condence scores.
Table 1. The segmentation and nal detection performance of our proposed methods.
Mean IoU Average F1 Mean AP
Base 0.869 0.518 0.498
Base + RLSA 0.869 0.763 0.666
Base(+) + RLSA 0.908 0.801 0.690
Base(+) + CRF + RLSA 0.886 0.745 0.611
Base(+) + CL + RLSA 0.897 0.776 0.662
274 Y. Li et al.
4.3 Limitations
There are still several scenarios our proposed algorithm might fail. As we can see in
Table 2, the F1 measure and average precision score of equations is much lower than
tables and gures. After analyzing the visualization results, we found that the evalu-
ation scores are crippled by equations with R,П, and equation number, for the
reasons that the segmentation model tends to predict the equations with equation
number into two separate equation regions, and RLSA separates the start and stop
indexes of Rand Пinto three lines.
To tackle the equation number problem in the future work, one can increase the
receptive eld of semantic segmentation model to merge the equations and equation
numbers in prediction map. As for the start and stop index problem, we trained a
classication model to recognize Rand П, and then merge the indexes. This
procedure did bring us a precision gain on equations but is still not perfect. Therefore,
there is still some room for the improvement of equation issue.
Table 2. Results on POD competition.
Team F1-measure Average precision
Formula Table Figure Mean Formula Table Figure Mean
PAL 0.902 0.951 0.898 0.917 0.816 0.911 0.805 0.844
Ours 0.716 0.911 0.776 0.801 0.506 0.893 0.672 0.690
HustVision 0.042 0.062 0.132 0.096 0.293 0.796 0.656 0.582
FastDetectors 0.639 0.896 0.616 0.717 0.427 0.884 0.365 0.559
Vislint 0.241 0.826 0.643 0.570 0.117 0.795 0.565 0.492
SOS 0.218 0.796 0.656 0.557 0.109 0.737 0.518 0.455
UTTVN 0.200 0.635 0.619 0.485 0.061 0.695 0.554 0.437
Matiai-ee 0.065 0.776 0.357 0.399 0.005 0.626 0.134 0.255
Fig. 4. Visualization results of our proposed page layout analysis algorithm. (Color gure
DeepLayout: A Semantic Segmentation Approach 275
4.4 Running Time
Our proposed algorithm consists of two main parts: semantic segmentation and post-
processing. Our segmentation is a deep neural network and a single inference takes
0.25 s on GPU (a single GTX1080). It should be at least twice faster if running on a
decent GPU like Titan X. Our post-processing step can be efciently done on CPUs (56
cores E5-2680v4) in 100 ms. In general, our whole system can process approximately
3 document images (*1300 * 1000) per second.
5 Conclusion
We have proposed a deep learning algorithm for page layout analysis, DeepLayout,
which is capable of recognizing and localizing the semantic and logical regions directly
from document images, without any help of PDF structural information or commercial
software. We treat page layout analysis as a semantic segmentation problem, and a
deep neural network is trained to understand the document image on pixel level. Then
connected component analysis is adopt to restore bounding boxes from the prediction
map. And we successfully bring local run length smoothing algorithm into our post-
processing step which signicantly improve the performance on both average F1 and
mAP scores. Our semantic segmentation model is trained and experiments are con-
ducted on ICDAR2017 POD dataset. It is demonstrated by the experiment results on
POD competition evaluation metrics that our proposed algorithm can achieve 0.801
average F1 and 0.690 mAP score, which outperforms the second place of the POD
competition. The running time of the whole system is approximately 3 fps.
Acknowledgement. This work was supported by the Natural Science Foundation of China for
Grant 61171138.
1. Yi, X., Gao, L., Liao, Y., et al.: CNN based page object detection in document images. In:
Proceedings of the International Conference on Document Analysis and Recognition,
pp. 230235. IEEE (2017)
2. Cesarini, F., Lastri, M., Marinai, S., et al.: Encoding of modied X-Y trees for document
classication. In: Proceedings of the International Conference on Document Analysis and
Recognition, pp. 11311136. IEEE (2001)
3. Priyadharshini, N., Vijaya, M.S.: Document segmentation and region classication using
multilayer perceptron. Int. J. Comput. Sci. Issues 10(2 part 1), 193 (2013)
4. Lin, M.W., Tapamo, J.R., Ndovie, B.: A texture-based method for document segmentation
and classication. S. Afr. Comput. J. 36,4956 (2006)
5. Chen, K., Yin, F., Liu, C.L.: Hybrid page segmentation with efcient whitespace rectangles
extraction and grouping. In: International Conference on Document Analysis and
Recognition, pp. 958962. IEEE Computer Society (2013)
6. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with
region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 11371149 (2017)
276 Y. Li et al.
7. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N.,
Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 2137. Springer, Cham (2016).
8. Chen, L.C., Papandreou, G., Kokkinos, I., et al.: DeepLab: semantic image segmentation
with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans.
Pattern Anal. Mach. Intell. 40(4), 834848 (2018)
9. Gao, L., Yi, X., Jiang, Z., et al.: Competition on page object detection. In: Proceedings of the
International Conference on Document Analysis and Recognition, ICDAR 2017, pp. 1417
1422. IEEE (2017)
10. Chu, W.T., Liu, F.: Mathematical formula detection in heterogeneous document images. In:
Technologies and Applications of Articial Intelligence, pp. 140145. IEEE (2014)
11. Gao, L., Yi, X., Liao, Y., et al.: A deep learning-based formula detection method for PDF
documents. In: Proceedings of the International Conference on Document Analysis and
Recognition, pp. 553558. IEEE (2017)
12. Hassan, T., Baumgartner, R.: Table recognition and understanding from PDF les. In:
International Conference on Document Analysis and Recognition, pp. 11431147. IEEE
13. Oyedotun, O.K., Khashman, A.: Document segmentation using textural features summa-
rization and feedforward neural network. Appl. Intell. 45(1), 198212 (2016)
14. Yang, X., Yumer, E., Asente, P., et al.: Learning to extract semantic structure from
documents using multimodal fully convolutional neural networks. arXiv preprint arXiv:
1706.02337 (2017)
15. Shotton, J., Fitzgibbon, A., Cook, M., et al.: Real-time human pose recognition in parts from
single depth images. In: Computer Vision and Pattern Recognition, pp. 12971304. IEEE
16. Ciresan, D., Giusti, A., Gambardella, L.M., et al.: Deep neural networks segment neuronal
membranes in electron microscopy images. Adv. Neural. Inf. Process. Syst. 28432851
17. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmen-
tation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640 (2017)
18. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image
segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015.
LNCS, vol. 9351, pp. 234241. Springer, Cham (2015).
19. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770
778 (2016)
20. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International
Conference on Learning Representations (2016). arXiv:1511.07122
21. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T.,
Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740755. Springer,
Cham (2014).
22. Koltun, V.: Efcient inference in fully connected CRFs with Gaussian edge potentials. In:
International Conference on Neural Information Processing Systems, pp. 109117. Curran
Associates Inc. (2011)
DeepLayout: A Semantic Segmentation Approach 277
... It used ANN and had two inputs. Literatures [18,35] adopted semantic segmentation approach to classify each pixel into their semantic meaning and used ANN and RNN to detect document elements according to their character and vision features. With the development of object detection method, researchers tried to use R-CNN approaches to solve this problem. ...
... For document images, each part is structured and contains multiple spatial information. Therefore, when processing document images, dilated convolution is used to increase the receptive field without losing spatial information [18]. Hence, this study replaces the traditional convolution with dilated convolution. ...
... This study compared the results with the state-of-the-art methods [18,31] and [1,5,10,21,30,32,38]. The ICDAR 2017, UW-III (image), ICDAR 2009, andICDAR 2013 (table) datasets are tested to evaluate the accuracy of model. ...
Full-text available
Document layout analysis is a critical step in optical character recognition. Traditional handcraft feature-based methods cannot handle various formats to obtain high accuracy. Although, deep-learning based methods obtain satisfactory accuracy, they are not memory-efficient for low-memory devices such as mobile phone. To alleviate such problems, a memory-efficient approach to layout analysis with the Lightweight Dilated Network (LD-Net) is proposed in this study. The initial document page image is segmented into blocks of content via Otsu algorithm and RLSA. Each block is sent into the LD-Net to classify them into four common different classes, figure, table, text, and formula. The main structure of the LD-Net is a shallow network, which performs better than deeper network for layout analysis task. Each convolution layer is composed of depthwise separable convolution and residual structure. In addition, the dilated convolution is also employed in the LD-Net to improve the accuracy of detection results. Experimental results based on benchmarks show that the proposed approach gets better performance in accuracy and memory occupied. The accuracy of the model on ICDAR dataset is 95.7% and the memory of the model occupies 39.7MB, which outperforms the existing methods.
... These methods take the document image as an input and then output the bounding boxes of objects with corresponding labels. Moreover, there are some methods based on deep semantic segmentation network where each pixel is classified into one semantic type [6][7][8][9]. The pixel level understanding is more precise than the bounding box level one. ...
... And they only complete pixel-wise segmentation task. DeepLayout [8] doesn't distinguish text type from background. They choose the DeepLab v2 structure [17] to segment these pixels that belong to table, figure and formula types. ...
... Each formula line of multi-line formula is labelled in the similar way. In our experiments, we use the POD extend training dataset as in [8], which contains about 10,000 training images and all text line regions are appended in the ground truth. ...
Full-text available
Semantic page segmentation of document images is a basic task for document layout analysis which is key to document reconstruction and digitalization. Previous work usually considers only a few semantic types in a page (e.g., text and non-text) and performs mainly on English document images and it is still challenging to make the finer semantic segmentation on Chinese and English document pages. In this paper, we propose a deep learning based method for semantic page segmentation in Chinese and English documents such that a document page can be decomposed into regions of four semantic types such as text, table, figure and formula. Specifically, a deep semantic segmentation neural network is designed to achieve the pixel-wise segmentation where each pixel of an input document page image is labeled as background or one of the four categories above. Then we can obtain the accurate locations of regions in different types by implementing the Connected Component Analysis algorithm on the prediction mask. Moreover, a Non-Intersecting Region Segmentation Algorithm is further designed to generate a series of regions which do not overlap each other, and thus improve the segmentation results and avoid possible location conflicts in the page reconstruction. For the training of the neural network, we manually annotate a dataset whose documents are from Chinese and English language sources and contain various layouts. The experimental results on our collected dataset demonstrate the superiority of our proposed method over the other existing methods. In addition, we utilize transfer learning on public POD dataset and obtain the promising results in comparison with the state-of-the-art methods.
... The segmentation-based method is intended to label each pixel with a semantic class by producing per-class binary prediction maps for input document image in a bottom-up manner. [12] adopted DeepLab [4] to perform a semantic segmentation procedure with connected component analysis in post-processing part. However, as mentioned in the work [12], the connected component analysis made some merging errors for its weakness in splitting close adjacent regions. ...
... [12] adopted DeepLab [4] to perform a semantic segmentation procedure with connected component analysis in post-processing part. However, as mentioned in the work [12], the connected component analysis made some merging errors for its weakness in splitting close adjacent regions. Therefore, separate segmentation-based method can not be directly applied to historical newspapers with close content layout. ...
In this paper, we introduce a novel historical newspaper layout analysis model named Panoptic-DLA. Different from the previous works regarding layout analysis as a separate object detection or semantic segmentation problem, we define the layout analysis task as the proposal-free panoptic segmentation to assign a unique value to each pixel in the document image, encoding both semantic label and instance id. The model consists of two branches: the semantic segmentation branch and the instance segmentation branch. Firstly, the pixels are separated to “things” and “stuff” by semantic classification taking the background as “stuff”, and content objects such as images, paragraphs, etc., as “things”. Then the predicted “things” are grouped further to their instance ids by instance segmentation. The semantic segmentation branch adopted DeepLabV3+ to predict pixel-wise class labels. In order to split adjacent regions well, the instance segmentation branch produce a mountain-like soft score-map and a center-direction map to represent content objects. The method is trained and tested on a dataset of historical European newspapers with complex content layout. The experiment shows that the proposed method achieves the competitive results against popular layout analysis methods. We also demonstrate the effectiveness and superiority of the methods compared to the previous methods.
... In this work, we focus on character-level separation techniques. To process a document, some existing related technologies, such as layout analysis [27][28][29] and connected-component analysis [30], can help locate character positions. Therefore, we assume that there are some front-end modules that can help us roughly segment printed characters from a complete document. ...
Full-text available
Separating printed or handwritten characters from a noisy background is valuable for many applications including test paper autoscoring. The complex structure of Chinese characters makes it difficult to obtain the goal because of easy loss of fine details and overall structure in reconstructed characters. This paper proposes a method for separating Chinese characters based on generative adversarial network (GAN). We used ESRGAN as the basic network structure and applied dilated convolution and a novel loss function that improve the quality of reconstructed characters. Four popular Chinese fonts (Hei, Song, Kai, and Imitation Song) on real data collection were tested, and the proposed design was compared with other semantic segmentation approaches. The experimental results showed that the proposed method effectively separates Chinese characters from noisy background. In particular, our methods achieve better results in terms of Intersection over Union (IoU) and optical character recognition (OCR) accuracy.
Analyzing the layout of a document to identify headers, sections, tables, figures etc. is critical to understanding its content. Deep learning based approaches for detecting the layout structure of document images have been promising. However, these methods require a large number of annotated examples during training, which are both expensive and time consuming to obtain. We describe here a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of the layout elements. The proposed generative process treats every physical component of a document as a random variable and models their intrinsic dependencies using a Bayesian Network graph. Our hierarchical formulation using stochastic templates allow parameter sharing between documents for retaining broad themes and yet the distributional characteristics produces visually unique samples, thereby capturing complex and diverse layouts. We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
Layout analysis from a document image plays an important role in document content understanding and information extraction systems. While many existing methods focus on learning knowledge with convolutional networks directly from color channels, we argue the importance of high-frequency structures in document images, especially edge information. In this paper, we present a novel document layout analysis framework with the Explicit Edge Embedding Network(E3Net). Specifically, the proposed network contains the edge embedding block and dynamic skip connection block to produce detailed features, as well as a lightweight fully convolutional subnet as the backbone for the effectiveness of the framework. The edge embedding block is designed to explicitly incorporate the edge information from the document images. The dynamic skip connection block aims to learn both color and edge representations with learnable weights. In contrast to the previous methods, we harness the model by using a synthetic document approach to overcome data scarcity. The combination of data augmentation and edge embedding is important toward a more compact representation than directly using the training images with only color channels. We conduct experiments using the proposed framework on three document layout analysis benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Conference Paper
State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at .
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.