Content uploaded by Nhu-Van Nguyen
Author content
All content in this area was uploaded by Nhu-Van Nguyen on Sep 14, 2020
Content may be subject to copyright.
What do we expect from comic panel extraction?
Nhu-Van Nguyen, Christophe Rigaud, Jean-Christophe Burie
Laboratoire L3i, SAIL joint Laboratory
Universit´
e de La Rochelle
17042 La Rochelle CEDEX 1, France
{nhu-van.nguyen, christophe.rigaud, jean-christophe.burie}@univ-lr.fr
Abstract—Among comic elements including panels, balloons,
comic characters, texts, etc. panels play an important role
in content adaptation and story animation for small devices
such as mobile phones or tablets. Different panel extraction
techniques have been investigated over the last ten years; most
of existing approaches rely on the assumption that a comic
panel is either simple as a rectangle or more complex as
a polygon having 4 edges. In this paper, we re-examine the
definition of comic panels, is a 4-edge polygon really sufficient
to represent integral information of a comic panel? We suggest
using a modern definition of comic panels, together with a
strong panel extraction baseline method for the approach.
Keywords-Panel extraction; Story board extraction; Comic
book image analysis; Pixel classification; Deep learning;
I. INTRODUCTION
The image analysis community has investigated comic
book element extraction for almost ten years, and methods
vary from low-level analysis such as text recognition to
high-level analysis such as style recognition. Among these
researches, comic book content (e.g., panels, balloons, texts,
comic characters, etc.) extraction is one of the most studied
analysis. Among comic book content elements, panels serve
as a particularly important element in storing and displaying
comic books on devices with small screens, such as phones
or tablets. Displaying each panel of the comic on the phone
helps readers to easily read and navigate in a comic book.
Most recent methods for panel extraction use either
image-processing technique or deep learning model [1], [2],
[3], [4], [5], [6]. [3] use the connected component analysis,
and outermost contours to detect the panels. [1] incorporates
three types of visual patterns extracted from the comic
image at different levels and a tree conditional random field
framework is used to label each visual pattern by modeling
its contextual dependencies. [4], [5], [6] considers panel ex-
traction as the object detection task and adapt some popular
deep learning object detection model for this problem. [2]
go further with a deep learning regression network to detect
quadrilateral panels. However, image-processing techniques
have had troubles detecting complex panels or overlapping
panels. Recent deep learning approaches can detect more
types of panels but not all of them.
All current methods are based on the assumption that
a panel in a comic is a 4-edge polygon (quadrilateral
shape). While this assumption works for many comic books,
especially for old ones, it is irrelevant for many other modern
comics or Japanese mangas. In our project, we are working
on modern comics from America, Europe and Japan. We
have found that many comic or mange books do not follow
the current assumption about panels. This finding led us to
the study on the new definition of comic panel.
In this paper, we confirm that a 4-edges polygon is not
sufficient to represent integral information of a comic panel.
We proposed another definition of comic panels which treats
them as free shape elements. Moreover, we proposed a deep
learning model which can extract panels with state-of-the-
art performance, as this method can be reproduced easily to
serve as a baseline method for this new approach.
In the next section, we answer the question “What do
we expect from panel extraction?” and propose a better
assumption on comic panel. In Section III, we introduce
a deep learning model which can extract the newly defined
type of panels. In Section IV, we present and discuss the
results of our model and conclude this work in Section V.
II. WH AT DO WE E XP EC T FRO M PAN EL E XT RAC TI ON ?
The most obvious purpose of comic panel extraction,
together with speech balloon extraction which can help users
read small texts in mobile devices, is to be able to display
comic books on the screen of mobile phones or tablets so
that readers can clearly see the details of the comic (see
Fig. 1). For this purpose, there are two important conditions
when we identify comic panels: high border accuracy is
needed so that the display is not affected too much by parts
that do not belong to the panel; extracted panels need to
contain all the contents of the panel so readers will not miss
anything.
Current approaches are based on the assumption that a
comic frame is a rectangle or a polygon with 4 edges. One
of the frequently asked questions is whether the assumption
meets the requirements we expect from panel extraction. It
is undeniable that there are many comics that are completely
consistent with this hypothesis but when working with recent
comics we found that there are many other cases in which
this assumption is no longer true. In the following section,
we present examples which prove that the current hypothesis
is not always reasonable. Hence, we need to have a better
Figure 1. Panel extraction to display on small size device. The order of panels are also computed to help users navigate through the comic book
definition of the comic panel which means we need a new
approach to identity comic panel to satisfy our needs for the
panel extraction task.
In the first example (Fig. 2), the red lines show the
definition of this panel in the eBDtheque dataset [7], as well
as other works [1], [2], [5], [4]. We can see that we will lose
a part of the panel content. The second example (Fig. 3)
shows that the panel of this comic cannot be represented by
a 4-angle polygon. For example, we can see that parts of the
text and the character will be lost if we only use the rectangle
panel (in red). The third example (Fig. 7) shows a rounded
panel, which the most recent methods in [1], [2] cannot
detect because they consider a panel as a quadrilateral. Or
the panel in Fig. 5 does not have borders, so methods in [1],
[2], [3] cannot detect it because the approaches need solid
borders to detect the panels. The fifth example Fig. 6 shows
two attaching panels, and the rectangle representation of the
panel does not contain all of its text which is a difficult case
for all methods.
From all of the examples we have discovered, we can see
that the conventional assumption is not enough to represent
the comic panels. We propose to use a different approach
to extract panels. Instead of extracting the bounding boxes
such as [5], [4], or 4-edge polygons as [2], [1], we propose
to segment the panels as a free shape which contains all
Figure 2. Example of missing balloon in the detected panel (red rectangle).
Figure 3. Example of a cut-off object (balloon) and character in the
detected panel (red rectangle).
Figure 4. Example of a panel with more than 4 edges (the blue polygon).
the content of the panel. That means, in our definition, a
panel is represented by an approximated polygon (contour)
which contains all of its content. For example, in Fig. 2,
the panel is represented by the blue contour. The method
in [3] can extract panels by its contour, however, it can not
detect overlapping, attaching, or no-border panels. In the
next Section, we describe our proposed method to extract
the panels which not only can extract full-content panel but
also can deal with attaching panels, overlapping panels and
also no-border panels.
Figure 5. Example of a panel without borders.
Figure 6. Example of a panel which is attached to another panel
III. PROP OS ED M OD EL
In order to extract free-shape panels, we have considered
it the image segmentation task (the pixel classification task).
In the deep learning approach, the image segmentation task
is presented as pixels classification where each pixel in the
image is classified into one of the classes of the object or
the background. Hence, the output of the neural network
architecture for a binary segmentation task is an energy map
where each point represents the probability that the pixel
at the same position belongs to the object class. Most of
image segmentation architectures (such as U-Net or Mask R-
CNN) optimize the model by minimizing the sum of Cross-
Figure 7. Example of a no-edge panel
Entropy loss between the ground-truth label distribution and
the prediction label distribution, over all pixels. For the
binary segmentation, the loss function is:
−
M
X
i=1
N
X
j=1
(y(i)
jlog(P(x(i)
j)+(1 −y(i)
j) log(1 −P(x(i)
j)) (1)
where we have a training set x(1), ..., x(M)consisting of M
independent examples, y(i)
j∈0,1being the label of the
pixel at position j(of total Npixels) in the image x(i). The
probability P(x(i)
j) = 1
1+e
−f(x(i)
j)is the sigmoid activation
of the value f(x(i)
j)at pixel jin the last feature map of the
neural network.
We have adopted the popular architecture U-Net [8] to
our binary problem: classify the pixels into two classes
background, panel. However, the U-Net model is not an
instance segmentation model, which means that it can clas-
sify a pixel into a class but it cannot separate attaching or
overlapping objects. Even if the model can well classify
the panels in the input image, however, imagine there are
two attaching panels, so in the final segmentation mask we
can only consider the region that combined two panels is
one detected panel. In order to overcome this issue, we
aim at classifying each pixel in the input comic image into
three classes: background, panels, and borders. Now, the two
attaching panels are separated by a border, so we can detect
two panels if the border is well classified. The loss we use to
train the model is the Categorical Cross-Entropy Loss with
the Softmax activation.
A. Training details
We follow the U-net architecture in [8], using the convo-
lutional part of the popular VGG-16 model. We leverage
transfer learning to train our model from a pre-trained
network on the ImageNet dataset [9]. We trained the model
Figure 8. Two detected panels are attached. The morphological operators
are used to remove this issues
using an Adam optimizer, with a learning rate of 0.001, and
a momentum of 0.9, weight decay with a parameter of =
0.001, for a total of 50 epochs.
B. Post processing
The output of our model is three energy maps having the
same size as the input image, representing the probability of
the pixels belong to the three classes. Taking the index of the
max value at each pixel, we have the segmentation mask in
which the id of a class is assigned to each pixel. Therefore in
our case, pixels will have values of 0,1,2in which 0means
the background, 1is the panel, and 2is the border. In order
to obtain the panels from the output of our model, we set
the value of border pixels to 0 and obtain a binary image
where 1 represent the panels. Next, the panels coordinates
can be identified by any algorithms that find the contours
in a binary image. In our work, we use the find contours
method in the library Skimage (https://scikit-image.org).
In order to boost performance, we have used transfer
learning. However, the pre-trained weights from ImageNet
[9] classification task is obtained from natural images which
are very different from comic images. Hence, our model’s
performance still suffers from limited training data, espe-
cially for attaching and overlapping panels. One of the issues
we have is that two panels are well detected and separated
but their long shared border is not well detected. there are
some attaching pixels (see Fig. 8). This issue may be solved
by adding more training data. But in our case, we have found
that simple morphology operators can help. We have applied
the morphological erosion operator to separate attaching
segmented panels, then the morphological dilation operator
to recover the original shape of the segmented panels. The
morphological operators help to shrink (erosion) or grow
(dilation) the image regions.
Table I
PANEL EXTRACTION COMPARISON
Method Sequencity612 eBDtheque
Recall Precision F1 Recall Precision F1
[10] - - - 78 79 78.5
[1] - - - 70 84 76.36
[5] - - - 44 68 53.43
Ours 82.04 86.94 84.42 78.47 82.15 80.27
C. Training, validation and test sets
In order to evaluate the proposed model, we have divided
the Sequencity612 dataset into 3 sets: the training, validation
and test sets containing 512, 50 and 50 images respectively.
Because the eBDtheque dataset is small (99 images, as we
removed the noisy image ”WARE ACME 024.jpg” which
contains more than 100 very small panels), we initiated the
model with pre-trained from the Sequencity612. Then train
the model with 10 epochs with a learning rate of 0.0001.
We used a weight decay of 0.001 and a momentum of 0.9.
We ran the cross-validation tests on 5 different training and
testing sets. Each training set contains 70 images and each
testing set contains the 29 remaining images. The reported
result on the eBDtheque dataset is the average of these 5
validations.
IV. RES ULT S
We have experimented our model with 2 datasets which
contain modern comic books. The first dataset is Se-
quencity612 [11], it is extracted from the online comic book
library Sequencity1, containing most of modern comics. This
private dataset contains 612 pages. The second dataset is the
popular public eBDtheque dataset [7] which contains 100
pages. All existing works reported the Precision, Recall, and
F1-score but there are some different definitions of correctly
detected panels within a page. There are three definitions
which are used in previous works. In [4], [5], [3], authors
used IoU (intersection over union) metric, that is, the IoU
between correct detected panels and ground-truth panels
should be higher than t= 0.5. In [1], [2], authors also used
IoU metric, but with a different threshold t= 0.9. In [2],
authors used the distance of four endpoints between correct
detected panels and ground-truth panels, the threshold was
set to 50. In our opinion, using IoU metric with t= 0.9
or the distance-based metric is better. The IoU metric with
t= 0.5is enough for general object detection task [12] but it
is not enough for panel segmentation because the border of
detected panels are not tight with t= 0.5. In our evaluation,
we used IoU metric with t > 0.9.
A. Evaluation
In our experiments, we compare the results of our model
with the results of existing methods on the eBDtheque
dataset. We also report our results on the Sequencity612
1http://sequencity.com
Figure 9. Some sample results of our model
dataset. Next, we show our qualitative results on difficult
cases on the eBDtheque dataset (the sequencity612’s images
are protected by copyright, we are not allowed to show/share
them).
Table I shows that our model reaches the state of the art
performance. The best existing performance on eBDtheque
dataset comes from a traditional algorithm [1], this method
has better precision than ours (1.85% better) but our model
has better recall (8.47% better). Our better recall means that
our model can detect more panels. For the Sequencity612
dataset, we achieve good performance (F1-score at 84.42%).
This performance is not as good as the performance of
the method in [1] (F1-score higher than 90%) on their
private datasets different from Sequencity612. But note that
our model aims at detecting the free-shape panels with its
integral content which is more difficult than the task in
previous works.
B. Qualitative Results
Fig. 9 shows sample results of our model. One can
see that our model work well on typical panels (rectangle
panels) and it detects all the content on difficult panels
including irregular polygons (upper left), attaching panels
(upper right), panel with balloon outside of its border (lower
left), panel of 4-edge polygon (not rectangle panel at lower
right).
Fig. 10 show some failure results of our model. The most
seen issue we have encountered is the attaching/overlapping
panels. The model can not separate all these cases in which
we need to investigate more in details to find a solution. The
easiest one would be to add more samples to the training
data. Another solution would be to use some augmentation
techniques which is possible because we can move existing
non-attaching panels over other panels to have new sample
data which contains attaching/overlapping panels.
Figure 10. Failure results of our model.
V. CONCLUSION
In this paper, we propose to approach the comic panel
extraction differently from the traditional approach. Aiming
at extracting free-shape panels from comic book images. We
proposed a three classes segmentation model based on U-
Net architecture. The experimental results on modern comics
dataset Sequencity612 and the public dataset eBDtheque
demonstrate that the proposed method achieves the state-
of-the-art performance and can detect difficult panels.
However, there are still some cases that the model does not
perform well, as discussed above. Beside the data augmen-
tation technique, the weighting schema for important pixels
(such as border class pixels) based on a deeper analysis of
the failure cases may help.
ACKNOWLEDGMENT
This work is supported by Research National Agency
(ANR) in the framework of 2017 LabCom program (ANR
17-LCV2-0006-01), CPER NUMERIC program funded by
the Region Nouvelle Aquitaine, CDA, Charente-Maritime
French Department, La Rochelle conurbation authority
(CDA) and the European Union through the FEDER fund-
ing.
REFERENCES
[1] Y. Wang, Y. Zhou, and Z. Tang, “Comic frame extraction
via line segments combination,” in 2015 13th International
Conference on Document Analysis and Recognition (ICDAR),
Aug 2015, pp. 856–860.
[2] Z. He, Y. Zhou, Y. Wang, S. Wang, X. Lu, Z. Tang, and
L. Cai, “An end-to-end quadrilateral regression network for
comic panel extraction,” in ACM Multimedia, 2018.
[3] C. Rigaud, “Segmentation and indexation of complex objects
in comic book images,” Thesis, Universit´
e de La Rochelle,
Dec 2014. [Online]. Available: https://tel.archives-ouvertes.
fr/tel-01221308
[4] T. Ogawa, A. Otsubo, R. Narita, Y. Matsui, T. Yamasaki, and
K. Aizawa, “Object detection for comics using manga109
annotations,” CoRR, vol. abs/1803.08670, 2018. [Online].
Available: https://arxiv.org/abs/1803.08670
[5] N.-V. Nguyen, C. Rigaud, and J.-C. Burie, “Digital comics
image indexing based on deep learning,” Journal of Imaging,
vol. 4, no. 7, p. 89, Jul 2018.
[6] ——, “Multi-task model for comic book image analysis,” in
MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris,
C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham:
Springer International Publishing, 2019, pp. 637–649.
[7] C. Gu ´
erin, C. Rigaud, A. Mercier, F. Ammar-Boudjelal,
K. Bertet, A. Bouju, J. C. Burie, G. Louis, J. M. Ogier, and
A. Revel, “eBDtheque: A representative database of comics,”
in 2013 12th International Conference on Document Analysis
and Recognition, Aug 2013, pp. 1145–1149.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolu-
tional networks for biomedical image segmentation,” in Med-
ical Image Computing and Computer-Assisted Intervention –
MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and
A. F. Frangi, Eds., pp. 234–241.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A Large-Scale Hierarchical Image Database,” in
CVPR09, 2009.
[10] C. Rigaud, N. L. Thanh, J. . Burie, J. . Ogier, M. Iwata,
E. Imazu, and K. Kise, “Speech balloon and speaker as-
sociation for comics and manga understanding,” in 2015
13th International Conference on Document Analysis and
Recognition (ICDAR), Aug 2015, pp. 351–355.
[11] N. V. Nguyen, C. Rigaud, and J. Burie, “Comic characters
detection using deep learning,” in 2nd International Workshop
on coMics Analysis, Processing, and Understanding, MANPU
2017, Kyoto, Japan, November 9-15, 2017, 2017, pp. 41–46.
[12] M. Everingham, S. M. Eslami, L. Gool, C. K. Williams,
J. Winn, and A. Zisserman, “The pascal visual object classes
challenge: A retrospective,” Int. J. Comput. Vision, vol. 111,
no. 1, pp. 98–136, jan 2015.