Conference PaperPDF Available

What do We Expect from Comic Panel Extraction?

What do we expect from comic panel extraction?
Nhu-Van Nguyen, Christophe Rigaud, Jean-Christophe Burie
Laboratoire L3i, SAIL joint Laboratory
e de La Rochelle
17042 La Rochelle CEDEX 1, France
{nhu-van.nguyen, christophe.rigaud, jean-christophe.burie}
Abstract—Among comic elements including panels, balloons,
comic characters, texts, etc. panels play an important role
in content adaptation and story animation for small devices
such as mobile phones or tablets. Different panel extraction
techniques have been investigated over the last ten years; most
of existing approaches rely on the assumption that a comic
panel is either simple as a rectangle or more complex as
a polygon having 4 edges. In this paper, we re-examine the
definition of comic panels, is a 4-edge polygon really sufficient
to represent integral information of a comic panel? We suggest
using a modern definition of comic panels, together with a
strong panel extraction baseline method for the approach.
Keywords-Panel extraction; Story board extraction; Comic
book image analysis; Pixel classification; Deep learning;
The image analysis community has investigated comic
book element extraction for almost ten years, and methods
vary from low-level analysis such as text recognition to
high-level analysis such as style recognition. Among these
researches, comic book content (e.g., panels, balloons, texts,
comic characters, etc.) extraction is one of the most studied
analysis. Among comic book content elements, panels serve
as a particularly important element in storing and displaying
comic books on devices with small screens, such as phones
or tablets. Displaying each panel of the comic on the phone
helps readers to easily read and navigate in a comic book.
Most recent methods for panel extraction use either
image-processing technique or deep learning model [1], [2],
[3], [4], [5], [6]. [3] use the connected component analysis,
and outermost contours to detect the panels. [1] incorporates
three types of visual patterns extracted from the comic
image at different levels and a tree conditional random field
framework is used to label each visual pattern by modeling
its contextual dependencies. [4], [5], [6] considers panel ex-
traction as the object detection task and adapt some popular
deep learning object detection model for this problem. [2]
go further with a deep learning regression network to detect
quadrilateral panels. However, image-processing techniques
have had troubles detecting complex panels or overlapping
panels. Recent deep learning approaches can detect more
types of panels but not all of them.
All current methods are based on the assumption that
a panel in a comic is a 4-edge polygon (quadrilateral
shape). While this assumption works for many comic books,
especially for old ones, it is irrelevant for many other modern
comics or Japanese mangas. In our project, we are working
on modern comics from America, Europe and Japan. We
have found that many comic or mange books do not follow
the current assumption about panels. This finding led us to
the study on the new definition of comic panel.
In this paper, we confirm that a 4-edges polygon is not
sufficient to represent integral information of a comic panel.
We proposed another definition of comic panels which treats
them as free shape elements. Moreover, we proposed a deep
learning model which can extract panels with state-of-the-
art performance, as this method can be reproduced easily to
serve as a baseline method for this new approach.
In the next section, we answer the question “What do
we expect from panel extraction?” and propose a better
assumption on comic panel. In Section III, we introduce
a deep learning model which can extract the newly defined
type of panels. In Section IV, we present and discuss the
results of our model and conclude this work in Section V.
The most obvious purpose of comic panel extraction,
together with speech balloon extraction which can help users
read small texts in mobile devices, is to be able to display
comic books on the screen of mobile phones or tablets so
that readers can clearly see the details of the comic (see
Fig. 1). For this purpose, there are two important conditions
when we identify comic panels: high border accuracy is
needed so that the display is not affected too much by parts
that do not belong to the panel; extracted panels need to
contain all the contents of the panel so readers will not miss
Current approaches are based on the assumption that a
comic frame is a rectangle or a polygon with 4 edges. One
of the frequently asked questions is whether the assumption
meets the requirements we expect from panel extraction. It
is undeniable that there are many comics that are completely
consistent with this hypothesis but when working with recent
comics we found that there are many other cases in which
this assumption is no longer true. In the following section,
we present examples which prove that the current hypothesis
is not always reasonable. Hence, we need to have a better
Figure 1. Panel extraction to display on small size device. The order of panels are also computed to help users navigate through the comic book
definition of the comic panel which means we need a new
approach to identity comic panel to satisfy our needs for the
panel extraction task.
In the first example (Fig. 2), the red lines show the
definition of this panel in the eBDtheque dataset [7], as well
as other works [1], [2], [5], [4]. We can see that we will lose
a part of the panel content. The second example (Fig. 3)
shows that the panel of this comic cannot be represented by
a 4-angle polygon. For example, we can see that parts of the
text and the character will be lost if we only use the rectangle
panel (in red). The third example (Fig. 7) shows a rounded
panel, which the most recent methods in [1], [2] cannot
detect because they consider a panel as a quadrilateral. Or
the panel in Fig. 5 does not have borders, so methods in [1],
[2], [3] cannot detect it because the approaches need solid
borders to detect the panels. The fifth example Fig. 6 shows
two attaching panels, and the rectangle representation of the
panel does not contain all of its text which is a difficult case
for all methods.
From all of the examples we have discovered, we can see
that the conventional assumption is not enough to represent
the comic panels. We propose to use a different approach
to extract panels. Instead of extracting the bounding boxes
such as [5], [4], or 4-edge polygons as [2], [1], we propose
to segment the panels as a free shape which contains all
Figure 2. Example of missing balloon in the detected panel (red rectangle).
Figure 3. Example of a cut-off object (balloon) and character in the
detected panel (red rectangle).
Figure 4. Example of a panel with more than 4 edges (the blue polygon).
the content of the panel. That means, in our definition, a
panel is represented by an approximated polygon (contour)
which contains all of its content. For example, in Fig. 2,
the panel is represented by the blue contour. The method
in [3] can extract panels by its contour, however, it can not
detect overlapping, attaching, or no-border panels. In the
next Section, we describe our proposed method to extract
the panels which not only can extract full-content panel but
also can deal with attaching panels, overlapping panels and
also no-border panels.
Figure 5. Example of a panel without borders.
Figure 6. Example of a panel which is attached to another panel
In order to extract free-shape panels, we have considered
it the image segmentation task (the pixel classification task).
In the deep learning approach, the image segmentation task
is presented as pixels classification where each pixel in the
image is classified into one of the classes of the object or
the background. Hence, the output of the neural network
architecture for a binary segmentation task is an energy map
where each point represents the probability that the pixel
at the same position belongs to the object class. Most of
image segmentation architectures (such as U-Net or Mask R-
CNN) optimize the model by minimizing the sum of Cross-
Figure 7. Example of a no-edge panel
Entropy loss between the ground-truth label distribution and
the prediction label distribution, over all pixels. For the
binary segmentation, the loss function is:
j)+(1 y(i)
j) log(1 P(x(i)
j)) (1)
where we have a training set x(1), ..., x(M)consisting of M
independent examples, y(i)
j0,1being the label of the
pixel at position j(of total Npixels) in the image x(i). The
probability P(x(i)
j) = 1
j)is the sigmoid activation
of the value f(x(i)
j)at pixel jin the last feature map of the
neural network.
We have adopted the popular architecture U-Net [8] to
our binary problem: classify the pixels into two classes
background, panel. However, the U-Net model is not an
instance segmentation model, which means that it can clas-
sify a pixel into a class but it cannot separate attaching or
overlapping objects. Even if the model can well classify
the panels in the input image, however, imagine there are
two attaching panels, so in the final segmentation mask we
can only consider the region that combined two panels is
one detected panel. In order to overcome this issue, we
aim at classifying each pixel in the input comic image into
three classes: background, panels, and borders. Now, the two
attaching panels are separated by a border, so we can detect
two panels if the border is well classified. The loss we use to
train the model is the Categorical Cross-Entropy Loss with
the Softmax activation.
A. Training details
We follow the U-net architecture in [8], using the convo-
lutional part of the popular VGG-16 model. We leverage
transfer learning to train our model from a pre-trained
network on the ImageNet dataset [9]. We trained the model
Figure 8. Two detected panels are attached. The morphological operators
are used to remove this issues
using an Adam optimizer, with a learning rate of 0.001, and
a momentum of 0.9, weight decay with a parameter of =
0.001, for a total of 50 epochs.
B. Post processing
The output of our model is three energy maps having the
same size as the input image, representing the probability of
the pixels belong to the three classes. Taking the index of the
max value at each pixel, we have the segmentation mask in
which the id of a class is assigned to each pixel. Therefore in
our case, pixels will have values of 0,1,2in which 0means
the background, 1is the panel, and 2is the border. In order
to obtain the panels from the output of our model, we set
the value of border pixels to 0 and obtain a binary image
where 1 represent the panels. Next, the panels coordinates
can be identified by any algorithms that find the contours
in a binary image. In our work, we use the find contours
method in the library Skimage (
In order to boost performance, we have used transfer
learning. However, the pre-trained weights from ImageNet
[9] classification task is obtained from natural images which
are very different from comic images. Hence, our model’s
performance still suffers from limited training data, espe-
cially for attaching and overlapping panels. One of the issues
we have is that two panels are well detected and separated
but their long shared border is not well detected. there are
some attaching pixels (see Fig. 8). This issue may be solved
by adding more training data. But in our case, we have found
that simple morphology operators can help. We have applied
the morphological erosion operator to separate attaching
segmented panels, then the morphological dilation operator
to recover the original shape of the segmented panels. The
morphological operators help to shrink (erosion) or grow
(dilation) the image regions.
Table I
Method Sequencity612 eBDtheque
Recall Precision F1 Recall Precision F1
[10] - - - 78 79 78.5
[1] - - - 70 84 76.36
[5] - - - 44 68 53.43
Ours 82.04 86.94 84.42 78.47 82.15 80.27
C. Training, validation and test sets
In order to evaluate the proposed model, we have divided
the Sequencity612 dataset into 3 sets: the training, validation
and test sets containing 512, 50 and 50 images respectively.
Because the eBDtheque dataset is small (99 images, as we
removed the noisy image ”WARE ACME 024.jpg” which
contains more than 100 very small panels), we initiated the
model with pre-trained from the Sequencity612. Then train
the model with 10 epochs with a learning rate of 0.0001.
We used a weight decay of 0.001 and a momentum of 0.9.
We ran the cross-validation tests on 5 different training and
testing sets. Each training set contains 70 images and each
testing set contains the 29 remaining images. The reported
result on the eBDtheque dataset is the average of these 5
We have experimented our model with 2 datasets which
contain modern comic books. The first dataset is Se-
quencity612 [11], it is extracted from the online comic book
library Sequencity1, containing most of modern comics. This
private dataset contains 612 pages. The second dataset is the
popular public eBDtheque dataset [7] which contains 100
pages. All existing works reported the Precision, Recall, and
F1-score but there are some different definitions of correctly
detected panels within a page. There are three definitions
which are used in previous works. In [4], [5], [3], authors
used IoU (intersection over union) metric, that is, the IoU
between correct detected panels and ground-truth panels
should be higher than t= 0.5. In [1], [2], authors also used
IoU metric, but with a different threshold t= 0.9. In [2],
authors used the distance of four endpoints between correct
detected panels and ground-truth panels, the threshold was
set to 50. In our opinion, using IoU metric with t= 0.9
or the distance-based metric is better. The IoU metric with
t= 0.5is enough for general object detection task [12] but it
is not enough for panel segmentation because the border of
detected panels are not tight with t= 0.5. In our evaluation,
we used IoU metric with t > 0.9.
A. Evaluation
In our experiments, we compare the results of our model
with the results of existing methods on the eBDtheque
dataset. We also report our results on the Sequencity612
Figure 9. Some sample results of our model
dataset. Next, we show our qualitative results on difficult
cases on the eBDtheque dataset (the sequencity612’s images
are protected by copyright, we are not allowed to show/share
Table I shows that our model reaches the state of the art
performance. The best existing performance on eBDtheque
dataset comes from a traditional algorithm [1], this method
has better precision than ours (1.85% better) but our model
has better recall (8.47% better). Our better recall means that
our model can detect more panels. For the Sequencity612
dataset, we achieve good performance (F1-score at 84.42%).
This performance is not as good as the performance of
the method in [1] (F1-score higher than 90%) on their
private datasets different from Sequencity612. But note that
our model aims at detecting the free-shape panels with its
integral content which is more difficult than the task in
previous works.
B. Qualitative Results
Fig. 9 shows sample results of our model. One can
see that our model work well on typical panels (rectangle
panels) and it detects all the content on difficult panels
including irregular polygons (upper left), attaching panels
(upper right), panel with balloon outside of its border (lower
left), panel of 4-edge polygon (not rectangle panel at lower
Fig. 10 show some failure results of our model. The most
seen issue we have encountered is the attaching/overlapping
panels. The model can not separate all these cases in which
we need to investigate more in details to find a solution. The
easiest one would be to add more samples to the training
data. Another solution would be to use some augmentation
techniques which is possible because we can move existing
non-attaching panels over other panels to have new sample
data which contains attaching/overlapping panels.
Figure 10. Failure results of our model.
In this paper, we propose to approach the comic panel
extraction differently from the traditional approach. Aiming
at extracting free-shape panels from comic book images. We
proposed a three classes segmentation model based on U-
Net architecture. The experimental results on modern comics
dataset Sequencity612 and the public dataset eBDtheque
demonstrate that the proposed method achieves the state-
of-the-art performance and can detect difficult panels.
However, there are still some cases that the model does not
perform well, as discussed above. Beside the data augmen-
tation technique, the weighting schema for important pixels
(such as border class pixels) based on a deeper analysis of
the failure cases may help.
This work is supported by Research National Agency
(ANR) in the framework of 2017 LabCom program (ANR
17-LCV2-0006-01), CPER NUMERIC program funded by
the Region Nouvelle Aquitaine, CDA, Charente-Maritime
French Department, La Rochelle conurbation authority
(CDA) and the European Union through the FEDER fund-
[1] Y. Wang, Y. Zhou, and Z. Tang, “Comic frame extraction
via line segments combination,” in 2015 13th International
Conference on Document Analysis and Recognition (ICDAR),
Aug 2015, pp. 856–860.
[2] Z. He, Y. Zhou, Y. Wang, S. Wang, X. Lu, Z. Tang, and
L. Cai, “An end-to-end quadrilateral regression network for
comic panel extraction,” in ACM Multimedia, 2018.
[3] C. Rigaud, “Segmentation and indexation of complex objects
in comic book images,” Thesis, Universit´
e de La Rochelle,
Dec 2014. [Online]. Available: https://tel.archives-ouvertes.
[4] T. Ogawa, A. Otsubo, R. Narita, Y. Matsui, T. Yamasaki, and
K. Aizawa, “Object detection for comics using manga109
annotations,” CoRR, vol. abs/1803.08670, 2018. [Online].
[5] N.-V. Nguyen, C. Rigaud, and J.-C. Burie, “Digital comics
image indexing based on deep learning,Journal of Imaging,
vol. 4, no. 7, p. 89, Jul 2018.
[6] ——, “Multi-task model for comic book image analysis,” in
MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris,
C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham:
Springer International Publishing, 2019, pp. 637–649.
[7] C. Gu ´
erin, C. Rigaud, A. Mercier, F. Ammar-Boudjelal,
K. Bertet, A. Bouju, J. C. Burie, G. Louis, J. M. Ogier, and
A. Revel, “eBDtheque: A representative database of comics,
in 2013 12th International Conference on Document Analysis
and Recognition, Aug 2013, pp. 1145–1149.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolu-
tional networks for biomedical image segmentation,” in Med-
ical Image Computing and Computer-Assisted Intervention –
MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and
A. F. Frangi, Eds., pp. 234–241.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A Large-Scale Hierarchical Image Database,” in
CVPR09, 2009.
[10] C. Rigaud, N. L. Thanh, J. . Burie, J. . Ogier, M. Iwata,
E. Imazu, and K. Kise, “Speech balloon and speaker as-
sociation for comics and manga understanding,” in 2015
13th International Conference on Document Analysis and
Recognition (ICDAR), Aug 2015, pp. 351–355.
[11] N. V. Nguyen, C. Rigaud, and J. Burie, “Comic characters
detection using deep learning,” in 2nd International Workshop
on coMics Analysis, Processing, and Understanding, MANPU
2017, Kyoto, Japan, November 9-15, 2017, 2017, pp. 41–46.
[12] M. Everingham, S. M. Eslami, L. Gool, C. K. Williams,
J. Winn, and A. Zisserman, “The pascal visual object classes
challenge: A retrospective,Int. J. Comput. Vision, vol. 111,
no. 1, pp. 98–136, jan 2015.
... Advances in computer vision could also be implemented within the application of regions, such as automatic selection of comic panels (Nguyen, Rigaud, & Burie, 2019), human figures (Imaizumi, Yamanishi, Nishihara, & Ozawa, 2021;Nguyen, Rigaud, Revel, & Burie, 2021), or human faces (Kumar, Kaur, & Kumar, 2019;Ogawa et al., 2018). ...
Full-text available
Multimodal combinations of writing and pictures have become ubiquitous in contemporary society, and scholars have increasingly been turning to analyzing these media. Here we present a resource for annotating these complex documents: the Multimodal Annotation Software Tool (MAST). MAST is an application that allows users to analyze visual and multimodal documents by selecting and annotating visual regions, and to establish relations between annotations that create dependencies and/or constituent structures. By means of schema publications, MAST allows annotation theories to be citable, while evolving and being shared. Documents can be annotated using multiple schemas simultaneously, offering more comprehensive perspectives. As a distributed, client-server system MAST allows for collaborative annotations across teams of users, and features team management and resource access functionalities, facilitating the potential for implementing open science practices. Altogether, we aim for MAST to provide a powerful and innovative annotation tool with application across numerous fields engaging with multimodal media.
Speaker estimation in a manga is one component that needs to be recognized in conducting research using manga. To identify the speaker of a text line in a manga, a dataset of who speaks the lines is needed. In order to construct such a dataset easily, we proposed a method to annotate who speaks a text line based on characteristics of information design and the human factor. Then, we developed a prototype system, constructed a dataset that mapped between text lines and speakers in the Manga109 dataset, and distributed the dataset on the Web. Then, we analyzed the dataset and showed that the perfect match rate was about 80% when there were five annotators. In addition, we found that variation in annotation occurred even with human judgment and that this was partly due to lines requiring reference to other frames. We also found that it was difficult for annotators to map speakers in scenes involving science fiction and battles by calculating the Evaluation Consistency Indicators.KeywordsComicMangaText LineSpeaker-Line Dataset
Comic panel detection is the task of identifying panel regions from a given comic image. Many comic datasets provide the borders of the panel lines as its panel region annotations, expressed in formats such as bounding boxes. However, since such panel annotations are usually not aware of the contents of the panel, they do not capture objects that extend outside of the panels, causing such objects to be partially discarded when panels are cropped along the annotations. In such applications, a content-aware annotation that contains all of the contents in each panel is suitable. In this paper, we assess the problem of content-aware comic panel detection using two types of annotations. We first create a small dataset with bounding box annotations where each region contains the entire contents of each panel, and train a detection model. We also explore training a pixel-wise instance segmentation model using synthetic data.KeywordsComic panel detectionObject detectionInstance segmentation
The purpose of this research is to analyze the textual source attributes of explanations and reviews about comics. Comics are difficult to process in terms of the intended story because they are primarily composed of pictures and text. One of the processing methods is to analyze comics text on the Web, particularly the description of characters and reviews including the reader’s impression about the comic. Sources of textual information, such as explanations or reviews, are selected according to the application of the study. However, differences among textual sources regarding comics are not taken into consideration in the analysis. This paper classifies words appearing frequently in the text semantically, with results showing that explanations include words that express the story, for example, the family structure, physical information, and sex of the characters for describing the characters. Conversely, the review frequently uses words that provide meta-information about comics, such as illustrations and style. The proposed method revealed that explanations of comics are more useful as textual sources for analyzing story information than reviews.KeywordsDifferences in Data SourcesReview Sentences of ComicsExplanation Texts of CharactersCharacteristics of Comic Story
The researchers of the graphic novels face challenges coping with the graphic novel images as they are associated with the range of designs, layout, text, and actions. These challenges make the content learning and object detection task much more difficult. To overcome this, deep learning approaches are incorporated in several domains through the use of machine learning techniques. A graphic novel is the composition of a text and graphic. To fully analyze the content of the graphic novel, understanding of the story, dialogs, line drawings, characters, and their location is required. Especially in comic analysis, detection of comic characters has been an interesting area as it inculcates adequate understanding of comics. The comparisons between graphic communication and languages are standardized in visual language theory (VLT). The visual language consists of signs that are highly conventionalized, but they vary according to the distribution of where they are positioned inside the panel of the graphic novel strip. The visual morphology uses semiotic references such as motion lines, scopic lines, radial lines, focal lines, spikes, twirls, spirals, and the shapes such as heart and stars. Depending on the location they are placed, the meaning varies. The research studies are focused to identify these conventions and how these signs interact and modify others. In this work, to identify the semiotics at different locations from a graphic novel strip, a custom YOLOv3 detector model is trained followed by the panel extraction. The individual panels are extracted using contour analysis. The trained model could detect the semiotics from the graphic novel images when they are placed around the character of interest with the mean average precision of 75.8%. The proposed method SPEGYOLO extracts the semiotics in the panels of graphic novels and further analysis based on their location and orientation will help the users to apprehend the meaning associated with it.KeywordsVisual language theorySemiotic referencesYOLOv3 detector modelPanel extractionSPEGYOLO
Full-text available
In current scenario, we have multiple technologies are available to transmit the data in wireless ad hoc network (WANET).But, compare to other existing models, we have proposed a new model where not only this model helps to reduce the energy consumption of nodes while transmitting the data but also our model focus on security aspect where it will help to solve some of the common attacks such as collusion attack, Dos attack, etc., our model consist of 3 stages/phases, namely as (i) registration, (ii) clustering, and (iii) transmission. In registration phase, each node in the network will identified by unique id, and nodes will registered themselves to the network, and in second phase, nearby nodes will form clustered based on the threshold value and transmitting power. By using multi-hop transmission after finding the route with the help of relay node search optimization (RESO), where appropriate relay node will be selected based on the parameters which will reduce the energy computation by the node and by the blockchain-based transaction from node of one cluster to another cluster. This will deminise the security threat in the network. However, in our model, we expecting the result better than existing models based on the theoretical knowledge.KeywordsBlockchainWANETClusteringRelay nodeDos attack
Advances in technology have propelled the growth of methods and methodologies that can create the desired multimedia content. “Automatic image synthesis” is one such instance that has earned immense importance among researchers. In contrast, audio-video scene synthesis, especially from document images, remains challenging and less investigated. To bridge this gap, we propose a novel framework, Comic-to-Video Network (C2VNet), which evolves panel-by-panel in a comic strip and eventually creates a full-length video (with audio) of a digitized or born-digital storybook. This step-by-step video synthesis process enables the creation of a high-resolution video. The proposed work’s primary contributions are; (1) a novel end-to-end comic strip to audio-video scene synthesis framework, (2) an improved panel and text balloon segmentation technique, and (3) a dataset of a digitized comic storybook in the English language with complete annotation and binary masks of the text balloon. Qualitative and quantitative experimental results demonstrate the effectiveness of the proposed C2VNet framework for automatic audio-visual scene synthesis.
Full-text available
The digital comic book market is growing every year now, mixing digitized and digital-born comics. Digitized comics suffer from a limited automatic content understanding which restricts online content search and reading applications. This study shows how to combine state-of-the-art image analysis methods to encode and index images into an XML-like text file. Content description file can then be used to automatically split comic book images into sub-images corresponding to panels easily indexable with relevant information about their respective content. This allows advanced search in keywords said by specific comic characters, action and scene retrieval using natural language processing. We get down to panel, balloon, text, comic character and face detection using traditional approaches and breakthrough deep learning models, and also text recognition using LSTM model. Evaluations on a dataset composed of online library content are presented, and a new public dataset is also proposed. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Full-text available
In this thesis, we review, highlight and illustrate the challenges related to comic book image analysis in order to give to the reader a good overview about the last research progress in this field and the current issues. We propose three different approaches for comic book image analysis that are composed by several processing. The first approach is called "sequential'' because the image content is described in an intuitive way, from simple to complex elements using previously extracted elements to guide further processing. Simple elements such as panel text and balloon are extracted first, followed by the balloon tail and then the comic character position in the panel. The second approach addresses independent information extraction to recover the main drawback of the first approach : error propagation. This second method is called “independent” because it is composed by several specific extractors for each elements of the image without any dependence between them. Extra processing such as balloon type classification and text recognition are also covered. The third approach introduces a knowledge-driven and scalable system of comics image understanding. This system called “expert system” is composed by an inference engine and two models, one for comics domain and another one for image processing, stored in an ontology. This expert system combines the benefits of the two first approaches and enables high level semantic description such as the reading order of panels and text, the relations between the speech balloons and their speakers and the comic character identification.
Full-text available
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at .
Conference Paper
Comic panel extraction, i.e., decomposing a comic page image into panels, has become a fundamental technique for meeting many practical needs of mobile comic reading such as comic content adaptation and comic animating. Most of existing approaches are based on handcrafted low-level visual patterns and heuristics rules, thus having limited ability to deal with irregular comic panels. Only one existing method is based on deep learning and achieves better experimental results, but its architecture is redundant and its time efficiency is not good. To address these problems, we propose an end-to-end, two-stage quadrilateral regressing network architecture for comic panel detection, which inherits the architecture of Faster R-CNN. At the first stage, we propose a quadrilateral region proposal network for generating panel proposals, based on a newly proposed quadrilateral regression method. At the second stage, we classify the proposals and refine their shapes with the proposed quadrilateral regression method again. Extensive experimental results demonstrate that the proposed method significantly outperforms the existing comic panel detection methods on multiple datasets by F1-score and page accuracy.
With the growth of digitized comics, image understanding techniques are becoming important. In this paper, we focus on object detection, which is a fundamental task of image understanding. Although convolutional neural networks (CNN)-based methods archived good performance in object detection for naturalistic images, there are two problems in applying these methods to the comic object detection task. First, there is no large-scale annotated comics dataset. The CNN-based methods require large-scale annotations for training. Secondly, the objects in comics are highly overlapped compared to naturalistic images. This overlap causes the assignment problem in the existing CNN-based methods. To solve these problems, we proposed a new annotation dataset and a new CNN model. We annotated an existing image dataset of comics and created the largest annotation dataset, named Manga109-annotations. For the assignment problem, we proposed a new CNN-based detector, SSD300-fork. We compared SSD300-fork with other detection methods using Manga109-annotations and confirmed that our model outperformed them based on the mAP score.
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.