ArticlePDF Available

# Text/Non-text Image Classification in the Wild with Convolutional Neural Networks

Authors:

## Abstract and Figures

Text in natural images is an important source of information, which can be utilized for many real-world applications. This work focuses on a new problem: distinguishing images that contain text from a large volume of natural images. To address this problem, we propose a novel convolutional neural network variant, called Multi-scale Spatial Partition Network (MSP-Net). The network classifies images that contain text or not, by predicting text existence in all image blocks, which are spatial partitions at multiple scales on an input image. The whole image is classified as a text image (an image containing text) as long as one of the blocks is predicted to contain text. The network classifies images very efficiently by predicting all blocks simultaneously in a single forward propagation. Through experimental evaluations and comparisons on public datasets, we demonstrate the effectiveness and robustness of the proposed method.
Content may be subject to copyright.
Author’s Accepted Manuscript
Text/Non-text Image Classification in the Wild
with Convolutional Neural Networks
Xiang Bai, Baoguang Shi, Chengquan Zhang, Xuan
Cai, Li Qi
PII: S0031-3203(16)30392-2
DOI: http://dx.doi.org/10.1016/j.patcog.2016.12.005
Reference: PR5977
To appear in: Pattern Recognition
Revised date: 5 December 2016
Accepted date: 8 December 2016
Cite this article as: Xiang Bai, Baoguang Shi, Chengquan Zhang, Xuan Cai and
Li Qi, Text/Non-text Image Classification in the Wild with Convolutional Neural
Networks, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.12.005
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
www.elsevier.com/locate/pr
Text/Non-text Image Classiﬁcation in the Wild with
Convolutional Neural Networks
Xiang Baia, Baoguang Shia, Chengquan Zhanga, Xuan Caib, Li Qib,
aSchool of Electronic Information and Communications, Huazhong University of Science
and Technology, Wuhan, China 430074
bThe Third Research Institute of the Ministry of Public Security, Shanghai, China
Abstract
Text in natural images is an important source of information, which can be
utilized for many real-world applications. This work focuses on a new problem:
distinguishing images that contain text from a large volume of natural images.
To address this problem, we propose a novel convolutional neural network vari-
ant, called Multi-scale Spatial Partition Network (MSP-Net). The network clas-
siﬁes images that contain text or not, by predicting text existence in all image
blocks, which are spatial partitions at multiple scales on an input image. The
whole image is classiﬁed as a text image (an image containing text) as long as
one of the blocks is predicted to contain text. The network classiﬁes images very
eﬃciently by predicting all blocks simultaneously in a single forward propaga-
tion. Through experimental evaluations and comparisons on public datasets,
we demonstrate the eﬀectiveness and robustness of the proposed method.
Keywords: Natural images, Text/non-text image classiﬁcation, Convolutional
neural network, Multi-scale spatial partition
1. Introduction
Scene text is an important source of information that is helpful for many
real-world applications, including image retrieval, human-computer interaction,
Corresponding author
Email addresses: xbai@hust.edu.cn (Xiang Bai), shibaoguang@gmail.com
(Baoguang Shi), zchengquan@gmail.com (Chengquan Zhang), caixuanfire@126.com
(Xuan Cai), quick.qi@foxmail.com (Li Qi)
Preprint submitted to Pattern Recognition December 10, 2016
blind assistance system, transportation navigation, etc. Therefore, scene text
reading, which includes text detection and recognition, has attracted much at-5
tention in the community [1, 2, 3]. However, typically, in a large volume of
natural images and video data, only a small portion contains text. In our es-
timation on an image dataset collected from social networks, only 10%-15% of
the images contain text. Directly applying scene text reading algorithms for
mining textual information tends to be ineﬃcient, as most of the existing text10
reading algorithms are time-consuming. To precisely localize text in an image,
algorithms like [4, 5, 6, 7, 8] typically require searching a large set of text-line
or character candidates, or dense image patches. The search would be mean-
ingless if an image contains no text at all. Therefore, an eﬃcient preprocessing
algorithm that quickly distinguishes whether an image contains text or not is15
desirable, which can be utilized as an essential stage of the systems for text
In this work, we address a relatively new problem: text/non-text image clas-
siﬁcation in the wild. The image that contains text is identiﬁed as text image (or
text positive image), regardless of the scale or location of text in it. Whereas,20
the image that does not contain any text is named as non-text image (or text
negative image). In this paper, we adopt the pair of text image and non-text
image to distinguish two types of natural images. We deﬁne text image as an
image that contains text, regardless of its scale or location, and non-text image
as an image that contains no text at all. Although some previous works have25
is mainly on video frames [10, 11], document images [12], or handwriting im-
ages [13, 14]. However, we focus on the discrimination of text/non-text natural
images, which has been seldom studied.
Unlike scene text detection, text/non-text image classiﬁcation neither re-30
quires ﬁnding precise text locations, nor recognizing text contents. Instead,
computational eﬃciency is important. A text/non-text image classiﬁcation al-
gorithm should classify a large amount of images in a short period of time, while
achieving high precision and recall.
2
(a)
(b)
Figure 1: Examples of text/non-text images. (a) Text images contain at least one piece of
scene text, regardless of the scales and locations; (b) Non-text images contain no text at all.
We argue that the proposed problem is challenging in four aspects. First,35
scene text exhibits large variations in font, scale, color, orientation, illumination,
and language type. The examples shown in Fig. 1 demonstrate some of the
variations. Second, diﬃcult to distinguish scene text with other background
objects, such as windows, grass, and fences, which are similar to text. Third, the
locations of scene text are not known in advance. It may appear at any position40
in an image. Last, a text/non-text image classiﬁcation algorithm should work
eﬃciently enough to process a large amount of data in a reasonable period of
time.
Essentially, text/non-text image classiﬁcation is a binary classiﬁcation prob-
lem. A straight-forward solution is to ﬁne-tune some well-trained image classi-45
ﬁers, such as the Convolutional Neural Network (CNN) model proposed in [15].
However, due to the above-mentioned challenges, general image classiﬁcation al-
gorithms may not work well for this problem. In particular, conventional CNN
3
models do not explicitly handle large scale and location variations exhibited in
scene text.50
In this paper, we propose a novel variant of CNN, named Multi-scale Spatial
Partition Network (MSP-Net), which is specially designed for the problem of
text/non-text image classiﬁcation. The main idea is to classify all image blocks,
which are regions produced by multi-scale spatial partition on an input image. If
at least one of the blocks is classiﬁed as text block, the whole image is recognized55
as a text image, otherwise a non-text image. Since blocks have various sizes and
positions so the proposed block level classiﬁcation scheme allows us to detect
text at multiple scales and locations. Moreover, as a by-product, the proposed
MSP-Net predicts coarse locations and scales of text.
MSP-Net can be evaluated and trained eﬃciently. During testing, all blocks60
of an image are classiﬁed simultaneously in a single network forward propaga-
tion. Plus our optimized GPU implementation, the proposed network classiﬁes
text/non-text images very eﬃciently. MSP-Net is end-to-end trainable, because
every layer of it can back-propagate error diﬀerentials. It can be easily trained
using images and corresponding block-level annotations.65
The contributions of this paper are summarized as the following: (1) We
propose a new scheme for text/non-text image classiﬁcation based on block-
level classiﬁcation, rather than whole image-level classiﬁcation; (2) We propose
a novel variant of CNN, called MSP-Net, which eﬃciently classiﬁes text/non-
text images, and is robust to the large variations on scale, location and language70
type of scene text; (3) As a by-product, we show that MSP-Net is also capable
of coarsely localizing scene text.
The rest of this paper is organized as followed. In Sec. 2 we review related
work. In Sec. 3, we describe and explain the architecture of MSP-Net. Ex-
perimental evaluations, comparisons with other methods, and discussions are75
presented in Sec. 4. Sec. 5 concludes our work.
4
2. Related work
Scene text reading. Scene text reading has been extensively studied in recent
years. Scene text detection and scene text recognition are two major topics in
this area. Most of the previous works focus on scene text detection and recogni-80
tion [4, 5, 7, 16, 17, 18]. As mentioned, text/non-text image classiﬁcation can be
handled by a scene text detection algorithm. Epshtein et al. in [19] utilized the
stroke width transform to seek candidate character components. Neumann and
Matas in [20] extracted maximally stable extremal regions (MSERs) as candi-
date character regions to set up a novel and robust pipeline for text localization85
in real-world images. Diﬀerent from the use of single character or stroke, Zhang
et al. exploited the symmetry property of character groups to directly extract
text-line candidates. However, most of them designed for precise localizing
text, which requires a lot of time to search and ﬁlter text/character candidates.
Whereas, text/non-text image classiﬁcation aims at ﬁnding if a natural image90
contains text or not.
Image classiﬁcation. In term of the essence, text image discrimination is a sub
task of image classiﬁcation. The existing methods can be summarized into three
categories: feature encoding based methods, deep learning based methods, and
hybrid methods. The framework of Bag of Words (BoW) is a typical feature95
encoding based method. The local descriptors such as HOG [21], SIFT [22],
LBP [23], etc. of regions of interesting (ROIs) are extracted, and aggregated
by some feature encoding methods such as vector of locally aggregated descrip-
tors (VLAD) [24], locality-constrained linear coding (LLC) [25]. After then,
one image can be represented by a compact and discriminative vector, which100
are eﬀective in image classiﬁcation or retrieval. Recently, convolutional neu-
ral networks have achieved high performance of image classiﬁcation. Thanks
to the CNN equipped with many convolutional layers, rectiﬁed units, sampling
layers, fully-connected layers,etc., the network can learn features and do im-
age classiﬁcation in an end-to-end manner. The learned CNN features have105
demonstrated the eﬀectiveness and robustness for image classiﬁcation [26], ob-
5
ject detection [27], contour detection [28], etc. However, most of existing CNN
models require a ﬁxed-size input image. He et al. [29] proposed SPP-net model
to generate a ﬁxed-length representation regardless of image size/scale. In our
approach, we also take the advantage of spatial pyramid pooling to generate110
ﬁxed-length representations for image blocks.
Text/non-text image classiﬁcation. There are several works that address the
problem of text image discrimination in document images or video data, but
most of them aren’t suitable for natural images. In [30], Alessi et al. proposed a
method to detect the potential text blocks of document image and set a thresh-115
old value to distinguish text and non-text documents. Vidya et al. [31] proposed
a system to classify the text and non-text regions in handwritten documents,
which can’t deal with natural images either. To our knowledge, our previous
work [32] ﬁrst proposed a suitable method that is the combination of three ma-
ture techniques including: MSERs, BoW, and CNN for text/non-text image120
classiﬁcation. We also released a large dataset which can be a benchmark for
evaluating algorithms of text/non-text image classiﬁcation. Another important
related work is the method proposed in [10], Shivakumara et al. ﬁrst proposed
a method for video text frame classiﬁcation based on ﬁxed-size block partition.
The text block can indicate the coarse position of text. Inspired by this idea,125
our work proposes multi-scale spatial partition for natural text/non-text image
classiﬁcation, due to the large variation of text scale and location in natural
scenes. Unlike the simple features adopted in [10], we adopt the convolutional
neural network to make the block-level prediction in a end-to-end manner by
moving the multi-scale spatial partition operation from image space to feature130
map. The multi-scale spatial partition plays the same role of ROI layer designed
in fast R-CNN [33], which can extract the CNN features for each region in an
eﬃcient way. Furthermore, one image block classiﬁed as text block should con-
sider the scale and area together in our method, so that the text block in our
method can also predict the position and scale of text at a coarse level.135
6
3. The Proposed Methodology
3.1. Overview
max-pooling
multi-scale
spatial partition
multi-level
feature maps
block-level
representation
image-level feature
generation
conv-1
conv-2
conv-3
conv-4
conv-5
deconv-3
deconv-4
deconv-5
classification
sub-network
non-text blocks
text blocks
Figure 2: The overall architecture of MSP-Net.
As introduced in Sec. 1, our starting point is to classify text/non-text image
through the examining images at a block level. However, diﬀerent from the
hand-crafted feature used for pre-partition image blocks in [10], our method140
combines spatial partition, feature extraction and text/non-text block classi-
ﬁcation into a single network (MSP-Net). The MSP-Net consists of 4 major
parts: image-level feature generation, multi-scale spatial partition, block-level
representation generation and text/non-text block classiﬁcation sub-network.
The overall structure of MSP-Net is illustrated in Fig. 2, which only requires145
the whole image as an input and examines all the image blocks in an end-to-end
manner.
Given an input image, the network outputs block-level classiﬁcation results
in a single forward propagation. Inside the network, ﬁrst, an image is fed into
the convolutional layers, whose structure is derived from the VGG-16 CNN150
structure [26], to generate a hierarchy of feature maps. Feature maps are then
upsampled to the same size by deconvolutional layers, and concatenated in
depth, resulting in a representation that comprises equally sized feature maps.
Next, the maps are spatially partitioned into blocks of diﬀerent sizes. The
7
adaptive max-pooling layer that equals to a spatial pyramid pooling layer [29]155
with only one pyramid level is applied to each block, producing feature vectors of
the same length. Following the pooling, feature vector for each block is fed into
the fully-connected layers which make the binary classiﬁcation for that block.
The ﬁnal classiﬁcation of the whole image is the logical OR of the individual
block classiﬁcation, i.e., as long as one block is classiﬁed as containing text, the160
image is considered text image, otherwise non-text image.
3.2. Image-level feature generation
Recently, feature maps from diﬀerent convolution layers are combined to
make pixel-level prediction tasks successfully [34, 35, 36], as they carry rich
and hierarchical information. When implementing, all images are scaled to165
have a ﬁxed height (500 pixels in our case), keeping their aspect ratios. The
feature generation part of MSP-Net consists of ﬁve convolutional layers that are
derived from the VGG-16 model [26], which has achieved superior performance
on image classiﬁcation. Given the scaled input images, the convolutional layers
produce a hierarchy of feature maps, where the map sizes produced by diﬀerent170
layers vary. Three deconvolutional layers are respectively connected to the third,
fourth and ﬁfth convolutional layers (abbreviated as conv-3, conv-4, and conv-
5). Via deconvolution, the maps are upsampled to the same size. The feature
representation is then the concatenation in depth of these upsampled maps,
which is a hierarchical representation of the whole image.175
In a CNN, each convolutional layer has a particular receptive ﬁeld size [37],
indicating the size of image region which every node on the feature maps is path-
connected to. Smaller receptive ﬁeld sizes lead to ﬁner feature granularity, while
larger sizes lead to coarser granularity. In our network settings, the receptive
ﬁeld sizes of conv-3 is 40, which favors lower-level and local features. For conv-180
5, the size is 192, which enables it to describe higher-level global context. As
shown in Fig. 3, feature maps (which are upsampled) of conv-3 have higher
sensitivities to text strokes and edges, while feature maps of conv-4 and conv-5
favor the whole text regions.
8
(a) (b) (c) (d)
Figure 3: Feature maps of diﬀerent layers. (a) is an input image, (b), (c) and (d) are feature
maps randomly selected from conv-3, conv-4 and conv-5, respectively.
The deconvolutional layers perform strided convolution on feature maps [35].185
They upsample input maps with ratios that are roughly the deconvolution
strides. With proper strides, we make output feature maps to have identical
width and height, so that they can be concatenated in depth.
3.3. Multi-scale spatial partition
Similar to the ROI pooling layer designed for fast feature extraction for
each proposal in Fast R-CNN [33], we move the operation of multi-scale spatial
partition from image-level space to feature-level space, in order to eﬃciently
obtain the features of each image block. In the partition step, the generated
feature maps are spatially partitioned into blocks with respect to several block
sizes. We use block sizes of w
N×h
N, where w, h are the width and height of the
input feature maps, and Nis an integer. Each block size uniformly partitions the
maps into N2equally sized blocks. Mathematically, the partition is formulated
by:
Fij (x, y) = F(x+iw
N, y +jh
N),
0x < w
N
0y < h
N,
(1)
9
where F(x, y) denotes the generated feature maps, Fij denotes the block at row190
j, column i(i, j are indexs of row and column, both of them start from 0 to
N1).
Following [29, 33], each block on the feature maps is associated to a region
on the input image:
Iij (x, y) = I(x+iW
N, y +jH
N),
0x < W
N
0y < H
N,
(2)
where I(x, y) is an input image whose size is W×H. We let the feature block
describe its corresponding image region. Although this results in redundant
description, since the receptive ﬁeld for the feature block would be larger than195
the region we deﬁne, this simpliﬁes our formulations, and works well in prac-
tice [29, 33]. Furthermore, we perform multi-scale spatial partition by choosing
diﬀerent values for N(e.g., 1, 3, 5 and 7), resulting in feature blocks of diﬀerent
sizes. The feature blocks describe local image regions of diﬀerent sizes, and they
are all used for the following adaptive pooling.200
In a neural network, all operations need to back propagate error diﬀerentials.
The back-propagation of the multi-scale spatial partition operation is formulated
by:
δL/δ F (x, y) = X
N
δL/δ F ij
i=bx
Nc,j=by
Nc(xiN, y jN),
0x<w
0y < h,
(3)
where Ldenotes the loss, the back-propagation on multi-scale spatial partition
operation is the sum of back-propagation of each feature block δL/δF ij .
3.4. Block-level representation and classiﬁcation
Since multiple scale values are used in multi-scale spatial partition ( we use
4 scales to partition feature maps into 1 ×1, 3 ×3, 5 ×5 and 7 ×7 feature205
blocks, respectively.), the output feature blocks represent corresponding image
blocks are of diﬀerent sizes, which are illustrated in Fig. 2. Hence, we normalize
the representation of each image block into the same size for feeding it into
10
the classiﬁcation sub-network. In order to generate ﬁxed-length feature repre-
sentation, an adaptive max-pooling layer is adopted. As one scale of spatial210
partition illustrated in Fig. 4, a block is equally divided into Ns×Nssub-blocks
(Ns×Nsdenotes the bock number partitioned under the s-th scale, s= 1 and
Ns×Ns= 3 ×3 here), in a similar way which an image is divided into blocks.
Then, max-pooling operation is applied to every block to generate a feature vec-
tor, whose length is Nmap, which is the depth of the feature map. Last, feature215
vectors generated from all blocks are concatenated into one block, whose length
is then N2
sNmap.
The spatial partition in a block is similar to the partition on feature maps,
described in Sec. 3.3. However, the purpose of dividing blocks into sub-blocks
is to capture the spatial relationships within a block, in order to improve the220
discrimination power of the resulting block-level representation. Essentially,
the sub-network that generates block-level representation is a special case of
the spatial pyramid pooling layer used in SPP-Net [29]. The spatial pyramid
pooling layer consists of several pyramid level of pooling layers, where each
pooling layer is adaptive layer that outputs ﬁxed-size feature by divided the225
feature map into ﬁxed-size bins. In fact, our spatial partition operation is equal
to 1 pyramid level of spatial pyramid pooling layer whose partition bin number
is Ns×Ns.
After feature extraction for all blocks of an image, we classify the blocks
using a single classiﬁcation sub-network. The classiﬁcation sub-network is a230
part of MSP-Net, which consists of three fully-connected layers. Since ﬁxed-
length representation of each image block is generated by adaptive max-pooling,
all feature vectors can be fed into the classiﬁcation sub-network in the form of
batch processing to make the text/non-text block classiﬁcation.
Besides, the numbers of dimensions for all block descriptors are the same, so235
the classiﬁcation sub-network accepts blocks of arbitrary sizes. Recall that other
parts of the network, namely convolutional layers, deconvolutional layers, and
spatial partition layers, also accepts arbitrarily-sized input maps. Consequently,
MSP-Net classiﬁes input images of arbitrary sizes. This property allows us
11
fixed-length representation
(36× 𝑁𝑚𝑎𝑝-d)
spatial pyramid pooling
(only one level: 6× 6 )
3×3 feature blocks
one feature block
(arbitrary size)
Figure 4: Block-level feature generation.
to directly feed original images into the network during testing, without any240
cropping or resizing that may cause loss of information.
3.5. Network training
Ground truth. The image blocks that are deﬁned as text blocks must meet two
constraint conditions: text area and scale. We use r1 denotes the text occupy
ratio in one image block, and the height ratio of text lines to the image block245
represented as r2. In our experiments, the value of r1 must be over 0.05, as well
as r2 must be over 0.5.As the dataset not only provides the image-level label
but bounding boxes of text lines, we can easily infer the ground truth of all
image blocks. As one example illustrated in Fig. 5, the yellow bounding boxes
in Fig. 5(b) are the ground truth of text lines, which indicate the text area and250
scale (or height) of text lines. Therefore, each image block generated by multi-
scale spatial partition in Fig. 5(c)Fig. 5(f) is deﬁned as positive if it meets two
constraints above, otherwise as negative. Besides, if an image block is classiﬁed
as text block, it not only means the whole image should be considered as text
image, but also indicates the coarse position and scale of text.255
Loss deﬁnition. Due to the binary class output of MSP-Net, we use the cross-
entropy loss function as the objective function. Suppose a training image I
12
(a) (b) (c)
(d) (e) (f)
Figure 5: Ground truth of image blocks with diﬀerent scales. (a) is a natural text image,
yellow bounding boxes in (b) show the text lines. Image is partitioned with multiple scales of
1×1, 3 ×3, 5 ×5, 7 ×7 in (c), (d), (e), and (f), respectively. The white blocks mean positive
and the black blocks are negative.
is partitioned with Nimage blocks, whose labels are denoted by {li}N
i. The
objective is to minimize the sum cross-entropy loss of all image blocks:
L=
N
X
i=1
(lilog pi+ (1 li) log(1 pi)),(4)
where piis the probability of i-th image block classiﬁed as text block,liis the
label of i-th image block.
We use the VGG-16 model which is pre-trained on ImageNet [15] to initial-
ize the 5 convolutional stages (ﬁrst 13 convolutional layers) of MSP-Net. Then,
eters by the back-propagation algorithm. Since the number of text blocks is
much smaller than the one of non-text blocks, we use the class-balancing weight
as a simple way to oﬀset this imbalance between text/non-text block. Thus, we
13
replace the equation (4) with the following formulation:
L=
N
X
i=1
(λlilog pi+ (1 λ)(1 li) log(1 pi)),(5)
where λdenotes the class-balancing weight, whose value is 2/3 in the training
stage.
4. Experiments260
In this section, we ﬁrst evaluate the proposed method on several public
benchmarks including the TextDis benchmark [32], the ICDAR2003 dataset [38]
and Hua’s dataset [39]. Then we compare our method with some existing meth-
ods, which are either text/non-text image classiﬁcation methods or general im-
age classiﬁcation methods. Last, in the discussion part, we evaluate the eﬀects265
of some parameters in our design.
4.1. Datasets
TextDis benchmark. This dataset is introduced in [32], which contains 7302
text images and 8000 non-text images. The benchmark randomly selects 2000
images for each class to build the testing dataset, and the remaining images are270
used for training. To our knowledge, this dataset is the ﬁrst dataset for the
discrimination of text and non-text natural image. Due to the large variation
in the fonts, scales, colors, languages and orientations of text in the image, this
dataset is quite challenging. Precision, recall and F-Measure are used as the
evaluation protocol for measuring the results of diﬀerent algorithms.275
ICDAR2003 dataset. 251 camera images are collected and released for evaluat-
ing scene text detection methods. Since all images are taken from natural scene,
there is still large variation in the fonts, scales and colors of text. The most sig-
niﬁcant diﬀerences from TextDis lie in that the language of text is English only
and the orientation of text is horizontal or nearly horizontal.280
14
Hua’s dataset. This dataset is a small video text detection benchmark, which
contains 42 text frames and 3 non-text frames. Diﬀerent from natural images,
text appearing in text frames usually has regular formats including fonts, scales
and positions.
4.2. Implementation details285
Table 1: The details of MSP-Net. Each convolutional stage has 2 or 3 convolutional layers.
’k’,’s’ and ’p’ mean kernel size, stride, and padding size in convolutional layers. And ’ws’
means the window size of pooling layer.
Layers Conﬁgurations
conv-1 2×{#map:64, k:3×3, s:1, p:1}
maxpooling ws:2 ×2, s:2
conv-2 2×{#map:128, k:3×3, s:1, p:1}
maxpooling ws:2 ×2, s:2
conv-3 3×{#map:256, k:3×3, s:1, p:1}
maxpooling ws:2 ×2, s:2
conv-4 3×{#map:512, k:3×3, s:1, p:1}
maxpooling ws:2 ×2, s:2
conv-5 3×{#map:512, k:3×3, s:1, p:1}
deconv-3 #map:128, k:1 ×1, s:1
deconv-4 #map:256, k:4 ×4, s:2
deconv-5 #map:256, k:8 ×8, s:4
muti-scale spatial partition #bin:{1×1, 3 ×3, 5 ×5, 7 ×7}
fc-1 #unit:4096
fc-2 #unit:4096
output #uint:2
Architecture details. The details of our proposed network (MSP-Net) are listed
in Table 1. The ﬁrst 5 convolutional stages are derived from VGG-16 model,
feature maps from conv-3, conv-4 and conv-5 are followed with up-sampling
layers which are replaced by deconvolutional layers with diﬀerent strides to make
the feature maps have the same size. The multi-scale spatial partition with 4290
scales ( e.g. 1 ×1, 3 ×3, 5 ×5, 7 ×7) are adopted in the feature map space
15
to eﬃciently generate features for 84 image blocks. After the spatial pyramid
pooling layer with only one level (i.e. 6 ×6), the feature size of each block is
(128 + 256 + 256) ×6×6. Finally, 84 feature blocks together form a team input
to the classiﬁcation sub-network for the ﬁnal text/non-text block classiﬁcation.295
The classiﬁcation sub-network consist of three fully-connected layers. Naturally,
if at least one block is classiﬁed as text block, the whole image is treated as text
image.
Data preparation. We apply rotation and ﬂipping operations to each training
image, and randomly crop 10 image regions with the same aspect ratio for data300
argumentation. After that, all training image regions are resized to ﬁxed height
(500 pixels). Since 4 diﬀerent scales are used in the layer of multi-scale spatial
partition, the heights of image blocks in 4 partition scales correspond to 500,
167, 100 and 71. Due to r2(the minimal height ratio of text line in image block)
is set to 0.5, one image block regarded as text block must meet the minimal305
height values: 250, 83, 50, and 10 for 4 partition scales.
Training details. We use stochastic gradient descent( SGD ) to ﬁne-tune the
MSP-Net whose details are listed in 1 with following parameters: mini-batch
size is 1 (due to multi-scale spatial partition, the number of image blocks is 84),
learning rate is 1e-6 (divided 10 after each 50K iterations), momentum value310
is 0.9, and weight decay is 0.0002. Training takes about 10 hours for a single
GPU (NVIDIA GTX TitanX). In testing phase, an input image is also resized
to the ﬁxed height and fed into the trained network to output 84 block-level
prediction results. Furthermore, the MSP-Net is trained on TextDis benchmark,
then tested on all datasets.315
4.3. Comparison methods
Locality-constrained Linear Coding (LLC). LLC [25] is a useful coding method
for image classiﬁcation. In our paper, we extract dense sift features of 3 diﬀerent
scales (e.g.,8 ×8, 16 ×16, 24 ×24), and the size of codebook clustered by k-
16
means is set to 2048. Besides, the spatial pyramid matching is replaced by320
global max-pooling, which still achieves a comparable result.
Spatial Pyramid Pooling Network (SPP-Net). The spatial pyramid pooling layer
proposed in [29] can generate ﬁxed-size and hierarchical features for image or
region in arbitrary sizes, which achieves a quite competitive performance on
object detection and recognition. In our comparison experiments, the SPP-Net325
adopts the same convolutional stages as our proposed method, and the pyramid
levels are in 3 scales (e.g., 1 ×1, 3 ×3, 5 ×5). However, the output of SPP-Net
is the image-level classiﬁcation, which is diﬀerent from our method.
CNN Coding. In our previous work [32], we proposed a method that combines
maximally stable extremal region (MSER), convolutional neural network (CNN)330
and bag of words (BoW) for text image discrimination. This work utilizes the
MSER to extract text candidates and feeds them into a trained CNN model to
generate visual features, then all features are aggregated by BoW to obtain the
ﬁnal representation for natural image. All the same parameters in [32] are used
for this comparison experiment.335
In the above methods for the comparison, LLC and SPP-Net only use the
information of image label, while the method of CNN Coding uses both image
label and text-line bounding box information to classify an image. Therefore,
the comparison between MSP-Net and CNN coding is more fair and represen-
tative.340
4.4. Experiments results
4.4.1. Experiments on TextDis benchmark
In Table 2, the quantitative classiﬁcation results of diﬀerent methods on
TextDis benchmark are listed. The proposed method (MSP-Net) outperforms
CNN Coding by 3.9% in precision, 5.1% in recall and 4.5% in F-measure. And345
the speed of MSP-Net is more than 3 times faster than CNN Coding. The com-
parison results between MSP-Net and SPP-Net show that it is hard to achieve
satisﬁed performance, if we directly use the existing framework of convolutional
17
Table 2: The results of diﬀerent comparison methods. The metrics including precision, recall,
F-measure and time cost are presented.
Methods Precision Recall F-Measure Time Cost
LLC 0.839 0.774 0.805 0.30s
SPP-Net 0.841 0.839 0.840 0.16s
CNN Coding 0.898 0.903 0.901 0.46s
MSP-Net 0.937 0.954 0.946 0.13s
network to do text/non-text image classiﬁcation. In order to intuitively illus-
trate the advanced performance of MSP-Net, we also plot the precision-recall350
curves of diﬀerent methods. Note that the MSP-Net can only output the conﬁ-
dence of image block identiﬁed as text block, so we use the maximum conﬁdence
value of all image blocks to approximate the score of the whole image that is
classiﬁed as a text image. The curve of MSP-Net in Fig. 6 shows that our
method keeps rather high precision even at the range of high recall.355
In addition, an important advantage of our proposed method is that text
blocks can indicate the coarse position and scale of text appeared in text image.
In order to better display this advantage, we keep all pixels of text blocks and
remove all non-text blocks. As shown in Fig. 7, text images are successfully
classiﬁed and their candidate text blocks highlighted with red bounding boxes360
in the second row are kept. Meanwhile, the majority of text in text images is
kept, and the scale (or height) of text line is comparable to the height of block
which it belongs to. Diﬀerent from other comparison methods which obtain
only the image-level conﬁdence of text image, our method can provide richer
4.4.2. Experiments on ICDAR2003 dataset
ICDAR2003 dataset is a publicly available scene text dataset whose text is
focused. We test our proposed method on ICDAR2003 to show that it works
well on focused text images. In order to acquire intuitive and fair comparison
results of the methods proposed in [10, 11], we use the classiﬁcation rate and370
the average processing time (APT) as the metrics.
18
Figure 6: The precision-recall curves of comparison methods.
The results of diﬀerent methods are list in Table. 3, which show that our
method outperforms the video text frame classiﬁcation methods [10, 11]. What’s
more, the average processing time of MSP-Net is much less. Some examples of
ICDAR2003 dataset are shown in Fig. 8.375
4.4.3. Experiments on Hua’s dataset
To discuss the generalization of our proposed method in video frames, we
test it on Hua’s dataset. The same metrics used in Sec. 4.4.2 are utilized to
evaluate the performances of diﬀerent methods. The results in Table. 4 show
that our method has obtained the highest classiﬁcation results. What’s more,380
the average processing time (APT) for each frame is quite faster than the other
two methods [11, 10] which are specially designed for text frame classiﬁcation.
19
(a)
(b)
(c)
Figure 7: Classiﬁcation results of TextDis benchmark. (a) are some samples of text images
from TextDis benchmark, red bounding boxes in (b) mean the text blocks detected by MSP-
Net, (c) keeps all pixels of text blocks.
Table 3: Classiﬁcation rates of proposed methods and existing methods on ICDAR2003.
Methods Text(%) Error(%) APT
Proposed method 89.2 10.8 0.132s
Shivakumara et al. [11] 80.97 19.03 1.23s
Shivakumara et al. [10] 81.12 18.88 N.A
In Fig. 9, we show some results of our method tested on Hua’s dataset. Most
text in Hua’s dataset is in the form of caption, which is easily captured, for
example video frames at the ﬁrst, second and third column of Fig. 9. Besides,385
some scene text in video frames can also be well captured by our proposed
method, like video frames in the fourth and ﬁfth columns of Fig. 9.
4.5. Discussion
4.5.1. Eﬀect of feature combination
In our proposed method, features from diﬀerent convolutional layers are390
concatenated after up-sampling to generate richer and more hierarchical fea-
tures. In order to discuss the eﬀect of diﬀerent groups of feature concatenation,
20
(a)
(b)
(c)
Figure 8: Classiﬁcation results of ICDAR2003 dataset. (a) are some samples of text images
from ICDAR2003 dataset, red bounding boxes in (b) mean the text blocks detected by MSP-
Net, (c) keeps all pixels of text blocks.
we adjust the feature maps from diﬀerent convolutonal layers and keep other
settings of the network. Table 5 list three settings of feature concatenation
and performance on the TextDis benchmark. From the listed results, the com-395
parison between Variant-1 and Variant-2 (or MSP-Net) also demonstrates that
diﬀerent feature maps that represent information with diﬀerent levels can be
concatenated to form rich and hierarchical representation for text/non-text im-
age. More feature maps from diﬀerent convolutional stages are concatenated,
the ﬁnal performance would be enhanced. Since the size of feature map at400
conv-1 and conv-2 stages is large, which would need more memory and consum-
ing time for feature concatenation, we don’t use feature maps from these two
convolutional stages.
4.5.2. Eﬀect of multiple scale for spatial partition
Since the large variance of natural text, especially the scale and area, we405
demonstrate the importance of multi-scale spatial partition through the com-
21
Table 4: Classiﬁcation rates of proposed methods and existing methods on Hua’s dataset.
Methods Text(%) Non-
text(%)
APT
Proposed method 100 100 0.127s
Shivakumara et al. [11] 97.62 100 1.05s
Shivakumara et al. [10] 75.54 24.46 2.04s
(a)
(b)
(c)
Figure 9: Classiﬁcation results of Hua’s dataset. (a) are some samples of text images from
Hua’s dataset, red bounding boxes in (b) mean the text blocks detected by MSP-Net, (c)
keeps all pixels of text blocks.
parison experiments with several groups of single-layer spatial partition. In
practice, we only change the layer of multi-scale spatial partition with diﬀer-
ent numbers and scales, keeping the same conﬁguration of other layers. In
Tab. 6, the result of multi-scale spatial partition outperforms any single spatial410
partition method. Although the result of single-layer with 7×7 achieves consid-
erable results, the multi-scale partition has obvious improvement. According to
the comparison results, we can demonstrate that convolutional neural network
can learn richer and more discriminative features for text block discrimination
if the range of text scale is proper.415
22
Table 5: Results of diﬀerent settings of feature combination. Variant-1 only uses the feature
maps from 5-th convolutional stages and Variant-2 combines the feature maps from 4-th and
5-th stages.
Variants Settings Precision Recall F-Measure Time Cost
Variant-1 conv-5 0.915 0.890 0.905 0.106s
Variant-2 conv4 + conv5 0.924 0.945 0.936 0.118s
MSP-Net conv-3 + conv-4 + conv-5 0.937 0.954 0.946 0.130s
Table 6: Eﬀect of multiple scale for spatial partition.
Scale Precision Recall F-Measure
1×1 0.825 0.819 0.822
3×3 0.870 0.864 0.867
5×5 0.892 0.921 0.906
7×7 0.931 0.914 0.922
1×1,3×3,5×5,7×7 0.937 0.954 0.946
4.5.3. Comparing with text detection methods
Table 7: Classifying text/non-text images on TextDis benchmark with diﬀerent text detection
methods.
Methods Precision Recall F-
Measure
MSP-Net 0.937 0.954 0.946
Zhang et al. [40] 0.754 0.979 0.851
Yao et al. [6] 0.808 0.902 0.853
Neumann et al. [20] 0.525 0.984 0.685
In this section, we compare MSP-Net with some existing natural text detec-
tion methods on classifying text/non-text image, which shows the eﬀectiveness
and eﬃciency of our proposed method. Similar with the classiﬁcation mech-
anism of MSP-Net, text detection methods classify one natural image as text420
image as long as one text line on it is detected. The results of diﬀerent text
detection methods on TextDis benchmark are listed in Tab. 7. The MSP-Net
obtain the highest accuracy as well as the least time.
23
Table 8: Time cost between Only Text Detection and MSP-Net +Text Detection on
TextDis benchmark.
Methods Only Text
Detection
MSP-Net +
Text Detec-
tion
Zhang et al. [40] 2.10s 0.85s
Yao et al. [6] 5.00s 2.10s
Neumann et al. [20] 0.94s 0.46s
Besides, we ﬁnd a interesting phenomenan that the time cost of text detec-
tion would be largely decreased if we use the MSP-Net to eliminate the non-text425
images before. In the Tab. 8, we ﬁnd the speeds of text detection methods on
TextDis benchmark are about more than doubled.
4.6. Limitations of the proposed method
While our proposed method outperforms other compared methods, there still
exists some failure cases. Text in diﬃcult natural conditions would get wrong430
classiﬁcation using our proposed method. For example, text in Fig. 10(a) is in
the condition of low illumination, while text in Fig. 10(b) are exposed. And some
regular curves, bricks or windows in Fig. 10(c),Fig. 10(d) are similar to text,
and would make false positive results. Due to the rigid spatial partition , the
majority of text is kept after text/non-text block classiﬁcation, but sometimes435
the remaining text is fragile if some text blocks are misclassiﬁed, shown in
Fig. 10(e)(f). In other way, the proposed method is based on the framework of
convolutional neural network, and therefore its time cost is limited to GPU.
5. Conclusion
In this paper, we have proposed a novel architecture of convolutional neural440
network (named MSP-Net) for text/non-text image classiﬁcation. The MSP-
Net takes input as a whole image and outputs block-level classiﬁcation results in
an end-to-end manner. The results on several datasets have demonstrated the
24
(a) (b) (c) (d) (e) (f)
Figure 10: Some failure cases. (a),(b) are text images in diﬃcult conditions. Some curves
or objects in (c),(d) are similar to text. Some true text blocks in (e) and (f) are eliminated,
which make the remaining text is fragile.
robustness and eﬀectiveness of our proposed method. Besides, one image block
classiﬁed as text block can also coarsely indicate the scale and position of text,445
which is helpful to scene text reading. The combination of text/non-text image
classiﬁcation with scene text reading system for mining scene text semantics
from the large scale images/videos on the Internet is worthy of exploration in
our future work.
6. Acknowledgements450
This work was supported by National Natural Science Foundation of China
(NSFC) No. 61222308 and No. 61573160, and in part by Program for New
Century Excellent Talents in University (No. NCET-12-0217).
References
[1] Y. Zhu, C. Yao, X. Bai, Scene text detection and recognition: recent ad-455
vances and future trends, Frontiers of Computer Science 10 (1) (2016) 19–
36.
[2] Y. Y. Tang, S.-W. Lee, C. Y. Suen, Automatic document processing: a
survey, Pattern Recognition 29 (12) (1996) 1931–1952.
25
[3] M. Khayyat, L. Lam, C. Y. Suen, Learning-based word spotting system for460
arabic handwritten documents, Pattern Recognition 47 (3) (2014) 1021–
1030.
[4] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the
wild with convolutional neural networks, International Journal of Computer
Vision 116 (1) (2016) 1–20.465
[5] Z. Zhang, W. Shen, C. Yao, X. Bai, Symmetry-based text line detection in
natural scenes, in: Proc. of CVPR, 2015, pp. 2558–2567.
[6] C. Yao, X. Bai, W. Liu, A uniﬁed framework for multioriented text de-
tection and recognition, IEEE Transactions on Image Processing 23 (11)
(2014) 4737–4749.470
[7] X. C. Yin, X. Yin, K. Huang, H. Hao, Robust text detection in natural
scene images, IEEE Transactions on PAMI 36 (5) (2014) 970–983.
[8] H. Hase, T. Shinokawa, M. Yoneda, C. Y. Suen, Character string extraction
from color documents, Pattern Recognition 34 (7) (2001) 1349–1365.
[9] B. Shi, X. Bai, C. Yao, Script identiﬁcation in the wild via discriminative475
convolutional neural network, Pattern Recognition 52 (2016) 448–458.
[10] P. Shivakumara, A. Dutta, T. Q. Phan, C. L. Tan, U. Pal, A novel mutual
nearest neighbor based symmetry for text frame classiﬁcation in video,
Pattern Recognition 44 (8) (2011) 1671–1683.
[11] N. Sharma, P. Shivakumara, U. Pal, M. Blumenstein, C. L. Tan, Piece-480
wise linearity based method for text frame classiﬁcation in video, Pattern
Recognition 48 (3) (2015) 862–881.
[12] E. Indermuhle, H. Bunke, F. Shafait, T. Breuel, Text versus non-text dis-
tinction in online handwritten documents, in: Proc. of SAC, 2010, pp. 3–7.
26
[13] A. Delaye, C.-L. Liu, Text/non-text classiﬁcation in online handwritten485
documents with conditional random ﬁelds, Pattern Recognition 321 (2012)
514–521.
[14] A. Delaye, C. Liu, Contextual text/non-text stroke classiﬁcation in on-
line handwritten notes with conditional random ﬁelds, Pattern Recognition
47 (3) (2014) 959–968.490
[15] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation with
deep convolutional neural networks, in: Advances in neural information
processing systems, 2012, pp. 1097–1105.
[16] P. R. Cavalin, R. Sabourin, C. Y. Suen, A. S. Britto Jr, Evaluation of
incremental learning algorithms for hmm in the recognition of alphanumeric495
characters, Pattern Recognition 42 (12) (2009) 3241–3253.
[17] X. Bai, C. Yao, W. Liu, Strokelets: A learned multi-scale mid-level repre-
sentation for scene text recognition, IEEE Transactions on Image Process-
ing 25 (6) (2016) 2789–2802.
[18] X.-X. Niu, C. Y. Suen, A novel hybrid cnn–svm classiﬁer for recognizing500
handwritten digits, Pattern Recognition 45 (4) (2012) 1318–1325.
[19] B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with
stroke width transform, in: Proc. of CVPR, IEEE, 2010, pp. 2963–2970.
[20] L. Neumann, J. Matas, A method for text localization and recognition in
real-world images, in: Proc. of ACCV, 2010, pp. 770–783.505
[21] N. Dalal, B. Triggs, Histograms of oriented gradients for human detecgtion,
in: Proc. of CVPR, Vol. 1, 2005, pp. 886–893.
[22] D. G. Lowe, Distinctive image features from scale-invariant keypoints, In-
ternational Journal of Computer Vision 60 (2) (2004) 91–110.
27
[23] T. Ojala, M. Pietik¨ainen, D. Harwood, A comparative study of texture510
measures with classiﬁcation based on featured distributions, Pattern Recog-
nition 29 (1) (1996) 51–59.
[24] H. J´egou, M. Douze, C. Schmid, P. P´erez, Aggregating local descriptors into
a compact image representation, in: Proc. of CVPR, 2010, pp. 3304–3311.
[25] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained515
linear coding for image classiﬁcation, in: Proc. of CVPR, 2010, pp. 3360–
3367.
[26] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
scale image recognition, in: Proc. of ICLR, 2015.
[27] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for520
accurate object detection and semantic segmentation, in: Proc. of CVPR,
2014, pp. 580–587.
[28] W. Shen, X. Wang, Y. Wang, X. Bai, Z. Zhang, Deepcontour: A deep
convolutional feature learned by positive-sharing loss for contour detection,
in: Proc. of CVPR, 2015, pp. 3982–3991.525
[29] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con-
volutional networks for visual recognition, in: Proc. of ECCV, 2014, pp.
346–361.
[30] N. G. Alessi, S. Battiato, G. Gallo, M. Mancuso, F. Stanco, Automatic
discrimination of text images, in: Proc. of SPIE, 2003, pp. 351–359.530
[31] V. Vidya, T. R. Indhu, V. K. Bhadran, Classiﬁcation of handwritten doc-
ument image into text and non-text regions, in: Proc. of ICSIP, 2012, pp.
103–112.
[32] C. Zhang, C. Yao, B. Shi, X. Bai, Automatic discrimination of text and
non-text natural images, in: Proc. of ICDAR, 2015, pp. 886–890.535
[33] R. Girshick, Fast r-cnn, in: Proc. of ICCV, 2015, pp. 1440–1448.
28
[34] B. Hariharan, P. Arbel´aez, R. Girshick, J. Malik, Hypercolumns for object
segmentation and ﬁne-grained localization, in: Proc. of CVPR, 2015, pp.
447–456.
[35] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for seman-540
tic segmentation, in: Proc. of CVPR, 2015, pp. 3431–3440.
[36] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proc. of ICCV, 2015,
pp. 1395–1403.
[37] Y. LeCun, L. Bottou, Y. Bengio, P. Haﬀner, Gradient-based learning ap-
plied to document recognition, Proceedings of the IEEE 86 (11) (1998)545
2278–2324.
[38] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, Icdar
2003 robust reading competitions, in: Proc. of ICDAR, 2003, p. 682.
[39] X.-S. Hua, L. Wenyin, H.-J. Zhang, An automatic performance evaluation
protocol for video text detection algorithms, IEEE Transactions on Circuits550
and Systems for Video Technology 14 (4) (2004) 498–507.
[40] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text
detection with fully convolutional networks, in: Proc. of CVPR, 2016.
29
... This approach requires working on vertical or curved text lines. Bai et al. [27] proposed a novel CNN variant for the problem of text/non-text image classification called Multi-scale Spatial Partition Network (MSP-Net). Wang et al. [28] recently used an optical flow-based approach to optimize text positions in corresponding frames. ...
Article
Full-text available
Detecting and recognizing text in natural scene videos and images has brought more attention to computer vision researchers due to applications like robotic navigation and traffic sign detection. In addition, Optical Character Recognition (OCR) technology is applied to detect and recognize text on the license plate. It will be used in various commercial applications such as finding stolen cars, calculating parking fees, invoicing tolls, or controlling access to safety zones and aids in detecting fraud and secure data transactions in the banking industry. Much effort is required when scene text videos are in low contrast and motion blur with arbitrary orientations. Presently, text detection and recognition approaches are limited to static images like horizontal or approximately horizontal text. Detecting and recognizing text in videos with data dynamicity is more challenging because of the presence of multiple blurs caused by defocusing, motion, illumination changes, arbitrarily shaped, and occlusion. Thus, we proposed a combined DeepEAST (Deep Efficient and Accurate Scene Text Detector) and Keras OCR model to overcome these challenges in the proffered DEFUSE (Deep Fused) work. This two-combined technique detects the text regions and then deciphers the result into a machine-readable format. The proposed method has experimented with three different video datasets such as ICDAR 2015, Road Text 1K, and own video Datasets. Our results proved to be more effective with precision, recall, and F1-Score.
... This approach requires working on vertical or curved text lines. Bai et al. [27] proposed a novel CNN variant for the problem of text/non-text image classification called Multi-scale Spatial Partition Network (MSP-Net). Wang et al. [28] recently used an optical flow-based approach to optimize text positions in corresponding frames. ...
Article
Detecting and recognizing text in natural scene videos and images has brought more attention to computer vision researchers due to applications like robotic navigation and traffic sign detection. In addition, Optical Character Recognition (OCR) technology is applied to detect and recognize text on the license plate. It will be used in various commercial applications such as finding stolen cars, calculating parking fees, invoicing tolls, or controlling access to safety zones and aids in detecting fraud and secure data transactions in the banking industry. Much effort is required when scene text videos are in low contrast and motion blur with arbitrary orientations. Presently, text detection and recognition approaches are limited to static images like horizontal or approximately horizontal text. Detecting and recognizing text in videos with data dynamicity is more challenging because of the presence of multiple blurs caused by defocusing, motion, illumination changes, arbitrarily shaped, and occlusion. Thus, we proposed a combined DeepEAST (Deep Efficient and Accurate Scene Text Detector) and Keras OCR model to overcome these challenges in the proffered DEFUSE (Deep Fused) work. This two combined technique detects the text regions and then deciphers the result into a machine readable format. The proposed method has experimented with three different video datasets such as ICDAR 2015, Road Text 1K, and own video Datasets. Our results proved to be more effective with precision, recall, and F1-Score.
... As a more realistic application, we consider the classification of character and non-character. It is known that this classification task is a part of scene text detection task and still a tough recognition task (e.g., [38]) because of the ambiguity between characters and non-characters. In this experiment, as a scene character image dataset, we use Chars74k dataset 5 , which contains large number of character images in the natural scene. ...
Preprint
Full-text available
In this paper, we propose an optimal rejection method for rejecting ambiguous samples by a rejection function. This rejection function is trained together with a classification function under the framework of Learning-with-Rejection (LwR). The highlights of LwR are: (1) the rejection strategy is not heuristic but has a strong background from a machine learning theory, and (2) the rejection function can be trained on an arbitrary feature space which is different from the feature space for classification. The latter suggests we can choose a feature space that is more suitable for rejection. Although the past research on LwR focused only on its theoretical aspect, we propose to utilize LwR for practical pattern classification tasks. Moreover, we propose to use features from different CNN layers for classification and rejection. Our extensive experiments of notMNIST classification and character/non-character classification demonstrate that the proposed method achieves better performance than traditional rejection strategies.
... They used a multi-layer perceptron (MLP) cascade classifier to separate text from non-text. Bai et al. [6] introduced a new type of cyclic neural network called multiscale spatial partition network (MSP-Net). This network classifies text/non-text images with different scales. ...
Article
Full-text available
Scene text detection and recognition have been given a lot of attention in recent years and have been used in many vision-based applications. In this field, there are various types of challenges, including images with wavy text, images with text rotation and orientation, changing the scale and variety of text fonts, noisy images, wild background images, which make the detection and recognition of text from the image more complex and difficult. In this article, we first presented a comprehensive review of recent advances in text detection and recognition and described the advantages and disadvantages. The common datasets were introduced. Then, the recent methods compared together and analyzed the text detection and recognition systems. According to the recent decade studies, one of the most important challenges is curved and vertical text detection in this field. We have expressed approaches for the development of the detection and recognition system. Also, we have described the methods that are robust in the detection and recognition of curved and vertical texts. Finally, we have presented some approaches to develop text detection and recognition systems as the future work.
... Convolutional Neural Networks are a popular deep learning model that has achieved state-of-the-art performance in many different tasks [130][131][132][133]. A convolutional layer is the major building block of a CNN architecture where the convolution operation is performed. ...
Article
Full-text available
This paper presents a critical approach to the non-intrusive load monitoring (NILM) problem, by thoroughly reviewing the experimental framework of both legacy and state-of-the-art studies. Some of the most widely used NILM datasets are presented and their characteristics, such as sampling rate and measurements availability are presented and correlated with the performance of NILM algorithms. Feature engineering approaches are analyzed, comparing the hand-made with the automatic feature extraction process, in terms of complexity and efficiency. The evolution of the learning approaches through time is presented, making an effort to assess the contribution of the latest state-of-the-art deep learning models to the problem. Performance evaluation methods and evaluation metrics are demonstrated and it is attempted to define the necessary requirements for the conduction of fair evaluation across different methods and datasets. NILM limitations are highlighted and future research directions are suggested.
... The font style that can recognize Chinese characters is of great significance to text-related work such as text recognition, artistic font style design, and handwriting identification [1]. For readability and aesthetics considerations, multiple font styles are used when formatting articles. ...
Article
Full-text available
Chinese characters have been created into many font styles in the long history, such as official script, running script, regular script and other standard computer font styles. Some famous calligraphers such as Ouyang Xun and Yan Zhenqing have produced many beautiful calligraphic works. Being able to detect and recognize these font styles quickly and accurately has many important applications in different use cases. In this paper, we present a sword-like model based on convolutional neural network with a sword structure to recognize font styles for Chinese characters. This model includes 15 convolutional layers. For each layer, we gradually increase the number of convolutional kernels to better extract the classification features of the input image.We use 4 downsampling layers in the model. For each downsampling operation, the length and width of the image become half of their original values while the number of channels gradually increases, leading to a sword-like shape. As a result, we name our model as SwordNet. We also created a Chinese font dataset called Nankai Chinese Font Style dataset, and made it available on Github. Using the above dataset, we compared the accuracy of our model with six other state-of-the-art network models. The experiments showed that SwordNet can achieve an average recognition accuracy of 99.03% in multiple experiments, while the other six models can only achieve accurary up to 94.91%. We concluded that SwordNet can perform better in font style recognition than other models.
Article
Land use/land cover classification of remote sensing images provide information to take efficient decisions related to resource monitoring. There exists several algorithms for remote sensing image classification. In the recent years, Deep learning models like convolution neural networks (CNNs) are widely used for remote sensing image classification. The learning and generalization ability of CNN, results in better performance in comparison with similar type of models. The functional behavior of CNNs is unexplainable because of its multiple layers of convolution and pooling operations. This results in black box characteristics of CNNs. Motivated with this factor, a CNN model with functional transparency is proposed in the present study. The model is named as Knowledge Based Morphological Deep Transparent Neural Networks (KBMDTNN) for remote sensing image classification. The architecture of KBMDTNN model provides functional transparency due to application of morphological operators, convolutional and pooling layers, and transparent neural network. In KBMDTNN model, the morphological operator preserve the shape/size information of the objects through efficient image segmentation. Convolution and pooling layers are used to produce minimal number of features from the image. The operational transparency of proposed model is coined based on the mathematical understanding of each layer in the model instead of randomly adding layers to the architecture of model. The transparency of proposed model is also because of assigning the initial weights of NN in output layer of model with computed values instead of random values. The proposed KBMDTNN model outperformed similar type of models as tested with multispectral and hyperspectral remote sensing images. The performance of KBMDTNN model is evaluated with the metrics like overall accuracy (OA), overall accuracy standard deviation ($OA_{\text{STD}}$), producer’s accuracy (PA), user’s accuracy (UA), dispersion score (DS), and kappa coefficient (KC).
Article
Full-text available
We develop a new edge detection algorithm that addresses two important issues in this long-standing vision problem: (1) holistic image training and prediction; and (2) multi-scale and multi-level feature learning. Our proposed method, holistically-nested edge detection (HED), performs image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations (guided by deep supervision on side responses) that are important in order to resolve the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSDS500 dataset (ODS F-score of 0.790) and the NYU Depth dataset (ODS F-score of 0.746), and do so with an improved speed (0.4 s per image) that is orders of magnitude faster than some CNN-based edge detection algorithms developed before HED. We also observe encouraging results on other boundary detection benchmark datasets such as Multicue and PASCAL-Context.
Conference Paper
Full-text available
Article
Full-text available
In this paper, we propose a novel approach for text detec- tion in natural images. Both local and global cues are taken into account for localizing text lines in a coarse-to-fine pro- cedure. First, a Fully Convolutional Network (FCN) model is trained to predict the salient map of text regions in a holistic manner. Then, text line hypotheses are estimated by combining the salient map and character components. Fi- nally, another FCN classifier is used to predict the centroid of each character, in order to remove the false hypotheses. The framework is general for handling text in multiple ori- entations, languages and fonts. The proposed method con- sistently achieves the state-of-the-art performance on three text detection benchmarks: MSRA-TD500, ICDAR2015 and ICDAR2013.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
In this paper, we are concerned with the problem of automatic scene text recognition, which involves localizing and reading characters in natural images. We investigate this problem from the perspective of representation and propose a novel multi-scale representation, which leads to accurate, robust character identification and recognition. This representation consists of a set of mid-level primitives, termed strokelets, which capture the underlying substructures of characters at different granularities. The Strokelets possess four distinctive advantages: 1) usability: automatically learned from character level annotations; 2) robustness: insensitive to interference factors; 3) generality: applicable to variant languages; and 4) expressivity: effective at describing characters. Extensive experiments on standard benchmarks verify the advantages of the strokelets and demonstrate the effectiveness of the text recognition algorithm built upon the strokelets. Moreover, we show the method to incorporate the strokelets to improve the performance of scene text detection.