Detecting text in natural scene images
with conditional clustering and
convolution neural network
Brian Kenji Iwana
Detecting text in natural scene images with conditional
clustering and convolution neural network
Anna Zhu,aGuoyou Wang,a,*Yangbo Dong,aand Brian Kenji Iwanab
aHuazhong University of Science and Technology, State Key Lab for Multispectral Information Processing Technology, School of Automation,
1037 Luoyu Road, Wuhan 430074, China
bKyushu University, Human Interface Laboratory, Information Science and Electrical Engineering, Nishi-ku, Fukuoka-shi 8190395, Japan
Abstract. We present a robust method of detecting text in natural scenes. The work consists of four parts. First,
automatically partition the images into different layers based on conditional clustering. The clustering operates in
two sequential ways. One has a constrained clustering center and conditional determined cluster numbers,
which generate small-size subregions. The other has fixed cluster numbers, which generate full-size subregions.
After the clustering, we obtain a bunch of connected components (CCs) in each subregion. In the second step,
the convolutional neural network (CNN) is used to classify those CCs to character components or noncharacter
ones. The output score of the CNN can be transferred to the postprobability of characters. Then we group the
candidate characters into text strings based on the probability and location. Finally, we use a verification step.
We choose a multichannel strategy to evaluate the performance on the public datasets: ICDAR2011 and
ICDAR2013. The experimental results demonstrate that our algorithm achieves a superior performance com-
pared with the state-of-the-art text detection algorithms. © 2015 SPIE and IS&T [DOI: 10.1117/1.JEI.24.5.053019]
Keywords: scene text detection; conditional clustering; convolutional neural network; character’s postprobability.
Paper 15180 received Mar. 18, 2015; accepted for publication Sep. 8, 2015; published online Sep. 29, 2015.
Human eyes are sensitive to important information in natural
scene images. Text is intuitive and contains vital information
that may provide the location, meaning, instructions, etc., of
scenes. In the human computer interaction field, correctly
detecting text from natural scene images is the primary
task of the following text recognition or optical character rec-
ognition (OCR). It is used to solve the problem, “where is the
text?”However, detecting text from natural scene images has
many difficulties,1,2for instance, the complex background,
geometrical distortions, various text orientations, and appear-
ances. A lot of work in this area has been done; for instance,
yearly competitions and conferences3–5are held to solve
these problems and improve the accuracy and computational
Text detection involves generating candidate bounding
boxes that are likely to contain lines of text. Based on the
ways of generating bounding box, we classify the text detec-
tion algorithms into three categories: generating text regions
from connected component groups, regional based patch
groups, or a hybrid of the two. The first one uses distinct
features based on the connected component (CC) level to
find the candidate characters. It groups spatial pixels to CCs
conditionally and then groups the characters to words. Yi and
Tian6designed a gradient-based partition and a color-based
partition algorithm using local gradient features and uniform
colors of text characters to obtain a set of CCs; then heuristic
grouping and structural analysis of text strings were per-
formed to distinguish text and noise. The maximally stable
extremal region (MSER) algorithm7was adopted to extract
components, then a trained AdaBoost classifier was used to
classify text and nontext parts. Shivakumara et al.8per-
formed K-means clustering in the Fourier-Laplacian domain
to extract CCs and used text straightness and edge density to
remove false positives. Because these kinds of methods
extract text on the pixel level, the detected result can be
used directly for text recognition.
The second category of methods attempts to classify
regional-based patches by textual features. The patches con-
taining text are merged to the text blocks. Usually, this
method uses a multiscale strategy and sliding windows.
Shivakumara et al.9used six different gradient edge features
(mean, standard deviation, energy, entropy, inertia, and local
homogeneity) over image blocks to capture the texture prop-
erty of the candidate text regions. Wang et al.10 classified
patches by using a random ferns’classifier trained on HOG
features in a sliding window. Wang et al.11 showed that a
deep feature based convolutional neural network (CNN) is
effective for character patch classification. These kinds of
methods treat text detection as an object detection task. It
can capture the inter-relationship of the text, which can accu-
rately detect text even when noisy.
The third category combines the advantages of the above
two methods. In the work of Pan et al.,12 a region-based
method first estimated the confidence of detecting the
existing text and then filtered out nontext components by
a conditional random field. Huang et al.13 used MSERs
by incorporating a text saliency map generated by a strong
CNN classifier to efficiently prune the trees of extremal
regions. These hybrid methods first roughly select candidate
text regions with one category and then use other strategies to
confirm the confidence of text regions and filter out nontext
*Address all correspondence to: Guoyou Wang, E-mail: email@example.com 1017-9909/2015/$25.00 © 2015 SPIE and IS&T
Journal of Electronic Imaging 053019-1 Sep∕Oct 2015 •Vol. 24(5)
Journal of Electronic Imaging 24(5), 053019 (Sep∕Oct 2015)
Our framework operates in four steps which can generate
high detection precision and recall. In the first stage, images
are partitioned based on conditional clustering. A lot of CCs
are generated in the subimages. Those CCs in different sub-
regions have different relations. In the second stage, a CNN
is trained to classify all of the CCs. Based on the output
scores of candidate characters and the position relationship
in each subregion, we group the candidate characters into
words. Finally, a verification step is given to classify candi-
date words to text or nontext. The whole procedure begins
with the entire image, continues to the individual parts, and
then to integration. We consider the text as an integral struc-
ture, which is opposite to the MSER method. To get a high
recall, we also extract CCs which perform nontext features in
the partition step. After classification by CNN, we group the
candidate characters to the text based on the former two
steps. The main contribution of our work is threefold.
First, in the image partition stage, the whole image is parti-
tioned in two ways. One is with the automatically computed
number of clusters, which generates many text candidate
subregions. The second is processed in the rest regions by
K-means with a fixed number of clusters. This two-way clus-
tering can avoid missing text candidates and also keep the
relationship of the detected text candidates in subregions.
Second, in the grouping stage, we use the postprobability
obtained by CNN and the position obtained in the image par-
tition stage of each of the CCs to group them into words.
Thus, the combined probability can also aid in filtering
out some nontext regions. Third, in the framework, we ana-
lyze both the inter- and the intra-features among the text and
nontext. We consider the text as a whole and analyze the fea-
tures of individual parts in the text and integrate the individ-
ual parts at last step. The experiment on three datasets shows
that our method can keep a high precision and also a high
recall in the text detection task.
In the rest of the paper, we present the proposed method in
detail in Sec. 2. The parameter’s selection is discussed in
Sec. 3.InSec.4, we give the experimental result. Finally,
Sec. 5summarizes and concludes.
2 Proposed Method
There are four stages in our proposed method. In the first
stage, we partition images to different subregions. The
size of each subregion is different depending on the different
clustering methods. If the clustering is based on a fixed num-
ber of clusters, the subregion is the same size as the image.
Otherwise, it is variable. Then all the CCs in those subre-
gions are input into a trained CNN, and the output is the can-
didate characters with the probability and the noncharacters.
The candidate characters are grouped into words based on
the probability and the position in each subregion. A verifi-
cation process is given by a trained support vector machine
(SVM) classifier to classify text and nontext.
2.1 Conditional Clustering for Image Partition
In this step, we partition the image to several subregions by
assigning pixels with similar intensity and spatial positions
to a certain layer. There are two sequential methods for clus-
tering. In the first clustering, many small-size subregions
containing text with a high probability are produced. In
the sequential clustering, we use K-means to get full-size
subregions. We follow the steps below to partition images
in the first clustering method.
1. The Canny edge detector14 is used to get the edge map
due to its continuity and integrity. It retains the edge
points whose orientations belong to ½π∕3;π∕2or
½−π∕2;−π∕3and removes the rest in the edge map.
2. Scan each row and extract points in the middle of two
neighboring edge points. Those points are marked as
nig, where gni is the intensity of
the extracted points nin the i’th row.
3. If the intensity difference of two neighboring points in
Giis less than Tg, that is jgpi −gðpþ1Þij<Tg, we mark
gðpþ1Þias a nonrepresentative point and delete these
points after all the points in Giare scanned.
4. If a point in Gihas an intensity difference with any
other points in Gi, which are <5, it should be preserved
and considered as a representative point (RP). Other-
wise, it should be removed.
5. Count the intensity distribution of the RPs in each row
to form an RP distribution map. Then search the dis-
tribution map with an r×rwindow to find the loca-
tion and intensity of potential text. Since 5 is the
tolerance difference of two point’s intensities in step 4,
we set the length of window size rto 2×5þ1¼11.
If there is no other point in the window excluding the
center one, the center point is set to 0.
6. In each row, if the difference between two RPs’gray
values is <5, points on their routine are marked as 1.
The same operation is performed in each column. We
can get clusters in this way. The left and right boun-
dary values of each cluster compose the intensity
range of potential text.
7. Rebuild the distribution of all ridge points of vertical
parallel edges and search neighboring points vertically
from the top and bottom boundary of each original
cluster separately to find the new boundaries. If two
clusters have a similar color range and are close to
each other vertically, then they are connected in the
vertical direction to form new top and bottom
The examples are shown in Fig. 1(a) and RPs are marked
as green dots. The RPs are obtained from the ridge of parallel
vertical edges. As text regions have more complex texture
features than smooth invariant regions, most of the text
ridge points are contained in the RPs. The statistical analysis
of the dataset of training text images15 shows that the xderiv-
atives tend to be larger in the central (i.e., text) region and
smaller in the regions above and below the text. Based on
this fact, we extract RPs between the two vertical direction
edges. The RPs in text regions have similar intensities and
approaching spatial positions, so they can form clusters. In
the background regions without text, they may disperse to
different intensities. Figure 1(b) shows the RPs’intensity,
and the position distribution of the images is shown in
Fig. 1(a). The clusters of Fig. 1(a) images are shown in
Fig. 1(c). If the height of a cluster is <15 or the width is
>100, it should not be partitioned. The height of a cluster
is the possible height of the text. There are many clusters
with the height <15, but they are background noise. To retain
Journal of Electronic Imaging 053019-2 Sep∕Oct 2015 •Vol. 24(5)
Zhu et al.: Detecting text in natural scene images with conditional clustering and convolution neural network
a higher precision, we remove those clusters whose height is
<15. The subregion’s size is decided by the cluster’s height
and image width. In case some text regions are missed in the
first-stage clustering, we use K-means to cluster the unproc-
essed regions. If most of the regions of an image are proc-
essed, less information is contained in the remaining regions.
Otherwise, more information needs to be segmented by sec-
ond-stage clustering. In the K-means algorithm, the cluster-
ing number is defined by humans, and sometimes it is fixed.
Here, we use the region ratio to decide it. We first count the
ratio of a partitioned subregion’s size to the image size. If the
ratio is >0.5, we set K¼3. Otherwise, Kis set to 5. The
color feature is used for the K-means clustering. The
partition result is shown in Fig. 2. In total, there are seven
subregions. Some background CCs appear after the sec-
ond-stage clustering, which can be filtered out by geometric
features. Thus, we filter out the CCs whose width or height is
the same as that of the image or the ratio is >0.8. Then we
can get some new potential text CCs in this clustering. This
process generates low precision but high recall.
2.2 Connected Components Classification by
Convolutional Neural Network
With the excellent performance of CNN on object classifi-
cation16 and character classification,17,18 it has become
more popular in the computer vision field. It can also be
used to solve the text detection problem, which has two
classes: text or nontext. The subregions contain many CCs
which are characters and background. In this section, we use
a trained CNN, as shown in Fig. 3, to classify those CCs. The
whole CNN structure is composed of three convolutional
layers and two fully connected layers. We use average pool-
ing and rectified units. The settings of the CNN architecture
used in our work are shown in Table 1. In this classification
system, all the stride sizes are set to 1. The input image
region is normalized to 32 ×32 pixels, and we can get 1×
1pixel-size feature maps after the first three convolutional
layers under a special design. The kernel numbers we
used are 48, 64, and 128, respectively. After each convolu-
tion, we perform average pooling with window sizes of 5, 2,
and 0. Then a two-layer full connection with 118 and 2 per-
ception units follow. SVM is used in the last layer; we can
obtain a vector of two dimensions representing the scores of
text and nontext.
In the first layer of CNN, we train the filters in an
unsupervised learning algorithm.11 In a given set of
32 ×32 pixels training images, we randomly select eight
patches of size 8×8 pixels, then perform contrast normal-
ized and zero-phase component analysis whitened19 on all
these patches to form 64 dimension input vectors of each
Fig. 1 (a) The processed original image with representative points marked in green, (b) distribution of
gray level and horizontal position, and (c) clusters for image partition.
Fig. 2 Partitioned subregions: (a) small-size subregions—the subregion’s size is image width ×cluster
height and (b) full-size subregions—the subregion’s size is the same as the original image. We rescale
them for display.
Journal of Electronic Imaging 053019-3 Sep∕Oct 2015 •Vol. 24(5)
Zhu et al.: Detecting text in natural scene images with conditional clustering and convolution neural network
variant. A K-means clustering method is used to learn a set of
low-level filters D∈R64×n. Here, we set n¼48, which is
equal to the first layer number. The filter bank in the
first layer is shown in Fig. 4. For a normalized and whitened
8-by-8 pixel patch x, we compute its first layer responses
by forming an inner product with the filter bank followed
by a scalar activation function: Response ¼maxf0;jDTxj−
αg, where α¼0.5 is a hyperparameter. In the training phase,
we use SVM classification and backpropagation to optimize
all parameters, while keeping the filters unchanged in the
first convolution layer (learned from K-means). After classi-
fication by CNN, the CCs are labeled with the probability of
text. We filter out some nontext CCs by setting a threshold T
to output the score of the CNN.
2.3 Connected Components Grouping
In this section, we group the candidate text CCs into words in
two stages. In the first stage, each CC is grouped with neigh-
boring CCs in the same subregion. In the second stage, we
consider all the CCs in different subregions to recall some of
the missing parts. This operation can retrieve some missing
characters in a word. Text strings in natural scene images
usually appear in alignment, and the characters inside are
close to each other, namely, each text character in a text
string must possess character siblings at adjacent positions.
The structure features among neighboring characters can be
used to determine whether the CCs belong to a word or are
unexpected noise. We search in the left and right directions
of double height distance regions of the connected compo-
nent as shown in Fig. 5, which is labeled as gray. If there
exist connected components in the two regions, we use
four criteria as defined below to decide whether they are
neighboring to each other. Here we label the top, bottom, left,
and right boundary coordinates XxT ,XxB ,XxL, and XxR of
the candidate character block, as well as the coordinates
1. The average color distance of two regions should be
less than Tcolor.
2. Considering the uppercase and lowercase characters,
the height ratio should fall in ½ð1∕ThÞ;Th.
3. The font of characters in a word usually has a similar
occupation ratio, so the relationship of their occupa-
tion ratio should fall in ½ð1∕ToccpÞ;Toccp.
4. Their center difference and top and bottom boundary
difference should satisfy Eq. (1).
Tcolor is the color difference, and we set it to 40. The
height difference Thand occupation ratio difference Toccp
are set to 2.3 and 2.5, respectively. Th1and Th2are
Fig. 3 Convolutional neural network (CNN) architecture used in our work.
Table 1 The parameters used in the convolutional neural network
Layers Kernel size Pooling size Nodes
Layer 1 8 5 48
Layer 2 2 2 64
Layer 3 2 0 128
Full 1 ––118
Full 2 ––2Fig. 4 The filter bank in the first layer of CNN.
Journal of Electronic Imaging 053019-4 Sep∕Oct 2015 •Vol. 24(5)
Zhu et al.: Detecting text in natural scene images with conditional clustering and convolution neural network
parameters of the grouping module. Th1is a threshold for
the height difference of two CCs’centers and Th2is a
threshold for the top and bottom boundary difference of
the two CCs. We set Th1¼0.3 ×minfhðciÞ;hðcjÞg and
jXiT −XjT j<Th2;jXiB −XjBj<Th2
These criteria are applied to all pairs of neighboring can-
didate characters. In the search of regions, there may exist
more than one neighboring CC; thus, we find the most closed
right and left CCs as the adjunct candidate characters.
Meanwhile, we label the CCs that only have adjunct candi-
date characters on one side.
After the first-stage grouping, CCs can be classified as
most left, most right, middle, and non-neighboring candidate
characters. For the middle ones, we do not search any more.
The above four criteria are operated for the CCs’most left
neighbored on the left side of the searched region and most
the right neighbored on the right side of the searched region.
After this process, each text string can be mapped into an
adjacent character group. However, some adjacent character
groups correspond to unexpected false positives, but not real
text strings. We use the text probability to filter the nontext
In the former stage, we get the score of all the candidate
text CCs in one image, then we normalize these scores to the
text probability constrained in [0, 1]. We count the maximum
scores Maxsand minimum scores Minsof the CCs in the
image. Then each CC’s probability is computed in
Eq. (2). This represents the text probability in the image.
Examples are shown in Fig. 6.
If the group has more than three candidate characters, we
count the average probability Pavg of all CCs that belong to
it. If Pavg >T
p, then they are considered as text regions. For
characters with only two CCs, if they satisfy this condition
and their occupation ratio is <0.8, they are considered as text.
Otherwise, we search their left and right sides with a width of
3×Wcc, where Wcc is the width of the centered CC. If there
exist text regions, they are considered as text. This two-stage
grouping method can retrieve characters in text strings in dif-
ferent image partitions.
2.4 Text/Nontext Verification
In the first step, we use text string features to partition
images. Then the deep feature20 based CNN classification
is used to classify the CCs. After grouping, the CCs form
either words or background. For the grouped CCs of an
uncertain size subregion, they have a very high probability
of being text by using both inter- and intra-features. For the
grouped CCs in full-size subregions, only character features
are analyzed. Thus, in this step, we classify these grouped
candidate text regions by an SVM classifier with a modified
gradient-based HOG descriptor.21 We perform a distance
transformation to obtain the distance map D. Based on
the distance map, the gradient is computed by Eq. (3).
EQ-TARGET;temp:intralink-;e003;63;320∇Iði; jÞ¼1∕2½Iðiþ1;jÞ−Iði−1;jÞ;Iði; j þ1Þ−Iði; j −1Þ;if Dði; jÞ≤3
No operation;otherwise :(3)
Large gradients arise close to edges, so we only compute the
gradients of pixels near the contours. The gradient orienta-
tion θ∇Iði; jÞand magnitude ρ∇Iði; jÞare obtained from
valid ∇Iði; jÞ.θ∇Iði; jÞis defined in the range ½0;2π.
Reference 22 points out that the gradient distribution is
largely independent of the horizontal position, so a single
cell in the horizontal plane performs better. Hence, we design
five blocks and mainly use horizontal cuts, as shown in
Fig. 7. The HOG in each cell has 12 bins, and each bin is
π∕6radians wide-centered at orientations k×ðπ∕6Þfor
k¼0;1;:::11. For feature normalization, the final descrip-
tor is divided by the sum of all features plus a constant η. The
input of SVM is a 15 ×12 normalized HOG feature vec-
A sliding window strategy is used to decompose candi-
date regions to several blocks and fuse all of the SVM scores
of these samples for classification. First, the text block in the
original color space is converted to grayscale, since the
human visual system is only sensitive to the brightness chan-
nel for character shapes’recognition.23 To solve the text
candidate region’s size variations, normalization is imple-
mented. A large normalized size will cost more time to com-
pute its features, but the features in a small normalized size
are indiscriminate. We normalize the text candidate to a
height of 30 pixels. Correspondingly, its width is resized
with equal proportions and then scaled to the approximate
pixel number of 30 ×nðn≥2∈NþÞ. The gray-scale text
candidate region and the binary region obtained from the
grouping result are both normalized. Then we use a sliding
30 ×60 pixels window with each window overlapping
30 pixels of the previous window to generate several sam-
ples. These samples are classified by the trained SVM clas-
sifier, and all the output results in Eq. (4) are averaged to
determine the final result, where Kis the kernel24 that occurs
for the dot product of the input feature vectors and the N
fixed support vectors ΦðxiÞ.αiare real weights and bis
the bias. The two parameters are obtained from the training
step. These SVM output scores of the decomposed samples
are averaged with Gaussian weights as used in Ref. 25. The
fused SVM score is the final candidate region’s score. If it is
Fig. 5 The filter bank in the first layer of CNN.
Journal of Electronic Imaging 053019-5 Sep∕Oct 2015 •Vol. 24(5)
larger than a certain threshold, then this region is regarded as
a text region; otherwise, it is regarded as a false alarm. In the
experiment, we set the threshold to −2.8 to balance the false
acceptance rate and false rejection rate.
iÞ þ b: (4)
In different channels, we get a set of text regions, but
some of the regions overlap when projected to one map.
Therefore, the text regions that share similar positions should
be compared and only one is selected as the real text region.
We label small-size subregions as Rs-text and text regions in
full-size subregions as Rf-text . In our method, we first analyze
the grouped text regions in the small-size subregions. If the
overlap size of two Rs-text exceeds 90%, they are considered
as the same text region and we get their average. If it is <30%
or there is no overlap, the two regions are considered as sep-
arate detected text regions. Otherwise, we select the larger
one. Then we analyze two overlapped Rf -text if their overlap
size is >20%. Motivated by Kim’s method26 that transforms
the output of SVM to a probability scaled in an interval [0, 1]
using a sigmoidal activation function, if the overlap is <50%,
we determine that the region that has the larger probability of
the transformed SVM output is considered a text region.
Otherwise, we select the most outer boundaries of both
text regions. For overlapped Rf-text and Rs-text , we consider
Rs-text as the base text region and only expand the boundaries
when their heights are similar, but the left side of the right
boundary of Rf-text is larger.
3 Parameter Discussion
Three important parameters in our framework are discussed
in this section: the intensity difference threshold Tgof the
conditional clustering step, the threshold Tof CNN’s output,
and the grouping threshold Tpto filter out nontext in the
3.1 Intensity Difference Threshold Tg
As we know, the response of the eye to changes in the inten-
sity of illumination is known to be nonlinear. The visual phe-
nomenon depicted in Fig. 8is the profile of the just
noticeable difference (JND)27 in the psychophysics concept.
JND is the minimum amount by which stimulus intensity
must be changed in order to produce a noticeable variation
in a sensory experience. Over a wide range of intensities, the
expression is ðΔI∕IÞ¼ξ.
ΔI∕Iis called the Weber fraction, where the original
intensity is I,ΔIis the additional amount required for the
difference to be perceived (the JND), and ξis a constant
called the Weber constant. When out of this range, the
JND is markedly changed with the variant intensity of I.
Then the intensity difference threshold is Tg¼ξ×I,
where Irepresents the point that has a lower intensity
value between two points. In the experiment, we set ξto
a variant value based on the range of I.
1. ξ¼0.15,ifI∈½0;100Þor ð220;255.
3.2 Convolutional Neural Network Output Threshold T
The CNN output threshold Tis used to decide whether the
input connected components are characters or noise. If T
tends to be infinitely great, no character components are
detected. If it tends to be infinitely small, all CCs are
Fig. 6 Converted probability of candidate characters.
Fig. 7 Five blocks for extracting HOG features.
Fig. 8 Contrast sensitivity measurements.
Journal of Electronic Imaging 053019-6 Sep∕Oct 2015 •Vol. 24(5)
classified to characters which dramatically affects the group-
ing result. To evaluate the effect, we calculate two elements:
the character detection precision pand the recall r. The char-
acter detection precision pis the ratio of a detected real char-
acter’s number and the number of all the detected candidate
characters. The recall ris the ratio of a detected real char-
acter’s number and the number of all the ground truth char-
acters. We combine the two datasets4,5with segmentation
ground truth files to evaluate the performance. The output
CCs belong to real characters only when their area overlaps
over 50% of the segmented ground truth character’s area. In
Fig. 9, we observe that the recall increases with the decrease
of T, while the precision increases in the beginning but then
decreases. The reason for this is that with the decrease of T,
we can detect more real characters, while the nontext CCs are
also increased. The recall is always increasing since it is only
related to the number of real characters; but the precision is
related to both of them. With the decrease of T, the real char-
acter’s increasing rate is faster than the nontext connected
component’s at the beginning, but is eventually overtaken
by the latter.
Therefore, we define a score Fas a trade-off with the two
situations, which combines the character detection precision
and the recall. Score Fis defined in Eq. (5). To get the high-
est score F, we set T¼0.
3.3 Grouping Thresholds Tp
The threshold Tpin the grouping step decides whether the
grouped regions belong to text or nontext. We also illustrate
the text precision-recall curves for this threshold among all
the scene images of the two datasets.4,5The precision is the
ratio of area of the successfully extracted text regions to the
area of the whole detected region, and the recall is the ratio of
the area of the successfully extracted text regions to the area
of the ground truth regions. The area of a region is the num-
ber of interior pixels. The same measure F-score in Eq. (5)is
used. If Tp→1, all the grouped regions are preserved. Thus,
the precision is high. If Tp→0, no region will be detected.
In Fig. 10, we test nine values of Tpin [0.1, 0.5]. The highest
F-score is 77.4% when Tp¼0.25. So it is set to 0.25 in our
4 Experimental Results
4.1 Data Collection
In our system, there are two machine-processed tools
required for training: the convolutional neural network for
character classification and SVM for text and nontext clas-
sification. Thus, we collect two training sets. For the char-
acter training set, we obtain the dataset from Ref. 11.It
consists of examples from the ICDAR 2003 training images,3
the English subset of the Chars74k dataset,28 and syntheti-
cally generated examples. Examples of the training images
are shown in Fig. 11(a). There are 18,500 character images in
total. We randomly select 500 images to train the first layer
filters. The rest are used to train the parameters in CNN.
For training the SVM classifier in the verification step, we
collect 1697 text regions of training scene images from the
ICDAR 2011 and 2013 datasets. First, these training scene
images are inverted to gray-scale images and each of them
is binarized by a clustering method.29 For less cost and higher
accuracy, we normalize both the binary text region images and
gray-scale text region images to a height of 30 with a fixed
width and height ratio. Then we divide each of them into sev-
eral 30 ×60 pixel samples and make sure that each gray-scale
sample corresponds to its own binary image. In total, we
collect 3000 text strings (containing 3000 gray-scale text
region images and corresponding to 3000 binary images).
Examples of the SVM training samples are shown in
Fig. 11(b). For nontext component samples, we apply our
algorithm to training scene images of both datasets and
manually select 3620 nontext single CCs and grouped nontext
We then test our method on two benchmark datasets:
ICDAR 20114and 2013 text location datasets.5In addition,
Fig. 10 The text detection precision-recall curves with different group-
ing thresholds Tp. The numbers above the labeled points are the val-
ues of Tp.
Fig. 9 The character precision-recall curves with different T. The
numbers above the labeled points are the values of T.
Journal of Electronic Imaging 053019-7 Sep∕Oct 2015 •Vol. 24(5)
in the ICDAR 2015 competition, a new challenge on the inci-
dental scene text task is held. The images of this dataset are
acquired with wearable devices and the text appearing in the
scene is low resolution. No specific prior action is taken to
change its appearance or improve its positioning or quality. It
is really different from the former scene text detection data-
sets in which the text is the expected input for applications
such as translation on demand. This new dataset covers
another wide range of applications linked to wearable cam-
eras or massive urban captures where the capture is difficult
or undesirable to control. We also tested our work on this
dataset. However, we only give the precision, recall, and
F-score for out method with no comparison.
4.2 Color Channel Selection
Algorithms using single channel or gray-scale information to
detect text regions suffer from the loss of information when
the character’s colors are not consistent. We exploit multi-
channel information to recall color inconsistent characters
in the case of variant illumination, multipolar text. To evaluate
the performance of multichannel information, we use the tra-
ditional equation 5to measure recall, precision, and F-mea-
sures. Four combinations of channels are evaluated on the
two ICDAR datasets. The result is shown in Fig. 12.
We also tested the datasets on a single channel while get-
ting bad results. This arises from the fact that text may have
different colors in one image and a strong illumination in text
Fig. 12 Performance evaluation of the four combination channels on the Robust Reading Dataset 2011
Fig. 11 Training samples: (a) training samples for the CNN and (b) training samples for the SVM
Journal of Electronic Imaging 053019-8 Sep∕Oct 2015 •Vol. 24(5)
regions. It also affects the partition and group results. Having
fewer channels could speed up the framework, but it also
reduces the precision and recall. If more channels are
added, the results change subtly, but there is a greater time
cost. Since the combination of Y-Cr-Cb and ab channels
show the best performance on all aspects, they are selected
for further comparison.
4.3 Experimental Results on Three Natural Scene
We use the combined channels, Y, Cr, Cb, a, and b, to perform
on our proposed framework. Low precision means overesti-
mation, while low recall means underestimation. In
this system, the first step uses two sequential clustering meth-
ods to get the CCs that operate with high recall but low pre-
cision. Then the CNN classification and probability based
grouping are followed to increase the precision. Overall,
our method can get high recall and high precision detection
results. The average running time for our algorithm on
MATLAB® platform is 272 s when using a standard PC
with 12 GHz Intel processor. Since we do not use a GPU
to accelerate the CNN, the time cost is high. Otherwise, it
may speed it up by more than 20 times. Tables 2and 3, respec-
tively, show the comparison of our framework with high
ranked winners of ICDAR 2011 and ICDAR 2013 robust
reading competition. Our method achieves a recall of 72%,
precision of 81%, and F-measure of 76% in the ICDAR
2011 text location dataset and recall of 73%, precision of
86%, and F-measure of 78% in the dataset of ICDAR
2013, which outperforms the state-of-the-art methods.
Some examples of detected text regions in a natural scene
are illustrated in Fig. 13 with red bounding boxes.
Additionally, we evaluate the performance on the new dataset,
the incidental Scene Text detection dataset.34 There are 1500
Fig. 13 Example of detected text in natural scene images.
Table 2 Comparison results on ICDAR 2011 dataset.
Method name Precision Recall F-score
Huang’s method13 0.88 0.71 0.78
Our method 0.81 0.72 0.76
Yin’s method30 0.86 0.68 0.76
Kim’s method40.83 0.62 0.71
Yi’s method31 0.67 0.58 0.62
TH-TextLoc system40.67 0.58 0.62
Table 3 Comparison results on ICDAR 2013 dataset.
Method name Precision Recall F-score
Xu’s method32 0.85 0.78 0.81
Our method 0.86 0.73 0.78
Liu’s method33 0.85 0.69 0.76
Yin’s method30 0.88 0.66 0.76
TextSpotter 0.87 0.65 0.75
CASIANLPR 0.79 0.68 0.73
Baseline 0.61 0.35 0.44
Journal of Electronic Imaging 053019-9 Sep∕Oct 2015 •Vol. 24(5)
images included, and we tested only 1000 images which have
ground truth. The result is dissatisfactory with 43% recall and
54% precision. The main reason arises from the low resolu-
tion and quality of the images. Most of the text regions are
missed in the first step process. We give some successfully
detected incidental scene text images at the bottom of Fig. 13.
In this paper, we presented a robust four-step text detection
algorithm, which achieved a high precision and recall. In the
first step, we used the intrinsic characteristic of text to par-
tition images in two sequential clustering modes and gener-
ated small-size subregions and full-size subregions. Then all
CCs in the partitioned images were classified by a five-layer
convolutional neural network. The output probability was
normalized to filter out nontext regions after a two-stage
grouping. Finally, a verification step was given to classify
candidate text regions in full-size subregions while those
detected in small-size subregions were directly considered
as text regions. The framework started from the subregions
to individual characters and then to integral text regions, and
used both the intrinsic characteristic of text and the individ-
ual properties of characters. The detection result showed that
our proposed method was highly effective on natural scene
text detection. Since our work was based on connected com-
ponent analysis, the detected results were binary text. As
future work, we hope to design a robust CNN architecture
to realize text recognition from the detection results.
1. K. Jung, K. In Kim, and A. K. Jain, “Text information extraction in images
and video: a survey,”Pattern Recognit. 37(5), 977–997 (2004).
2. H. Zhang et al., “Text extraction from natural scene image: a survey,”
Neurocomputing 122, 310–323 (2013).
3. S. M. Lucas et al., “ICDAR 2003 robust reading competitions,”in
Seventh Int. Conf. on Document Analysis and Recognition, pp. 682–
687, IEEE (2003).
4. A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust reading
competition challenge 2: reading text in scene images,”in Int. Conf.
on Document Analysis and Recognition, pp. 1491–1496, IEEE (2011).
5. D. Karatzas et al., “ICDAR 2013 robust reading competition,”in 12th
Int. Conf. on Document Analysis and Recognition, pp. 1484–1493,
6. C. Yi and Y. Tian, “Text string detection from natural scenes by struc-
ture-based partition and grouping,”IEEE Trans. Image Process. 20(9),
7. H. I. Koo and D. H. Kim, “Scene text detection via connected compo-
nent clustering and nontext filtering,”IEEE Trans. Image Process.
22(6), 2296–2305 (2013).
8. P. Shivakumara, T. Q. Phan, and C. L. Tan, “A Laplacian approach to
multi-oriented text detection in video,”IEEE Trans. Pattern Anal.
Mach. Intell. 33(2), 412–419 (2011).
9. P. Shivakumara et al., “Accurate video text detection through classifi-
cation of low and high contrast images,”Pattern Recognit. 43(6), 2165–
10. K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recogni-
tion,”in IEEE Int. Conf. on Computer Vision, pp. 1457–1464, IEEE (2011).
11. T. Wang et al., “End-to-end text recognition with convolutional neural
networks,”in 21st Int. Conf. on Pattern Recognition, pp. 3304–3308,
12. Y.-F. Pan, X. Hou, and C.-L. Liu, “A hybrid approach to detect and
localize texts in natural scene images,”IEEE Trans. Image Process.
20(3), 800–813 (2011).
13. W. Huang, Y. Qiao, and X. Tang, “Robust scene text detection with
convolution neural network induced MSER trees,”Lec. Notes
Comput. Sci. 8692, 497–511 (2014).
14. J. Canny, “A computational approach to edge detection,”IEEE Trans.
Pattern Anal. Mach. Intell. PAMI-8(6), 679–698 (1986).
15. X. Chen and A. L. Yuille, “Detecting and reading text in natural scenes,”
in Proc. of the 2004 IEEE Computer Society Conf. on Computer Vision
and Pattern Recognition, Vol. 2, II-366–II-373, IEEE (2004).
16. K. He et al., “Delving deep into rectifiers: surpassing human-level per-
formance on imagenet classification,”arXiv preprint arXiv:1502.01852
17. A. Bissacco et al., “PhotoOCR: reading text in uncontrolled conditions,”
in IEEE Int. Conf. on Computer Vision, pp. 785–792, IEEE (2013).
18. M. Jaderberg et al., “Reading text in the wild with convolutional neural
networks,”Int. J. Comput. Vis. 1–20 (2014).
19. A. Hyvärinen and E. Oja, “Independent component analysis: algorithms
and applications,”Neural Netw. 13(4), 411–430 (2000).
20. M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text
spotting,”in European Conf. on Computer Vision, pp. 512–528,
21. N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,”in IEEE Computer Society Conf. on Computer Vision and
Pattern Recognition, Vol. 1, pp. 886–893, IEEE (2005).
22. R. Minetto et al., “T-hog: an effective gradient-based descriptor for
single line text regions,”Pattern Recognit. 46(3), 1078–1090 (2013).
23. M. D. Fairchild, Color Appearance Models, John Wiley & Sons, United
24. D. Picard and P.-H. Gosselin, “Improving image similarity with vectors
of locally aggregated tensors,”in 18th IEEE Int. Conf. on Image
Processing, pp. 669–672, IEEE (2011).
25. C. Jung, Q. Liu, and J. Kim, “Accurate text localization in images based
on SVM output scores,”Image Vis. Comput. 27(9), 1295–1301 (2009).
26. K. I. Kim, K. Jung, and J. H. Kim, “Texture-based approach for text
detection in images using support vector machines and continuously
adaptive mean shift algorithm,”IEEE Trans. Pattern Anal. Mach.
Intell. 25(12), 1631–1639 (2003).
27. J. Shen, “On the foundations of vision modeling: I. Webers law and
Weberized TV restoration,”Phys. D 175(3), 241–251 (2003).
28. T. E. de Campos, B. R. Babu, and M. Varma, “Character recognition in
natural images,”in Proc. of the Int. Conf. on Computer Vision Theory
and Applications, pp. 273–280 (2009).
29. C. Mancas-Thillou and B. Gosselin, “Color text extraction with selec-
tive metric-based clustering,”Comput. Vis. Image Underst. 107(1), 97–
30. X.-C. Yin et al., “Robust text detection in natural scene images,”IEEE
Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2014).
31. C. Yi and Y. Tian, “Localizing text in scene images by boundary clus-
tering, stroke segmentation, and string fragment classification,”IEEE
Trans. Image Process. 21(9), 4256–4268 (2012).
32. H. Xu and F. Su, “Robust seed localization and growing with deep con-
volutional features for scene text detection,”in Proc. of the 5th ACM on
Int. Conf. on Multimedia Retrieval, pp. 387–394, ACM (2015).
33. J. Liu et al., “Robust text detection via multi-degree of sharpening and
blurring,”Signal Process. (2015).
34. “Incidental Scene Text,”http://rrc.cvc.uab.es/?ch=4com=downloads/.
Anna Zhu received her BS degree from the Department of Electronic
and Information Engineering, Huazhong University of Science and
Technology, Wuhan, China, in 2011. She is currently pursuing her
PhD in the school of automation. Now she is visiting the Human Inter-
face Laboratory at the Kyushu University of Japan as a visiting stu-
dent. Her current research interests include object detection, image
processing, and machine learning.
Guoyou Wang received his BS and MS degrees from Huazhong
University of Science and Technology, Wuhan, China, in 1988 and
1992, respectively. He is a faculty member with the China Society
of Image and Graphics. He is currently a professor of the Institute
for Pattern Recognition and Artificial Intelligence. His research inter-
ests include image modeling, computer vision, fractals, and guidance
Yangbo Dong received his BS degree in engineering from Wuhan
University of Technology, Wuhan, China, in 2013. He is currently pur-
suing his MS degree in Institute for Pattern Recognition and Artificial
Intelligence, Huazhong University of Science and Technology,
Wuhan, China. His current research interests include text segmenta-
tion and recognition in natural scene images and video, object recog-
nition, and image processing.
Brian Kenji Iwana received his BS degree from the Department of
Electrical Engineering and Computer Science, University of California
Irvine, Irvine, California, in 2005. He is currently pursuing his PhD in
the Human Interface Laboratory, Kyushu University. His current
research interests include machine learning, temporal pattern recog-
nition, and dissimilarity space embedding.
Journal of Electronic Imaging 053019-10 Sep∕Oct 2015 •Vol. 24(5)