Content uploaded by Mikhail Kharinov

Author content

All content in this area was uploaded by Mikhail Kharinov on May 23, 2019

Content may be subject to copyright.

43

Object Detection in Color Image

M. Kharinov 1), A. Buslavsky 1)

1) St. Petersburg Institute for Informatics and Automation of RAS,

14_liniya Vasil’evskogo ostrova 39, St. Petersburg, 199178 Russia,

khar@iias.spb.su, www.spiiras.nw.ru

Abstract: In this paper the problem of automatized object

detection in a color image is treated. The solution basing

on the classic pixel clustering methods is developed. The

parameter for the heterogeneity of image areas is

introduced. The method for markup of an image with

automatically produced object names is proposed. The

data structure is described. The computational complexity

of clustering algorithms is estimated.

Keywords: pixel clustering, standard deviation, quasi-

optimal approximations, hierarchy.

1. INTRODUCTION

A few decades ago, the creation of software and

hardware image processing systems was mainly limited to

the development of the user interface, which most

programmers of each firm were engaged in. The situation

has significantly changed with the advent of the Windows

operating system, when the majority of developers

switched to solving the problems of image processing

itself. However, this has not yet led to cardinal progress in

solving typical tasks of recognizing faces, car numbers

and road signs, analyzing remote and medical images, etc.

Each of these "eternal" problems is solved by trial and

error by the efforts of numerous groups of engineers and

scientists. As modern technical solutions turn out to be

excessively expensive, the task of automating the creation

of software tools for solving intellectual problems is

formulated and intensively solved abroad [1]. In the field

of image processing, the required toolkit should support

the analysis and recognition of images of previously

unknown content and ensure the effective development of

applications by ordinary programmers. Just as the

Windows toolkit supports the creation of interfaces for

solving various applied problems.

2. PROBLEM STATEMENT

The processing of images of any content is declared in

the context of detection of the most noticable areas in the

image. This image processing domain in original works

is reffered as “salience region detection” [2,3]. In these

works, a so-called Saliency Map of visible areas is

constructed for an image. Then, the Saliency Map is

usually converted to a black-and-white object-to-

background mask by a threshold transformation. In

discussed model of pixel clustering, the problem

statement [2,3] is being generalized and developed. The

considering problem is that of ordering sets of pixels

according to the heterogeneity (local nonhomogeneity)

parameter, which makes sense of a quantitative measure.

Like the number of pixels the heterogeneity decreases or

at least does not increase when splitting a set of pixels

into subsets.

Usually, the parameter of heterogeneity type, which

sets the contrast, the complexity of the image section is

determined for a given pixel by evaluating its

characteristics with respect to the remaining pixels of the

image [2,3]. In this paper to introduce of the target

heterogeneity parameter a special sequence of piecewise

constant image approximations is calculated. It is

assumed that the approximations constitute the binary

hierarchy and have special properties that are necessary

for the correct interpretation of the trivially defined

heterogeneity parameter. Up to some refinements a

hierarchical sequence of approximations is produced by

classical cluster analysis methods [4,5].

Figuratively speaking, the task is to model the visual

perception of a living creature that has just appeared, for

example, a man, a fly, etc. that has no any experience but

is able to see the scene in ordered colors and to

distinguish objects, which presented and ranked by size

via clusters of pixels. In this case, computer simulation is

not associated with imitation of biological visual

perception, but relies on optimization of piecewise

constant approximations of the image by standard

deviation

or approximation error, i.e. total squared

error 2

3

NE , where N is the number of pixels in the

image.

3. HETEROGENEITY PARAMETER

It is known that the sequence of optimal piecewise-

constant approximations of an image depending on the

number

g

of pixel clusters is described by a

monotonically increasing sequence of non-positive

increments 0

E of the approximation error

0... 132

N

EEE

or by a convex sequence of

the values E themselves:

1...,,3,2,

2

11

Ng

EE

Egg

g.

(1)

Since, in the general case, the sequence of optimal

approximations is not hierarchical, the problem of

approximating of a hierarchy of optimal approximations

by a hierarchical sequence of quasi-optimal

approximations arises. Quasi-optimal approximations are

described by a convex sequence of g

E values and contain

an optimal approximation with a fixed number of clusters

0

g (Fig. 1).

When the condition of convexity of the approximating

curve is satisfied, the limitation of the approximation

error within a certain threshold g

E is guaranteed

(downward dotted straight line in Fig. 1):

N

g

EEg1

1.

(2)

Heterogeneity parameter

jiEsplit

is defined for

each cluster ji as the absolute value of the

approximation error increment

jiEsplit

caused by

44

the division of cluster ji into the pair of nested

clusters i and j that specified in the considered binary

hierarchy.

g

E

g0

E1

N

1

0

Fig.1 – Approximation of optimal approximations (gray curv

e

by quasioptimal approximations (solid black curve).

From the property of convexity of the curve for

quasioptimal approximations in Fig. 1 it follows that the

heterogeneity parameter does not increase with decreasing

of clusters:

jEiEjiE splitsplitsplit

, . (3)

Therefore, it can be considered as a quantitative

measure of the heterogeneity of pixel clusters.

The numerical value of the heterogeneity parameter

jiEsplit is expressed via the numbers of pixels i

n,

j

n and their three-component values i

I, j

I averaged

within the clusters ji, as:

2

,ji

ji

ji

mergesplit II

nn

nn

jiEjiE

,(4)

where

0 jiEmerge is a non negative increment of

the approximation error caused by merging of two

clusters i and j into the cluster ji .

4. HIERARCHY OF PIXEL CLUSTERS

A posteriori any convex sequence 1g

E, 2g

E, ...,

Ng

E (Fig. 1) can be obtained by increasing clusters of

pixels using Ward method [4-6]. In this method, at the

beginning, each pixel is treated as an independent cluster.

Then, at each step, the pair of clusters ji, is merged with

each other providing the minimum increment of the

approximation error

jiEmerge ,

:

jiEjijiji merge

gji ,minarg,:,

1,...2,1,0,

, (5)

where the number of clusters decreases from N to 1.

For gray images, along with the Ward's iterative pixel

clustering, Otsu hierarchical method [7] of iterative

division of pixel clusters into two is also applicable,

which, in combination with the Otsu multi-threshold

method [8,9], provides the simplest software

implementation of target pixel clustering according to

Fig.1.

At the output of iterative clustering, a binary cluster

hierarchy is generated in an algorithm of iterative merging

or splitting of pixel sets. It contains N image

approximations. These approximations contain only

12

N different pixel clusters. Of these, N clusters are

indivisible, since they consist of individual pixels. And

for each of the other 1

N clusters, the split operation is

maintained, providing the restoration of pair of clusters

of pixels 1 and 2, which was merged each other forming

cluster of 3 in Ward’s method: 213:2,13

5. SEQUENCE OF OBJECTS

The detection of a sequence of geometrically non-

intersecting objects is performed by the threshold value of

heterogeneity or area, i. e. the number of pixels in a

cluster. In this case, a subsequence of approximations

consisting of pixel clusters with heterogeniety values

higher and lower than the established threshold is selected

from the full hierarchical sequence of approximations. So,

the image field is divided into regions of objects and

background. The sequence of objects that reveal in the

image field is encoded in the object rating map by

numbers in the order of their detection (Fig. 2).

Fig.2 – Encoding of revealed objects in the standard color

Lenna ima

g

e.

Fig. 2 shows the original image on the left and the

object rating map in 5 tones with 14132 segments on the

top-right. This map is calculated for a split

E

threshold

equal to 1% of the maximum value 1

E marked in the

Fig.1.The bottom-right representation presents the image

with 13151 colors, which is obtained by averaging the

pixels inside the segments of the object rating map.

Objects that revealed first on the top-right object

rating map are marked in black. Last revealed objects are

marked in white. The values of heterogeneity, and the

values of the marked area increase with increasing

intensity of gray tones. The resulting pixel intensities for

given threshold are treated as automatically calculated

object identifiers.

Note that to get the top-right representation of the

image in several colors it is enough to specify a single

45

threshold value of the heterogeneity parameter or area

threshold value.

Object rating maps obtained for several thresholds of

heterogeneity and/or area describe image points marked

by vectors of object sequence numbers, which are

analyzed as automatically generated names of objects and

intended for further image recognition.

6. СOMPUTATIONAL COMPLEXITY

The computational complexity of the discussed

computer image processing is determined by iterative

generating of a hierarchy of image approximations that

satisfies the conditions (3) of order conservation in the

dichotomous separation of clusters. Due to the

peculiarities of video data, the indicated fact that

generation can be performed by the classical Ward’s

method [4-6] does not allow one to accurately predict the

result of calculations. The images are characterized by

repeatability of the minimum values of merge

E at the

initial iterations of the pixel merging. Therefore, the result

of the calculations is affected by the order of merging of

cluster pairs. So, the target hierarchical sequence of

approximations satisfying (3) is constructed ambiguously.

The original Ward’s method is used extremely rarely

in image processing due to the large computational

complexity that quadratically increases with the number

N of pixels in the image. However, the obtaining image

approximations in ordered colors (3) can be significantly

accelerated by applying the Ward’s method to parts of an

image containing a limited number of pixels. In this case,

the processing is divided into three stages.

At the first stage, the image is divided into 0

g clusters

of pixels, in particular, into 0

g segments processed by

Ward's method as independent images. At this stage, a

hierarchical sequence of approximations is constructed for

each cluster.

At the second stage, the quality of image partition into

0

g clusters is optimized by approximation error E. As a

result, the image is subdivided into 0

g superpixels

(elementary pixel clusters), which are characterized by a

minimized value of E.

In the final stage, Ward's super-pixel clustering is

performed. The processing finishes when all 0

g

superpixels merge into one cluster, and the complete

hierarchical sequence from N image approximations is

calculated.

The total computational complexity of the first and the

third stages of processing by Ward's method, depending

on the number N of image pixels, is estimated in order of

magnitude as the function )(Nf :

2

0

0

2

~)( g

g

N

Nf , (6)

which has a minimum at 0

g:

3

2

02

N

g. (7)

Then, when choosing the number of superpixels 0

g

according to (7), the computational complexity )(Nf is

expressed as:

32

2

3

~)( NNNf . (8)

Thus, for increasing pixel number N, the

computational complexity )(Nf grows as 3

4

N, which

ensures the applicability of Ward's method to the images

of actual dimensions.

If a hierarchical Ward's pixel clustering in each of 0

g

clusters of pixels at first stage of processing results in

such image approximation for which the minimum

increment of the approximation error merge

E does not

exceed the maximum heterogeneity value split

E:

,maxmin splitmerge EE , (9)

then the second stage of generation of ordered colors is

omitted. In this case, the estimation (8) describes the

computational complexity of the algorithm as a whole. It

is so, since (9) is a criterion of preserving of the color

order while merging of the hierarchy of 0

g superpixels

with the inside hierarchies of pixel set for each

superpixel.

If criterion (9) is violated for 0

g structured sets of

pixels after first stage, all violations are suppressed at the

intermediate stage of superpixel formation due to the so-

called SI method consisting in iteratively performing

division of one cluster acompanied with merging of

another pair of clusters [10,11].

The idea of the SI method is obvious. If condition (9)

is violated, we are looking for a cluster to divide it into

two subclusters with a maximum decrease in the

approximation error split

Emax . Along with the division

of the found cluster, a pair of clusters with a minimum

increment merge

E

min is found and merged.

If instead of a simple merging of clusters after the

merging, the sequence of enlarged approximations is

updated, then the efficiency of the approximation error

minimizing increases. It is so, because in this case, the

maximum drop of the approximation error split

Emax is

calculated on the set of maximal values for each cluster.

7. DATA STRUCTURE

Speed operations with hierarchically structured

clusters of image pixels are supported in terms of trees. In

this case, it is more convenient to use the Sleator-Tarjan

dynamic trees [12,13] instead of conventional trees, in

particular, dendrograms, etc.

In the conventional interpretation of a tree, a new node

is generated when merging sets of pixels. And when

interpreting according to Sleator-Tarjan, the merging of

pixel sets is described by establishing an arc between the

root nodes of trees. So, the sets of pixels themselves are

ordered in the tree structure (Fig. 3).

Fig. 3 explains the difference in the interpretations of

trees by the example of an image consisting of four

pixels.

A characteristic feature of the developing software

toolbox for the formation and ordering of averaged colors

in an image is the utilization of a reversible merging of

pixel clusters. In a reversible merging, for each cluster

containing more than one pixel, two clusters united when

given cluster is obtained, are memorized. In this case,

iterative merging of cluster pairs is performed “from

pixels” in some calculated order. The merging order is

stored and replaced by the opposite when splitting the

clusters.

46

Fig.3 – Formation of a binary hierarchy of pix el clusters in

terms of conventional trees (above) and dynamic Sleator-

Tarjan trees (below).

In addition to reversing the order of cluster merging,

in the process of cluster dividing an arbitrary choice of

one or another cluster to be divided in two is also

supported. In this case, the modification of the initial

order of cluster merging is performed. Modification of the

cluster merging order is supported automatically.

Thus, reversible calculations are not limited to simple

data recovery at any step, as in [14,15], but are

implemented in a generalized sense. And it becomes

possible to reduce the approximation error and improve

the quality of image approximations due to the

combination of operations of merging and division of

pixel clusters in two. For generalized reversible

calculations Sleator-Tarjan dynamic trees (acyclic graphs)

are supplemented with cycles (cyclic graphs) in the form

of linked lists. Sleator-Tarjan dynamic trees together with

linked lists make up a network connecting the image

pixels (Fig. 4).

Fig. 4 illustrates an image matrix of 25 pixels

interconnected by arcs of the Sleator-Tarjan tree, which is

shown by solid lines. The tree has a single root node that

coincides with the first image pixel and treated as an

identifier for combining all the pixels of the image into a

single cluster. When breaking the arc incident to the root

node, the whole tree splits into two trees, and the whole

set of image pixels is divided into two clusters, which are

further considered as separate images. For each node of

the tree Fig. 4 incoming arcs are combined into cycles,

shown by thin dashed lines. Cycles define the order in

which the arcs have been established, which is provided

by an additional indication for each cycle of either a start

or end node by means of pointers, indicated by bold

dashed lines. The merging of clusters is determined by the

establishment of arcs between root nodes, and the inverse

operation of dividing the cluster into two is provided by

the breaking of arcs. When reversing the process of

cluster merging for a given root node, the arcs are broken

in the reverse order.

split

E

Fig.4 – The scheme of reversible cluster merging.

The data structure, illustrated by the scheme in fig. 4,

is given by three arrays: an array of dynamic trees, an

array of cycles, and an array of pointers of initial or end

nodes in cycles. The real data structure for speedup

calculations includes a number of additional arrays, which

are described by less complex schemes.

Sleator-Tarjan dynamic tree and cycles of fig. 4

constitute a typical network in which incoming arcs for a

given node are indexed by values of heterogeneity (the

drop of the approximation error) split

E, caused by the

division of the pixel cluster specified by this node. At the

same time, the considered network is called dynamic,

since it is dynamically rebuild in the course of

computations. And the discussed network is called

algebraic, since it is obtained by merging trees and cycles

according to the established rules. In this case, condition

(3) provides a hierarchical sequence of image partitions,

which is described by a convex sequence of

approximation error values. In network this condition is

expressed in that the weights of the arcs are weakly

monotonically fall down when traversing the cycles from

the end to the initial element and from the root to the

periphery of the Sleator-Tarjan dynamic tree (Fig. 4).

To obtain a hierarchical sequence of image partitions

corresponding to a convex sequence of approximation

error values, the pixel clusters are calculated by breaking

at each step of those arc that provide a maximum value of

split

E. On the other hand, if the original image is

divided into

g

independent images containing more than

one pixel, then there are

g

options for dividing it into

1

g independent images. Thus, in addition to the only

binary hierarchy of image partitions for object detection

in descending order, it is provided a variety of different

sequences of partitions of the original image into

independent structured images that represent objects in

various combinations.

Unlike conventional trees, Sleator-Tarjan dynamic

trees are built directly on the set of image pixels, without

specifying additional nodes. Contrary to conventional

trees, the binary hierarchy of pixel clusters in terms of

Sleator-Tarjan dynamic trees is specified by an irregular

tree structure. But the visual interpretation of the

calculations is clearly preserved (Fig. 4).

47

Compared to conventional trees, the main features of

the Sleator-Tarjan dynamic trees are that:

1. Metadata describing the hierarchy of pixel clusters and

the clusters themselves are supported on the original

set of coordinates.

2. To minimize

or 2

3

NE , a generalized mode of

reversible computations is implemented. At that the

state of the computing system at any step is not

necessarily restored to the same. Due to this fact,

minimization of E is realized not only in direct but

also during the reverse course of calculations.

Thus, Sleator-Tarjan dynamic trees provide all the

capabilities of conventional trees with minimal memory

costs. Thanks to the first of listed properties, Sleator-

Tarjan dynamic trees are quite convenient to provide the

simplest implementation of reversible calculations. In the

developed data structure, dynamic trees (acyclic graphs)

are constructed in several forms and supplemented with

cycles (cyclic graphs). In combination, they form a

dynamic algebraic network that supports high-speed

computation, storage and conversion of millions of sets of

pixels in the computer’s RAM. At the same time, as

experience shows, mastering the toolbox of dynamic

networks in order to solve problems of image processing

of a particular type, presents a certain complexity for

programmers, which prevents its implementation into the

practice of image processing. A feature of the software

implementation of the model is the multiple calculation of

the extreme values of the array elements while modifying

data, which requires time-consuming acceleration of the

algorithms by routine and special programming

techniques. Therefore, for the implementation of a

computational model, the freely available ready-made

programs are preferable.

8. CONCLUSION

In the field of image recognition, a key problem is the

creation of software tools for developing solutions to

specific engineering problems, i.e. face recognition, road

conditions, estimation of distances to objects by stereo

pairs, etc. The urgency of the problem is currently

dictated primarily by financial considerations.

Moreover, solving the problem requires the model of

the image (model of objects in the image), which is

implemented in the package of application programs that

are freely available and are intended to automate the

creation of specific applications.

The development of the required package has long

been realized in the United States (PPAML project of the

agency DARPA 20013-2017, [1]). In Russia, such

projects have not yet been carried out, and the

development of the theory of image processing is mainly

carried out by generalizing solutions to specific applied

problems [16].

Perhaps, in addition to the inductive development of

image processing software toolboxes, a special attention

should be paid to the deductive development of

application development toolboxes [17], including the

discussed model of quasioptimal image approximations,

which is being developed in SPIIRAS [7,13].

9. REFERENCES

[1] PPAML (Probabilistic Programming for Advanced

Machine Learning), DARPA project, 2013-2017,

https://galois.com/project/probabilistic-programming-for-

advanced-machine-learning/.

[2] Achanta R., Hemami S., Estrada F., Susstrunk S.

Frequency-tuned salient region detection, Computer

vision and pattern recognition (CVPR), IEEE

conference, 2009. pp. 1597-1604.

[3] Cheng M.M., Mitra N.J., Huang X., Torr P.H., Hu

S.M. Global contrast based salient region detection,

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2015. Vol. 37. № 3. pp. 569-582.

[4] Ajvazyan S.A., Buhshtaber V.M., Enyukov I.S.,

Meshalkin L.D., Applied Statistics: Classification and

Dimension Reduction, Moscow: Finance and

Statistics, 1989. 607 pp.

[5] Mandel I.D. Cluster Analysis, Moscow: Finance and

Statistics, 1988. 176 pp.

[6] Ward J.H., Jr. Hierarchical grouping to optimize an

objective function, J. Am. Stat. Assoc. 1963. Vol. 58,

Issue 301. pp. 236-244.

[7] Kharinov M.V. Pixel Clustering for Color Image

Segmentation, Programming and Computer Software,

2015, Vol. 41, № 5, pp. 258–266, DOI:

10.1134/S0361768815050047

[8] Otsu N. A Threshold Selection Method from Gray-

Level Histograms, IEEE Trans. on systems, MAN,

and CYBERNETICS, January 1979. Vol. SMC-9, №

1. pp. 62-66.

[9] Ping-Sung Liao, Tse-Sheng Chen, Pau-Choo Chung A

Fast Algorithm for Multilevel Thresholding, J. Inf.

Sci. Eng., 2001. № 17, pp. 713-727.

[10] Kharinov M.V., Khanykov I.G. Optimization of

a piecewise constant approximation of a segmented

image, Proceedings of SPIIRAS, Vol. 3(40), 2015.

pp. 183-202.

[11] Kharinov M.V. Reversible Image Merging for

Low-level Machine Vision, arXiv preprint,

arXiv: 1604.03832, 2016. 5 pp.

[12] Nock, R., Nielsen F, Statistical Region Merging,

IEEE Trans. Pattern Anal. Mach. Intell. 2004. 26(11),

pp. 1452–1458.

[13] Kharinov M.V. Data structures of learning

system for automatic image recognition, PhD thesis,

St.-Pt. Institute for Informatics and Automation of

Russian Academy of Sciences, 1993. 20 pp.

[14] Toffoli T. Reversible computing, In International

Colloquium on Automata, Languages, and

Programming, Springer Berlin Heidelberg. 1980. –

pp. 632-644.

[15] Zongxiang Yan Reversible Three-Dimensional

Image Segmentation. US Patent № 20110158503 A1.

2009. 10 pp.

[16] Chochia P.A. Theory and methods of processing

video information on the basis of a two-scale image

model, Dis. Doct. technical. Sciences, Moscow: IPPI

RAS, 2016. 302 pp.

[17] Gurevich I.B., Zhuravlev Y.I. Computer science:

Subject, fundamental research problems,

methodology, structure, and applied problems, Pattern

recognition and image analysis, 2014. Vol. 24, № 3.

pp. 333-346.