Content uploaded by Thomas E. Portegys

Author content

All content in this area was uploaded by Thomas E. Portegys on Apr 16, 2014

Content may be subject to copyright.

910

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 9, SEPTEMBER 1995

A Search Technique for Pattern Recognition

Using Relative Distances

Thomas E. Portegys

Abstract-A

technique for creating and searching a tree of patterns

using relative distances is presented. The search is conducted to find pat-

terns which are nearest neighbors of a given test pattern. The structure

of the tree is such that the search time is proportional to the distance he-

tween the test pattern and its nearest neighbor, which suggests the

anomalous possibility that a larger tree, which can be expected on aver-

age to contain closer neighbors, can be searched faster than a smaller

tree. The technique has been used to recognize OCR digit samples de-

rived from NUT data at an accuracy rate of 97% using a tree of 7,000

patterns.

Index Terms-Pattern recognltlon, optical character recognition, near-

est neighbor, distance metric, branch and bound, NIST digit samples.

I. INTRODLJCTI~N

A technique for creating and searching a tree of patterns using

relative distances is presented. The search is conducted to find pat-

terns which are nearest neighbors of a given test pattern. The struc-

ture of the tree is such that the search time is proportional to the dis-

tance between the test pattern and its nearest neighbor, which sug-

gests the anomalous possibility that a larger tree, which can be ex-

pected on average to contain closer neighbors, can be searched faster

than a smaller tree. The technique has been used to recognize Optical

Character Recognition (OCR) digit samples derived from National

Institute of Standards and Technologies (NISI’) data [I I ] at an accu-

racy rate of 97% using a tree of 7,000 patterns.

The task of recognizing handwritten characters is an active area of

research, especially in neural networks [4], [S], [lo]. This paper is an

investigation of a how a particular memory-based, nearest neighbor

search technique behaves when applied to a large number of patterns.

As a nearest neighbor scheme, the search technique attempts to ad-

dress problems encountered in dealing with many-dimensional ob-

jects 131, [51,[9], which in this case are represented by OCR patterns.

The technique is intended to be applicable not only to character rec-

ognition, but to pattern recognition tasks in general.

The paper is organized as follows: First, a description of the tech-

nique is presented, including the definition of distance, the insertion

algorithm, and the search algorithm. The results of the NIST and

other tests are then presented. Finally, a proposal is made for a device

to improve the speed of searching.

II. DESCRIPTION

A. Distance Formula

The distance between patterns Pl and P2 was chosen to be the

city-block distance:

where

&(A, P2) = ilpl, - p2il

(1)

i=l

N = number of pixel in the patterns

pl , p2 = pixel values

The city-block distance, which is the Hamming distance for binary

pixel values, is possibly not the best choice; it was chosen for the

Manuscript received Dec. 20,1993; revised Mar. 17.1995.

The author is with AT&T Bell Laboratories, Naperville. IL 60566; e-mail:

t.e.pertegys@att.com.

IEEECS Log Number P95091.

value of its fast computation. Any distance formula is suitable if it

conforms to these conditions:

dist(P1, F2) 10

(0

dist(P1, P2) =

dist(P2,

Pl) (ii)

dist(P1, P3) I dist(P1, P2) + dist(P2, F3)

(iii)

Conditions (i) and (ii) hold for the city-block distance due to the ab-

solute value operator. Condition (iii) is the well-known triangle ine-

quality [6]. This condition must hold for patterns containing single

pixels since it must be true for the distances between any three scalar

numbers. It must then hold for multiple pixel patterns since inequal-

ity relationships are preserved when summing.

B. Pattern Insertion

Patterns are stored in a tree structure according to their relative

distances for efficient searching. They are inserted into the tree by a

recursive procedure starting at the root which is the first prospective

parent pattern. A decision is made whether to link the new pattern di-

rectly to the parent pattern or to pass it on to the first child of the par-

ent to which it “fits.” An inserted pattern fits a child pattern if the

distance between it and the child is less than the distance between the

parent and the child multiplied by a constant. If the constant is S, for

example, the node is passed to the child if it is within a radius of half

the distance between the parent and the child patterns. Once passed

to a child, the child becomes the parent for the next iteration of link

checking.

The algorithm used for insertion, written in C, using RADIUS as

the link control constant, is given in Appendix A.

The purpose of the RADIUS constant is to control the degree

(bmshiness) of the tree. At the extremes, when RADIUS is set to 0,

all children are linked to the root pattern; when set to 2, no pattern

will have more than one child, i.e., the tree will be a linked list. For

the NIST tests described later, RADIUS was set to .7, which an aver-

age node degree of 3.6.

One feature of the algorithm is that the order of a pattern’s subtree

branches is important: A new pattern is always inserted in the first tit-

ting branch (even though there can be more than one such branch). This

features allows the search of a tree which contains a duplicate of a given

pattern to proceed with maximum efficiency: The duplicate will always

be found on the first branch which fits the pattern.

Another feature is a tree reorganization procedure which prevents

excessive children from accumulating on a parent pattern. When a

new child is linked to a parent pattern, every other previously in-

serted child is checked to determine if it should be linked to the new

child instead of the parent. For each such child, the child subtree‘is

severed from the parent and each pattern in the subtree is inserted at

the current parent pattern. Note that they are not inserted into the new

child since not all patterns in the subtree necessarily fit there.

C. Pattern Searching

The purpose of the pattern search algorithm is to efficiently find

patterns in the tree which are nearest neighbors of a given test pat-

tern. Searching is done in a best-first manner, that is, the pattern

whose subtree could contain the pattern least distant from the search

pattern is searched next.

The essence of the search decision procedure is illustrated in

Fig. 1. When the search pattern S arrives at pattern A, it must deter-

mine whether to search either pattern B or D next. It does this by cal-

culating the least distances between S and the patterns which can

potentially appear within the subtrees of B and D. These are labeled

B’ and D’, respectively. This distance, called the search distance, is

defined as follows for S and B:

0162-8828/95$04.00 Q 1995 IEEE

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 9, SEPTEMBER 1995

911

seurch(S, B) = dist(b, B) - (dist(A, B) * RADIUS) = dist(b, B’) (2)

RADIUS is the link control constant (see the pattern insertion algo-

rithm). If (2) results in a value less than zero, the search distance is

set to zero. After calculating the search distances for B and D, the

search proceeds to the pattern having the least search distance.

Fig. 1. Search distance.

The search algorithm is given in Appendix B. The SEARCH-

WORK structures contain temporary data and are configured into a

copy of the searched portion of the pattern tree during the search. To

prevent searching the entire tree, a maximum parameter can be pro-

vided. This value of this parameter can be changed dynamically to

allow for more or less searching under various circumstances, such as

time constraints. A list of similar patterns is returned by the algo-

rithm.

The algorithm features a branch and bound capability which al-

lows portions of the search tree to be cut-off during the search: As

less distant patterns are found, greater cut-off is achieved. This is be-

cause a pattern does not have to be searched if its search distance is

greater than the distance of a previously found pattern.

III. DIGIT RECOGNITION TEST RESULTS

The patterns were derived from MST digits O-9, and were obtained

through an internal company source. They were size normalized to fit in

a 20

x

20 pixel box, and were then centered to fit in a 28

x

28 image

using center of gravity. The pixels were scaled to four levels of gray

value. Fig. 2 shows an example of a pattern for the digit ‘6.’ The com-

parisons were done without any translation or rotation.

A. Test 1

For the first test, approximately 7,000 digit samples were inserted

into a tree, and a different set of 1,000 patterns was selected as test

patterns. 97.1% of the test samples were identified correctly, which is

comparable with other pattern recognizers. In addition, on the aver-

age, the nearest neighbor was found after searching 337 patterns in

the search tree, and 6,000 patterns were not searched due to cut-off

conditions.

B. Test 2

The next series of tests were an attempt to simulate a search tree

large enough to presumably contain patterns that are very similar to a

set of search patterns. The question of what happens to the reliability

and extent of the search of such trees in relation to their size was the

focus of these tests.

Fig. 2. Example NIST digit.

The initial search tree contained 1,000 patterns, and these same

1,000 patterns also comprised the search set, meaning the search was

to find identical patterns in the tree. Following this, an additional

1,000 patterns were inserted into the tree, and a random set of 1,000

out of the 2,000 accumulated patterns was selected as the search set.

This procedure was repeated until 10,000 patterns were inserted into

the tree.

Fig. 3 shows the average number of patterns searched before

Ending the identical pattern as a function of search tree size. In all

cases, the identical pattern was found. It can be seen that only about

25 additional patterns were searched as the tree grew by 9,000 pat-

terns. The data also roughly conforms to the function logs.a(x)*lO,

where 3.6 was found to be the average degree of the search tree. This

suggests that the search effort is proportional to some logarithmic

function of the tree size.

100

40

tl

1oal2m3ow4aoo5Lm6ooo7m8m~1~

heafke

Fig. 3. Patterns searched to find identical pattern.

C. Test 3

As a check on the validity of using identical search patterns instead

of similar ones, noise was randomly introduced into the search patterns

such that they were closely similar, but not identical to, patterns in the

tree. ‘lle noise was chosen such that the average distance between a

stored pattern and a modified search pattern was 10% of the average

distance between the stored pattern and a random pattern. Searching a

tree containing 10,000 patterns resulted in a 100% identification rate,

and an average search of 196 patterns to find the most similar. In addi-

tion, 9,065 patterns were not searched due to cut-off conditions.

912

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL. 17. NO. 9, SEPTEMBER 1995

A forced cut-off at 200 patterns was introduced in the above

search to determine the effect of limiting the extent of the search.

This resulted in an identification rate of 96%, with the most similar

pattern being found after an average search of 79 patterns. 9,801

patterns were not searched due to cut-off conditions. The primary

reason for the effectiveness of the limit was that it curtailed the rela-

tively few searches of excessive extent. In most cases, these excessive

searches were not necessary since a correctly identifying pattern was

found early on, even though it was not the most similar one.

D. Test 4

To look further at the effect of increasing the distance between the

search patterns and the stored patterns in a larger tree, 217( 131,072)

patterns were created by generating all possible combinations of the

pixel values of 0 and 10, thus ensuring that the stored patterns have a

minimum distance of 10 between them. These were then added to the

search tree in random order. A set of search patterns was created by

randomly selecting 100 stored patterns. This formed the distance 0

search set. The distance 1 search set was formed by copying the dis-

tance 0 set, randomly selecting a pixel in each pattern and modifying

its value by 1. Distance 2-5 search sets were created in successive

manner.

Fig. 4 plots the number of patterns searched to find the most

similar stored pattern as a function of the distance of the search pat-

terns. The function appears to be a linear one, especially looking at

the distance 2-5 points.

.

0 I 2 3 4 5

dismmcfrom~p

Fig. 4. Patterns searched to find similar pattern in 131072 stored patterns.

IV. A PATTERN COMPARATOR DEVICE

The search mechanism demands the repetitive computation of(l),

which for the 784 pixel NIST pattern size requires a little over 5 ms

per invocation on a SUN SPARC workstation. This is the time to

compare two patterns during a search which involves multiple com-

parisons. In approximately the same time other systems are able to

complete a recognition computation using specialized hardware [ 121.

It seems clear that there is a need for a faster means of computing the

distance. A vector processor would be able to compute the difference

terms in parallel, but the summation of these terms would remain as a

bottleneck.

An optical device may hold the answer as a means of performing

the summation. Consider the device shown in Fig. 5. The difference

terms are transduced into optical signals whose intensities are pro-

portional to the sizes of the differences. A lens is used to focus these

signals onto a detector which is capable of responding in proportion

to the sum of the signal intensities, thus achieving a naturally parallel

summation of the terms.

It is likely that this device could be made to compute a sum in a

few microseconds, given the known performance of optical devices

VI9

171.

Fig. 5. Optical summation.

V. CONCLUSION

Much like chess playing machines, which have gained high rank-

ings largely by relying on speed [2], the findings presented here sug-

gest that a “brute force” approach, in the form of storing a large

number of patterns, may be effective for pattern recognition.