Content uploaded by Berthold K. P. Horn

Author content

All content in this area was uploaded by Berthold K. P. Horn on May 26, 2015

Content may be subject to copyright.

1

Introduction

In this chapter we discuss what a machine vision system is, and what

tasks it is suited for. We also explore the relationship of machine vision

to other ﬁelds that provide techniques for processing images or symbolic

descriptions of images. Finally, we introduce the particular view of machine

vision exploited in this text and outline the contents of subsequent chapters.

1.1 Machine Vision

Vision is our most powerful sense. It provides us with a remarkable amount

of information about our surroundings and enables us to interact intelli-

gently with the environment, all without direct physical contact. Through

it we learn the positions and identities of objects and the relationships be-

tween them, and we are at a considerable disadvantage if we are deprived of

this sense. It is no wonder that attempts have been made to give machines

a sense of vision almost since the time that digital computers ﬁrst became

generally available.

Vision is also our most complicated sense. The knowledge we have ac-

cumulated about how biological vision systems operate is still fragmentary

and conﬁned mostly to the processing stages directly concerned with sig-

nals from the sensors. What we do know is that biological vision systems

2 Introduction

are complex. It is not surprising, then, that many attempts to provide

machines with a sense of vision have ended in failure. Signiﬁcant progress

has been made nevertheless, and today one can ﬁnd vision systems that

successfully deal with a variable environment as parts of machines.

Figure 1-1. A machine vision system can make a robot manipulator much more

versatile by allowing it to deal with variations in part position and orientation. In

some cases simple binary image-processing systems are adequate for this purpose.

Most progress has been made in industrial applications, where the vi-

sual environment can be controlled and the task faced by the machine vision

system is clear-cut. A typical example would be a vision system used to

direct a robot arm to pick parts oﬀ a conveyor belt (ﬁgure 1-1).

Less progress has been made in those areas where computers have been

called upon to extract ill-deﬁned information from images that even people

ﬁnd hard to interpret. This applies particularly to images derived by other

than the usual optical means in the visual spectrum. A typical example of

such a task is the interpretation of X-rays of the human lung.

It is of the nature of research in a diﬃcult area that some early ideas

have to be abandoned and new concepts introduced as time passes. While

1.2 Tasks for a Machine Vision System 3

frustrating at times, it is part of the excitement of the search for solutions.

Some believed, for example, that understanding the image-formation pro-

cess was not required. Others became too enamored of speciﬁc computing

methods of rather narrow utility. No doubt some of the ideas presented

here will also be revised or abandoned in due course. The ﬁeld is evolving

too rapidly for it to be otherwise.

We cannot at this stage build a “universal” vision system. Instead,

we address ourselves either to systems that perform a particular task in a

controlled environment or to modules that could eventually become part of

a general-purpose system. Naturally, we must also be sensitive to practical

considerations of speed and cost. Because of the enormous volume of data

and the nature of the computations required, it is often diﬃcult to reach a

satisfactory compromise between these factors.

Figure 1-2. The purpose of a machine vision system is to produce a symbolic

description of what is being imaged. This description may then be used to direct

the interaction of a robotic system with its environment. In some sense, the

vision system’s task can be viewed as an inversion of the imaging process.

1.2 Tasks for a Machine Vision System

A machine vision system analyzes images and produces descriptions of what

is imaged (ﬁgure 1.2). These descriptions must capture the aspects of the

objects being imaged that are useful in carrying out some task. Thus we

consider the machine vision system as part of a larger entity that interacts

4 Introduction

with the environment. The vision system can be considered an element of

a feedback loop that is concerned with sensing, while other elements are

dedicated to decision making and the implementation of these decisions.

The input to the machine vision system is an image, or several images,

while its output is a description that must satisfy two criteria:

•It must bear some relationship to what is being imaged.

•It must contain all the information needed for the some given task.

The ﬁrst criterion ensures that the description depends in some way on the

visual input. The second ensures that the information provided is useful.

An object does not have a unique description; we can conceive of de-

scriptions at many levels of detail and from many points of view. It is

impossible to describe an object completely. Fortunately, we can avoid this

potential philosophical snare by considering the task for which the descrip-

tion is intended. That is, we do not want just any description of what is

imaged, but one that allows us to take appropriate action.

A simple example may help to clarify these ideas. Consider again the

task of picking parts from a conveyor belt. The parts may be randomly

oriented and positioned on the belt. There may be several diﬀerent types of

parts, with each to be loaded into a diﬀerent ﬁxture. The vision system is

provided with images of the objects as they are transported past a camera

mounted above the belt. The descriptions that the system has to produce

in this case are simple. It need only give the position, orientation, and

type of each object. The description could be just a few numbers. In other

situations an elaborate symbolic description may be called for.

There are cases where the feedback loop is not closed through a ma-

chine, but the description is provided as output to be interpreted by a

human. The two criteria introduced above must still be satisﬁed, but it

is harder in this case to determine whether the system was successful in

solving the vision problem presented.

1.3 Relation to Other Fields

Machine vision is closely allied with three ﬁelds (ﬁgure 1-3):

•Image processing.

•Pattern classiﬁcation.

•Scene analysis.

1.3 Relation to Other Fields 5

Figure 1-3. Three ancestor paradigms of machine vision are image processing,

pattern classiﬁcation, and scene analysis. Each contributes useful techniques, but

none is central to the problem of developing symbolic descriptions from images.

Image processing is largely concerned with the generation of new im-

ages from existing images. Most of the techniques used come from linear

systems theory. The new image may have noise suppressed, blurring re-

moved, or edges accentuated. The result is, however, still an image, usually

meant to be interpreted by a person. As we shall see, some of the tech-

niques of image processing are useful for understanding the limitations of

image-forming systems and for designing preprocessing modules for ma-

chine vision.

Pattern classiﬁcation has as its main thrust the classiﬁcation of a “pat-

6 Introduction

tern,” usually given as a set of numbers representing measurements of an

object, such as height and weight. Although the input to a classiﬁer is not

an image, the techniques of pattern classiﬁcation are at times useful for

analyzing the results produced by a machine vision system. To recognize

an object means to assign it to one of a number of known classes. Note,

however, that recognition is only one of many tasks faced by the machine

vision system. Researchers concerned with classiﬁcation have created sim-

ple methods for obtaining measurements from images. These techniques,

however, usually treat the images as a two-dimensional pattern of bright-

ness and cannot deal with objects presented in an arbitrary attitude.

Figure 1-4. In scene analysis, a low-level symbolic description, such as a line

drawing, is used to develop a high-level symbolic description. The result may

contain information about the spatial relationships between objects, their shapes,

and their identities.

Scene analysis is concerned with the transformation of simple descrip-

tions, obtained directly from images, into more elaborate ones, in a form

more useful for a particular task. A classic illustration of this is the in-

terpretation of line drawings (ﬁgure 1-4). Here a description of the image

of a set of polyhedra is given in the form of a collection of line segments.

Before these can be used, we must ﬁgure out which regions bounded by

the lines belong together to form objects. We will also want to know how

objects support one another. In this way a complex symbolic description

of the image can be obtained from the simple one. Note that here we do

not start with an image, and thus once again do not address the central

issue of machine vision:

•Generating a symbolic description from one or more images.

1.4 Outline of What Is to Come 7

1.4 Outline of What Is to Come

The generation of descriptions from images can often be conveniently bro-

ken down into two stages. The ﬁrst stage produces a sketch, a detailed but

undigested description. Later stages produce more parsimonious, struc-

tured descriptions suitable for decision making. Processing in the ﬁrst

stage will be referred to as image analysis, while subsequent processing of

the results will be called scene analysis. The division is somewhat arbi-

trary, except insofar as image analysis starts with an image, while scene

analysis begins with a sketch. The ﬁrst thirteen chapters of the book are

concerned with image analysis, also referred to as early vision, while the

remaining ﬁve chapters are devoted to scene analysis.

The development of methods for machine vision requires some under-

standing of how the data to be processed are generated. For this reason we

start by discussing image formation and image sensing in chapter 2. There

we also treat measurement noise and introduce the concept of convolution.

Figure 1-5. Binary images have only two brightness levels: black and white.

While restricted in application, they are of interest because they are particularly

easy to process.

The easiest images to analyze are those that allow a simple separation

of an “object” from a “background.” These binary images will be treated

ﬁrst (ﬁgure 1-5). Some industrial problems can be tackled by methods that

use such images, but this usually requires careful control of the lighting.

There exists a fairly complete theory of what can and cannot be accom-

plished with binary images. This is in contrast to the more general case of

gray-level images. It is known, for example, that binary image techniques

are useful only when possible changes in the attitude of the object are con-

ﬁned to rotations in a plane parallel to the image plane. Binary image

8 Introduction

processing is covered in chapters 3 and 4.

Many image-analysis techniques are meant to be applied to regions of

an image corresponding to single objects, rather than to the whole image.

Because typically many surfaces in the environment are imaged together,

the image must be divided up into regions corresponding to separate entities

in the environment before such techniques can be applied. The required

segmentation of images is discussed in chapter 5.

In chapters 6 and 7 we consider the transformation of gray-level im-

ages into new gray-level images by means of linear operations. The usual

intent of such manipulations is to reduce noise, accentuate some aspect of

the image, or reduce its dynamic range. Subsequent stages of the machine

vision system may ﬁnd the processed images easier to analyze. Such ﬁlter-

ing methods are often exploited in edge-detection systems as preprocessing

steps.

Figure 1-6. In order to use images to recover information about the world, we

need to understand image formation. In some cases the image formation process

can be inverted to extract estimates of the permanent properties of the surfaces

of the objects being imaged.

1.4 Outline of What Is to Come 9

Complementary to image segmentation is edge ﬁnding, discussed in

chapter 8. Often the interesting events in a scene, such as a boundary where

one object occludes another, lead to discontinuities in image brightness or

in brightness gradient. Edge-ﬁnding techniques locate such features. At

this point, we begin to emphasize the idea that an important aspect of

machine vision is the estimation of properties of the surfaces being imaged.

In chapter 9 the estimation of surface reﬂectance and color is addressed

and found to be a surprisingly diﬃcult task.

Finally, we confront the central issue of machine vision: the generation

of a description of the world from one or more images. A point of view

that one might espouse is that the purpose of the machine vision system is

to invert the projection operation performed by image formation. This is

not quite correct, since we want not to recover the world being imaged, but

to obtain a symbolic description. Still, this notion leads us to study image

formation carefully (ﬁgure 1-6). The way light is reﬂected from a surface

becomes a central issue. The apparent brightness of a surface depends on

three factors:

Figure 1-7. The appearance of the image of an object is greatly inﬂuenced by

the reﬂectance properties of its surface. Perfectly matte and perfectly specular

surfaces present two extreme cases.

10 Introduction

Figure 1-8. The appearance of the image of a scene depends a lot on the lighting

conditions. To recover information about the world from images we need to un-

derstand how the brightness patterns in the image are determined by the shapes

of surfaces, their reﬂectance properties, and the distribution of light sources.

1.4 Outline of What Is to Come 11

•Microstructure of the surface.

•Distribution of the incident light.

•Orientation of the surface with respect to the viewer and the light sources.

In ﬁgure 1-7 we see images of two spherical surfaces, one covered with a

paint that has a matte or diﬀuse reﬂectance, the other metallic, giving rise

to specular reﬂections. In the second case we see a virtual image of the

world around the spherical object. It is clear that the microstructure of

the surface is important in determining image brightness.

Figure 1-8 shows three views of Place Ville-Marie in Montreal. The

three pictures were taken from the same hotel window, but under diﬀerent

lighting conditions. Again, we easily recognize that the same objects are

depicted, but there is a tremendous diﬀerence in brightness patterns be-

tween the images taken with direct solar illumination and those obtained

under a cloudy sky.

In chapters 10 and 11 we discuss these issues and apply the understand-

ing developed to the recovery of surface shape from one or more images.

Representations for the shape of a surface are also introduced there. In

developing methods for recovering surface shape, we often consider the

surface broken up into tiny patches, each of which can be treated as if it

were planar. Light reﬂection from such a planar patch is governed by three

angles if it is illuminated by a point source (ﬁgure 1-9).

The same systematic approach, based on an analysis of image bright-

ness, is used in chapters 12 and 13 to recover information from time-varying

images and images taken by cameras separated in space. Surface shape,

object motion, and other information can be recovered from images us-

ing the methods developed in these two chapters. The relations between

various coordinate systems, either viewer-centered or object-centered, are

uncovered in the discussion of photogrammetry in chapter 13, along with

an analysis of the binocular stereo problem. In using a machine vision

system to guide a mechanical manipulator, measurements in the camera’s

coordinate system must be transformed into the coordinate system of the

robot arm. This topic naturally ﬁts into the discussion of this chapter also.

At this point, we turn from image analysis to scene analysis. Chapter

14 introduces methods for classifying objects based on feature measure-

ments. Line drawings obtained from images of polyhedral objects are an-

alyzed in chapter 15 in order to recover the spatial relationships between

the objects.

12 Introduction

Figure 1-9. The reﬂection of light from a point source by a patch of an object’s

surface is governed by three angles: the incident angle i, the emittance angle

e, and the phase angle g. Here N is the direction perpendicular, or normal, to

the surface, S the direction to the light source, and V the direction toward the

viewer.

The issue of how to represent visually acquired information is of great

importance. In chapter 16 we develop in detail the extended Gaussian

image, a representation for surface shape that is useful in recognition and

allows us to determine the attitude of an object in space. Image sequences

can be exploited to recover the motion of the camera. As a by-product,

we obtain the shapes of the surfaces being imaged. This forms the topic

of chapter 17. (The reader may wonder why this chapter does not directly

follow the one on optical ﬂow. The reason is that it does not deal with image

analysis and so logically belongs in the part of the book dedicated to scene

analysis.) Finally, in chapter 18 we bring together many of the concepts

developed in this book to built a complete hand–eye system. A robot

1.4 Outline of What Is to Come 13

arm is guided to pick up one object after another out of a pile of objects.

Visual input provides the system with information about the positions of

the objects and their attitudes in space. In this chapter we introduce some

new topics, such as methods for representing rotations in three-dimensional

space, and discuss some of the diﬃculties encountered in building a real-

world system.

Throughout the book we start by discussing elementary issues and well-

established techniques, progress to more advanced topics, and close with

less certain matters and subjects of current research. In the past, machine

vision may have appeared to be a collection of assorted heuristics and ad

hoc tricks. To give the material coherence we maintain a particular point

of view here:

•Machine vision should be based on a thorough understanding of image

formation.

This emphasis allows us to derive mathematical models of the image-

analysis process. Algorithms for recovering a description of the imaged

world can then be based on these mathematical models.

An approach based on the analysis of image formation is, of course,

not the only one possible for machine vision. One might start instead from

existing biological vision systems. Artiﬁcial systems would then be based

on detailed knowledge of natural systems, provided these can be adequately

characterized. We shall occasionally discuss alternate approaches to given

problems in machine vision, but to avoid confusion we will not dwell on

them.

Figure 1-10. In many cases, the development of a symbolic description of a

scene from one or more images can be broken down conveniently into two stages.

The ﬁrst stage is largely governed by our understanding of the image-formation

process; the second depends more on the needs of the intended application.

14 Introduction

The transformation from image to sketch appears to be governed

mostly by what is in the image and what information we can extract di-

rectly from it (ﬁgure 1-10). The transformation from a crude sketch to

a full symbolic description, on the other hand, is mostly governed by the

need to generate information in a form that will be of use in the intended

application.

1.5 References

Each chapter will have a section providing pointers to background reading,

further explanation of the concepts introduced in that chapters, and recent

results in the area. Books will be listed ﬁrst, complete with authors and

titles. Papers in journals, conference proceedings, and internal reports of

universities and research laboratories are listed after the books, but without

title. Please note that the bibliography has two sections: the ﬁrst for books,

the second for papers.

There are now numerous books on the subject of machine vision. Of

these, Computer Vision by Ballard & Brown [1982] is remarkable for its

broad coverage. Also notable are Digital Picture Processing by Rosenfeld

& Kak [1982], Computer Image Processing and Recognition by Hall [1979],

and Machine Perception [1982], a short book by Nevatia. A recent addition

is Vision in Man and Machine [1985] by Levine, a book that has a biological

vision point of view and emphasizes applications to biomedical problems.

Many books concentrate on the image-processing side of things, such

as Computer Techniques in Image Processing by Andrews [1970], Digi-

tal Image Processing by Gonzalez & Wintz [1977], and two books dealing

with the processing of images obtained by cameras in space: Digital Image

Processing by Castleman [1979] and Digital Image Processing: A Systems

Approach by Green [1983]. The ﬁrst few chapters of Digital Picture Pro-

cessing by Rosenfeld & Kak [1982] also provide an excellent introduction

to the subject. The classic reference on image processing is still Pratt’s

encyclopedic Digital Image Processing [1978].

One of the earliest signiﬁcant books in this ﬁeld, Pattern Classiﬁcation

and Scene Analysis by Duda & Hart [1973], contains more on the sub-

ject of pattern classiﬁcation than one typically needs to know. Artiﬁcial

Intelligence by Winston [1984] has an easy-to-read, broad-brush chapter

on machine vision that makes the connection between that subject and

artiﬁcial intelligence.

A number of edited books, containing contributions from several re-

searchers in the ﬁeld, have appeared in the last ten years. Early on there

1.5 References 15

was The Psychology of Computer Vision, edited by Winston [1975], now

out of print. Then came Digital Picture Analysis, edited by Rosenfeld

[1976], and Computer Vision Systems, edited by Hanson & Riseman [1978].

Several papers on machine vision can be found in volume 2 of Artiﬁcial In-

telligence: An MIT Perspective, edited by Winston & Brown [1979]. The

collection Structured Computer Vision: Machine Perception through Hier-

archical Computation Structures, edited by Tanimoto & Klinger, was pub-

lished in 1980. Finally there appeared the ﬁne assemblage of papers Image

Understanding 1984, edited by Ullman & Richards [1984].

The papers presented at a number of conferences have also been col-

lected in book form. Gardner was the editor of a book published in 1979

called Machine-aided Image Analysis, 1978. Applications of machine vi-

sion to robotics are explored in Computer Vision and Sensor-Based Robots,

edited by Dodd & Rossol [1979], and in Robot Vision, edited by Pugh [1983].

Stucki edited Advances in Digital Image Processing: Theory, Application,

Implementation [1979], a book containing papers presented at a meeting

organized by IBM. The notes for a course organized by Faugeras appeared

in Fundamentals in Computer Vision [1983].

Because many of the key papers in the ﬁeld were not easily accessible,

a number of collections have appeared, including three published by IEEE

Press, namely Computer Methods in Image Analysis, edited by Aggarwal,

Duda, & Rosenfeld [1977], Digital Image Processing, edited by Andrews

[1978], and Digital Image Processing for Remote Sensing, edited by Bern-

stein [1978].

The IEEE Computer Society’s publication Computer brought out a

special issue on image processing in August 1977, the Proceedings of the

IEEE devoted the May 1979 issue to pattern recognition and image pro-

cessing, and Computer produced a special issue on machine perception for

industrial applications in May 1980. A special issue (Volume 17) of the

journal Artiﬁcial Intelligence was published in book form under the title

Computer Vision, edited by Brady [1981]. The Institute of Electronics and

Communication Engineers of Japan produced a special issue (Volume J68-

D, Number 4) on machine vision work in Japan in April 1985 (in Japanese).

Not much is said in this book about biological vision systems. They

provide us, on the one hand, with reassuring existence proofs and, on the

other, with optical illusions. These startling eﬀects may someday prove

to be keys with which we can unlock the secrets of biological vision sys-

tems. A computational theory of their function is beginning to emerge, to

a great extent due to the pioneering work of a single man, David Marr.

16 Introduction

His approach is documented in the classic book Vision: A Computational

Investigation into the Human Representation and Processing of Visual In-

formation [1982].

Human vision has, of course, always been a subject of intense curios-

ity, and there is a vast literature on the subject. Just a few books will

be mentioned here. Gregory has provided popular accounts of the subject

in Eye and Brain [1966] and The Intelligent Eye [1970]. Three books by

Gibson—The Perception of the Visual World [1950], The Senses Consid-

ered as Perceptual Systems [1966], and The Ecological Approach to Visual

Perception [1979]—are noteworthy for providing a fresh approach to the

problem. Cornsweet’s Visual Perception [1971] and The Psychology of Vi-

sual Perception by Haber & Hershenson [1973] are of interest also. The

work of Julesz has been very inﬂuential, particularly in the area of binoc-

ular stereo, as documented in Foundations of Cyclopean Perception [1971].

More recently, in the wonderfully illustrated book Seeing, Frisby [1982] has

been able to show the crosscurrents between work on machine vision and

work on biological vision systems. For another point of view see Perception

by Rock [1984].

Twenty years ago, papers on machine vision were few in number and

scattered widely. Since then a number of journals have become preferred

repositories for new research results. In fact, the journal Computer Graph-

ics and Image Processing, published by Academic Press, had to change

its name to Computer Vision, Graphics and Image Processing (CVGIP)

when it became the standard place to send papers in this ﬁeld for review.

More recently, a new special-interest group of the Institute of Electrical and

Electronic Engineers (IEEE) started publishing the Transactions on Pat-

tern Analysis and Machine Intelligence (PAMI). Other journals, such as

Artiﬁcial Intelligence, published by North-Holland, and Robotics Research,

published by MIT Press, also contain articles on machine vision. There are

several journals devoted to related topics, such as pattern classiﬁcation.

Some research results ﬁrst see the light of day at an “Image Under-

standing Workshop” sponsored by the Defense Advanced Research Projects

Agency (DARPA). Proceedings of these workshops are published by Science

Applications Incorporated, McLean, Virginia, and are available through

the Defense Technical Information Center (DTIC) in Alexandria, Virginia.

Many of these papers are later submitted, possibly after revision and ex-

tension, to be reviewed for publication in one of the journals mentioned

above.

The Computer Society of the IEEE organizes annual conferences on

1.5 References 17

Computer Vision and Pattern Recognition (CVPR) and publishes their

proceedings. Also of interest are the proceedings of the biannual Interna-

tional Joint Conference on Artiﬁcial Intelligence (IJCAI) and the national

conferences organized by the American Association for Artiﬁcial Intelli-

gence (AAAI), usually in the years in between.

The thorough annual surveys by Rosenfeld [1972, 1974, 1975, 1976,

1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984a, 1985] in Computer Vi-

sion, Graphics and Image Processing are extremely valuable and make it

possible to be less than complete in providing references here. The most

recent survey contained 1,252 entries! There have been many analyses of

the state of the ﬁeld or of particular views of the ﬁeld. An early survey

of image processing is that of Huang, Schreiber, & Tretiak [1971]. While

not really a survey, the inﬂuential paper of Barrow & Tenenbaum [1978]

presents the now prevailing view that machine vision is concerned with the

process of recovering information about the surfaces being imaged. More

recent surveys of machine vision by Marr [1980], Barrow & Tenenbaum

[1981a], Poggio [1984], and Rosenfeld [1984b] are recommended particu-

larly. Another paper that has been inﬂuential is that by Binford [1981].

Once past the hurdles of early vision, the representation of information

and the modeling of objects and the physical interaction between them

become important. We touch upon these issues in the later chapters of this

book. For more information see, for example, Brooks [1981] and Binford

[1982].

There are many papers on the application of machine vision to in-

dustrial problems (although some of the work with the highest payoﬀ is

likely not to have been published in the open literature). Several papers in

Robotics Research: The First International Symposium, edited by Brady

& Paul [1984], deal with this topic. Chin [1982] and Chin & Harlow [1982]

have surveyed the automation of visual inspection.

The inspection of printed circuit boards, both naked and stuﬀed, is a

topic of great interest, since there are many boards to be inspected and since

it is not a very pleasant job for people, nor one that they are particularly

good at. For examples of work in this area, see Ejiri et al. [1973], Daniels-

son & Kruse [1979], Danielsson [1980], and Hara, Akiyama, & Karasaki

[1983]. There is a similar demand for such techniques in the manufacture

of integrated circuits. Masks are simple black-and-white patterns, and

their inspection has not been too diﬃcult to automate. The inspection of

integrated circuit wafers is another matter; see, for example, Hsieh & Fu

[1980].

18 Introduction

Machine vision has been used in automated alignment. See Horn

[1975b], Kashioka, Ejiri, & Sakamoto [1976], and Baird [1978] for ex-

amples in semiconductor manufacturing. Industrial robots are regularly

guided using visually obtained information about the position and orienta-

tion of parts. Many such systems use binary image-processing techniques,

although some are more sophisticated. See, for example, Yachida & Tsuji

[1977], Gonzalez & Safabakhsh [1982], and Horn & Ikeuchi [1984]. These

techniques will not ﬁnd widespread application if the user has to program

each application in a standard programming language. Some attempts have

been made to provide tools speciﬁcally suited to the vision applications; see,

for example, Lavin & Lieberman [1982].

Papers on the application of machine vision methods to the vectoriza-

tion of line drawings are mentioned at the end of chapter 4; references on

character recognition may be found at the end of chapter 14.

1.6 Exercises

1-1 Explain in what sense one can consider pattern classiﬁcation, image pro-

cessing, and scene analysis as “ancestor paradigms” to machine vision. In what

way do the methods from each of these disciplines contribute to machine vision?

In what way are the problems addressed by machine vision diﬀerent from those

to which these methods apply?

2

Image Formation & Image Sensing

In this chapter we explore how images are formed and how they are

sensed by a computer. Understanding image formation is a prerequisite for

full understanding of the methods for recovering information from images.

In analyzing the process by which a three-dimensional world is projected

onto a two-dimensional image plane, we uncover the two key questions of

image formation:

•What determines where the image of some point will appear?

•What determines how bright the image of some surface will be?

The answers to these two questions require knowledge of image projection

and image radiometry, topics that will be discussed in the context of simple

lens systems.

A crucial notion in the study of image formation is that we live in a

very special visual world. It has particular features that make it possi-

ble to recover information about the three-dimensional world from one or

more two-dimensional images. We discuss this issue and point out imag-

ing situations where these special constraint do not apply, and where it is

consequently much harder to extract information from images.

2.1 Two Aspects of Image Formation 19

We also study the basic mechanism of typical image sensors, and how

information in diﬀerent spectral bands may be obtained and processed.

Following a brief discussion of color, the chapter closes with a discussion of

noise and reviews some concepts from the ﬁelds of probability and statis-

tics. This is a convenient point to introduce convolution in one dimension,

an idea that will be exploited later in its two-dimensional generalization.

Readers familiar with these concepts may omit these sections without loss

of continuity. The chapter concludes with a discussion of the need for

quantization of brightness measurements and for tessellations of the image

plane.

2.1 Two Aspects of Image Formation

Before we can analyze an image, we must know how it is formed. An image

is a two-dimensional pattern of brightness. How this pattern is produced

in an optical image-forming system is best studied in two parts: ﬁrst, we

need to ﬁnd the geometric correspondence between points in the scene and

points in the image; then we must ﬁgure out what determines the brightness

at a particular point in the image.

2.1.1 Perspective Projection

Consider an ideal pinhole at a ﬁxed distance in front of an image plane

(ﬁgure 2-1). Assume that an enclosure is provided so that only light coming

through the pinhole can reach the image plane. Since light travels along

straight lines, each point in the image corresponds to a particular direction

deﬁned by a ray from that point through the pinhole. Thus we have the

familiar perspective projection.

We deﬁne the optical axis, in this simple case, to be the perpendic-

ular from the pinhole to the image plane. Now we can introduce a con-

venient Cartesian coordinate system with the origin at the pinhole and

z-axis aligned with the optical axis and pointing toward the image. With

this choice of orientation, the zcomponents of the coordinates of points

in front of the camera are negative. We use this convention, despite the

drawback, because it gives us a convenient right-hand coordinate system

(with the x-axis to the right and the y-axis upward).

We would like to compute where the image Pof the point Pon some

object in front of the camera will appear (ﬁgure 2-1). We assume that no

other object lies on the ray from Pto the pinhole O. Let r=(x, y, z)Tbe

the vector connecting Oto P, and r=(x,y

,f)Tbe the vector connecting

Oto P. (As explained in the appendix, vectors will be denoted by boldface

20 Image Formation & Image Sensing

letters. We commonly deal with column vectors, and so must take the

transpose, indicated by the superscript T, when we want to write them in

terms of the equivalent row vectors.)

Figure 2-1. A pinhole camera produces an image that is a perspective projection

of the world. It is convenient to use a coordinate system in which the xy-plane is

parallel to the image plane, and the origin is at the pinhole O. The z-axis then

lies along the optical axis.

Here fis the distance of the image plane from the pinhole, while x

and yare the coordinates of the point Pin the image plane. The two

vectors rand rare collinear and diﬀer only by a (negative) scale factor.

If the ray connecting Pto Pmakes an angle αwith the optical axis, then

the length of ris just

r=−zsec α=−(r·ˆz) sec α,

where ˆz is the unit vector along the optical axis. (Remember that zis

negative for a point in front of the camera.)

The length of ris

r=fsec α,

2.1 Two Aspects of Image Formation 21

and so 1

fr=1

r·ˆz r.

In component form this can be written as

x

f=x

zand y

f=y

z.

Sometimes image coordinates are normalized by dividing xand yby f

in order to simplify the projection equations.

Figure 2-2. When the scene depth is small relative to the average distance

from the camera, perspective projection can be approximated by orthographic

projection. In orthographic projection, rays from a point in the scene are traced

parallel to the projection direction until they intercept the image plane.

2.1.2 Orthographic Projection

Suppose we form the image of a plane that lies parallel to the image at

z=z0. Then we can deﬁne m, the (lateral) magniﬁcation, as the ratio

of the distance between two points measured in the image to the distance

between the corresponding points on the plane. Consider a small interval

(δx, δy, 0)Ton the plane and the corresponding small interval (δx,δy

,0)T

22 Image Formation & Image Sensing

in the image. Then

m=(δx)2+(δy)2

(δx)2+(δy)2=f

−z0

,

where −z0is the distance of the plane from the pinhole. The magniﬁcation

is the same for all points in the plane. (Note that m<1, except in the

case of microscopic imaging.)

A small object at an average distance −z0will give rise to an image that

is magniﬁed by m, provided that the variation in zover its visible surface

is not signiﬁcant compared to −z0. The area occupied by the image of

an object is proportional to m2. Objects at diﬀerent distances from the

imaging system will, of course, be imaged with diﬀerent magniﬁcations.

Let the depth range of a scene be the range of distances of surfaces from

the camera. The magniﬁcation is approximately constant when the depth

range of the scene being imaged is small relative to the average distance of

the surfaces from the camera. In this case we can simplify the projection

equations to read

x=−mx and y=−my,

where m=f/(−z0) and −z0is the average value of −z. Often the scaling

factor mis set to 1 or −1 for convenience. Then we can further simplify

the equations to become

x=xand y=y.

This orthographic projection (ﬁgure 2-2), can be modeled by rays parallel

to the optical axis (rather than ones passing through the origin). The

diﬀerence between perspective and orthographic projection is small when

the distance to the scene is much larger than the variation in distance

among objects in the scene.

The ﬁeld of view of an imaging system is the angle of the cone of

directions encompassed by the scene that is being imaged. This cone of

directions clearly has the same shape and size as the cone obtained by

connecting the edge of the image plane to the center of projection. A

“normal” lens has a ﬁeld of view of perhaps 25◦by 40◦.Atelephoto lens

is one that has a long focal length relative to the image size and thus a

narrow ﬁeld of view. Conversely, a wide-angle lens has a short focal length

relative to the image size and thus a wide ﬁeld of view. A rough rule of

thumb is that perspective eﬀects are signiﬁcant when a wide-angle lens is

used, while images obtained using a telephoto lenses tend to approximate

2.2 Brightness 23

orthographic projection. We shall show in exercise 2-11 that this rule is

not exact.

Figure 2-3. (a) Irradiance is the power per unit area falling on a surface. (b)

Radiance is the power emitted per unit area into a cone of directions having unit

solid angle. The term brightness is used informally for both concepts.

2.2 Brightness

The more diﬃcult, and more interesting, question of image formation is

what determines the brightness at a particular point in the image. Bright-

ness is an informal term used to refer to at least two diﬀerent concepts:

image brightness and scene brightness. In the image, brightness is related

to energy ﬂux incident on the image plane and can be measured in a num-

ber of ways. Here we introduce the term irradiance to replace the informal

term image brightness. Irradiance is the power per unit area (W·m−2—

watts per square meter) of radiant energy falling on a surface (ﬁgure 2-3a).

In the ﬁgure, Edenotes the irradiance, while δP is the power of the radiant

energy falling on the inﬁnitesimal surface patch of area δA. The blackening

of a ﬁlm in a camera, for example, is a function of the irradiance. (As we

24 Image Formation & Image Sensing

shall discuss a little later, the measurement of brightness in the image also

depends on the spectral sensitivity of the sensor.) The irradiance at a par-

ticular point in the image will depend on how much light arrives from the

corresponding object point (the point found by following the ray from the

image point through the pinhole until it meets the surface of an object).

In the scene, brightness is related to the energy ﬂux emitted from a

surface. Diﬀerent points on the objects in front of the imaging system will

have diﬀerent brightnesses, depending on how they are illuminated and

how they reﬂect light. We now introduce the term radiance to substitute

for the informal term scene brightness. Radiance is the power per unit

foreshortened area emitted into a unit solid angle (W·m−2·sr−1—watts per

square meter per steradian) by a surface (ﬁgure 2-3b). In the ﬁgure, Lis the

radiance and δ2Pis the power emitted by the inﬁnitesimal surface patch

of area δA into an inﬁnitesimal solid angle δω. The apparent complexity

of the deﬁnition of radiance stems from the fact that a surface emits light

into a hemisphere of possible directions, and we obtain a ﬁnite amount

only by considering a ﬁnite solid angle of these directions. In general the

radiance will vary with the direction from which the object is viewed. We

shall discuss radiometry in detail later, when we introduce the reﬂectance

map.

We are interested in the radiance of surface patches on objects because

what we measure, image irradiance, turns out to be proportional to scene

radiance, as we show later. The constant of proportionality depends on the

optical system. To gather a ﬁnite amount of light in the image plane we

must have an aperture of ﬁnite size. The pinhole, introduced in the last

section, must have a nonzero diameter. Our simple analysis of projection

no longer applies, though, since a point in the environment is now imaged

as a small circle. This can be seen by considering the cone of rays passing

through the circular pinhole with its apex at the object point.

We cannot make the pinhole very small for another reason. Because

of the wave nature of light, diﬀraction occurs at the edge of the pinhole

and the light is spread over the image. As the pinhole is made smaller and

smaller, a larger and larger fraction of the incoming light is deﬂected far

from the direction of the incoming ray.

2.3 Lenses

In order to avoid the problems associated with pinhole cameras, we now

consider the use of a lens in an image-forming system. An ideal lens pro-

duces the same projection as the pinhole, but also gathers a ﬁnite amount

2.3 Lenses 25

of light (ﬁgure 2-4). The larger the lens, the larger the solid angle it sub-

tends when seen from the object. Correspondingly it intercepts more of

the light reﬂected from (or emitted by) the object. The ray through the

center of the lens is undeﬂected. In a well-focused system the other rays

are deﬂected to reach the same image point as the central ray.

Figure 2-4. To obtain ﬁnite irradiance in the image plane, a lens is used instead

of an ideal pinhole. A perfect lens generates an image that obeys the same

projection equations as that generated by a pinhole, but gathers light from a

ﬁnite area as well. A lens produces well-focused images of objects at a particular

distance only.

An ideal lens has the disadvantage that it only brings to focus light

from points at a distance −zgiven by the familiar lens equation

1

z+1

−z=1

f,

where zis the distance of the image plane from the lens and fis the focal

length (ﬁgure 2-4). Points at other distances are imaged as little circles.

This can be seen by considering the cone of light rays passing through the

lens with apex at the point where they are correctly focused. The size of

the blur circle can be determined as follows: A point at distance −zis

imaged at a point zfrom the lens, where

1

z+1

−z=1

f,

26 Image Formation & Image Sensing

and so

(z−z)= f

(z+f)

f

(z+f)(z−z).

If the image plane is situated to receive correctly focused images of objects

at distance −z, then points at distance −zwill give rise to blur circles of

diameter

d

z|z−z|,

where dis the diameter of the lens. The depth of ﬁeld is the range of

distances over which objects are focused “suﬃciently well,” in the sense

that the diameter of the blur circle is less than the resolution of the imaging

device. The depth of ﬁeld depends, of course, on what sensor is used, but

in any case it is clear that the larger the lens aperture, the less the depth

of ﬁeld. Clearly also, errors in focusing become more serious when a large

aperture is employed.

Figure 2-5. An ideal thick lens provides a reasonable model for most real lenses.

It produces the same perspective projection that an ideal thin lens does, except

for an additional oﬀset, the lens thickness t, along the optical axis. It can be un-

derstood in terms of the principal planes and the nodal points at the intersections

of the principal planes and the optical axis.

Simple ray-tracing rules can help in understanding simple lens combi-

nations. As already mentioned, the ray through the center of the lens is

undeﬂected. Rays entering the lens parallel to the optical axis converge to

2.3 Lenses 27

a point on the optical axis at a distance equal to the focal length. This fol-

lows from the deﬁnition of focal length as the distance from the lens where

the image of an object that is inﬁnitely far away is focused. Conversely,

rays emitted from a point on the optical axis at a distance equal to the focal

length from the lens are deﬂected to emerge parallel to the optical axis on

the other side of the lens. This follows from the reversibility of rays. At an

interface between media of diﬀerent refractive indices, the same reﬂection

and refraction angles apply to light rays traveling in opposite directions.

A simple lens is made by grinding and polishing a glass blank so that

its two surfaces have shapes that are spherical. The optical axis is the line

through the centers of the two spheres. Any such simple lens will have

a number of defects or aberrations. For this reason one usually combines

several simple lenses, carefully lining up their individual optical axes, so as

to make a compound lens with better properties.

A useful model of such a system of lenses is the thick lens (ﬁgure 2-5).

One can deﬁne two principal planes perpendicular to the optical axis, and

two nodal points where these planes intersect the optical axis. A ray arriv-

ing at the front nodal point leaves the rear nodal point without changing

direction. This deﬁnes the projection performed by the lens. The distance

between the two nodal points is the thickness of the lens. A thin lens is

one in which the two nodal points can be considered coincident.

It is theoretically impossible to make a perfect lens. The projection

will never be exactly like that of an ideal pinhole. More important, exact

focusing of all rays cannot be achieved. A variety of aberrations occur. In

a well-designed lens these defects are kept to a minimum, but this becomes

more diﬃcult as the aperture of the lens is increased. Thus there is a

trade-oﬀ between light-gathering power and image quality.

A defect of particular interest to us here is called vignetting. Imagine

several circular diaphragms of diﬀerent diameter, stacked one behind the

other, with their centers on a common line (ﬁgure 2-6). When you look

along this common line, the smallest diaphragm will limit your view. As

you move away from the line, some of the other diaphragms will begin to

occlude more, until ﬁnally nothing can be seen. Similarly, in a simple lens,

all the rays that enter the front surface of the lens end up being focused

in the image. In a compound lens, some of the rays that pass through

the ﬁrst lens may be occluded by portions of the second lens, and so on.

This will depend on the inclination of the entering ray with respect to the

optical axis and its distance from the front nodal point. Thus points in

the image away from the optical axis beneﬁt less from the light-gathering

28 Image Formation & Image Sensing

power of the lens than does the point on the optical axis. There is a falloﬀ

in sensitivity with distance from the center of the image.

Figure 2-6. Vignetting is a reduction in light-gathering power with increasing

inclination of light rays with respect to the optical axis. It is caused by apertures

in the lens system occluding part of the beam of light as it passes through the

lens system. Vignetting results in a smooth, but sometimes quite large, falloﬀ in

sensitivity toward the edges of the image region.

Another important consideration is that the aberrations of a lens in-

crease in magnitude as a power of the angle between the incident ray and

the optical axis. Aberrations are classiﬁed by their order, that is, the power

of the angle that occurs in this relationship. Points on the optical axis may

be quite well focused, while those in a corner of the image are smeared out.

For this reason, only a limited portion of the image plane is usable. The

magnitude of an aberration defect also increases as a power of the distance

from the optical axis at which a ray passes through the lens. Thus the

image quality can be improved by using only the central portion of a lens.

One reason for introducing diaphragms into a lens system is to im-

prove image quality in a situation where it is not necessary to utilize fully

the light-gathering power of the system. As already mentioned, ﬁxed di-

aphragms ensure that rays entering at a large angle to the optical axis do

not pass through the outer regions of any of the lenses. This improves

image quality in the outer regions of the image, but at the same time

greatly increases vignetting. In most common uses of lenses this is not

an important matter, since people are astonishingly insensitive to smooth

2.4 Our Visual World 29

spatial variations in image brightness. It does matter in machine vision,

however, since we use the measurements of image brightness (irradiance)

to determine the scene brightness (radiance).

2.4 Our Visual World

How can we hope to recover information about the three-dimensional world

using a mere two-dimensional image? It may seem that the available in-

formation is not adequate, even if we take several images. Yet biological

systems interact intelligently with the environment using visual informa-

tion. The puzzle is solved when we consider the special nature of our usual

visual world. We are immersed in a homogeneous transparent medium, and

the objects we look at are typically opaque. Light rays are not refracted

or absorbed in the environment, and we can follow a ray from an image

point through the lens until it reaches some surface. The brightness at

a point in the image depends only on the brightness of the corresponding

surface patch. Surfaces are two-dimensional manifolds, and their shape can

be represented by giving the distance z(x,y

) to the surface as a function

of the image coordinates xand y.

This is to be contrasted with a situation in which we are looking into

a volume occupied by a light-absorbing material of varying density. Here

we may specify the density ρ(x, y, z) of the material as a function of the

coordinates x,y, and z. One or more images provide enough constraint to

recover information about a surface, but not about a volume. In theory,

an inﬁnite number of images is needed to solve the problem of tomography,

that is, to determine the density of the absorbing material.

Conditions of homogeneity and transparency may not always hold ex-

actly. Distant mountains appear changed in color and contrast, while in

deserts we may see mirages. Image analysis based on the assumption that

conditions are as stated may go awry when the assumptions are violated,

and so we can expect that both biological and machine vision systems will

be misled in such situations. Indeed, some optical illusions can be ex-

plained in this way. This does not mean that we should abandon these

additional constraints, for without them the solution of the problem of re-

covering information about the three-dimensional world from images would

be ambiguous.

Our usual visual world is special indeed. Imagine being immersed

instead in a world with varying concentrations of pigments dispersed within

a gelatinous substance. It would not be possible to recover the distributions

of these absorbing substances in three dimensions from one view. There

30 Image Formation & Image Sensing

just would not be enough information. Analogously, single X-ray images

are not useful unless there happens to be sharp contrast between diﬀerent

materials, like bone and tissue. Otherwise a very large number of views

must be taken and a tomographic reconstruction attempted. It is perhaps

a good thing that we do not possess Superman’s X-ray vision capabilities!

By and large, we shall conﬁne our attention to images formed by con-

ventional optical means. We shall avoid high-magniﬁcation microscopic

images, for instance, where many substances are eﬀectively transparent,

or at least translucent. Similarly, images on a very large scale often show

the eﬀects of absorption and refraction in the atmosphere. Interestingly,

other modalities do sometimes provide us with images much like the ones

we are used to. Examples include scanning electron microscopes (SEM)

and synthetic-aperture radar systems (SAR), both of which produce im-

ages that are easy to interpret. So there is some hope of analyzing them

using the methods discussed here.

In view of the importance of surfaces, we might hope that a machine

vision system could be designed to recover the shapes of surfaces given one

or more images. Indeed, there has been some success in this endeavor, as

we shall see in chapter 10, where we discuss the recovery of shape from

shading. Detailed understanding of the imaging process allows us to re-

cover quantitative information from images. The computed shape of a

surface may be used in recognition, inspection, or in planning the path of

a mechanical manipulator.

2.5 Image Sensing

Almost all image sensors depend on the generation of electron–hole pairs

when photons strike a suitable material. This is the basic process in bi-

ological vision as well as photography. Image sensors diﬀer in how they

measure the ﬂux of charged particles. Some devices use an electric ﬁeld

in a vacuum to separate the electrons from the surface where they are lib-

erated (ﬁgure 2-7a). In other devices the electrons are swept through a

depleted zone in a semiconductor (ﬁgure 2-7b).

Not all incident photons generate an electron–hole pair. Some pass

right through the sensing layer, some are reﬂected, and others lose energy

in diﬀerent ways. Further, not all electrons ﬁnd their way into the detect-

ing circuit. The ratio of the electron ﬂux to the incident photon ﬂux is

called the quantum eﬃciency, denoted q(λ). The quantum eﬃciency de-

pends on the energy of the incident photon and hence on its wavelength λ.

It also depends on the material and the method used to collect the liber-

2.5 Image Sensing 31

ated electrons. Older vacuum devices tend to have coatings with relatively

low quantum eﬃciency, while solid-state devices are near ideal for some

wavelengths. Photographic ﬁlm tends to have poor quantum eﬃciency.

Figure 2-7. Photons striking a suitable surface generate charge carriers that

are collected and measured to determine the irradiance. (a) In the case of a

vacuum device, electrons are liberated from the photocathode and attracted to

the positive anode. (b) In the case of a semiconductor device, electron–hole pairs

are separated by the built-in ﬁeld to be collected in an external circuit.

2.5.1 Sensing Color

The sensitivity of a device varies with the wavelength of the incident light.

Photons with little energy tend to go right through the material, while

very energetic photons may be stopped before they reach the sensitive

layer. Each material has its characteristic variation of quantum eﬃciency

with wavelength.

For a small wavelength interval δλ, let the ﬂux of photons with energy

equal to or greater than λ, but less than λ+δλ,beb(λ)δλ. Then the

number of electrons liberated is

∞

−∞

b(λ)q(λ)dλ.

32 Image Formation & Image Sensing

If we use sensors with diﬀerent photosensitive materials, we obtain diﬀerent

images because their spectral sensitivities are diﬀerent. This can be helpful

in distinguishing surfaces that have similar gray-levels when imaged with

one sensor, yet give rise to diﬀerent gray-levels when imaged with a diﬀer-

ent sensor. Another way to achieve this eﬀect is to use the same sensing

material but place ﬁlters in front of the camera that selectively absorb dif-

ferent parts of the spectrum. If the transmission of the ith ﬁlter is fi(λ),

the eﬀective quantum eﬃciency of the combination of that ﬁlter and the

sensor is fi(λ)q(λ).

How many diﬀerent ﬁlters should we use? The ability to distinguish

among materials grows as more images are taken through more ﬁlters.

The measurements are correlated, however, because most surfaces have a

smooth variation of reﬂectance with wavelength. Typically, little is gained

by using very many ﬁlters.

The human visual system uses three types of sensors, called cones,in

daylight conditions. Each of these cone types has a particular spectral

sensitivity, one of them peaking in the long wavelength range, one in the

middle, and one in the short wavelength range of the visible spectrum,

which extends from about 400 nm to about 700 nm. There is considerable

overlap between the sensitivity curves. Machine vision systems often also

use three images obtained through red, green, and blue ﬁlters. It should

be pointed out, however, that the results have little to do with human

color sensations unless the spectral response curves happen to be linear

combinations of the human spectral response curves, as discussed below.

One property of a sensing system with a small number of sensor types

having diﬀerent spectral sensitivities is that many diﬀerent spectral distri-

butions will produce the same output. The reason is that we do not measure

the spectral distributions themselves, but integrals of their product with

the spectral sensitivity of particular sensor types. The same applies to bio-

logical systems, of course. Colors that appear indistinguishable to a human

observer are said to be metameric. Useful information about the spectral

sensitivities of the human visual system can be gained by systematically

exploring metamers. The results of a large number of color-matching ex-

periments performed by many observers have been averaged and used to

calculate the so-called tristimulus or standard observer curves. These have

been published by the Commission Internationale de l’Eclairage (CIE) and

are shown in ﬁgure 2-8. A given spectral distribution is evaluated as fol-

lows: The spectral distribution is multiplied in turn by each of the three

functions x(λ), y(λ), and z(λ). The products are integrated over the visible

2.5 Image Sensing 33

wavelength range. The three results X,Y, and Zare called the tristimulus

values. Two spectral distributions that result in the same values for these

three quantities appear indistinguishable when placed side by side under

controlled conditions. (By the way, the spectral distributions used here are

expressed in terms of energy per unit wavelength interval, not photon ﬂux.)

The actual spectral response curves of the three types of cones cannot

be determined in this way, however. There is some remaining ambiguity.

It is known that the tristimulus curves are ﬁxed linear transforms of these

spectral response curves. The coeﬃcients of the transformation are not

known accurately.

We show in exercise 2-14 that a machine vision system with the same

color-matching properties as the human color vision system must have sen-

sitivities that are linear transforms of the human cone response curves. This

in turn implies that the sensitivities must be linear transforms of the known

standard observer curves. Unfortunately, this rule has rarely been observed

when color-sensing systems were designed in the past. (Note that we are

not addressing the problem of color sensations; we are only interested in

having the machine confuse the same colors as the standard observer.)

2.5.2 Randomness and Noise

It is diﬃcult to make accurate measurements of image brightness. In this

section we discuss the corrupting inﬂuence of noise on image sensing. In

order to do this, we need to discuss random variables and the probability

density distribution. We shall also take the opportunity to introduce the

concept of convolution in the one-dimensional case. Later, we shall en-

counter convolution again, applied to two-dimensional images. The reader

familiar with these concepts may want to skip this section.

Measurements are aﬀected by ﬂuctuations in the signal being mea-

sured. If the measurement is repeated, somewhat diﬀering results may be

obtained. Typically, measurements will cluster around the “correct” value.

We can talk of the probability that a measurement will fall within a certain

interval. Roughly speaking, this is the limit of the ratio of the number of

measurements that fall in that interval to the total number of trials, as the

total number of trials tends to inﬁnity. (This deﬁnition is not quite ac-

curate, since any particular sequence of experiments may produce results

that do not tend to the expected limit. It is unlikely that they are far oﬀ,

however. Indeed, the probability of the limit tending to an answer that is

not the desired one is zero.)

34 Image Formation & Image Sensing

Figure 2-8. The tristimulus curves allow us to predict which spectral distri-

butions will be indistinguishable. A given spectral distribution is multiplied by

each of the functions x(λ), y(λ), and z(λ), in turn, and the products integrated.

In this way we obtain the tristimulus values, X,Y, and Z, that can be used

to characterize the spectral distribution. Spectral distributions that lead to the

same tristimulus values appear the same when placed next to one another.

2.5 Image Sensing 35

Figure 2-9. (a) A histogram indicates how many samples fall into each of a

series of measurement intervals. If more and more samples are gathered, these

intervals can be made smaller and smaller while maintaining the accuracy of the

individual measurements. (b) In the limit the histogram becomes a continuous

function, called the probability distribution.

Now we can deﬁne the probability density distribution, denoted p(x).

The probability that a random variable will be equal to or greater than

x, but less than x+δx, tends to p(x)δx as δx tends to zero. (There

is a subtle problem here, since for a given number of trials the number

falling in the interval will tend to zero as the size of the interval tends

to zero. This problem can be sidestepped by considering the cumulative

36 Image Formation & Image Sensing

probability distribution, introduced below.) A probability distribution can

be estimated from a histogram obtained from a ﬁnite number of trials

(ﬁgure 2-9). From our deﬁnition follow two important properties of any

probability distribution p(x):

p(x)≥0 for all x, and ∞

−∞

p(x)dx =1.

Often the probability distribution has a strong peak near the “correct,” or

“expected,” value. We may deﬁne the mean accordingly as the center of

area, μ, of this peak, deﬁned by the equation

μ∞

−∞

p(x)dx =∞

−∞

xp(x)dx.

Since the integral of p(x) from minus inﬁnity to plus inﬁnity is one,

μ=∞

−∞

xp(x)dx.

The integral on the right is called the ﬁrst moment of p(x).

Next, to estimate the spread of the peak of p(x), we can take the second

moment about the mean, called the variance:

σ2=∞

−∞

(x−μ)2p(x)dx.

The square root of the variance, called the standard deviation, is a useful

measure of the width of the distribution.

Another useful concept is the cumulative probability distribution,

P(x)=x

−∞

p(t)dt,

which tells us the probability that the random variable will be less than

or equal to x. The probability density distribution is just the derivative of

the cumulative probability distribution. Note that

lim

x→∞ P(x)=1.

One way to improve accuracy is to average several measurements, assuming

that the “noise” in them will be independent and tend to cancel out. To

understand how this works, we need to be able to compute the probability

distribution of a sum of several random variables.

2.5 Image Sensing 37

Suppose that xis a sum of two independent random variables x1and

x2and that p1(x1) and p2(x2) are their probability distributions. How do

we ﬁnd p(x), the probability distribution of x=x1+x2? Given x2,we

know that x1must lie between x−x2and x+δx −x2in order for xto lie

between xand x+δx (ﬁgure 2-10). The probability that this will happen

is p1(x−x2)δx.Nowx2can take on a range of values, and the probability

that it lies in a particular interval x2to x2+δx2is just p2(x2)δx2.To

ﬁnd the probability that xlies between xand x+δx we must integrate the

product over all x2.Thus

p(x)δx =∞

−∞

p1(x−x2)δx p2(x2)dx2,

or

p(x)=∞

−∞

p1(x−t)p2(t)dt.

Figure 2-10. The probability distribution of the sum of two independent ran-

dom variables is the convolution of the probability distributions of the two vari-

ables. This can be shown by integrating the product of the individual probability

distributions over the narrow strip between x1+x2=xand x1+x2=x+δx.

38 Image Formation & Image Sensing

By a similar argument one can show that

p(x)=∞

−∞

p2(x−t)p1(t)dt,

in which the roles of x1and x2are reversed. These correspond to two ways

of integrating the product of the probabilities over the narrow diagonal strip

(ﬁgure 2-10). In either case, we talk of a convolution of the distributions

p1and p2, written as

p=p1⊗p2.

We have just shown that convolution is commutative.

We show in exercise 2-16 that the mean of the sum of several random

variables is equal to the sum of the means, and that the variance of the

sum equals the sum of the variances. Thus if we compute the average of

Nindependent measurements,

x=1

N

N

i=1

xi,

each of which has mean μand standard deviation σ, the mean of the result

is also μ, while the standard deviation is σ/√Nsince the variance of the

sum is Nσ2. Thus we obtain a more accurate result, that is, one less

aﬀected by “noise.” The relative accuracy only improves with the square

root of the number of measurements, however.

A probability distribution that is of great practical interest is the nor-

mal or Gaussian distribution

p(x)= 1

√2πσe−1

2(x−μ

σ)2

with mean μand standard deviation σ. The noise in many measurement

processes can be modeled well using this distribution.

So far we have been dealing with random variables that can take on

values in a continuous range. Analogous methods apply when the possible

values are in a discrete set. Consider the electrons liberated during a ﬁxed

interval by photons falling on a suitable material. Each such event is inde-

pendent of the others. It can be shown that the probability that exactly n

are liberated in a time interval Tis

Pn=e−mmn

n!

2.5 Image Sensing 39

for some m. This is the Poisson distribution. We can calculate the average

number liberated in time Tas follows:

∞

n=1

ne−mmn

n!=me−m

∞

n=1

mn−1

(n−1)!.

But ∞

n=1

mn−1

(n−1)! =

∞

n=0

mn

n!=em,

so the average is just m. We show in exercise 2-18 that the variance is also

m. The standard deviation is thus √m, so that the ratio of the standard

deviation to the mean is 1/√m. The measurement becomes more accurate

the longer we wait, since more electrons are gathered. Again, the ratio of

the “signal” to the “noise” only improves as the square root of the average

number of electrons collected, however.

To obtain reasonable results, many electrons must be measured. It can

be shown that a Poisson distribution with mean mis almost the same as

a Gaussian distribution with mean mand variance m, provided that mis

large. The Gaussian distribution is often easier to work with. In any case,

to obtain a standard deviation that is one-thousandth of the mean, one

must wait long enough to collect a million electrons. This is a small charge

still, since one electron carries only

e=1.602192 ...×10−19 Coulomb.

Even a million electrons have a charge of only about 160 fC (femto-

Coulomb). (The preﬁx femto- denotes a multiplier of 10−15.) It is not

easy to measure such a small charge, since noise is introduced in the mea-

surement process.

The number of electrons liberated from an area δA in time δt is

N=δA δt ∞

−∞

b(λ)q(λ)dλ,

where q(λ) is the quantum eﬃciency and b(λ) is the image irradiance in

photons per unit area. To obtain a usable result, then, electrons must be

collected from a ﬁnite image area over a ﬁnite amount of time. There is

thus a trade-oﬀ between (spatial and temporal) resolution and accuracy.

A measurement of the number of electrons liberated in a small area

during a ﬁxed time interval produces a result that is proportional to the

irradiance (for ﬁxed spectral distribution of incident photons). These mea-

surements are quantized in order to read them into a digital computer.

40 Image Formation & Image Sensing

This is done by analog-to-digital (A/D) conversion. The result is called a

gray-level. Since it is diﬃcult to measure irradiance with great accuracy,

it is reasonable to use a small set of numbers to represent the irradiance

levels. The range 0 to 255 is often employed—requiring just 8 bits per

gray-level.

2.5.3 Quantization of the Image

Because we can only transmit a ﬁnite number of measurements to a com-

puter, spatial quantization is also required. It is common to make mea-

surements at the nodes of a square raster or grid of points. The image is

then represented as a rectangular array of integers. To obtain a reason-

able amount of detail we need many measurements. Television frames, for

example, might be quantized into 450 lines of 560 picture cells, sometimes

referred to as pixels.

Each number represents the average irradiance over a small area. We

cannot obtain a measurement at a point, as discussed above, because the

ﬂux of light is proportional to the sensing area. At ﬁrst this might appear

as a shortcoming, but it turns out to be an advantage. The reason is that

we are trying to use a discrete set of numbers to represent a continuous

distribution of brightness, and the sampling theorem tells us that this can

be done successfully only if the continuous distribution is smooth, that

is, if it does not contain high-frequency components. One way to make a

smooth distribution of brightness is to look at the image through a ﬁlter

that averages over small areas.

What is the optimal size of the sampling areas? It turns out that

reasonable results are obtained if the dimensions of the sampling areas are

approximately equal to their spacing. This is fortunate because it allows us

to pack the image plane eﬃciently with sensing elements. Thus no photons

need be wasted, nor must adjacent sampling areas overlap.

We have some latitude in dividing up the image plane into sensing

areas. So far we have been discussing square areas on a square grid. The

picture cells could equally well be rectangular, resulting in a diﬀerent res-

olution in the horizontal and vertical directions. Other arrangements are

also possible. Suppose we want to tile the plane with regular polygons. The

tiles should not overlap, yet together they should cover the whole plane.

We shall show in exercise 2-21 that there are exactly three tessellations,

based on triangles, squares, and hexagons (ﬁgure 2-11).

It is easy to see how a square sampling pattern is obtained simply by

taking measurements at equal intervals along equally spaced lines in the

2.6 References 41

image. Hexagonal sampling is almost as easy, if odd-numbered lines are

oﬀset by half a sampling interval from even-numbered lines. In television

scanning, the odd-numbered lines are read out after all the even-numbered

lines because of ﬁeld interlace, and so this scheme is particularly easy to

implement. Hexagons on a triangular grid have certain advantages, which

we shall come to later.

Figure 2-11. The plane can be tiled with three regular polygons: the triangle,

the square, and the hexagon. Image tessellations can be based on these tilings.

The gray-level of a picture cell is the quantized value of the measured power

falling on the corresponding area in the image.

2.6 References

There are many standard references on basic optics, including Principles of

Optics: Electromagnetic Theory of Propagation, Interference and Diﬀrac-

tion of Light by Born & Wolf [1975], Handbook of Optics, edited by Driscoll

& Vaughan [1978], Applied Optics: A Guide to Optical System Design by

Levi [volume 1, 1968; volume 2, 1980], and the classic Optics by Sears

[1949]. Lens design and aberrations are covered by Kingslake in Lens De-

sign Fundamentals [1978]. Norton discusses the basic workings of a large

variety of sensors in Sensor and Analyzer Handbook [1982]. Barbe edited

Charge-Coupled Devices [1980], a book that includes some information on

the use of CCDs in image sensors.

There is no shortage of books on probability and statistics. One such

is Drake’s Fundamentals of Applied Probability Theory [1967].

Color vision is not treated in detail here, but is mentioned again in

chapter 9 where we discuss the recovery of lightness. For a general discus-

sion of color matching and tristimulus values see the ﬁrst few chapters of

Color in Business, Science, and Industry by Judd & Wyszeck [1975].

42 Image Formation & Image Sensing

Some issues of color reproduction, including what constitutes an ap-

propriate sensor system, are discussed by Horn [1984a]. Further references

on color vision may be found at the end of chapter 9.

Straight lines in the three-dimensional world are projected as straight

lines into the two-dimensional image. The projections of parallel lines inter-

sect in a vanishing point. This is the point where a line parallel to the given

lines passing through the center of projection intersects the image plane.

In the case of rectangular objects, a great deal of information can be re-

covered from lines in the images and their intersections. See, for example,

Barnard [1983].

When the medium between us and the scene being imaged is not per-

fectly transparent, the interpretation of images becomes more complicated.

See, for example, Sjoberg & Horn [1983]. The reconstruction of absorbing

density in a volume from measured ray attenuation is the subject of to-

mography; a book on this subject has been edited by Herman [1979].

2.7 Exercises

2-1 What is the shape of the image of a sphere? What is the shape of the

image of a circular disk? Assume perspective projection and allow the disk to lie

in a plane that can be tilted with respect to the image plane.

2-2 Show that the image of an ellipse in a plane, not necessarily one parallel to

the image plane, is also an ellipse. Show that the image of a line in space is a line

in the image. Assume perspective projection. Describe the brightness patterns

in the image of a polyhedral object with uniform surface properties.

2-3 Suppose that an image is created by a camera in a certain world. Now

imagine the same camera placed in a similar world in which everything is twice

as large and all distances between objects have also doubled. Compare the new

image with the one formed in the original world. Assume perspective projection.

2-4 Suppose that an image is created by a camera in a certain world. Now

imagine the same camera placed in a similar world in which everything has half

the reﬂectance and the incident light has been doubled. Compare the new image

with the one formed in the original world. Hint: Ignore interﬂections, that is,

illumination of one part of the scene by light reﬂected from another.

2-5 Show that in a properly focused imaging system the distance ffrom the

lens to the image plane equals (1 + m)f, where fis the focal length and mis the

2.7 Exercises 43

magniﬁcation. This distance is called the eﬀective focal length. Show that the

distance between the image plane and an object must be

m+2+ 1

mf.

How far must the object be from the lens for unit magniﬁcation?

2-6 What is the focal length of a compound lens obtained by placing two thin

lenses of focal length f1and f2against one another? Hint: Explain why an object

at a distance f1on one side of the compound lens will be focused at a distance

f2on the other side.

2-7 The f-number of a lens is the ratio of the focal length to the diameter of

the lens. The f-number of a given lens (of ﬁxed focal length) can be increased

by introducing an aperture that intercepts some of the light and thus in eﬀect

reduces the diameter of the lens. Show that image brightness will be inversely

proportional to the square of the f-number. Hint: Consider how much light is

intercepted by the aperture.

Figure 2-12. To determine the focal length and the positions of the principal

planes, a number of measurements are made. Here, an object lying in a plane a

distance xfrom an arbitrary reference point on one side of the lens is properly in

focus in a plane on the other side at a distance yfrom the reference point. The

two principal planes lie at distances aand bon either side of the reference point.

2-8 When a camera is used to obtain metric information about the world, it

is important to have accurate knowledge of the parameters of the lens, including

the focal length and the positions of the principal planes. Suppose that a pattern

in a plane at distance xon one side of the lens is found to be focused best on a

44 Image Formation & Image Sensing

plane at a distance yon the other side of the lens (ﬁgure 2-13). The distances

xand yare measured from an arbitrary but ﬁxed point in the lens. How many

paired measurements like this are required to determine the focal length and

the position of the two principal planes? (In practice, of course, more than

the minimum required number of measurements would be taken, and a least-

squares procedure would be adopted. Least-squares methods are discussed in the

appendix.)

Suppose that the arbitrary reference point happens to lie between the two

principal planes and that aand bare the distances of the principal planes from

the reference point (ﬁgure 2-7). Note that a+bis the thickness of the lens, as

deﬁned earlier. Show that

(ab +bf +fa)−xi(f+b)+yi(f+a)+xiyi=0,

where xiand yiare the measurements obtained in the ith experiment. Suggest

a way to ﬁnd the unknowns from a set of nonlinear equations like this. Can a

closed-form solution be obtained for f,a,b?

2-9 Here we explore a restricted case of the problem tackled in the previous

exercise. Describe a method for determining the focal length and positions of the

principal planes of a lens from the following three measurements: (a) the position

of a plane on which a scene at inﬁnity on one side of the lens appears in sharp

focus; (b) the position of a plane on which a scene at inﬁnity on the other side of

the lens appears in sharp focus; (c) the positions of two planes, one on each side

of the lens, such that one plane is imaged at unit magniﬁcation on the other.

2-10 Here we explore what happens when the image plane is tilted slightly.

Show that in a pinhole camera, tilting the image plane amounts to nothing

more than changing the place where the optical axis pierces the image plane

and changing the perpendicular distance of the projection center from the image

plane. What happens in a camera that uses a lens? Hint: Is a camera with an

(ideal) lens diﬀerent from a camera with a pinhole as far as image projection is

concerned?

How would you determine experimentally where the optical axis pierces the

image plane? Hint: It is diﬃcult to ﬁnd this point accurately.

2-11 It has been stated that perspective eﬀects are signiﬁcant when a wide-

angle lens is used, while images obtained using a telephoto lenses tend to ap-

proximate orthographic projection. Explain why these are only rough rules of

thumb.

2-12 Straight lines in the three-dimensional world are projected as straight

lines into the two-dimensional image. The projections of parallel lines intersect

2.7 Exercises 45

in a vanishing point. Where in the image will the vanishing point of a particular

family of parallel lines lie? When does the vanishing point of a family of parallel

lines lie at inﬁnity?

In the case of a rectangular object, a great deal of information can be recov-

ered from lines in the images and their intersections. The edges of a rectangular

solid fall into three sets of parallel lines, and so give rise to three vanishing points.

In technical drawing one speaks of one-point, two-point, and three-point perspec-

tive. These terms apply to the cases in which two, one, or none of three vanishing

points lie at inﬁnity. What alignment between the edges of the rectangular object

and the image plane applies in each case?

2-13 Typically, imaging systems are almost exactly rotationally symmetric

about the optical axis. Thus distortions in the image plane are primarily ra-

dial. When very high precision is required, a lens can be calibrated to determine

its radial distortion. Commonly, a polynomial of the form

Δr=k1(r)+k3(r)3+k5(r)5+···

is ﬁtted to the experimental data. Here r=x2+y2is the distance of a

point in the image from the place where the optical axis pierces the image plane.

Explain why no even powers of rappear in the polynomial.

2-14 Suppose that a color-sensing system has three types of sensors and that

the spectral sensitivity of each type is a sum of scaled versions of the human cone

sensitivities. Show that two metameric colors will produce identical signals in

the sensors.

Now show that a color-sensing system will have this property for all metamers

only if the spectral sensitivity of each of its three sensor types is a sum of scaled

versions of the human cone sensitivities. Warning: The second part of this prob-

lem is much harder than the ﬁrst.

2-15 Show that the variance can be calculated as

σ2=∞

−∞

x2p(x)dx −μ2.

2-16 Here we consider the mean and standard deviation of the sum of two

random variables.

(a) Show that the mean of x=x1+x2is the sum μ1+μ2of the means of the

independent random variables x1and x2.

(b) Show that the variance of x=x1+x2is the sum σ2

1+σ2

2of the variances

of the independent random variables x1and x2.

46 Image Formation & Image Sensing

2-17 Suppose that the probability distribution of a random variable is

p(x)=(1/2w),if |x|≤w;

0,if |x|>w.

What is the probability distribution of the average of two independent values

from this distribution?

2-18 Here we consider some properties of the Gaussian and the Poisson distri-

butions.

(a) Show that the mean and variance of the Gaussian distribution

p(x)= 1

√2πσ e−1

2(x−μ

σ)2

are μand σ2respectively.

(b) Show that the mean and the variance of the Poisson distribution

pn=e−mmn

n!

are both equal to m.

2-19 Consider the weighted sum of independent random variables

N

i=1

wixi,

where xihas mean mand standard deviation σ. Assume that the weights wi

add up to one. What are the mean and standard deviation of the weighted sum?

For ﬁxed N, what choice of weights minimizes the variance?

2-20 A television frame is scanned in 1/30 second. All the even-numbered lines

in one ﬁeld are followed by all the odd-numbered lines in the other ﬁeld. Assume

that there are about 450 lines of interest, each to be divided into 560 picture cells.

At what rate must the conversion from analog to digital form occur? (Ignore time

intervals between lines and between successive frames.)

2-21 Show that there are only three regular polygons with which the plane can

be tiled, namely (a) the equilateral triangle, (b) the square, and (c) the hexagon.

(By tiling we mean covering without gaps or overlap.)