A two-dimensional, object-based analog VLSI visual attention system
ABSTRACT A two-dimensional object-based analog VLSI model of selective attentional processing has been implemented using a standard 1.2 μm CMOS process. This chip extends previous work modeling object-based selection and scanning by incorporating the circuity and architectural changes necessary for two-dimensional focal plane processing. To balance the need for closely spaced large photodetectors with the space requirements of complex in-pixel processing, the chip implements a multiresolution architecture. The system has he ability to group pixels into objects; this grouping is dynamic, driven solely by the segmentation criterion at the input. In the demonstration system, image intensity has been chosen for the input saliency map and the segmentation is based on spatial lowpass filtering followed by an intensity threshold. We present experimental results
[show abstract] [hide abstract]
ABSTRACT: The goal of perception is to extract invariant properties of the underlying world. By computing contrast at edges, the retina reduces incident light intensities spanning twelve decades to a twentyfold variation. In one stroke, it solves the dynamic range problem and extracts relative reflectivity, bringing us a step closer to the goal. We have built a contrast-- sensitive silicon retina that models all major synaptic interactions in the outer--plexiform layer of the vertebrate retina using current--mode CMOS circuits: namely, reciprocal synapses between cones and horizontal cells, which produce the antagonistic center/surround receptive field, and cone and horizontal cell gap junctions, which determine its size. The chip has 90 Theta 92 pixels on a 6:8 Theta 6:9mm die in 2¯m n--well technology and is fully functional. 1 INTRODUCTION Retinal cones use both intracellular and extracellular mechanisms to adapt their gain to the input intensity level and hence remain sensitive over a la...12/1994;
[show abstract] [hide abstract]
ABSTRACT: A functional two-dimensional silicon retina that computes a complete set of local direction-selective outputs is reported. The chip motion computation uses unidirectional delay lines as tuned filters for moving edges. Photoreceptors detect local changes in image intensity, and the outputs from these photoreceptors are coupled into the delay line, where they propagate with a particular speed in one direction. If the velocity of the moving edges matches that of the delay line, then the signal on the delay line is reinforced. The output of each pixel is the power in the delay line signal, computed within each pixel. This power computation provides the essential nonlinearity for velocity selectivity. The delay line architecture differs from the usual pairwise correlation models in that motion information is aggregated over an extended spatiotemporal range. As a result, the detectors are sensitive to motion over a wide range of spatial frequencies. The design of functional one- and two-dimensional silicon retinas with direction-selective, velocity-tuned pixels is described. It is shown that pixels with three hexagonal directions of motion selectivity are approximately (225 mum)(2) in area in a 2-mum CMOS technology and consume less than 5 muW of power.IEEE Transactions on Neural Networks 02/1993; 4(3):529-41. · 2.95 Impact Factor
International Journal of Computer Vision. 01/1992; 8:191-202.
A Two-Dimensional, Object-Based Analog VLSI
Visual Attention System
Charles S. Wilson, Tonia G. Morris†, and Stephen P. DeWeerth
School of Electrical and Computer Engineering
Georgia Institute of Technology
Atlanta, GA 30332-0250
A two-dimensional object-based analog VLSI model of selective attentional processing has been
implemented using a standard 1.2µm CMOS process. This chip extends previous work modeling
object-based selection and scanning by incorporating the circuity and architectural changes
necessary for two-dimensional focal plane processing. To balance the need for closely spaced,
large photodetectors with the space requirements of complex in-pixel processing, the chip
implements a multiresolution architecture. The system has the ability to group pixels into
objects; this grouping is dynamic, driven solely by the segmentation criterion at the input. In the
demonstration system, image intensity has been chosen for the input saliency map and the seg-
mentation is based on spatial lowpass filtering followed by an intensity threshold. We present
One of the major hurdles in solving visual processing problems is the sheer quantity of infor-
mation that must be processed in real time. The amount of processing circuity and wiring neces-
sary to process this information completely in parallel is prohibitive. In both engineered and
biological systems, such computational resources are rarely available and are costly in terms of
power, space, and reliability. One option when dealing with this tradeoff is to subdivide the
image data in both space and time, thus greatly reducing the computational resources necessary.
Biological vision systems demonstrate an excellent example of balancing processing perfor-
mance needs and implementation limitations. Vertebrate visual systems have a hybrid architec-
ture, where low-level processing is performed across the entire visual field in parallel, and
higher-level processing is performed on selected subregions in sequence . Low-level tasks
computed in parallel are called preattentive. The higher-level, serially-computed tasks are post-
attentive. It is the transition from preattentive to postattentive processing that comprises selec-
tive attention. Selective attention is not limited to visual processing alone, although vision
provides the most obvious application . Koch and Ullman introduced a theory of selective
shifts of attention based upon a saliency map . The saliency map provides a measure of lev-
els of interest across the visual field. According to this theory, the levels of interest are com-
puted through the integration of features that are detected preattentively, such as color,
orientation, spatial derivatives, temporal derivatives, and motion. Selective visual attention is
one method utilized by biological systems to focus limited processing resources strategically
while maintaining full coverage of the visual scene. The saliency map interpretation provides a
†. presently with Intel Corporation, MS CH6-438, 5000 W. Chandler Blvd., Chandler, AZ 85226-3699.
convenient framework for implementation.
In a general sense, the preattentive and postattentive processing pathways can be physically
traced within the brain and are often referred to as the “where” (magnocellular or dorsal) path-
way, and the “what” (parvocellular or ventral) pathway . The “where” pathway is used to
identify the locations of various features and objects within the visual field, and operates on the
entire field in parallel, at a lower resolution. The “what” pathway is used to determine the iden-
tity of various objects at the locations determined by the “where” pathway. It operates sequen-
tially on selected subregions of the visual field, usually at the higher resolution provided by the
foveal area of the retina. A simple view of how the two pathways interact is that the “where”
pathway localizes regions of interest so that the corresponding information can be routed within
the “what” pathway to perform discrimination and recognition tasks. Thus, biological vision
systems provide an example not only of the algorithms of selective attention, but also of the
architecture to be used.
Visual processing tasks often require that the selected region of attention correspond to one
or more objects, since the visual scene is constructed from objects. Therefore, the selective
attention processing should perform object-based rather than pixel-based computation. Many of
our previously published systems performed all computation from a pixel-based perspective
. We have more recently presented a one-dimensional object-based selective attention
system . We have now extended this object-based processing paradigm to a two-dimen-
sional system .
Each of these systems are implemented using subthreshold analog VLSI circuits . In
addition to the similarity of the underlying physics of neural systems and subthreshold MOS
operation, analog VLSI has a number of advantages over other technologies from a purely engi-
neering standpoint. The dense integration of current MOS technology allows on-chip imple-
mentation of massively parallel architectures, while subthreshold operation lends itself to very
low power dissipation. Ordinarily, with subthreshold systems, speed limitations are a major
concern, but due to the parallel operation of many processing elements, or PEs, we are able to
achieve continuous, real-time operation.
In this paper, we present a two-dimensional object-based selective attention system realized
as a focal plane array. It demonstrates the ability to localize the most salient regions in an image
and perform a biologically inspired scanning algorithm that sequentially selects areas for further
processing, based on relative saliency. It is comprised of an array of processing elements, in
which each PE contains circuits implementing a winner-take-all selection with object segmenta-
tion, excitatory feedback, and inhibitory feedback. The winner-take-all computation selects a
single region of interest for further processing; this spotlight of attention enables the visual pro-
cessing system to perform complex computations on only a small region of the visual field. The
excitatory feedback provides hysteresis (for noise immunity) in the selection, and the inhibitory
feedback forces a shift of attention even when the visual field is unchanging.
Many focal-plane processing arrays developed in analog VLSI use local, pixel-based compu-
tation , while others compute a single global measure for the entire visual field
. Our system occupies the middle ground: we use a dynamic wires  approach to sepa-
rate regions of the visual field into multiple-pixel sections, so that global processing techniques
can be applied to smaller regions of the visual field. This enables our system to segment objects
from each other and from the background, so that all pixels within an object can cooperate,
while each object competes against the others for the spotlight of attention. This object segmen-
tation is dynamic, and is driven solely by the segmentation criteria at the input.
A system diagram appears in Figure 1, along with the effect of each stage of processing on a
hypothetical two-dimensional array of inputs.
2: Algorithms and architecture
In the design of two-dimensional focal plane processing arrays, there are at least three com-
peting design goals. First, the physical dimensions of each processing element define the lowest
possible sampling period, which should be minimized to account for higher spatial frequencies
in the input image. Second, the ratio of sensing area to processing area (fill factor) is critical; if
Figure 1 Description of the object-based selection system architecture and
operation. A single processing element (pixel) is shown in (a). The input to
each computation is an analog current. The computations include aggrega-
tion, normalization, filtering, segmentation, and object-based selection.
Communication to the neighboring elements is necessary for the spatial
lowpass filtering, segmentation-based saliency assignment, and the object-
based selection. The digital output from the thresholding operation is used
to control the communication (dynamic wires) for the saliency assignment
and object-based selection operations. (As implemented, both the saliency
assignment and object-based selection steps share the same set of
dynamic wires to perform both functions, not two separate sets as drawn).
The effect of each stage of processing on a hypothetical two-dimensional
array of inputs is shown in (b).
V = 5
V = 0
V = 5
the processing area is significantly greater than the sensing area, then important image features
are more likely to be imaged onto areas that contain no photodetectors. Both of these factors
mitigate against the third: complex processing typically requires many transistors, and large
areas of silicon. This system implements a fairly complex algorithm that requires much more
area for processing than for light gathering. To minimize the deleterious effects of this low fill
factor on the sampling period, we implemented a multiresolution architecture, combining higher
resolution sampling of the input image with lower resolution processing.
The selective attention algorithm implemented by this system includes six major compo-
nents: (1) saliency map generation, (2) object segmentation, (3) object-based selection, (4)
recurrent excitation or hysteresis of selection, (5) inhibiton of return or automatic scanning, and
(6) output encoding. Each component will be discussed in turn.
2.1: Saliency map generation
Incident light is first transduced by a regular array of photodetectors, at a higher spatial reso-
lution than that of the processing elements. Therefore, each processing element aggregates the
output of a number of photodetectors. This multiresolution architecture allows us to achieve an
acceptable sampling rate and fill factor. The higher resolution sampling of the input image is
necessary so that once an object is selected, enough detailed information exists to allow mean-
ingful post-processing, such as pattern recognition or discrimination. The selective attention
system is considered part of the “where” pathway, and does not require such high resolution.
Thus, each processing element contains four photodetectors, but only a single selective atten-
tion pixel. The outputs of these four photodetectors are aggregated for selective attention pro-
Figure 2 Multiresolution architecture. The design of the two-dimensional
selective attention system uses a multiresolution scheme. The attention cir-
cuitry relies on a lower resolution image consisting of four photodetector
inputs per processing element. The scanning circuits, which occupy two
edges of the array, access both the higher-resolution photodetector values
and the lower-resolution selection values. The position encoding circuits
occupy the other two edges of the array.
cessing, halving the resolution for attention processing. A diagram of this arrangement appears
in Figure 2. The aggregate values are then normalized across the entire visual field, to generate
a two-dimensional array of values representing incoming saliency at each processing element
In general, a saliency map is constructed from several different feature maps, where each fea-
ture map is sensitive to image features that are important in determining saliency. These features
include image intensity, temporal derivatives, spatial derivatives/spatial frequencies, and color,
among others. The choice of features and their relative weighting when constructing the
saliency map is task dependent. In this implementation, however, incident image intensity alone
determines saliency†. The addition of the necessary circuitry to detect more complicated fea-
tures would have reduced the fill factor of the system to an unreasonable level.
2.2: Object segmentation
The saliency map is then passed through a spatial lowpass filter, which is used to suppress
outliers in the saliency map and emphasize larger contiguous regions. After filtering, each pixel
value is compared to a global threshold. This thresholding results in an array of binary values,
that are passed to later stages to control dynamic connections between neighboring PEs. Those
pixels whose value is above the threshold qualify as candidates for inclusion within an object;
an object is defined as a contiguous group of object candidates. Each object is assigned a per-
pixel saliency value equal to that of the average of all candidate pixels within it. Individual, non-
object candidate pixels are treated as “objects of one pixel,” with the option that their saliency
value can be suppressed. This can be used to insure that only multi-pixel objects may compete
for the spotlight of attention.
2.3: Object-based selection
Once the objects have been segmented, all pixels vie for attention in a combined cooperative/
competitive manner. Within each object, the results of the threshold computation are used to
control dynamic connections, coupling specific nodes in the pixel-based selection system, such
that the collection of processing elements included in an object act as a single selection process-
ing element . In this way, all of the pixels within an object act cooperatively as a single unit.
These collections then compete against each other and against the lone, non-object pixels. Usu-
ally, the lone pixels cannot win this competition — the post-filtering value of most lone pixels is
lower than the threshold, while the value of all pixels within an object group (both before and
after averaging) is above that threshold. Time dependent effects described below can affect this,
however, and allow the occasional lone pixel to obtain the spotlight of attention.
2.4: Recurrent excitation and hysteresis
Small differences in the saliency values of each object, coupled with noise inherent in real-
world systems, can cause unpredictable oscillations between closely matched objects. Since we
wish to provide a stable selection to allow later processing stages adequate time to process a
selected object, we have included a local positive feedback element at each pixel. When an
object is selected, this recurrent excitation of each pixel within the object is activated. Thus, for
a competing and closely matched object to obtain the spotlight of attention, its saliency must
exceed that of the currently selected object plus the added feedback. This provides the necessary
hysteresis for stability in the presence of noise. Additionally, we have capability to spatially
†. Second order effects due to the use of adaptive photoreceptors introduce some spatiotemporal dependency.