A Two-Dimensional, Object-Based Analog VLSI
Visual Attention System
Charles S. Wilson, Tonia G. Morris†, and Stephen P. DeWeerth
School of Electrical and Computer Engineering
Georgia Institute of Technology
Atlanta, GA 30332-0250
A two-dimensional object-based analog VLSI model of selective attentional processing has been
implemented using a standard 1.2µm CMOS process. This chip extends previous work modeling
object-based selection and scanning by incorporating the circuity and architectural changes
necessary for two-dimensional focal plane processing. To balance the need for closely spaced,
large photodetectors with the space requirements of complex in-pixel processing, the chip
implements a multiresolution architecture. The system has the ability to group pixels into
objects; this grouping is dynamic, driven solely by the segmentation criterion at the input. In the
demonstration system, image intensity has been chosen for the input saliency map and the seg-
mentation is based on spatial lowpass filtering followed by an intensity threshold. We present
One of the major hurdles in solving visual processing problems is the sheer quantity of infor-
mation that must be processed in real time. The amount of processing circuity and wiring neces-
sary to process this information completely in parallel is prohibitive. In both engineered and
biological systems, such computational resources are rarely available and are costly in terms of
power, space, and reliability. One option when dealing with this tradeoff is to subdivide the
image data in both space and time, thus greatly reducing the computational resources necessary.
Biological vision systems demonstrate an excellent example of balancing processing perfor-
mance needs and implementation limitations. Vertebrate visual systems have a hybrid architec-
ture, where low-level processing is performed across the entire visual field in parallel, and
higher-level processing is performed on selected subregions in sequence . Low-level tasks
computed in parallel are called preattentive. The higher-level, serially-computed tasks are post-
attentive. It is the transition from preattentive to postattentive processing that comprises selec-
tive attention. Selective attention is not limited to visual processing alone, although vision
provides the most obvious application . Koch and Ullman introduced a theory of selective
shifts of attention based upon a saliency map . The saliency map provides a measure of lev-
els of interest across the visual field. According to this theory, the levels of interest are com-
puted through the integration of features that are detected preattentively, such as color,
orientation, spatial derivatives, temporal derivatives, and motion. Selective visual attention is
one method utilized by biological systems to focus limited processing resources strategically
while maintaining full coverage of the visual scene. The saliency map interpretation provides a
†. presently with Intel Corporation, MS CH6-438, 5000 W. Chandler Blvd., Chandler, AZ 85226-3699.
convenient framework for implementation.
In a general sense, the preattentive and postattentive processing pathways can be physically
traced within the brain and are often referred to as the “where” (magnocellular or dorsal) path-
way, and the “what” (parvocellular or ventral) pathway . The “where” pathway is used to
identify the locations of various features and objects within the visual field, and operates on the
entire field in parallel, at a lower resolution. The “what” pathway is used to determine the iden-
tity of various objects at the locations determined by the “where” pathway. It operates sequen-
tially on selected subregions of the visual field, usually at the higher resolution provided by the
foveal area of the retina. A simple view of how the two pathways interact is that the “where”
pathway localizes regions of interest so that the corresponding information can be routed within
the “what” pathway to perform discrimination and recognition tasks. Thus, biological vision
systems provide an example not only of the algorithms of selective attention, but also of the
architecture to be used.
Visual processing tasks often require that the selected region of attention correspond to one
or more objects, since the visual scene is constructed from objects. Therefore, the selective
attention processing should perform object-based rather than pixel-based computation. Many of
our previously published systems performed all computation from a pixel-based perspective
. We have more recently presented a one-dimensional object-based selective attention
system . We have now extended this object-based processing paradigm to a two-dimen-
sional system .
Each of these systems are implemented using subthreshold analog VLSI circuits . In
addition to the similarity of the underlying physics of neural systems and subthreshold MOS
operation, analog VLSI has a number of advantages over other technologies from a purely engi-
neering standpoint. The dense integration of current MOS technology allows on-chip imple-
mentation of massively parallel architectures, while subthreshold operation lends itself to very
low power dissipation. Ordinarily, with subthreshold systems, speed limitations are a major
concern, but due to the parallel operation of many processing elements, or PEs, we are able to
achieve continuous, real-time operation.
In this paper, we present a two-dimensional object-based selective attention system realized
as a focal plane array. It demonstrates the ability to localize the most salient regions in an image
and perform a biologically inspired scanning algorithm that sequentially selects areas for further
processing, based on relative saliency. It is comprised of an array of processing elements, in
which each PE contains circuits implementing a winner-take-all selection with object segmenta-
tion, excitatory feedback, and inhibitory feedback. The winner-take-all computation selects a
single region of interest for further processing; this spotlight of attention enables the visual pro-
cessing system to perform complex computations on only a small region of the visual field. The
excitatory feedback provides hysteresis (for noise immunity) in the selection, and the inhibitory
feedback forces a shift of attention even when the visual field is unchanging.
Many focal-plane processing arrays developed in analog VLSI use local, pixel-based compu-
tation , while others compute a single global measure for the entire visual field
. Our system occupies the middle ground: we use a dynamic wires  approach to sepa-
rate regions of the visual field into multiple-pixel sections, so that global processing techniques
can be applied to smaller regions of the visual field. This enables our system to segment objects
from each other and from the background, so that all pixels within an object can cooperate,
while each object competes against the others for the spotlight of attention. This object segmen-
tation is dynamic, and is driven solely by the segmentation criteria at the input.
A system diagram appears in Figure 1, along with the effect of each stage of processing on a
hypothetical two-dimensional array of inputs.
2: Algorithms and architecture
In the design of two-dimensional focal plane processing arrays, there are at least three com-
peting design goals. First, the physical dimensions of each processing element define the lowest
possible sampling period, which should be minimized to account for higher spatial frequencies
in the input image. Second, the ratio of sensing area to processing area (fill factor) is critical; if
Figure 1 Description of the object-based selection system architecture and
operation. A single processing element (pixel) is shown in (a). The input to
each computation is an analog current. The computations include aggrega-
tion, normalization, filtering, segmentation, and object-based selection.
Communication to the neighboring elements is necessary for the spatial
lowpass filtering, segmentation-based saliency assignment, and the object-
based selection. The digital output from the thresholding operation is used
to control the communication (dynamic wires) for the saliency assignment
and object-based selection operations. (As implemented, both the saliency
assignment and object-based selection steps share the same set of
dynamic wires to perform both functions, not two separate sets as drawn).
The effect of each stage of processing on a hypothetical two-dimensional
array of inputs is shown in (b).
V = 5
V = 0
V = 5
the processing area is significantly greater than the sensing area, then important image features
are more likely to be imaged onto areas that contain no photodetectors. Both of these factors
mitigate against the third: complex processing typically requires many transistors, and large
areas of silicon. This system implements a fairly complex algorithm that requires much more
area for processing than for light gathering. To minimize the deleterious effects of this low fill
factor on the sampling period, we implemented a multiresolution architecture, combining higher
resolution sampling of the input image with lower resolution processing.
The selective attention algorithm implemented by this system includes six major compo-
nents: (1) saliency map generation, (2) object segmentation, (3) object-based selection, (4)
recurrent excitation or hysteresis of selection, (5) inhibiton of return or automatic scanning, and
(6) output encoding. Each component will be discussed in turn.
2.1: Saliency map generation
Incident light is first transduced by a regular array of photodetectors, at a higher spatial reso-
lution than that of the processing elements. Therefore, each processing element aggregates the
output of a number of photodetectors. This multiresolution architecture allows us to achieve an
acceptable sampling rate and fill factor. The higher resolution sampling of the input image is
necessary so that once an object is selected, enough detailed information exists to allow mean-
ingful post-processing, such as pattern recognition or discrimination. The selective attention
system is considered part of the “where” pathway, and does not require such high resolution.
Thus, each processing element contains four photodetectors, but only a single selective atten-
tion pixel. The outputs of these four photodetectors are aggregated for selective attention pro-
Figure 2 Multiresolution architecture. The design of the two-dimensional
selective attention system uses a multiresolution scheme. The attention cir-
cuitry relies on a lower resolution image consisting of four photodetector
inputs per processing element. The scanning circuits, which occupy two
edges of the array, access both the higher-resolution photodetector values
and the lower-resolution selection values. The position encoding circuits
occupy the other two edges of the array.
cessing, halving the resolution for attention processing. A diagram of this arrangement appears
in Figure 2. The aggregate values are then normalized across the entire visual field, to generate
a two-dimensional array of values representing incoming saliency at each processing element
In general, a saliency map is constructed from several different feature maps, where each fea-
ture map is sensitive to image features that are important in determining saliency. These features
include image intensity, temporal derivatives, spatial derivatives/spatial frequencies, and color,
among others. The choice of features and their relative weighting when constructing the
saliency map is task dependent. In this implementation, however, incident image intensity alone
determines saliency†. The addition of the necessary circuitry to detect more complicated fea-
tures would have reduced the fill factor of the system to an unreasonable level.
2.2: Object segmentation
The saliency map is then passed through a spatial lowpass filter, which is used to suppress
outliers in the saliency map and emphasize larger contiguous regions. After filtering, each pixel
value is compared to a global threshold. This thresholding results in an array of binary values,
that are passed to later stages to control dynamic connections between neighboring PEs. Those
pixels whose value is above the threshold qualify as candidates for inclusion within an object;
an object is defined as a contiguous group of object candidates. Each object is assigned a per-
pixel saliency value equal to that of the average of all candidate pixels within it. Individual, non-
object candidate pixels are treated as “objects of one pixel,” with the option that their saliency
value can be suppressed. This can be used to insure that only multi-pixel objects may compete
for the spotlight of attention.
2.3: Object-based selection
Once the objects have been segmented, all pixels vie for attention in a combined cooperative/
competitive manner. Within each object, the results of the threshold computation are used to
control dynamic connections, coupling specific nodes in the pixel-based selection system, such
that the collection of processing elements included in an object act as a single selection process-
ing element . In this way, all of the pixels within an object act cooperatively as a single unit.
These collections then compete against each other and against the lone, non-object pixels. Usu-
ally, the lone pixels cannot win this competition — the post-filtering value of most lone pixels is
lower than the threshold, while the value of all pixels within an object group (both before and
after averaging) is above that threshold. Time dependent effects described below can affect this,
however, and allow the occasional lone pixel to obtain the spotlight of attention.
2.4: Recurrent excitation and hysteresis
Small differences in the saliency values of each object, coupled with noise inherent in real-
world systems, can cause unpredictable oscillations between closely matched objects. Since we
wish to provide a stable selection to allow later processing stages adequate time to process a
selected object, we have included a local positive feedback element at each pixel. When an
object is selected, this recurrent excitation of each pixel within the object is activated. Thus, for
a competing and closely matched object to obtain the spotlight of attention, its saliency must
exceed that of the currently selected object plus the added feedback. This provides the necessary
hysteresis for stability in the presence of noise. Additionally, we have capability to spatially
†. Second order effects due to the use of adaptive photoreceptors introduce some spatiotemporal dependency.
lowpass filter the local positive feedback. This filtering has the effect of enhancing the potential
for selection of pixels that are not part of the currently selected object but are near its borders,
which can improve tracking of objects in motion .
2.5: Inhibition of return
If our system implemented only the object-based selection with hysteresis, it would be possi-
ble for a single object to monopolize the spotlight of attention, causing the rest of the visual
field to be neglected. Once an object has been attended for a sufficient time, we should force the
selection to shift to another object. To accomplish this, we have included a time-delayed nega-
tive feedback element within each pixel. However, once an object becomes unselected, after a
sufficient time it should be allowed to regain the spotlight of attention. In biological systems,
this type of behavior is observed in the visual system, and is called inhibition of return. The two
time constants, which relate to maximum dwell time and minimum avoidance time, together
develop an interesting emergent behavior: automatic, or smart, scanning among objects in the
visual field, even if the visual field is unchanging .
2.6: Output encoding
At the final stage of processing, the position within the visual field of the selected object is
encoded by calculating its centroid . Thus, we are able to perform an immense data reduction
task: from a large array of analog saliency values, we compute, in real time, the X and Y coordi-
nates of the centroid of the selected object. With the assumption that most objects can be con-
tained within a bounding circle of moderate radius, these two data values are enough to specify
the region of the visual field which must be processed further, at any given time. Of course, this
assumption fails for objects which occupy a large percentage of the visual field, but limitations
in the amount of hardware available, whether electronic or neural, preclude processing such a
large object all at once in any case. The assumption also fails for objects with extreme aspect
ratios (“tall and skinny” or “short and wide”). In these cases, it would be better to pass along the
pixel information of all pixels within the given object, rather than the coordinates of just the
center of the object.
We also encode the output of each pixel using a dual-scanning technique based on the multi-
sync scanners developed by Mead and Delbruck . By scanning both the high resolution
photodetector values and the object selection values from each PE on two different channels, we
are able to view the operation of this system in real time on a multisync monitor.
The processing elements contain analog, current mode, subthreshold circuits that perform the
object-based attentive selection processing. There are three major processing circuits within
each PE: four adaptive photoreceptors, an object segmentation circuit, and an object-based
selection circuit. The adaptive photoreceptors  facilitate operation under a wide range of
ambient light intensity. The object segmentation circuit combines the output from the four pho-
toreceptors, and generates a preliminary segmented saliency map. The object-based selection
circuit modulates this preliminary saliency with excitatory and inhibitory feedback, performs
the actual selection, and contains the circuitry necessary for interfacing to the position encoding
and video scanning output systems.
3.1: Object segmentation
The circuit used for segmentation appears in Figure 3. The currents from the four photode-
tectors are summed, and this aggregated current is used as the input to the normalization com-
putation. The normalization is necessary to ensure that the input current to later processing
stages is within a set range of subthreshold values.Transistors
used to compute the normalized value, where a single transistor for the entire array sets the sum
of the normalized currentsvia a global bias voltage
a current-mode resistive network  that performs spatial lowpass filtering. Transistors
implement the lateral resistance of this network, while
and within each pixel are
. This circuit is merged with
controls the amount of smooth-
Figure 3 Object segmentation circuit. The linear normalization computation
is performed through the combination of transistors
bal current set by. The filtering is implemented by a current-mode
resistive network composed of transistors
stage is used to compare the output of the filter to a constant threshold
value, which is set by thevoltage.
ment masking: pixels whose values are below the threshold optionally can
be blocked from participating in the object selection process. This behavior
is controlled by the voltage.
and , and a glo-
, and. A high-gain
,and serve to imple-
ing by changing the lateral resistance. Two copies of the lowpass filtered and normalized current
are mirrored through transistorsand . Transistors
parator stage, where the first copy ofis compared to the threshold value
set by the global voltage.
A global signal allows optional masking of local saliency values that are less than the
threshold . If is high, then the masking transistor
the comparison. When is less than the threshold, and consequently
is off and ; ifis greater than the threshold (
on, and. Whenis low, then
the comparison, and. In any case,
next stage of processing.
The combination of filtering and thresholding directly affects the spatial extent and average
saliency of the segmented objects. An increase in the amount of smoothing will emphasize
objects of larger spatial extent, while also reducing the average post-segmentation saliency of
those objects. This result arises because objects of small extent will be blurred below the thresh-
old, thus emphasizing larger objects, but those larger objects will have a correspondingly
smaller volume left above the threshold, thus having a smaller average post-segmentation
and create a high-gain com-
, which is
is controlled by the result of
Vbin m n ,,
Vbin m n ,,
is low, then
is high) thenis
is always on regardless of the result of
andare sent as inputs to the
Ipix m n ,,
Vbin m n ,,
3.2: Object-based selection
Since the object based selection circuit contains several functional blocks, we will describe
these blocks each in turn. The first block includes the basic selection circuitry, the positive feed-
back network, and the output interface circuitry, while the second block encompasses the time-
dependent negative feedback system. We will first describe these blocks in terms of single pixel
operation, and will describe the object-based extensions during the discussion of the third block,
which implements the object-based processing. A diagram of the first block appears in Figure 4.
The basic selection is performed by a winner-take-all network  formed by transistors
and in each processing element, and by a single bias transistor for the entire array. The bias
transistor provides the bias current for the winner-take-all network, and is controlled by the
voltage. Node is common to all pixels in the array. When a particular pixel is winning,
the voltage at its local nodegoes high, while the corresponding node at the losing pixels
goes low.is not a digital signal; in this context, “high” means that
voltages above ground, while “low” means that
the triode region.
The positive feedback is implemented by transistors
local pixel is winning and itsnode is high, transistor
set by the global voltage on transistor
transistorsand. However, transistors
tive network, which diffuses this feedback current to neighboring pixels. The characteristic
length of this resistive network is set by the global voltage
stantial fraction of is mirrored to form the positive feedback current
Transistors, andmake three copies of
puts of each pixel: the currents flowing through
coordinates of the centroid of the selected object, respectively, and the current flowing through
is used to generate the video signal for observing the pattern of selected pixels on a multi-
sync monitor. However, this video current is gated by
algorithm described earlier. We chose to use the hysteretic feedback current
signals; by adding an extra transistor, we could have used the output of the winner-take-all com-
is about two threshold
is low enough to force transistorinto
,,, and. When the
turns on. This allows current
, to flow through the current mirror formed by
and form another current-mode resis-
; depending on its value, a sub-
. These copies form the three out-
are used to compute the X and Y and
according to the multisync scanning
Ie m n ,
for the output
Ilpf m n ,,
Ilpf m n ,,
Ipix m n ,
Ilpf m n ,
Ilpf m n ,,
Ilpf m n ,,
Ipix m n ,,,
Ipix m n ,,
Ie m n ,,
Ie m n ,,
The second block, which implements time-dependent negative feedback system, appears in
Figure 5. When a particular pixel is winning,
allows current set by global voltageon transistor
ror formed by transistorsand . This mirrored current charges capacitor
local inhibition control voltage. Capacitor
currentset by global voltageon transistor
for proper operation. Using the approximation that transistors
roughly like current sources, the capacitor charges at the rate given in Eq. (1) when the pixel is
Vinh m n ,,
goes high, and turns on transistor
to flow through the current mir-
, raising the
is also subject to a continual discharge
; this current must be smaller than
Iwta m n ,,
Figure 4 Positive (excitatory) feedback portion of the object-based selection
circuit. Transistors andform a winner-take-all network biased by a
single transistor for the entire array, sourcing
a pixel wins, its internal nodegoes high, allowing current
through. This current is spatially lowpass filtered by the current mode
resistive network formed by transistors
rored by transistor to create the excitatory positive feedback current
. Three other copies of this current are created, to be used in comput-
ing the centroid of the selected object and for scanning out to a multisync
monitor. The positive feedback portion of the circuit is darkened for clarity.
and controlled by . When
, and. The result is mir-
Ie m n ,,
Vinh m n ,
and discharges at the rate given in Eq. (2) when the pixel is losing.
The local inhibition control voltage
rises, the current subtracted from the input node increases exponentially. As the control voltage
falls, the subtracted current decreases similarly. Thus, when a particular pixel is winning, its
local inhibition gradually grows until it is no longer the winner, thus limiting the maximum time
that any given pixel/object may hold the spotlight of attention. Once the pixel is no longer win-
ning, the local inhibition gradually falls, allowing the pixel to recover; this recovery time sets
the minimum time the spotlight of attention avoids a previously winning pixel/object.
drives transistor: as the control voltage
The third block, which implements the object-based processing, is depicted in Figure 6. In
this block, the binary threshold output voltage
the corresponding node of its neighboring pixels. If a given pixel is part of an object (i.e. its fil-
tered and normalized input exceeds the threshold current
more of its neighbors only if that neighbor is also part of an object. This adaptively controlled
coupling is an example of the dynamic wires approach . Transistors
is used to couple nodeof a pixel to
), the node couples with one or
Vinh m n ,
Vinh m n ,,
Figure 5 The negative (inhibitory) feedback portion of the object-based
selection circuit. Transistors—
age on capacitor. When is high (this object is selected), the voltage on
increases at a rate proportional to
decreases at a rate proportional to
current through transistor. The portion of the circuit implement-
ing the negative feedback is darkened for clarity.
form a network that controls the volt-
. Otherwise, the voltage on
. This voltage controls the inhibitory
Iinh m n ,,
Vbin m n ,,
implement the coupling, where each transistor performs one half of a logical AND opera-
tion with its neighbor pixel to make a composite switch. When the AND condition is true, both
transistors conduct and the dynamic wire is formed.
The effective shorting of node in one pixel to its neighbors causes the winner-take-all
input transistors of each pixel to be connected in parallel; the input transistors within an
object then operate as a single transistor with a larger aspect ratio. This allows each object, or
contiguous group of interconnected pixels, to be assigned a uniform saliency value equal to
the average of the component pixels’currents. By joining the input nodes, the
values of all pixels in the object are summed. This current is then passed through the
ner-take-all input transistors in parallel. Thus, the input values are averaged over the extent of
each object. Similar reasoning applies to the
tors of all pixels within a single object are connected in parallel. Therefore, the output current
transistors: the winner-take-all output transis-
Ipix m n
Ipix m n
Figure 6 The object-based selection circuit. Transistors
the dynamic wires. They are extended to the four neighboring processing
elements in order to connect all processing elements within the same
object. By doing so, they allow the individual currents representing pixel
saliency to be averaged within a single object, and ensure that all pixels
within an object act as a single competitor in the winner-take-all. The por-
tion of the circuit implementing the dynamic wires is darkened for clarity.
through each of the
divided by the number of pixels in the object. Since both of the winner-take-all transistors
are connected in parallel with the matching transistors of all other pixels within a single object,
each object acts as a super-pixel: a single competitor in the winner-take-all selection. The win-
ner-take-all competition extends over the entire array, but the number of competitors varies with
the number of objects in the image.
The operation of the positive and negative feedback circuits within this object-based frame-
work deserves some discussion. At each pixel within the object, all local
allowing a hysteretic currentto flow. The feedback current is spread through the resistive
network formed by and, but since the hysteretic feedback currents
the object-common node, they are averaged across the entire object along with the input cur-
rents. Due to the spreading effect of the resistive network, adjacent unselected pixels receive a
certain amount of extra current. Within the negative feedback system, each capacitor
selected object begins charging at the same time, at the same rate. Within mismatch limits, and
assuming the same initial state, the local inhibition voltage
given time for all pixels within the object; however, averaging compensates for both mismatch
and differences in initial state. All inhibition currents
object-common node , and are therefore averaged along with the input and hysteretic cur-
rents over the entire object. When the total object input current, plus the total hysteretic current,
minus the total inhibition current for the currently selected object is no longer greater than a
similar measure for a different object, the currently selected object will lose, and a shift of the
spotlight of attention will occur.
transistors of the winning object equals the winner-take-all bias current
nodes are high,
will be the same at any
within the object flow from the
3.3: Output encoding
The centroid computation for the selected object is separable in two dimensions, which
allows us to implement that computation using only
global communication wires, for an
nate of the centroid using a linear array ofcentroid processors , arranged along a horizon-
tal edge of our selective attention processing array. Each pixel, if it is inside a selected object,
will contribute a fixed amount of current (via
PEs in a single column. The X centroid computation array will then receive
whose values are equivalent to the number of selected PEs with the given X coordinate. There-
fore, the centroid of those inputs corresponds to the X centroid of the selected region. Similarly,
the Y coordinate of the centroid is computed using a linear array of
along a vertical edge of our selective attention processing array.
On the other two edges of the array, we implemented a dual-scanning framework  so that
both the input image (individual photodetector values) and the selected object (outputs of each
processing element) can be viewed on a multisync monitor. Along each dimension an array of
shift registers is used to select one row and one column at a time. The horizontal shift registers
pass a bit from one register to the next according to the input clock frequency; in this system,
the clock frequency is 4MHz. As each horizontal position receives the active bit, the output cur-
rent for that column is switched onto a scanning line. This scanned current is sent to a sense
amplifier to generate the video signal. The output current for all other columns is switched to a
reference line that is used to hold the output of all pixels at a constant voltage. Extra shift regis-
ters at the end of the horizontal array allow for the horizontal blanking period, and are used to
generate the horizontal sync pulse. The horizontal sync pulse also controls the clocking of the
vertical shift registers, but we use a Johnson counter to divide this frequency. Therefore, each
row in the array is scanned a number of times before the vertical shift register is clocked. The
centroid computation elements, and
array. Therefore, we compute the X coordi-
in Figure 6) to a wire that is common to all
Ie m n
Vinh m n ,,
Iinh m n
output of each vertical shift register is used to select an entire row, thus designating the single
current available for scanning at each column in the array. The vertical shift registers also
include a blanking period and the appropriate logic for generating a vertical sync pulse.
Additional circuitry is required to implement dual-scanning. The current switches along the
horizontal array are duplicated with an additional set of scanning and reference lines. Since the
two signals being scanned are of differing resolutions, we coordinate them by duplicating the
lower resolution (selection) signal to match the higher resolution (input image) frequency. The
number of horizontal shift registers is equal to the number of outputs of the higher resolution
signal, plus those necessary for the horizontal blanking period. The output of the vertical shift
registers are also duplicated; this is implemented by a single logic gate at each row of the pro-
cessing array. The video signal for both the input image and the selection output can be dis-
played on a monitor at the same time using two different colors, thus implementing a dual-
scanning system. Each video signal requires an external discrete amplifier .
4: Experimental results
The chip was implemented in a
service. The size of the chip was
(2400 photodetectors) arranged in a
CMOS process through the MOSIS silicon brokerage
. The system included 600 processing elements
( ) array, where each processing element
Figure 7 The processing element layout for the object-based selective atten-
tion system. A single processing element includes four photoreceptor cells
and one selection circuit cell. The photodiodes are arranged on a rectangu-
lar sampling grid with a 133 λ (79.8µm) center-to-center spacing. The entire
cell is 266 λ (159.6µm) on a side.
pling period was
tors is shown in Figure 7. The entire two-dimensional processing array consumed an average of
during operation. The digital scanners consumed an average of
cuitry consumed . Access to intermediate signals at each stage of processing was not
provided due to the overhead required in terms of extra transistors and wiring. Since the only
available system outputs were the analog coordinates of the centroid of the selected object and
the video image, testing and analysis were problematic.
We present two experiments to demonstrate system performance. For each experiment a
static input image was used to determine processing performance under well-defined condi-
tions. The input was imaged onto the die using a standard 8mm focal length lens, under ordinary
room lighting conditions. The image itself was constructed from simple shaded shapes on a
black field. Each shape had a graded intensity profile that decreased from the center out to the
boundary. Output measurements were taken using a scan converter and video capture hardware
on an SGI O2 workstation, and using a digital oscilloscope to capture simultaneous trajectories
of the X and Y coordinates of the centroid of the selected object. The
5.0V for all testing. According to the MOSIS parametric test results, the threshold voltages for
the NMOS and PMOS devices on this fabrication run were
on a side. Since each PE contained four photodetectors, the image sam-
. The layout of a single processing element with its four photodetec-
and the pad cir-
voltage was set to
V andV, respectively.
4.1: Two-dimensional saccade maps
In the first experiment, we wanted to demonstrate smart scanning behavior reminiscent of
biological saccades. An example of biological saccades appears in Figure 8 from Yarbus .
Figure 8a shows the image observed by the subject, and Figure 8b traces the subject’s eye
movements (overt attentional shifts). Using the input image depicted in Figure 9a, we measured
the analog X and Y position values generated by our selective attention system over a period of
10 seconds. Since the on-chip scanners allow us to view both the selected object on one color
channel, and the input image as perceived by the system on another channel, we captured this
input image and plotted the position of the centroid of the selected object on a map of the input
image. This plot appears in Figure 9b, and bears a striking resemblance to Figure 8b. One differ-
Figure 8 Biological saccades . A subject is told to study the image in (a),
while his eye movements are recorded, and plotted in (b). Reprinted by per-
ence between Figure 9b and Figure 8b is that the former shows attention devoted to interesting
features such as the eyes and hairline, whereas the latter shows attention allotted only to bright
areas. This is because our system uses intensity alone as the initial saliency map.
Although it is clear from Figure 9b that most of the “saccades” are made to the center of the
target objects, there are a number of saccades to individual pixels. This occurs when all three
objects are inhibited by local negative feedback, and cannot be selected; the winner-take-all net-
work must choose a winner somewhere. A histogram (Figure 9c) of the frequency of visits to
each location in the array shows that the majority (98%) are made to the three salient objects, as
expected. Finally, a three dimensional plot of the saccade trajectories over time (Figure 9d)
reveal that the saccades to isolated pixels occur, for the most part, toward the end of the 10 sec-
ond trial. This is caused by an interesting interaction between the adaptive photoreceptors and
the normalization circuit.
With a constant image, the adaptive photoreceptors slowly adapt their output to a steady, neu-
tral value: bright areas fade, and dark areas brighten. This behavior was observed on the moni-
tor. However, as the bright areas fade, the normalization circuit acts to boost the output of pixels
Figure 9 Artificial saccade maps. Using the image in (a), measurements
were taken of the centroid of the selected object for ten seconds. These
attentional shifts are plotted against the image in (b). A histogram of the
number of visits to each location is plotted in (c) and shows that 98% of the
visits were to the three main targets. In (d), the trajectory of the attention
shifts is plotted versus time.
that were not previously part of an object. In some cases, new “objects” are detected or existing
objects grow beyond their actual borders. If these spurious objects are selected by the winner-
take-all circuit, then the centroid output no longer lies within the bounds of the three targets.
The adaptive photoreceptors can be adjusted to have relatively long adaptation times; in this
experiment, noticeable adaptation begins to appear at about 8 seconds, while the saccades to
isolated pixels begin at about 8.5 seconds.
4.2: Segmentation tuning
In this experiment, we demonstrate the effect changing the threshold value for object seg-
mentation has on the size of detected objects. In each case the scanned output was captured by a
video frame grabber. In Figure 10, we show some typical outputs from the chip. Depending on
the level for the thresholding operation, either the entire shape or a part of the shape is seg-
mented as an object. Figure 10a and Figure 10d demonstrate the selection of the circle for two
different settings of , while Figure 10b and Figure 10e demonstrate the selection of the
square for the same two settings of, as do Figure 10c and Figure 10f for the selection of
The first setting of
entire shape being above threshold; the entire shape is selected as indicated by the outline. (In
actuality we are able to distinguish the selected area according to its different color as it is
scanned to the monitor. For clarity, we have indicated the outline of the selected area for gray-
scale viewing.) The second value forwas
10f) which caused only a portion of each shape to be segmented as a single object.
toV (Figure 10a, Figure 10b, and Figure 10c) resulted in the
V (Figure 10d, Figure 10e, and Figure
Figure 10 Segmentation tuning. Using the same input image as in Figure
9(a), The output images shown in (a) and (d) demonstrate the change in the
detected object size for two different threshold values, (a) 0.42V and (d)
0.75V. Similar descriptions apply to (b) and (e), and also to (c) and (f). The
selected object is denoted by the outline.
5: Conclusions and future work
We have presented an analog VLSI system that implements two-dimensional object-based
processing for selective attention. The segmentation of objects within an image is a critical pre-
processing stage for object-based selection. The implementation presented in this paper per-
forms the segmentation with a filtering and thresholding computation. This combination of
filtering and thresholding allows flexibility for emphasizing particular object characteristics in
the selective attention computation . The segmentation is used to define the granularity of
the selection operation. The implementation demonstrates an elegant method of modeling the
dynamic size of the attentional spotlight.
The segmentation and object-based selection circuits were implemented with object-based
excitatory and inhibitory feedback, which encompasses the remaining components of the selec-
tive-attention framework. The entire object-based system demonstrates the expected behavior,
and successfully implements a selective attention system with a dynamic spotlight size. In addi-
tion to this -detector implementation, this system has been scaled to a
and implemented as a -detector array, with nearly 8000 pixels and 2000 processing ele-
ments. We are currently testing the device.
One of the difficulties in designing focal plane processors is that while transistor density
increases as process sizes shrink, photodetectors do not scale as well. In the
implementation, the fill factor was less than 3%, yet gave satisfactory performance. Although
we have not yet tested the -detector version, during the design phase we found it neces-
sary to keep the physical size of the photodetectors relatively unchanged, resulting in a fill fac-
tor of 11%. Thus, the size of a single processing element, which was
side in the implementation (see Figure 7), shrank to
implementation. However, a strict scaling between the two technologies predicts that the
implementation should have a resolution of
resolution due to the scaling problems associated with photodetectors.
We would like to construct systems that not only have higher resolutions than those imple-
mented thus far, but also use more sophisticated saliency maps than the intensity-based one in
the system described in this paper. However, saliency maps that are more complicated than sim-
plistic intensity-based ones require more focal-plane area to be devoted to processing, rather
than to sensing, further reducing the fill factor.
To address these area and scaling issues, we have begun to investigate multichip systems in
which the selective attention processing is not located on the focal plane. By using a separate
focal plane saliency map generator such as a “traditional” silicon retina, we will be able to
achieve better resolution and more flexibility; the same selective attention processor may be
used with different saliency maps, without fabricating an entirely new chip. We will implement
these multichip selective attention systems using an address-event interchip communication
protocol that is reminiscent of the spike-based communication used in neural systems .
Address-event communication will allow a highly modular system, in which the “saliency map
generator” may actually be composed of several focal plane feature detectors, whose types and
relative importance may be changed according to the task. Building upon our current work,
these multichip systems will demonstrate behavior that is more similar to that observed in bio-
logical systems. Then, we will investigate the performance of these systems within specific
engineering contexts, such as target acquisition, tracking, and recognition.
or on a
or on a side in the
. This represents a 45% loss of
6: References Download full-text
 K.A. Boahen and A.G. Andreau, “A Contrast Sensitive Silicon Retina with Reciprocal Synapses,” Advances in
Neural Information Processing Systems 4, ed. J.E. Moody, Morgan Kauffman Publishers: San Mateo, CA, 1991.
 T. Delbruck, “Silicon Retina with Correlation-Based Velocity-Tuned Pixels,” IEEE Transactions on Neural Net-
works, Vol. 4, No. 3, pp. 529–41, May 1993.
 T. Delbruck and C.A. Mead, “Analog VLSI Phototransduction by Continuous Time, Adaptive, Logarithmic Pho-
toreceptor Circuits,” Caltech CNS Memo 30, California Institute of Technology Computation and Neural Sys-
tems Program, June 26, 1995.
 S.P. DeWeerth, T.G. Morris, and C.S. Wilson, “A VLSI Implementation of Visual Based Attentive Processing”.
Final Report, Army Research Laboratories Contract #DAKF11-94-D-0004/0045. November 30, 1996.
 S.P. DeWeerth, “Analog VLSI Circuits for Stimulus Localization and Centroid Computation,” International
Journal of Computer Vision, Kluwer Academic Publishers, Vol. 8, No. 3, pp. 191–202, 1992.
 B. Gilbert, “A 16-Channel Array Normalizer,” IEEE Journal of Solid-State Circuits, Vol. 19, pp 956–63, 1984.
 R. Harrison and C. Koch, “An Analog VLSI Model of the Fly Elementary Motion Detector,” Advances in Neu-
ral Information Processing Systems 10, M.I. Jordan, M.J. Stearns, S.A. Solla, eds. MIT Press: Cambridge, MA,
pp. 880–6, 1998.
 A.L Yarbus, Eye Movements and Vision, translated by B. Haigh. L.A. Riggs, trans. ed., Plenum Press: New York,
 E.R. Kandel, “Perception of Motion, Depth, and Form,” Principles of Neural Science, third edition, eds. Kandel,
Schwartz. and Jessull, Elsevier: New York, pp. 440–65, 1991.
 C. Koch and S. Ullman, “Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry,”
Human Neurobiology, Vol. 4, pp. 219-227, 1985.
 J. Lazzaro, S. Ryckebusch, M.A. Mahowald, and C.A. Mead, “Winner-take-all networks of O(n) complexity,”
Advances in Neural Information Processing Systems 1, D.S. Touretzky, ed., Morgan Kaufmann: San Mateo,
CA, pp. 703–11, 1989.
 S.-C. Liu, and J. Harris, “Dynamic Wires: An Analog VLSI Model for Object-Based Processing,” International
Journal of Computer Vision, Kluwer Academic Publishers, Vol. 8, No. 3, pp. 231–9, 1992.
 M. Mahowald, VLSI Analogs of Neuronal Visual Processing: A Synthesis of Form and Function, Ph.D. Thesis,
California Institute of Technology, 1992.
 C.A. Mead, Analog VLSI and Neural Systems, Reading, MA, Addison-Wesley, 1989.
 C.A. Mead and T. Delbruck, “Scanners for Visualizing Activity of Analog VLSI Circuity,” Caltech CNS Memo
11, California Institute of Technology Computation and Neural Systems Program, June 27, 1991.
 T.G. Morris and S.P. DeWeerth, “Analog VLSI Excitatory Feedback Circuits for Attentional Shifts and Track-
ing,” Analog Integrated Circuits and Signal Processing, Kluwer Academic Publishers, Vol. 13, pp. 79-91, 1997.
 T.G. Morris and S.P. DeWeerth, “Analog VLSI Circuits for Covert Attentional Shifts,” Proceedings of the Fifth
International Conference on Microelectronics for Neural Networks and Fuzzy Systems, IEEE Computer Society
Press, Lausanne, Switzerland, pp. 30-37, February 1996.
 T.G. Morris, Analog VLSI Visual Attention Systems, Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA,
 T.G. Morris, T.K. Horiuchi, and S.P. DeWeerth, “Object-based Selection within an Analog VLSI Visual Atten-
tion System”, IEEE Circuits and Systems, in press.
 T.G. Morris, C.S. Wilson, and S.P. DeWeerth, “Analog VLSI circuits for sensory attentive processing,” MFI '96:
1996 IEEE/SICE/RSJ International Conference on Multisensor Fusion and Integration for Intelligent Systems.
IEEE Press: Piscataway, NJ. pp. 395-402. December 8, 1996.
 T.G. Morris, C.S. Wilson, and S.P. DeWeerth, “An Analog VLSI Focal-Plane Processing Array that Performs
Object-Based Attentive Selection”. Proc. 40th Midwest Symposium on Circuits and Systems. pp. 43-46. August
 M.I. Posner and S.E. Petersen, “The Attention System of the Human Brain,” Ann. Rev. of Neuroscience, 13:25-
 C.S. Wilson, T.G. Morris, and S.P. DeWeerth, “Segmentation Coding for Object-Based Attentive Selection Sys-
tems”. Proc. IEEE International Symposium on Circuits and Systems. IEEE Press: Piscataway, NJ. May 31,