Conference PaperPDF Available

Visual cortex on the GPU: Biologically inspired classifier and feature descriptor for rapid recognition

Authors:

Abstract and Figures

We present a biologically motivated classifier and feature descriptors that are designed for execution on single instruction multi data hardware and are applied to high speed multiclass object recognition. Our feature extractor uses a cellular tuning approach to select the optimal Gabor filters to process a given input, followed by the computation of scale and rotation-invariant features that are sparsified with a lateral inhibition mechanism. Neighboring features are pooled into feature hierarchies whose resonant properties are used to select the most representative hierarchies for each object class. The feature hierarchies are used to train a novel form of adaptive resonance theory classifier for multiclass object recognition. Our model has unprecedented biologically plausibility at all stages and uses the programmable graphics processing unit (GPU) for high speed feature extraction and object classification. We demonstrate the speedup achieved with the use of the GPU and test our model on the Caltech 101 and 15 Scene datasets, where our system achieves state-of-the-art performance.
Content may be subject to copyright.
Visual Cortex on the GPU:
Biologically Inspired Classifier and Feature Descriptor for Rapid Recognition
Kris Woodbeck, Gerhard Roth
School of Information Technology and Engineering
University of Ottawa
Ottawa, Ontario, Canada
kris.woodbeck@alumni.uottawa.ca, gerhardroth@rogers.com
Huiqiong Chen
Faculty of Computer Science
Dalhousie University
Halifax, Nova Scotia, Canada
huiqiong@cs.dal.ca
Abstract
We present a biologically motivated classifier and fea-
ture descriptors that are designed for execution on Single
Instruction Multi Data hardware and are applied to high
speed multiclass object recognition. Our feature extractor
uses a cellular tuning approach to select the optimal Ga-
bor filters to process a given input, followed by the com-
putation of scale and rotation-invariant features that are
sparsified with a lateral inhibition mechanism. Neighbor-
ing features are pooled into feature hierarchies whose res-
onant properties are used to select the most representative
hierarchies for each object class. The feature hierarchies
are used to train a novel form of Adaptive Resonance The-
ory classifier for multiclass object recognition. Our model
has unprecedented biologically plausibility at all stages and
uses the programmable Graphics Processing Unit (GPU)
for high speed feature extraction and object classification.
We demonstrate the speedup achieved with the use of the
GPU and test our model on the Caltech 101 and 15 Scene
datasets, where our system achieves state-of-the-art perfor-
mance.
1. Introduction
Computer vision systems have become quite adept at
recognizing objects that they have already seen, but the
more general task of recognizing a class from a limited
number of instances has proven somewhat elusive. Since
the primate visual system outperforms all existing computer
vision systems at recognition, it is reasonable to look to-
wards biology for inspiration. Biologically inspired work
[21, 17, 26] has produced a number of models that com-
pete with the top computer vision systems at object recog-
nition. These models aim to replicate the lowest levels of
the visual cortex and generally place a great deal of focus
on the computation of visual features. Much less focus has
been placed on developing a biologically plausible classi-
fier. Existing models [19, 21, 17] have been implemented
on Single Instruction Single Data (SISD) processors, but
given the highly parallel nature of the brain, a more parallel
processor is required to accurately model the processing in
the visual cortex. In recent years, the graphics industry has
developed general-purpose, highly parallel programmable
Graphics Processing Units (GPU). In our work, we show
that the GPU is better suited to the task of modeling the
visual cortex and clearly outperforms its SISD counterpart,
the CPU, at biologically motivated processing.
The lowest areas of the visual cortex are retinotopic [13],
meaning that adjacent cells process adjacent regions of the
visual field with identical operations in a Single Instruc-
tion Multi Data (SIMD) manner. At the lowest areas, the
cells perform a convolution operation, which continues onto
various forms of cellular response pooling at higher levels.
Most object classes are too complex to be recognized using
only small, isolated regions of an image. Our model mim-
ics the visual cortex by gradually pooling simpler features
at lower layers into more complex features at higher layers
[19]. The features used in our model take heavy cues from
visual features indicated by cells in the cortex [14, 3].
Our model uses a novel filter parameter tuning strategy
to select the optimal cells with which to process each vi-
sual field. This is followed by feature extraction which uses
three novel feature descriptors and comparison metrics of
gradually increasing complexity. We present a biologically
motivated classifier based on Adaptive Resonance Theory
[5] for object recognition that is implemented as a series of
SIMD processing states on the GPU.
Our model has rapid recognition capabilities; it is able
to perform feature extraction and object classification in
270 milliseconds on current hardware using over 250,000
feature descriptors in the classifier. It is tested on the
Caltech 101 [8] and 15 Scene [16] datasets; the results
are comparable to the state-of-the-art and surpass all other
1
978-1-4244-2340-8/08/$25.00 ©2008 IEEE
biologically motivated models. We give timing results
which show the speedup obtained by our system. Our
results make a strong case both for using the GPU and
biologically inspired methods for recognition.
Related Work Due to the inherent incompatibility be-
tween most SISD and SIMD algorithms, our model could
not be built from any existing models. Our model uses a
similar methodology to existing models of the visual cortex,
including the HMAX or ’Standard Model’ [19], the ’Stan-
dard Model 2.0’ [21] and later derived work [17]. In turn,
these models have similarities with previous work, such as
the Neocognitron [9]. All models start with V1 simple and
complex cells, as discovered by Hubel and Wiesel [15].
Simple cells perform convolution operations with Gabor fil-
ters. Complex cells pool simple cell responses at a given
orientation using a MAX operator over all scales and posi-
tions in the cell’s receptive field. Simple and complex cells
are alternated and their receptive fields gradually grow in
size. Recent biological models [21, 17, 26] have used (non-
biologically motivated) Support Vector Machines (SVM)
for classification. Non-biological approaches comprised of
hierarchies of Gabor filters have also yielded some success
at object recognition [18]. Our model distinguishes itself
from previous models [19, 21, 17] in the following ways:
our feature descriptors have rotation, scale and transla-
tion invariant properties,
we use a filter tuning approach is used instead of static
filterbanks,
an inhibition mechanism is employed to reduce feature
redundancy, as in other work [17], and
a novel multiclass classifier, based on Adaptive Reso-
nance Theory, is used for recognition.
2. Physiological Feature Descriptors
Our model consists of several classes of feature descrip-
tors which progressively increase in complexity and are
used in combination with a biologically-motivated classi-
fier. Figure 1 gives an overview of the process used to iso-
late features and classify an image. Our descriptors are built
around key anatomical structures, the first of which are sim-
ple and complex cells, as used in other models [19, 21, 17].
We introduce the physiological cortical column structure
[10]. This structure serves to group simple cells that vary
only by orientation; it is used both to regulate our cellu-
lar tuning process and to aid in feature normalization. The
hypercomplex cell [15] is introduced, which serves to pool
complex cell features into higher order features that are both
rotation and scale-invariant. We introduce a novel type of
cell that pools hypercomplex cell features and links them
in a hierarchical manner. We define comparison metrics for
each descriptor and give a method of isolating the pertinent
features for a given object class. Classification is done on
the GPU using a biologically motivated classifier, derived
from Adaptive Resonance Theory (ART) [4]. Note that all
cellular data, meaning all system input and output, is stored
in texture format on the GPU during all stages of process-
ing.
V1 Simple Cells (V1S Layer) V1 simple cells perform
a convolution on the visual field using Gabor filters over a
range of orientation. We use four orientations, similar to
other models [21, 17]. A Gabor filter is described by:
C
s
(x, y) = exp
x
2
1
+ γ
2
y
2
1
2σ
2
cos
2π
x
1
λ
+ ψ
(1)
where x
1
= x cos θ + y sin θ and y
1
= y cos θ x sin θ. σ
and λ are related by the bandwidth parameter b from Equa-
tion 2.
σ
λ
=
1
π
ln 2
2
2
b
+1
2
b
1
(2)
The system uses parameters within the ranges shown
in Table 1. The initial values of the parameters are:
ψ =0.0 =0.7 =0.6,b =1.1, but the final filter
values are obtained with the use of a tuning operation which
dynamically selects the optimal simple cell parameters for
each image. This procedure is explained below.
V1 Complex Cells (V1C Layer) V1 complex cells per-
form a pooling operation over simple cells within the com-
plex cell’s receptive field, σ
RF
. The pooling is modeled
with a MAX operation [19, 21, 17, 26]. Physiologically,
most complex cells pool the responses of simple cells with
a similar preferred orientation θ.LetC
s
(s
i
,x,y) be a
simple cell’s response at (x, y) with orientation θ and scale
s
i
. The complex cell pooling operation is defined as:
C
c
(θ, x, y) = max{C
s
(s
i
,x+ x
,y+ y
): (3)
(x
,y
) σ
RF
, s
i
[5, 31]}
Complex cells effectively perform a sub-sampling
operation: a complex cell with a given receptive field σ
RF
reduces a group of simple cells within its receptive field to
a single cell. Note that the feature derived from a complex
cell, C
i
c
, has an orientation value α
i
which comes from the
orientation of the object in the visual field activating the
underlying simple cell(s). α
i
is closely related to the simple
cell θ parameter. The α
i
value is used to achieve rotation
and scale invariance, it is further detailed in the V1HC layer.
Parameter scale θ ψ b γ λ
Value Range [5, 31] 0,
π
4
,
π
2
,
3π
4
[
π
2
,
π
2
] [0.2, 3.0] [0.0, 3.0] [0.0, 6.0]
Table 1: Gabor parameter ranges used during tuning; all parameters begin at a default value and are tuned within this range.
Cellular Tuning Selecting the best simple cells with
which to process a given visual field is not trivial: the vi-
sual cortex has a wide range of simple cells available for
visual processing. Other models [19, 21, 17] use static sets
of Gabor filters for simple cells, but our model does away
with this assumption. Simple cell parameters are selected
using an iterative tuning model that serves to mimic intra-
columnar communication in the cortex [10]. The results
of this tuning process are an optimized set of simple cells
with which to process the visual field. The tuning process
is summarized as follows:
1. The cortex is initialized with default simple cell values
2. Tuning takes place over the following V1 simple cell
parameters: γ, ψ, b, λ
3. There are N tuning steps, with M simple cell settings
tested at each step
4. At each tuning step, a new series of simple cells are
generated by altering the current simple cell values in
a biologically plausible manner
5. All M newly generated simple cells are applied to the
visual field
6. The results from the M simple cells are evaluated
based on intra-columnar (corner) and inter-columnar
(edge) activations
7. The parameters that create the optimal ratio of corners
to edges while maximizing corners are selected as the
winning parameters
8. The tuning process is repeated N times
The full details of the tuning process are covered in
[26]. A number of physiological properties are taken into
account when generating parameters during the tuning
process, such as an increase of cellular receptive field size
and, generally speaking, similar cellular setting ranges as
in [21]. Due to the fact that 4NM total convolutions are
performed per visual field, the tuning is only plausible if
done using hardware capable of SIMD computation, such
as the GPU.
V2 Hypercomplex Cells (V2HC Layer) V2 has re-
ceived much less focus from the neuroscientific research
community than V1; electrophysiological studies are nearly
10 times more common for V1 than V2 [3]. V2 is known
to activate in response to illusory contours of objects [22],
it has a complex contrast ratio response and activates for
relatively complex shapes and textures [14].
While a complex cell will pool simple cells at a common
orientation, cells in V2 are known to pool cells at distinct
orientations [1, 3]. To model this, we introduce the hyper-
complex cell [15]. Hypercomplex cells are known to pool
over complex cell orientation, they respond to end-stopped
input (bars of a specific length) and also respond to stimuli
with either direction-specific or general motion. Since our
model’s recognition component does not require motion, we
use only end-stopped and orientation pooling hypercomplex
cells. These properties allow hypercomplex cells to activate
for simple shapes, such as that of a chevron or star. An ori-
entation pooling hypercomplex cell, comprised of all active
complex cells in its receptive field, is defined as follows:
C
h
(x, y)={C
c
(θ
i
,x,y): (4)
θ
i
2π, C
c
(θ
i
,x,y) >t
act
}
Here, t
act
is an activation threshold and is set to 10% of
the maximum response value. A hypercomplex cell fea-
ture C
a
h
is comprised of all active complex cell features
C
a
c0
...C
a
cN
, which are arranged in a clockwise manner us-
ing the cortical column structure. Hypercomplex cell fea-
tures have properties which allow them to be normalized in
both rotation and scale-invariant manners.
Let C
a
ci
be an active complex cell with a receptive field
σ
a
. Rotation invariance is achieved by traversing all previ-
ously pooled simple cells in σ
a
to detect both end-stopped
and non-end-stopped regions of activation. The orientation
of the traversal path is α
i
and reflects the orientation of the
underlying object that is activating the complex cells. The
orientation difference between any two complex cell fea-
tures is α
ij
= abs(α
i
α
j
); examples of this can be seen
in Figure 2. Rotation invariance is achieved by determining
the largest α
ij
value and setting C
a
ci
as the primary complex
cell. The remainder of the cells are arranged according to
their order in the cortical column, nameley in a clockwise
manner.
Scale invariance is achieved by normalized each feature
with respect to the length of the longest complex cell fea-
ture within the cortical column. Eeach complex cell C
c
has
a length associated with its activation, dictated by σ
a
, s
i
and the traversal path. Once rotation and scale normaliza-
Retina Layer
V1S Layer
V1C Layer
V2HC Layer
V2LI Layer
V2C Layer
HAR Layer
ART Layer
Cellular tuning
Hierarchical
feature resonance
Hypercomplex cell
pooling and path
traversal
Feature reduction
by lateral inhibition
Hierarchical pooling
of contour features
Feature hierarchy
classication
Complex cell
response pooling
...
...
X
Figure 1: An overview of the model; each state is a GPU-
resident processing step. Results are sent to the CPU only
after the HAR Layer.
tion have taken place, C
a
h
is comprised of complex cells
C
a
c0
...C
a
cN
and has the following properties:
α
01
= max
i<N
abs(α
i
α
i+1
) (5)
C
ci
C
h
(α
i
α
i+1
< 0) (6)
C
ci
C
h
(len(C
ci
)=
len(C
ci
)
max len(C
cj
)
) (7)
A metric for comparing two complex cell features (C
ci
and
C
cj
) is given in Equation 8, and a similar metric for compar-
ing two hypercomplex cells (C
i
h
and C
j
h
) is given in Equa-
tion 9.
||C
ci
C
cj
|| =
(α
ij
)
2
+(len(C
ci
) len(C
cj
))
2
(8)
||C
i
h
C
j
h
|| =
N
k=0
||C
i
ck
C
j
ck
||
2
N
(9)
These metrics allow the comparison of both complex
cell features and hypercomplex features. The complex cell
metric is chosen for its simplicity, but we believe that a
more biologically motivated Mahalanobis metric could be
derived in the future. These metrics raise the important
issue of how to separate foreground from background,
since hypercomplex cell features will inevitably contain
complex cell features with both foreground and back-
ground information. Physiologically, having two eyes aids
us greatly in determining object boundaries: depth is a very
obvious cue. But in most computer vision problems, we do
not have two views of a given scene, so other techniques
must be employed to isolate an object of interest from
its background. This problem is the focus of the HAR layer.
V2 Lateral Inhibition (V2LI Layer) The visual field
now consists of a set of hypercomplex features whose cells’
receptive fields overlap one another, leaving redundant
features in the visual field. Lateral inhibition is a neural
mechanism whereby a given cell inhibits neighboring
cells with lower activations. In our model, this inhibitory
mechanism activates over the previously defined paths of
complex cells. As in previous models [17, 18], lateral
inhibition or feature sparsification has been shown to help
to effectively isolate the important features within the
visual field. The inhibitory mechanism in our model uses
the receptive fields of hypercomplex cells to inhibit the re-
sponse of neighboring hypercomplex cells. The inhibition
mechanism acts as a MAX operator and is similar in nature
to the V1 complex cell.
V2 Hierarchy Building (V2C Layer) Once lateral inhi-
bition has taken place, the visual field consists of a collec-
tion of hypercomplex cells pooled from complex cells. The
next step is to introduce higher order features by pooling
the hypercomplex cells one step further. We define another
type of cell that has not directly been observed within the
visual cortex, but its properties are quite similar to cells in
the lower levels: it is yet another type of pooling opera-
tion. This pooling is geared towards activation by object
contours.
This new type of cell pools the response of hypercom-
plex cells to form a hierarchy of hypercomplex features. Let
(C
a
h
,C
b
h
) be two hypercomplex cells that are linked together
by a shared complex cell. Each hypercomplex cell has its
constituent complex cells’ receptive fields traversed to de-
termine its hypercomplex cell neighbors. The end result is
the resonant cell feature C
a
r
:
C
a
r
= {(C
a
h
,C
k
h
):k [0,N] (10)
(i, j, C
a
ci
C
a
h
,C
k
cj
C
k
h
):||C
a
ci
C
k
cj
|| =0}
A sample resonant feature can be seen in Figure 2. The
pooling operation links 2-tuples of hypercomplex cell fea-
tures together into a resonant feature. Resonant features are
designed to activate in response to illusory contours of ob-
jects, but one problem remains: features contain a good deal
of background information. The ideal resonant feature will
activate solely to an object’s contour, whose elements are
generally linked with a similar luminance value, as in the
visual cortex [22]. The isolation of resonant features for
an object class is the focus of the HAR layer. In practise,
a given feature descriptor does not typically acquire more
than 10 child feature descriptors.
3. Feature Isolation and Classification
Hierarchical Adaptive Resonance (HAR Layer) The
visual field now comprises a series of hierarchical resonant
features; however, every feature can consist either partially
or completely of background information. In order to iso-
late background information, the HAR layer creates and ad-
justs a series of weight vectors for each resonant feature.
The HAR layer allows two complete visual fields to be com-
pared to one another in a resonant comparison operation.
The HAR layer performs this resonance comparison opera-
tion over a collection of positive and negative (background)
visual fields for a given object class. The result of this layer
is a set of weights which rank the resonant features with
an image-to-class metric, instead of the typical image-to-
image metric used in SVMs [19, 21, 17, 26, 18].
Let C
a
r
be a resonant feature composed of hypercomplex
features C
a
h0
...C
a
hN
.LetC
b
r
be a resonant feature from
a separate visual field. The following code shows an algo-
rithm for comparing two resonant features to one another.
The discardChildren function serves to discard the small-
est hypercomplex child features until both features have the
Cortical column
(simple cells)
c2
C
a
c2
C
b
c1
C
b
12
a
20
a
01
a
c0
C
a
h
C
b
c0
C
b
c1
C
a
h
C
a
Figure 2: A sample of a single resonant feature, C
a
r
, com-
prised of 2 hypercomplex features (C
a
h
and C
b
h
) and 6 com-
plex features (C
a
c0
...C
a
c2
and C
b
c0
...C
b
c2
, where C
a
c2
=
C
b
c2
). The grid represents a simple cell field arranged in cor-
tical columns, on top of which complex cell pooling occurs.
Red areas represent active hypercomplex cells and orange
areas represent active simple cells that have been pooled by
complex cells (represented by lines). The primary complex
cells, C
a
c0
and C
b
c0
, are associated with the largest angles in
their respective hypercomplex cells. The cortical column is
used to arrange complex cells in a clockwise manner for ro-
tation invariance and to normalize complex cell length for
scale invariance. The grid is represented as a texture on the
GPU.
same child count.
if childCount(C
a
r
) <childCount(C
b
r
) then
C
b
r
discardChildren(C
b
r
)
end if
if childCount(C
a
r
)==childCount(C
b
r
) then
rootAct ←||C
a
h0
C
b
h0
||
if rootAct > ρ then
for k =1to childCount(C
a
r
) do
weight[i] rootAct ×||C
a
hk
C
b
hk
||
end for
end if
end if
The ρ parameter is identical to the one used in the
classifier. The matching process is run on every feature
combination in two visual fields. For each C
a
r
in the source
visual field, the best matching feature is selected (via the
max operator) in the destination visual field. The process
is repeated for a series of positive examples, and a set
of background examples. When a feature activates for
a negative class, weight[i] is subtracted from the total,
otherwise it is added to the total. Once complete, all final
weights and features are extracted and stored in a database
for use by the classifier.
Classifier (GPU ART Layer) Our model uses a clas-
sifier based on the Adaptive Resonance Theory (ART) [5]
family of classifiers. The majority of biologically inspired
visual cortex models use classifiers that are not biologi-
cally plausible, such as Support Vector Machines (SVM)
[21, 17, 11, 26]. While SVMs are popular, they are a bi-
nary classifier which makes them inherently ill-suited for
multiclass object recognition. Adaptive Resonance Theory
(ART) is a form of biologically motivated classifier with
several unique and highly desirable properties for object
recognition. ART classifiers do not suffer from the train /
test dichotomy that plagues most classifiers; they can switch
between training and testing modes at any point. ART
classifiers are also not limited to binary classification, they
are well adapted to multiclass problems and have shown
promise in visual classification tasks [24].
There are several ART based classifiers available, the
most popular of which is Fuzzy ARTMAP [6]. This uses
a combination of an ART-1 layer and ART-2 layer with an
associative map field in between. The ART-1 layer receives
a stream of input vectors and ART-2 receives a stream of
correct predictions. When the two layers activate in reso-
nance, the map field pairs their activation. The map field
includes a feedback control mechanism which can trigger
the search for another recognition category, in a process
called match tracking. We have made the following adjust-
ments to ARTMAP for our classifier, which we term GPU-
ARTMAP:
the classifier uses an ‘image-to-class’ metric, instead
of the typical ‘image-to-image’ metrics,
the classifier does not require bounded length input
samples,
the feature complement coding step is omitted,
the classifier is implemented on the graphics proces-
sor; all comparisons are done in a single r ender pass
with O(MN) comparisons of M known features to N
input features.
The final feature-set used for Caltech 101 includes over
250,000 features selected from the HAR Layer, all of which
are stored on a series of 1024x1024 textures used during
training and classification. Each feature’s class identifier is
discarded when loaded into the classifier (prior to training).
Since all feature data is stored on a relatively small number
of textures, training or classification can be done in a single
render pass.
The Programmable Graphics Processing Unit With
current programmable graphics processors reaching well
over 500 GFLOPS (billion FLoating point Operations Per
Second), they are fast becoming the ideal platform for high
performance computing [7]. We show that the GPU per-
forms significantly better than the CPU at various SIMD
processing tasks. Since the primate visual cortex operates
in a retinotopic/topographic fashion [13], which is inher-
ently SIMD in nature, the GPU is an ideal platform for
modeling the visual cortex. All cellular processing in our
model, from feature descriptor isolation and comparison to
object classification, is done using OpenGL GLSL fragment
shaders [20]. While development on the GPU has generally
been quite challenging due to lack of any high level method
of debugging, hardware-aware debuggers have recently ad-
vanced significantly [23], helping to reduce GPU develop-
ment time.
Our model represents each layer of the cortex as one or
more fragment shaders. A fragment shader a program that is
executed (in parallel) on every fragment (pixel) of a texture,
in what is known as a render pass. When developing al-
gorithms for the GPU, coprocessor bandwidth can become
a significant bottleneck: it is therefore crucial to minimize
data transfer between the CPU and GPU. In our model, fea-
ture descriptors are read from the GPU only after the HAR
layer has been applied. At this point, the features for a given
object class have been isolated and a subset of these features
are selected for use by the classifier.
Method [17] [11] [25] [16] [27] [12] [2] Our method
101 Accuracy 56.0 58.2 63.0 64.6 66.0 67.6 77.8 64.3
15 Scene Accuracy - - - 81.4 - - - 76.3
Table 2: Published results for the Caltech 101 & 15 Scene datasets. Results for our model are the average of 10 independent
runs.
4. Results
4.1. Multiclass Classification
The Caltech 101 [8] and 15 Scene [16] datasets are a
collection of 9197 and 4485 images, divided into 101 ob-
ject categories and 15 scene categories respectively. Caltech
101 includes a background category, which is also used for
the HAR layer in the 15 scene dataset. Our model was run
on grayscale versions of the images which have been re-
sized such that neither the width nor height exceeded 256
pixels. We combine the “faces” and “faces easy” categories
in Caltech 101, due to a high mismatch rate. Each result is
an average of 10 runs. For each run we do the following:
1. For each class, we perform HAR tuning on all images
within the class:
(a) Run all steps in the model up to HAR layer for
all images in the target class
(b) Run the HAR layer on all images within the class
(c) Run the HAR layer on 50 random images from
the background class
2. Select all features from HAR tuning whose accuracy is
above 50%
3. Randomly select N image training examples from
each class, placing all others in the test set
4. Train ART classifier with N training examples from
each class
5. Perform classification on all remaining test images
Our results for the 101-dataset is 64.3% +/- 1.3% using
30 training images. Our results for the 15 scene dataset is
76.3% +/- 1.8% using 100 training images. Our results are
comparable to the state-of-the-art for both of these datasets.
4.2. SIMD versus SISD Timing
We demonstrate the speedup obtained by our GPU im-
plementation with two benchmarks. First, we compare the
execution time for all cellular operations in feature extrac-
tion, including simple, complex and hypercomplex cells on
both the CPU and GPU. This operation is performed on a
number of input texture sizes and the results are show in
512 1024 2048 4096
Texture Size
1
10
100
1000
10000
100000
1
x
10
6
Time (sec)
CPU
GPU
(a) Feature Extraction
100 1000
Feature Count
0.1
1
10
100
1000
(b) Feature Comparison
Figure 3: Execution time of SIMD versus SISD algorithms.
3a) Execution of area V1 and parts of V2 in processing
10,000 images with layer V1S to layer V2C. 3b) Execution
of HAR layer feature field comparisons; an O(N
2
) com-
parison operation is performed that consist of all features in
two identical images. We vary the feature count in the im-
ages by adjusting lateral inhibition. A GeForce 8800 GTX
Ultra GPU and an AMD64 processor at 2.2 GHz CPU were
used for tests.
Figure 3. Second, we take a feature comparison operator,
effectively the HAR Layer and compare its execution on
the CPU and GPU.
The GPU achieves a speedup of approximately four or-
ders of magnitude for the first test and approximately three
orders of magnitude for the second test. This is achieved
due largely to the design of the GPU: it is composed of 128
processors at 1.5 GHz each, with memory at 2.2 GHz and
a memory bandwidth of 100 GB/s. The processor tested
was clocked at 2.2 GHz, with memory of 100 MHz and a
memory bandwidth of 3.2 GB/s. This GPU is capable of ap-
proximately 500 GFLOPS, while the CPU is approximately
6 GFLOPS. The increased processor count, memory speed
and bandwidth easily accounts for the speedup observed;
the GPU is clearly much better suited for the operations
used in our model than the CPU. With the recent trend of
CPUs becoming more multi-core, they will inevitably im-
prove in performance at these tasks, but are unlikely to sur-
pass the GPU in parallelism.
5. Discussion and Future Work
We have shown that a biologically motivated model can
compete with state-of-the-art computer vision systems in
object recognition. We have expanded the biological basis
of existing cortex models to create both feature descriptors
and a classifier that are closely aligned with the properties of
the visual cortex. Using the graphics processor, our model
is able to classify images quite rapidly compared to other
models, which have been indicated to require several sec-
onds to classify an image [17]. Our setup involves only a
single GPU per machine, we could obtain a further speedup
with the use of Scalable Link Interface (SLI), which allows
up to four GPUs to work in parallel with one another.
It is important to consider the influence of processor ar-
chitecture on a system’s design: one often cannot solve a
problem in the same manner on SISD and SIMD processors.
Throughout this model, we have detaied operations which
are much too costly to perform in an SISD environment, but
are in fact are reasonable for the SIMD framework of the
GPU. While there is certainly an overhead and a number of
limitations that must be taken into account when developing
on any new hardware, the speedup is noteworthy.
Future work involves exploring other Mahalanobis dis-
tance metric of a more biological basis for complex and hy-
percomplex feature comparison. It is also clear that biology
uses a form of hierarchical classification to handle such a
large variety of object classes; exploring methods of build-
ing hierarchies of classifiers could prove useful for scaling
this model to handle much larger datasets.
References
[1] A. Anzai, X. Peng, and D. C. V. Essen. Neurons in monkey
visual area v2 encode combinations of orientations. Nature
Neuroscience, 10(10):1313–1321, 2007.
[2] A. Bosch, A. Zisserman, and X. Munoz. Representing shape
with a spatial pyramid kernel. In CVPR, 2007.
[3] G. M. Boynton and J. Hegd
´
e. Visual cortex: The continuing
puzzle of area v2. Current Biology, 14:523–524, 2004.
[4] G. Carpenter and S. Grossberg. ART 3: Hierarchical search
using chemical transmitters in self-organizing pattern recog-
nition architectures. Neural Networks, 3:129–152, 1990.
[5] G. Carpenter and S. Grossberg. Adaptive resonance theory.
MIT Press, second edition edition, 2003.
[6] G. Carpenter, S. Grossberg, N. Markuzon, J. Reynolds, and
D. Rosen. Fuzzy ARTMAP: A neural network architecture
for incremental supervised learning of analog multidimen-
sional maps. IEEE Trans. on Neural Networks, 3:698–713,
1992.
[7] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. Gpu
cluster for high performance computing. In ACM/IEEE SC
Conference, 2004.
[8] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative
visual models from few training examples: an incremental
bayesian approach tested on 101 object categories. In CVPR,
2004.
[9] K. Fukushima. Neocognitron: A self-organizing neu-
ral network model for a mechanism of pattern recognition
unaffected by shift in position. Biological Cybernetics,
36(4):193–202, 1980.
[10] G. Goodhill and M. Carreira-Perpin
´
an. Cortical columns.
Encyclopedia of Cog. Science, 1:845–851, 2002.
[11] K. Grauman and T. Darrell. Pyramid match kernels: Dis-
criminative classification with sets of image features. Tech-
nical Report MIT-CSAIL-TR-2006-020, MIT, 2006.
[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-
egory dataset. Technical Report 7694, California Institute of
Technology, 2007.
[13] K. Grill-Spector and R. Malach. The human visual cortex.
Ann. Rev. of Neuroscience, 27:649–77, 2004.
[14] J. Hedg
´
e and D. C. V. Essen. Selectivity for complex shapes
in primate visual area v2. J. of Neuroscience, 20:RC61:1–6,
2000.
[15] D. H. Hubel and T. N. Wiesel. Receptive fields and func-
tional architecture in two nonstriate visual areas (18 and 19)
of the cat. J. of Neurophysiology, 28:229–289, 1965.
[16] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural
scene categories. In CVPR, 2006.
[17] J. Mutch and D. Lowe. Multiclass object recognition with
sparse, localized features. In CVPR, 2006.
[18] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun. Unsuper-
vised learning of invariant feature hierarchies with applica-
tion to object recognition. In CVPR, 2007.
[19] M. Riesenhuber and T. Poggio. Hierarchical models of object
recognition in cortex. Nature Neuroscience, 2:1019–1025,
1999.
[20] M. Segal and K. Akeley. The OpenGL Graphics System: A
Specification (Version 2.1), 2006.
[21] T. Serre, L. Wolf, and T. Poggio. Object recognition with
features inspired by visual cortex. In CVPR, 2005.
[22] M. Soriano, L. Spillman, and M. Bach. The abutting grating
illusion. Vision Research, 36:109116, 1996.
[23] M. Strengert, T. Klein, and T. Ertl. A hardware-aware de-
bugger for the opengl shading language. In ACM Siggraph
Workshop on Graphics Hardware, 2007.
[24] M. Uysal, E. Akbas, and F. Yarman-Vural. A hierarchical
classification system based on adaptive resonance theory. In
ICIP, 2006.
[25] G. Wang, Y. Zhang, and L. Fei-Fei. Using dependent re-
gions for object categorization in a generative framework. In
CVPR, 2006.
[26] K. Woodbeck. On neural processing in the ventral and dorsal
visual pathways using the programmable graphics process-
ing unit. Master’s thesis, University of Ottawa, 2007.
[27] H. Zhang, A. Berg, M. Maire, and J. Malik. SVM-KNN:
Discriminative nearest neighbor classification for visual cat-
egory recognition. In CVPR, 2006.
... Many variations of the above underlying ideas have been proposed, including various learning strategies at higher layers [145,147], wavelet based filters [71], different feature sparsification strategies [73,110,147] and optimizations of filter parameters [107,147]. Yet another body of research, advocates that the hierarchical processing (termed F ilter → Rectif y → F ilter) that takes place in the visual cortex deals progressively with higher-order image structures [5,48,108]. ...
... Many variations of the above underlying ideas have been proposed, including various learning strategies at higher layers [145,147], wavelet based filters [71], different feature sparsification strategies [73,110,147] and optimizations of filter parameters [107,147]. Yet another body of research, advocates that the hierarchical processing (termed F ilter → Rectif y → F ilter) that takes place in the visual cortex deals progressively with higher-order image structures [5,48,108]. ...
... Many variations of the above underlying ideas have been proposed, including various learning strategies at higher layers [145,147], wavelet based filters [71], different feature sparsification strategies [73,110,147] and optimizations of filter parameters [107,147]. Yet another body of research, advocates that the hierarchical processing (termed F ilter → Rectif y → F ilter) that takes place in the visual cortex deals progressively with higher-order image structures [5,48,108]. It is therefore advocated that the same set of kernels present at the first layer (i.e. ...
Preprint
Full-text available
This document will review the most prominent proposals using multilayer convolutional architectures. Importantly, the various components of a typical convolutional network will be discussed through a review of different approaches that base their design decisions on biological findings and/or sound theoretical bases. In addition, the different attempts at understanding ConvNets via visualizations and empirical studies will be reviewed. The ultimate goal is to shed light on the role of each layer of processing involved in a ConvNet architecture, distill what we currently understand about ConvNets and highlight critical open problems.
... Many variations of the above underlying ideas have been proposed, including various learning strategies at higher layers [145,147], wavelet based filters [71], different feature sparsification strategies [73,110,147] and optimizations of filter parameters [107,147]. Yet another body of research, advocates that the hierarchical processing (termed F ilter → Rectif y → F ilter) that takes place in the visual cortex deals progressively with higher-order image structures [5,48,108]. ...
... Many variations of the above underlying ideas have been proposed, including various learning strategies at higher layers [145,147], wavelet based filters [71], different feature sparsification strategies [73,110,147] and optimizations of filter parameters [107,147]. Yet another body of research, advocates that the hierarchical processing (termed F ilter → Rectif y → F ilter) that takes place in the visual cortex deals progressively with higher-order image structures [5,48,108]. ...
... Many variations of the above underlying ideas have been proposed, including various learning strategies at higher layers [145,147], wavelet based filters [71], different feature sparsification strategies [73,110,147] and optimizations of filter parameters [107,147]. Yet another body of research, advocates that the hierarchical processing (termed F ilter → Rectif y → F ilter) that takes place in the visual cortex deals progressively with higher-order image structures [5,48,108]. It is therefore advocated that the same set of kernels present at the first layer (i.e. ...
Article
Full-text available
This document will review the most prominent proposals using multilayer convolutional architectures. Importantly, the various components of a typical convolutional network will be discussed through a review of different approaches that base their design decisions on biological findings and/or sound theoretical bases. In addition, the different attempts at understanding ConvNets via visualizations and empirical studies will be reviewed. The ultimate goal is to shed light on the role of each layer of processing involved in a ConvNet architecture, distill what we currently understand about ConvNets and highlight critical open problems.
... Parks et al. [19] presented a CUDA implementation of a saliency system for detection and the HMAX model for recognition, both steps are 10 times faster compared with the original algorithms. Woodbeck et al. [20] presented a GPU implementation of a bio-inspired model-similar to the HMAX model-using the OpenGL framework that achieves speedups of up to three orders of magnitude. Note that these last three works are based on the HMAX model, a region-based visual feature system, which is similar to our proposed model called the AVC algorithm. ...
Article
Full-text available
The need for highly accurate classification systems capable of working on real-time applications has increased in recent years. Nowadays, several computer vision tasks apply a classification step as part of bigger systems, hence requiring classification models that work at a fast pace. This rendered interesting the concept of real-time object classification to several research communities. In this paper, we propose to accelerate a bio-inspired model for object classification, which has given very good results when compared with other state-of-the-art proposals using the compute unified device architecture (CUDA) and exploiting computational capabilities of graphic processing units. The classification model that is used is called the artificial visual cortex, a novel bio-inspired approach for image classification. In this work, we show that through an implementation of this model in the CUDA framework it is possible to achieve real-time functionality. As a result, the proposed system is able to process images in average of up to 90 times faster than the original system.
... There are some other object recognition models that have used the Adaptive Resonance Theory. For example, Woodbeck et al. [23] proposed a biologically plausible hierarchical structure which was an extension of the sparse localized features (SLF) suggested by Mutch et al. [24]. One of their contributions was that, instead of using support vector machines (SVM) for classification, they used Fuzzy ARTMAP as a biologically plausible multiclass classifier [25] which is based on the Adaptive Resonance Theory (ART). ...
Article
Full-text available
The brain mechanism of extracting visual features for recognizing various objects has consistently been a controversial issue in computational models of object recognition. To extract visual features, we introduce a new, biologically motivated model for facial categorization, which is an extension of the Hubel and Wiesel simple-to-complex cell hierarchy. To address the synaptic stability versus plasticity dilemma, we apply the Adaptive Resonance Theory (ART) for extracting informative intermediate level visual features during the learning process, which also makes this model stable against the destruction of previously learned information while learning new information. Such a mechanism has been suggested to be embedded within known laminar microcircuits of the cerebral cortex. To reveal the strength of the proposed visual feature learning mechanism, we show that when we use this mechanism in the training process of a well-known biologically motivated object recognition model (the HMAX model), it performs better than the HMAX model in face/non-face classification tasks. Furthermore, we demonstrate that our proposed mechanism is capable of following similar trends in performance as humans in a psychophysical experiment using a face versus non-face rapid categorization task.
Conference Paper
This paper introduces mathematical formalism for Spatial Pooler (SP) of Hierarchical Temporal Memory (HTM) with a spacial consideration for its hardware implementation. Performance of HTM network and its ability to learn and adjust to a problem at hand is governed by a large set of parameters. Most of parameters are codependent which makes creating efficient HTM-based solutions challenging. It requires profound knowledge of the settings and their impact on the performance of system. Consequently, this paper introduced a set of formulas which are to facilitate the design process by enhancing tedious trial-and-error method with a tool for choosing initial parameters which enable quick learning convergence. This is especially important in hardware implementations which are constrained by the limited resources of a platform.
Article
In this paper, a distributed approach is developed for achieving large-scale classifier training and image classification. First, a visual concept network is constructed for determining the inter-related learning tasks automatically, e.g., the inter-related classifiers for the visually similar object classes in the same group should be trained in parallel by using multiple machines to enhance their discrimination power. Second, an MPI-based distributed computing approach is constructed by using a master–slave mode to address two critical issues of huge computational cost and huge storage/memory cost for large-scale classifier training and image classification. In addition, an indexing-based storage method is developed for reducing the sizes of intermediate SVM models and avoiding the repeated computations of SVs (support vectors) in the test stage for image classification. Our experiments have also provided very positive results on 2010 ImageNet database for Large Scale Visual Recognition Challenge.
Article
The spread of graphics processing unit (GPU) computing paved the way to the possibility of reaching high-computing performances in the simulation of complex biological systems. In this work, we develop a very efficient GPU-accelerated neural library, which can be employed in real-world contexts. Such a library provides the neural functionalities that are the basis of a wide range of bio-inspired models, and in particular, we show its efficacy in implementing a cortical-like architecture for visual feature coding and estimation. In order to fully exploit the intrinsic parallelism of such neural architectures and to manage the huge amount of data that characterizes the internal representation of distributed neural models, we devise an effective algorithmic solution and an efficient data structure. In particular, we exploit both data parallelism and task parallelism, with the aim of optimally taking advantage from the computational capabilities of modern graphics cards. Moreover, we assess the performances of two different development frameworks, both supplying a wide range of basic signal processing GPU-accelerated functions. A systematic analysis, aiming at comparing different algorithmic solutions, shows the best data structure and parallelization computational scheme to compute features from a distributed population of neural units.
Article
In this paper, we present a real time biologically motivated 3D motion classifier cells integrating the depth information generated from a stereo input implemented in an active vision system. The proposed approach is accurately able to detect and estimate multiple interfered 3D complex motions under the absence of predefined spatial coherence. Moreover, the system has ability to examine the response of input 3D motion vector fields to a certain 3D motion patterns (3D motion classifier cells) such as motion in the Z direction representing movements towards the system, which is very important to overcome typical problem in autonomous mobile robotic vision such as collision detection and inhibition of the ego-motion defects of a moving camera head. The output of the algorithm is part in a multi-object segmentation approach implemented in an active vision system.
Conference Paper
In this paper we present a real-time scan-line segment based stereo vision for the estimation of biologically motivated classifier cells in an active vision system. The system is challenged to overcome several problems in autonomous mobile robotic vision such as the detection of incoming moving objects and estimating its 3D motion parameters in a dynamic environment. The proposed algorithm employs a modified optimization module within the scan-line framework to achieve valuable reduction in computation time needed for generating real-time depth map. Moreover, the experimental results showed high robustness against noises and unbalanced light condition in input data.
Conference Paper
Full-text available
We present an unsupervised method for learning a hierarchy of sparse feature detectors that are invariant to small shifts and distortions. The resulting feature extractor consists of multiple convolution filters, followed by a feature-pooling layer that computes the max of each filter output within adjacent windows, and a point-wise sigmoid non-linearity. A second level of larger and more invariant features is obtained by training the same algorithm on patches of features from the first level. Training a supervised classifier on these features yields 0.64% error on MNIST, and 54% average recognition rate on Caltech 101 with 30 training samples per category. While the resulting architecture is similar to convolutional networks, the layer-wise unsupervised training procedure alleviates the over-parameterization problems that plague purely supervised learning procedures, and yields good performance with very few labeled training samples.
Article
The objective of this paper is classifying images by the object categories they contain, for example motorbikes or dolphins. There are three areas of novelty. First, we introduce a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel. These are designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel. Second, we generalize the spatial pyramid kernel, and learn its level weighting parameters (on a validation set). This significantly improves classification performance. Third, we show that shape and appearance kernels may be combined (again by learning parameters on a validation set). Results are reported for classification on Caltech-101 and retrieval on the TRECVID 2006 data sets. For Caltech-101 it is shown that the class specific optimization that we introduce exceeds the state of the art performance by more than 10%.
Chapter
Adaptive resonance theory is a cognitive and neural theory about how the brain develops and learns to recognize and recall objects and events throughout life. It shows how processes of learning, categorization, expectation, attention, resonance, synchronization, and memory search interact to enable the brain to learn quickly and to retain its memories stably, while explaining many data about perception, cognition, learning, memory, and consciousness.
Article
Current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. In addition, no algorithm presented in the literature has been tested on more than a handful of object categories. We present an method for learning object categories from just a few training images. It is quick and it uses prior information in a principled way. We test it on a dataset composed of images of objects belonging to 101 widely varied categories. Our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. A generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. The parameters of the model are learnt incrementally in a Bayesian manner. Our incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum likelihood. The incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. Both Bayesian methods outperform maximum likelihood on small training sets.
Article
A model to implement parallel search of compressed or distributed pattern recognition codes in a neural network hierarchy is introduced. The search process functions well with either fast learning or slow learning, and can robustly cope with sequences of asynchronous input patterns in real-time. The search process emerges when computational properties of the chemical synapse, such as transmitter accumulation, release, inactivation, and modulation, are embedded within an Adaptive Resonance Theory architecture called ART 3. Formal analogs of ions such as Na− and Ca2− control nonlinear feedback interactions that enable presynaptic transmitter dynamics to model the postsynaptic short-term memory representation of a pattern recognition code. Reinforcement feedback can modulate the search process by altering the ART 3 vigilance parameter or directly engaging the search mechanism. The search process is a form of hypothesis testing capable of discovering appropriate representations of a nonstationary input environment.
Conference Paper
The objective of this paper is classifying images by the object categories they contain, for example motorbikes or dolphins. There are three areas of novelty. First, we introduce a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel. These are designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel. Second, we generalize the spatial pyramid kernel, and learn its level weighting parameters (on a validation set). This significantly improves classification performance. Third, we show that shape and appearance kernels may be combined (again by learning parameters on a validation set). Results are reported for classification on Caltech-101 and retrieval on the TRECVID 2006 data sets. For Caltech-101 it is shown that the class specific optimization that we introduce exceeds the state of the art performance by more than 10%.
We consider visual category recognition in the framework of measuring similarities, or equivalently perceptual distances, to prototype examples of categories. This approach is quite flexible, and permits recognition based on color, texture, and particularly shape, in a homogeneous framework. While nearest neighbor classifiers are natural in this setting, they suffer from the problem of high variance (in bias-variance decomposition) in the case of limited sampling. Alternatively, one could use support vector machines but they involve time-consuming optimization and computation of pairwise distances. We propose a hybrid of these two methods which deals naturally with the multiclass setting, has reasonable computational complexity both in training and at run time, and yields excellent results in practice. The basic idea is to find close neighbors to a query sample and train a local support vector machine that preserves the distance function on the collection of neighbors. Our method can be applied to large, multiclass data sets for which it outperforms nearest neighbor and support vector machines, and remains efficient when the problem becomes intractable for support vector machines. A wide variety of distance functions can be used and our experiments show state-of-the-art performance on a number of benchmark data sets for shape and texture classification (MNIST, USPS, CUReT) and object recognition (Caltech- 101). On Caltech-101 we achieved a correct classification rate of 59.05%(±0.56%) at 15 training images per class, and 66.23%(±0.48%) at 30 training images.