Accepted to 2020 Cold Spring Harbor Laboratory meeting: From Neuroscience to Artificially Intelligent Systems (NAISys)
INTEGRATION OF HUMAN-LIKE COVERT-OVERT ATTENTION WITH PROBABILISTIC
CONVOLUTIONAL NEURAL NETWORKS
1Augustus Intelligence, R&D, NYC, NY
Attention in the human vision system is a mechanism of efficiency, focusing limited computational resources on the
relevant parts of a scene to save “bandwidth” and minimize complexity. There are distinct and complementary attentional
mechanisms: e.g. covert in the periphery, overt guiding fixation, feature-based attention (FBA) to identify specific aspects
such as color, and object-based attention (OBA).
Attention in artificial vision systems, on the other hand, aims to isolate “interesting” or salient regions of an image for
further processing by a convolutional neural net (i.e. R-CNN). These methods have been successful in image
classification tasks while significantly reducing the computational burden of CNNs. Yet we suggest room for
improvement by incorporating aspects of human visual attention: efficiency via complementary attention processes, and
utilizing valuable task-dependent information.
Our info-theoretic, sequential processing notion of saliency more closely resembles human fixation patterns than other
methods. We define a (unsupervised) partially-observable Markov decision process (POMDP) atop a retinotopically
organized self-information map. For task-dependent attention, we can incorporate a supervisory signal as feedback at
each fixation step, yielding a mutual information map.
Our “retina-like” sensor does not see the environment in full, but rather extracts information only in a local region (or
narrow frequency band), similar to the Recurrent Attention Model. They define a reinforcement learning agent that
receives a scalar reward at each fixation to learn a high-order policy, whereas we use a first-order POMDP with optional
supervision to guide the sensor; studies show humans do not integrate high-order sequence info in fixations.
We model a dual covert and overt attentional process: The covert mechanism analyzes the periphery for
maximally-informative data, leading to overt fixation to that next unseen location. To overcome fixations oscillating
between regions of max interest, we bookkeep with a fixation history map: a 2D representation larger than the visual field
containing the sequence of recent fixations, not unlike human frontal eye fields.
For visual processing at each attended subset of the visual space, we implement deep kernel learning, combining the
non-parametric flexibility of Gaussian Processes with the inductive biases and feature extraction of a CNN.
We experiment on Multi-MNIST to show OBA vs FBA -- the former seeks full digits, the latter seeks specific features. We
use two challenging human eye fixation datasets – MIT300, CAT2000 -- to validate the task-based attention paths.
Visual anomaly detection and localization is a great application of our approach: object-based algorithms of R-CNN won’t
suffice, and in general end-to-end learning is a poor approach due to class imbalance and unavailability of labeled
anomaly data. We elucidate this on the benchmark Cement Crack dataset, yielding results competitive with state-of-art
visual anomaly detection methods, while being computationally more efficient.