Content uploaded by P. Michael Furlong

Author content

All content in this area was uploaded by P. Michael Furlong on Feb 10, 2022

Content may be subject to copyright.

Fractional Binding in Vector Symbolic

Representations for Efﬁcient Mutual Information

Exploration

P. Michael Furlong

Centre for Theoretical Neuroscience

University of Waterloo

Waterloo, Canada

michael.furlong@uwaterloo.ca

Terrence C. Stewart

University of Waterloo Collaboration Centre

National Research Council of Canada

Waterloo, Canada

terrence.stewart@nrc-cnrc.gc.ca

Chris Eliasmith

Centre for Theoretical Neuroscience

University of Waterloo

Waterloo, Canada

celiasmith@uwaterloo.ca

Abstract—Mutual information (MI) is a standard objective

function for driving exploration. The use of Gaussian processes

to compute information gain is limited by time and memory

complexity that grows with the number of observations collected.

We present an efﬁcient implementation of MI-driven exploration

by combining vector symbolic architectures with Bayesian Linear

Regression. We demonstrate equivalent regret performance to a

GP-based approach with memory and time complexity that is

constant in the number of samples collected, as opposed to t2

and t3, respectively, enabling long-term exploration.

Index Terms—mutual information sampling, Bayesian opti-

mization, vector symbolic architecture, fractional binding

I. INTRODUCTION

Mutual information (MI) is a standard objective function

for quantifying curiosity when exploring [1], [2]. In this paper

we use Bayesian Optimization as a framework for curiosity,

and present an algorithm for MI-driven exploration that has

time and memory requirements that are constant in the number

of observations, improving on the t3(time) and t2(memory)

requirements of Gaussian Process approaches to computing

MI.

A common approach to informative sampling is to compute

information gain using Gaussian Processes (GPs), e.g. [3]–[6].

However, computing the variance, necessary to compute MI,

requires inverting a matrix that grows with the square of the

number of sampled data points, t. Unbounded growth of mem-

ory, and the concomitant increase in time to evaluate sampling

locations, is not compatible with long-term operations using

systems with limited memory capacity.

To overcome the limitations of GPs, researchers have im-

proved occupancy grid methods [7] with efﬁcient algorithms

for computing information gain [8], [9]. The complexity of

these approaches tend to grow linearly in the number of grid

cells. However, occupancy grids have constraints that GPs

do not - there is a ﬁxed resolution, and only points in the

grid are modelled. These shortcomings can be ameliorated

with irregular and adaptive representations (e.g., triangulated

meshes or KD-trees), but they require additional machinery

to represent the function domain, and increased memory to

represent larger areas.

We present an algorithm that provides the beneﬁts of both

approaches. It has memory and time requirements that are

constant with respect to the number of observations, and is

linear in the number of candidate sampling locations, but is

still deﬁned for all points in the function domain. We achieved

this by combining the concept of fractional binding in Vector

Symbolic Architectures (VSAs) with Bayesian Linear Regres-

sion (BLR), to model uncertainty over the function domain.

VSAs are used in modelling cognitive processes [10]–

[14]. Symbols are represented as vectors, and cognition is

conducted through operations on those vectors. Binding is a

key operation, where a new symbol, C, is created by binding

two existing symbols, Aand B, denoted C=A~B, typically

representing a slot-ﬁller relationship between Aand B.

Cognitive Semantic Pointers are a neurally implemented

VSA for which the binding operator is circular convolu-

tion1. Integer quantities, atomic symbols (e.g., words), and

structured representations (e.g. sentences), can be represented

in Semantic Pointers through binding. To represent integer

quantities, binding is iterated an integer number of times,

denoted Sk=S~. . . ~S, where k∈Nand S∈Rdis

a ﬁxed semantic pointer of dimension d, that we call an “axis

pointer” or “axis vector”.

Spatial Semantic Pointers (SSPs) extend Semantic Pointers

to represent real numbers through the process of fractional

binding [13], [15]. Fractional binding is implemented with the

Fourier transform Sx=F−1{F {S}x}, where x∈Ris the

real value that is encoded through element-wise exponentiation

of the Fourier transform of the axis pointer, S. Using SSPs

allows us to make a connection between biological representa-

tion and information-theoretic models of curiosity. SSPs link

to biology through modelling grid and place cells [16], [17],

representations linked to an organism’s location [18]. Further,

the organization of spatial relationships may be used in other

brain areas [19].

SSPs connect to information theoretic exploration through

1Plate [13] originally suggested circular convolution for a purely algebraic

VSA.

the kernel induced by the dot product over SSP vectors. SSPs

induce a sinc kernel function [20], and sinc kernels have been

found to be efﬁcient kernels for kernel density estimators [21],

[22]. Vector representations that induce kernels can be used to

make memory- and time-efﬁcient kernel density estimators, as

in the EXPoSE algorithm [23]. But where Schneider et al. [23]

made a KDE, we are combining SSPs with Bayesian Linear

Regression to approximate Gaussian Process regression.

Other approaches combine vector representations with BLR

to improve computational efﬁciency. ALPaCA uses uncertainty

over the vector space for meta-learning [24], [25]. Perrone

et al. [26] project data into a vector space, for more com-

putationally efﬁcient Bayesian Optimization. However, these

techniques require learning a projection from input data into

a vector space. The advantage of SSPs is that the represen-

tation doesn’t need to be learned, it can be designed, further

improving efﬁciency.

In this paper we compare the performance of our algorithm,

the Spatial Semantic Pointer Mutual Information Bayesian

optimization (SSP-MI), to the Gaussian Process Mutual Infor-

mation Bayesian optimization (GP-MI) algorithm developed

in [6]. We empirically show that the regret achieved by the

algorithm is at least as good as the GP algorithm, with time and

memory complexity that is constant in the number of samples

collected, t, as opposed to O(t3)and O(t2)for the GP-based

method. The constant time and memory requirements of SSP-

MI means that it is feasible to deploy this algorithm in limited

hardware for long-duration operations.

II. AP PROACH

Our exploration algorithm uses the Bayesian Optimization

framework of Contal et al. [6], given in Algorithm 1. The

objective is to ﬁnd the sampling location, x∗, that max-

imizes a function f(·). The function domain is sampled

to provide a set of candidate function sampling points, X.

The algorithm computes an acquisition function, given by

µt(x) + pγt+σ2

t(x)−√γt, where µt(x)is the current

estimate of f(x),σ2

t(x)is the predicted variance of µt(x), and

γtaccumulates the predicted variance of previously observed

locations. The highest scoring candidate sample location is

selected for follow-up observations, which are used to update

the algorithm that predicts µt(x)and σ2

t(x).

In the baseline algorithm, GP-MI, µt(x)and σ2

t(x)are

provided by a Gaussian Process regression using a radial basis

kernel function, operating on the raw input vectors, x∈ X.

In our algorithm µt(x)and σ2

t(x)are provided by a BLR

over the SSP representation of the points in X. GP-MI was

implemented using the GPy library [27]. For GP-MI, mean

and variance are computed per [28, §6.4]:

µt(x) = kT

t−1Σt−1yt−1(1)

σ2

t(x) = k(x, x)−kT

t−1Σt−1kt−1(2)

where kt= (k(x, x1), . . . , k(x, xt)) is the vector of kernel

function evaluations between the test point xand all points

in the data set collected up to observation t∈ {0, . . . , T }.

Algorithm 1 Mutual information Bayesian optimization

1: procedure MIBO(budget, f (·),X)

2: γ←0

3: while t < budget do

4: µt(x), σ2

t(x)←agent.query(x)∀x∈ X

5: φt(x)←pγt+σ2(x)−√γt

6: xt←arg max

x∈X µ(x) + φt(x)

7: yt←f(xt)Observe f(·)at xt

8: agent.update(xt, yt)

9: γt←γt+σ2(x)

10: t←t+ 1

11: end while

12: end procedure

yt= (y1, . . . , yt)is the vector of previously collected function

value observations, yt=f(xt). After the observation (xt, yt),

is collected, Σt−1,kt−1and yt−1are updated.

For SSP-MI candidate sample locations are transformed into

SSPs, which are row vectors denoted ψ(x). To fractionally

bind vector-valued variables, we select a random axis pointer,

Si, for each dimension of the vector as in [15]. We then frac-

tionally bind each axis pointer using the corresponding vector

element of x, and then bind all of the vectors representing the

individual axes:

ψ(x) = n

~

i=1Sxi/l

i(3)

lis a length scale parameter. In this work we use the same

length scale parameter for all vector elements, although using

one length scale per dimension would be reasonable. The

length scale parameter is optimized by minimizing the L2error

in the predicted observations:

arg min

l∈R+

t

X

ik(yi−mT

tψ(xi/l))k2(4)

The choice to minimize the prediction error instead of max-

imizing the log likelihood of the observations was arbitrary.

The BLR parameters were updated online, per [28, §3.3]:

Σ−1

t= Σ−1

t−1+βψ(xt)Tψ(xt)(5)

mt= ΣtΣ−1

t−1mt−1+ Σtβψ(xt)yt(6)

The predicted mean and variance were computed:

µt(x) = mT

t−1ψ(x)(7)

σ2

t(x) = 1

β+ψ(x)Σt−1ψ(x)T(8)

Both algorithms are initialized with ten observations that are

used to optimize hyperparameters. Initialization points were

selected by randomly shufﬂing the candidate locations and

using the ﬁrst 10 points. For both algorithms the hyperpa-

rameters were only optimized on the initial 10 samples, and

not modiﬁed afterwards.

A. Experiment

We tested the algorithms on the Himmelblau, Branin-Hoo,

and Goldstein-Price functions, which were used as benchmarks

in [6]. We scaled the functions to make the problem a

maximization, to ensure GP hyperparameter ﬁtting converged,

and to get similar regret numbers to those reported in [6].

The functions were evaluated, without noise, over a restricted

domain, with points spaced evenly along each axis to form a

100 ×100 grid. Agents were given a budget of 250 samples.

The domain and scale factors are given in Table I.

Function Domain Scale

Himmelblau [−5,5] ×[−5,5] −1

100

Branin-Hoo [−5,10] ×[0,15] −1

Goldstein-Price [−2,2] ×[−2,2] −1

105

TABLE I: The tested functions, the evaluation domains, and

the scale factors applied.

Algorithm performance was measured with regret averaged

over samples taken. The regret at each time point is the

difference between the function value at sampling location,

x, and the maximum value of the function over the candidate

sampling locations, x∗= arg maxx∈X f(x). The regret at time

t, is Rt=1

tPt

i=1 f(x∗)−f(xi). We also recorded the running

total of the time it took to predict µt(x)and σ2

t(x)over the

set of candidate sampling points including, for SSP-MI, the

one-time encoding of the points in Xas SSPs.

III. RES ULTS

Fig. 1 shows the evolution of the algorithms’ regret on the

left, and the cumulative time spent evaluating sampling loca-

tions on the right. Shaded regions represent a 95% conﬁdence

interval for the N= 30 trials.

Except for some initial samples, the algorithms’ average

regret is largely indistinguishable. Table II reports the differ-

ence in the means and in the standard deviation of Rtof the

algorithms. Positive values mean the SSP-MI algorithm has

lower regret, standard deviation. A Bayesian hypothesis test

at samples 125 and 250, ﬁnds the performance of the SSP-MI

algorithm is either better than or statistically indistinguishable

from the GP-MI algorithm with 95% probability. Where

there is a statistical difference, the effect size (Cohen’s d) is

moderate to large. For regret performance, under the tested

scenarios, there is no reason to choose GP-MI over SSP-MI.

The beneﬁts SSP-MI become apparent in the time to se-

lect sampling locations. At each iteration the algorithm re-

computes the acquisition function for the candidate sampling

locations. For GP-MI the time grows as a function of the

number of samples collected. For SSP-MI the time to compute

the acquisition function is constant in the number of samples

collected, hence the observed linear trend in Fig. 1.

Of note is the large constant offset for the initial processing

time. This is due to the time it takes to encode the candidate

sampling locations as SSPs, {ψ(x) : x∈ X}. These values

were cached so the encoding was only performed once.

0 50 100 150 200 250

Sample Number

1

2

3

4

5

6

Average Regret (a.u.)

GP-MI

SSP-MI

(a) Himmelblau average regret

0 50 100 150 200 250

Sample Number

0

1

2

3

4

5

6

7

8

Cumulative Time (sec)

GP-MI

SSP-MI

(b) Himmelblau processing time

0 50 100 150 200 250

Sample Number

0

20

40

60

80

100

120

140

160

Average Regret (a.u.)

GP-MI

SSP-MI

(c) Branin-Hoo average regret

0 50 100 150 200 250

Sample Number

0

2

4

6

8

Cumulative Time (sec)

GP-MI

SSP-MI

(d) Branin-Hoo processing time

0 50 100 150 200 250

Sample Number

0

1

2

3

4

Average Regret (a.u.)

GP-MI

SSP-MI

(e) Goldstein-Price average regret

0 50 100 150 200 250

Sample Number

0

1

2

3

4

5

6

7

8

Cumulative Time (sec)

GP-MI

SSP-MI

(f) Goldstein-Price processing

time

Fig. 1: Graphs on the left show the average regret, Rt, and

graphs on the right the total accumulated processing time.

Shaded regions are 95% conﬁdence intervals for N= 30

trials. In all cases the Rtfor SSP-MI is either the same as

or statistically signiﬁcantly better than the GP-MI algorithm.

SSP-MI shows a substantial improvement in performance with

respect to accumulated processing time.

IV. DISCUSSION

We have demonstrated, using biologically plausible rep-

resentations, MI-driven exploration that has ﬁxed limits on

memory and computation time while still being deﬁned over

continuous spaces. Combining Spatial Semantic Pointers and

Bayesian Linear Regression enables operations with limited

memory space and long-term operation in arbitrary spaces.

Our empirical regret performance is either statistically

indistinguishable, or better than the baseline GP approach

on three standard optimization targets. The time to evaluate

candidate sampling locations is constant in the number of

samples collected, unlike the GP-based approaches, and has

a memory requirement O(d2), in the dimensionality of the

SPP, compared to O(t2), in the number of observations, for

GP-MI. Note also that our estimation of compute time in

Fig. 1 conservatively includes the one-time encoding cost.

While it takes over 100 samples to amortize this cost, the

encoding could be done prior to the onset of operations, which

Function tEGP [Rt]−ESSP [Rt]SDGP [Rt]−SDSSP [Rt]Effect Size

µ95% HDI µ95% HDI µ95% HDI

Himmelblau 125 1.01 [ 0.50, 1.61] 0.09 [-0.11, 0.34] 0.93 [ 0.44, 1.48]

250 1.06 [ 0.51, 1.60] 0.09 [-0.10, 0.35] 0.98 [ 0.45, 1.48]

Branin-Hoo 125 2.42 [-0.07, 4.82] 3.54 [ 1.72, 5.74] 0.67 [ 0.02, 1.35]

250 2.83 [ 0.94, 4.70] 3.03 [ 1.68, 4.47] 0.79 [ 0.25, 1.34]

Goldstein-Price 125 0.02 [-0.39, 0.41] 0.00 [-0.11, 0.11] 0.01 [-0.35, 0.42]

250 0.02 [-0.37, 0.43] 0.00 [-0.12, 0.12] 0.02 [-0.37, 0.41]

TABLE II: The difference in regret measured at samples t= 125 and t= 250. SSP-MI regret is either better than or statistically

indistinguishable from GP-MI, at the selected sample points. We report the differences in the average regret and the standard

deviations, as well as the effect size (Cohen’s d) for 30 trials. We also report 95% high density intervals (HDI). Positive values

means the SSP algorithm has a lower regret or standard deviation. Results using an unpaired Bayesian hypothesis test [29].

would further favour SSP-MI. Like occupancy grid methods,

evaluation time grows linearly with the number of candidate

locations, but SSP-MI retains its deﬁnition over the continuum.

Our algorithm represents an initial proof-of-concept for cu-

riosity guided exploration using vector representations. If SSPs

are a unifying tool for modeling cognition, as in Eliasmith

et al. [14], then our approach could also model curiosity

in conceptual spaces. However, while we encoded data with

SSPs, other vector encoding that induce kernels could be used.

There remain algorithmic reﬁnements to explore. Hexagonal

SSPs [17] could improve the efﬁciency of encoding candidate

sample locations, and further integrate the work to neural

models of spatial representation. Performance degradation in

response to noise remains to be examined.

Because computing the acquisition function is efﬁcient, it

should be practical to ﬁnd sample collection points, x, that

maximize the acquisition function, instead of selecting from

a ﬁnite set, avoiding regret due to arbitrary sampling of the

function domain.

Further, we may be able to evaluate entire trajectories, not

just individual sample locations. Single SSPs can represent tra-

jectories and regions of a space, facts that may be exploitable

to efﬁciently evaluate trajectories for informativeness and,

through computing dot-products, feasibility in a conﬁguration

space. As our curiosity model is goal-driven, via the µ(x)term

in the acquisition function, a variable weighting of expected

reward and information gain could allow switching between

task-driven exploration as well as something akin to play.

We have presented an efﬁcient implementation of curiosity

that may be of use in memory- and time-limited contexts.

While preliminary, this work is a jumping-off point for efﬁ-

cient autonomous exploration.

ACK NOW LE DG EM EN T

The authors would like to thank Nicole Sandra-Yaffa Du-

mont for discussions that helped improve this paper. This work

was supported by CFI and OIT infrastructure funding as well

as the Canada Research Chairs program, NSERC Discovery

grant 261453, NUCC NRC File A-0028850.

REFERENCES

[1] D. V. Lindley, “On a measure of the information provided by an

experiment,” The Annals of Mathematical Statistics, pp. 986–1005, 1956.

[2] T. Loredo, “Bayesian adaptive exploration in a nutshell,” Statistical

Problems in Particle Physics, Astrophysics, and Cosmology, vol. 1, p.

162, 2003.

[3] A. Singh, A. Krause, C. Guestrin, W. J. Kaiser, and M. A. Batalin,

“Efﬁcient planning of informative paths for multiple robots,” in IJCAI,

vol. 7, 2007, pp. 2204–2211.

[4] D. R. Thompson and D. Wettergreen, “Intelligent maps for autonomous

kilometer-scale science survey,” 2008.

[5] K. Yang, S. Keat Gan, and S. Sukkarieh, “A gaussian process-based rrt

planner for the exploration of an unknown and cluttered environment

with a uav,” Advanced Robotics, vol. 27, no. 6, pp. 431–443, 2013.

[6] E. Contal, V. Perchet, and N. Vayatis, “Gaussian process optimization

with mutual information,” in International Conference on Machine

Learning. PMLR, 2014, pp. 253–261.

[7] F. Bourgault, A. A. Makarenko, S. B. Williams, B. Grocholsky, and

H. F. Durrant-Whyte, “Information based adaptive robotic exploration,”

in IEEE/RSJ international conference on intelligent robots and systems,

vol. 1. IEEE, 2002, pp. 540–545.

[8] B. Charrow, S. Liu, V. Kumar, and N. Michael, “Information-theoretic

mapping using cauchy-schwarz quadratic mutual information,” in 2015

IEEE International Conference on Robotics and Automation (ICRA).

IEEE, 2015, pp. 4791–4798.

[9] Z. Zhang, T. Henderson, S. Karaman, and V. Sze, “Fsmi: Fast computa-

tion of shannon mutual information for information-theoretic mapping,”

The International Journal of Robotics Research, vol. 39, no. 9, pp. 1155–

1177, 2020.

[10] P. Kanerva, Sparse distributed memory. MIT press, 1988.

[11] ——, “Hyperdimensional computing: An introduction to computing

in distributed representation with high-dimensional random vectors,”

Cognitive computation, vol. 1, no. 2, pp. 139–159, 2009.

[12] T. Plate, “Holographic reduced representations: Convolution algebra for

compositional distributed representations.” in IJCAI, 1991, pp. 30–35.

[13] T. A. Plate, “Holographic reduced representations,” IEEE Transactions

on Neural networks, vol. 6, no. 3, pp. 623–641, 1995.

[14] C. Eliasmith, “How to build a brain: From function to implementation,”

Synthese, vol. 159, no. 3, pp. 373–388, 2007.

[15] B. Komer, “Biologically inspired spatial representation,” 2020.

[16] E. P. Frady, P. Kanerva, and F. T. Sommer, “A framework for linking

computations and rhythm-based timing patterns in neural ﬁring, such as

phase precession in hippocampal place cells,” 2018.

[17] N. S.-Y. Dumont and C. Eliasmith, “Accurate representation for spatial

cognition using grid cells,” in 42nd Annual Meeting of the Cognitive

Science Society. Toronto, ON: Cognitive Science Society, 2020, pp.

2367–2373.

[18] E. I. Moser, E. Kropff, and M.-B. Moser, “Place cells, grid cells, and

the brain’s spatial representation system,” Annu. Rev. Neurosci., vol. 31,

pp. 69–89, 2008.

[19] T. E. Behrens, T. H. Muller, J. C. Whittington, S. Mark, A. B. Baram,

K. L. Stachenfeld, and Z. Kurth-Nelson, “What is a cognitive map?

organizing knowledge for ﬂexible behavior,” Neuron, vol. 100, no. 2,

pp. 490–509, 2018.

[20] A. R. Voelker, “A short letter on the dot product between rotated fourier

transforms,” arXiv preprint arXiv:2007.13462, 2020.

[21] I. K. Glad, N. L. Hjort, and N. G. Ushakov, “Correction of density

estimators that are not densities,” Scandinavian Journal of Statistics,

vol. 30, no. 2, pp. 415–427, 2003.

[22] I. K. Glad, N. L. Hjort, and N. Ushakov, “Density estimation using the

sinc kernel,” Preprint Statistics, vol. 2, p. 2007, 2007.

[23] M. Schneider, W. Ertel, and G. Palm, “Expected similarity estimation for

large scale anomaly detection,” in 2015 International Joint Conference

on Neural Networks (IJCNN). IEEE, 2015, pp. 1–8.

[24] J. Harrison, A. Sharma, and M. Pavone, “Meta-learning priors for

efﬁcient online bayesian regression,” in International Workshop on the

Algorithmic Foundations of Robotics. Springer, 2018, pp. 318–337.

[25] S. Banerjee, J. Harrison, P. M. Furlong, and M. Pavone, “Adaptive meta-

learning for identiﬁcation of rover-terrain dynamics,” arXiv preprint

arXiv:2009.10191, 2020.

[26] V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau, “Multiple

adaptive bayesian linear regression for scalable bayesian optimization

with warm start,” arXiv preprint arXiv:1712.02902, 2017.

[27] GPy, “GPy: A gaussian process framework in python,”

http://github.com/ShefﬁeldML/GPy, since 2012.

[28] C. M. Bishop, Pattern recognition and machine learning. springer,

2006.

[29] J. K. Kruschke, “Bayesian estimation supersedes the t test.” Journal of

Experimental Psychology: General, vol. 142, no. 2, p. 573, 2013.