ChapterPDF Available

A Deeper Look at Bongard Problems

Authors:

Abstract and Figures

Machine learning, especially deep learning, has been successfully applied to a wide array of computer vision classification tasks in recent years. Infamous for requiring massive amounts of data to perform well at image classification problems, deep learning has so far been unable to solve Bongard problems (BPs), a set of abstract visual reasoning tasks invented in the 1960s. Each BP can be seen as a supervised learning task, with few training samples (6 for positive and 6 for negative), and often requiring highly abstract features to learn well. Automatically solving Bongard problems directly from images remains an ambitious goal, with very little machine learning literature on the topic. In this paper, we discuss several special properties of BPs as well as what it means to solve a BP. Making use of an expanded set of BP-like tasks to allow for a more careful evaluation of automated solvers, we develop and benchmark a deep learning based approach to solve these problems. To encourage work on this interesting problem, we also make freely available a dataset of over 200 BPs (https://github.com/XinyuYun/bongard-problems).
Content may be subject to copyright.
A Deeper Look at Bongard Problems
Xinyu Yun, Tanner Bohn, and Charles Ling
Western University, London ON N6A 3K7, Canada
{xyun,tbohn,charles.ling}@uwo.ca
Abstract. Machine learning, especially deep learning, has been success-
fully applied to a wide array of computer vision classification tasks in
recent years. Infamous for requiring massive amounts of data to perform
well at image classification problems, deep learning has so far been un-
able to solve Bongard problems (BPs), a set of abstract visual reasoning
tasks invented in the 1960s. Each BP can be seen as a supervised learning
task, with few training samples (6 for positive and 6 for negative), and
often requiring highly abstract features to learn well. Automatically solv-
ing Bongard problems directly from images remains an ambitious goal,
with very little machine learning literature on the topic. In this paper,
we discuss several special properties of BPs as well as what it means to
solve a BP. Making use of an expanded set of BP-like tasks to allow for a
more careful evaluation of automated solvers, we develop and benchmark
a deep learning based approach to solve these problems. To encourage
work on this interesting problem, we also make freely available a dataset
of over 200 BPs 1.
Keywords: Bongard Problems ·Convolutional Neural Networks ·Fea-
ture Extraction ·Few-Shot Learning.
1 Introduction
Despite recent successes in machine learning on many problems previously con-
sidered beyond the reach of artificial intelligence, tasks requiring divergent think-
ing, abstraction, and few-shot learning continue to be a challenge. While other
tasks requiring one or more of these properties have seen recent attention and
progress [12, 9], Bongard problems (BP), which appear to require the solver to
possess all three of these skills, continue to be largely unstudied. Created in the
1960s by Mikhail Bongard, these problems were designed to demonstrate the
inadequacy of the standard pattern recognition tools of the day for achieving
human-level visual cognition [1].
A typical Bongard problem consists of 12 tiles evenly divided into a left and a
right class. To gauge the cognitive abilities of a test subject, the subject is shown
the 12 tiles and then asked to provide a rule which distinguishes the tiles appear-
ing on one side from the tiles on the other. For example, the intended rule for
1https://github.com/XinyuYun/bongard-problems
2 Xinyu Yun, Tanner Bohn, and Charles Ling
Fig. 1. Examples of easy, intermediate, and difficult Bongard problems.
the second problem in Figure 1 is ‘clockwise spirals on the left, counterclockwise
spirals on the right’.
As a classification task, BPs possess several properties which make it both
interesting and difficult with respect to machine learning. A few of these prop-
erties are shared with other well-studied tasks, however, other properties also
establish the Bongard problems as uniquely difficult2.
Divergent thinking. The three Bongard problems in Figure 1, ranging from
easy to difficult demonstrate the typical variation, both visually and in terms of
solutions. Since there are a very large number of potential features to consider
and many ways these features can be combined to define different rules, diver-
gent thinking is required to perform well at Bongard problems. This property
is also partially shared by Raven’s Progressive Matrices (RPM) task [11], where
deciding upon the tile which best completes the matrix requires considering
many alternative hypotheses to find the one requiring the simplest justification.
using a fixed set of visual features and sequence progression relations. There is
considerably more diversity in the visual elements and rule types in Bongard
problems.
Abstract thinking. To solve second problem in Figure 1, recognizing that
the shapes have the characteristic of spiraling requires abstract thinking, be-
cause the property of spiraling is not physically present, but exists as non-trivial
relationship between points on a curve. The patterns required to be identified
to solve the problem often are not directly visible, but exists as a complex rela-
tionship between other abstract features. For example, finding the intended rule
for the third problem in Figure 1 likely requires observing that the individual
shapes of a particular type should be grouped together to form the outlines of
larger shapes.
Few-shot learning. To recognize that all objects on a given side share
one potentially complex property among innumerable alternatives given only six
samples per class requires few-shot learning. In contrast, datasets for image clas-
sification problems often have orders of magnitude more samples per class. This
few-shot learning property is shared with both the popular Omniglot task (con-
2A description of what does and does not make for a valid BP can be found here:
http://www.foundalis.com/res/invalBP.html.
A Deeper Look at Bongard Problems 3
cerned with classification of hand-written characters) [9] and Raven’s Progressive
Matrices (matrix completion) [11].
For most of these properties, machine learning has had some success on asso-
ciated problems. However, when multiple properties are present, as in the case
of BPs, learning to automatically solve the tasks becomes much more difficult.
Due to this large performance gap and the unique challenges of BPs, we believe
studying BPs is an efficient route towards reaching human-level performance
across a variety of tasks.
Towards this end, the main contributions of the present work are as follows.
We adapt a deep learning based approach to solve Bongard problems and
overcome weaknesses in previous approaches (Section 3).
We consider the set of properties which make BPs uniquely difficult and pro-
pose a set of metrics for automatic evaluation of BP solvers, which interprets
BPs as few-shot classification tasks (Section 4).
We evaluate our deep-learning based approaches on the BPs while examining
the effects of pre-training and feature extraction methods (Section 5).
2 Related Work
Due to the difficulty of automatically solving BPs or the lack of awareness of
them, few attempts at the task have been made.
Motivated by the appearance of Bongard problems in Godel, Escher, Bach
[6], Hofstadter’s own graduate student, Harry Foundalis, decided to approach
the problem of automatically solving them in his dissertation [5]. Foundalis’
approach consists of a cognitive architecture for visual pattern recognition called
Phaeaco, which tries to solve BPs with the following process. First, working at
the pixel level, Phaeaco attempts to explicitly extract the geometric primitives
contained in each of the 12 tiles of a problem. Next, features shared among the
tiles for each side are identified. This is repeated either until a rule is found
or some stopping criterion is reached. The Phaeaco model is capable of finding
solutions for up to 15 problems out of 2003. Due to the non-deterministic nature
of the program, the success rate of each of these problems varies dramatically,
between 6% and 100%.
A more recent approach of solving Bongard problems is provided by [3]. Sim-
ilar to Phaeaco, their pipeline begins with explicit extraction of visual features.
Additionally, these visual features are then translated into a symbolic visual vo-
cabulary. Candidate rules which split the 12 tiles are scored based on assigned
prior probabilities of the grammar’s production rules which produced the rule in
such a way that shorter, less complex rules are preferred. Under this restriction,
only 39 BPs are considered. The approach solves 35 of the 39 problems.
A recent approach utilizing deep learning to solve BPs was proposed in an
intriguing blog post by [7]4. While this approach does not entirely avoid manu-
ally defining the type of visual features that are important to consider, it comes
3Phaeaco results can be found here: http://www.foundalis.com/res/solvprog.htm.
4https://k10v.github.io/2018/02/25/Solving-Bongard-problems-with-deep-learning/
4 Xinyu Yun, Tanner Bohn, and Charles Ling
close, and is the inspiration for the model we present in Section 3. Kharagorgiev’s
approach works roughly as follows: first, an image dataset of simple shapes is au-
tomatically constructed and used to train a convolutional neural network (CNN)
as domain knowledge. Second, feature vectors are extracted and binarized with
a manually chosen threshold for each of the 12 tiles with the CNN by taking
globally-averaged feature maps, as proposed in [10]. Finally, finding a solution
to a BP is then reduced to locating a feature where all tiles from each side
have the same value, unique to that side. Of the 232 problems assembled by
Foundalis5, 47 problems are considered solved, 41 of which are correctly solved.
3 Our Models
Due to the uniqueness (both visually and in terms of solutions) of Bongard
problems and the small size of the problem set compiled over the years (currently
around 300), training a meta-learning model on a subset of the problems to
try apply to new problems is difficult without overfitting to the specific rules
types present in the training data. These properties make recent state-of-the-
art approaches for few-shot classification problems [14], ill-suited for Bongard
problems. In an attempt to overcome these hurdles, we apply transfer learning, a
common deep learning based approach to learning with small data. The approach
we take is to pre-train a convolutional neural network with synthetic images
that contain visual features commonly present in BP tiles, then train a simple
classifier on feature vectors for the 12 tiles produced by the CNN feature maps.
Figure 2 provides a high level view of this process.
Pre-training Samples
Fig. 2. Bongard problem solver pipeline.
Pre-training. Pre-training for image classification, as described in [4], pop-
ularized the insight that rather than learning to perform a new classification
5The set of original BPs by Mikhail Bongard as well as those proposed by others can
be found here: http://www.foundalis.com/res/bps/bpidx.htm.
A Deeper Look at Bongard Problems 5
task from scratch, one can take advantage of knowledge coming from previously
learned categories. By training a machine learning model to perform one task,
it may implicitly discover features useful to learning to perform another similar
task. Compared to past approaches to solving BPs which extracted visual and
abstract features using hard-coded feature detectors and routines [3, 5], we can
influence what patterns a CNN discovers by simply augmenting the training task
to require discovering those patterns, a much easier task than manually writing
algorithms to detect those particular features. To ensure that the features we
extract from the feature maps that are relevant to visual patterns presenting in
BPs, and we pre-train the CNN on a related task: shape classification. Figure 2
shows some pre-training samples as well. In Section 5.2, we examine the effects
of increasing variety of shapes on final BP solver performance.
Feature Extraction. To extract features for a given BP tile, we use global-
averaged feature map activation, which computes the spatially averaged acti-
vation value for each kernel [10]. The magnitude of a globally-averaged value
can be interpreted as the prevalence of a particular feature in the input image,
with features in earlier layers often corresponding to simple visual features and
later layers detecting features corresponding to more abstract concepts specific
to the dataset and task [15]. In Section 5.2, we examine the effects of extracting
features from layers of different depths in the pre-trained CNNs.
Classification. After calculating feature vectors for each tile in a BP, we
train a classifier to distinguish between the two classes. While any classifier may
be used, careful consideration should be made to influence the type of rule we
want it to learn. In Section 4, we discuss the different types of solutions and rules
to Bongard problems. In Section 5.2, we also observe the effects of the classifier
on performance.
4 Evaluating Bongard Problem Solvers
To understand how to automatically evaluate a BP solver, it helps to under-
stand what properties a solution may possess. In the present work, we consider
proposed solutions to possess (or lack) the following properties.
Validity. We consider a proposed rule to be valid if it is able to correctly
split (classify) the original 12 tiles, and invalid otherwise. We consider a rule
to be a condition that can categorize tiles into left or right (either correctly or
incorrectly), whereas a solution is a rule which is valid and can thus correctly
categorize the 12 original tiles.
Robustness. We consider a solution to be robust if it is able to not only
classify the original 12 tiles, but also additional ones which are classified left or
right according to the intended rule, defined by the author of the problem.
Simplicity. An intuitive definition, although often impractical to use for
evaluation, is that a simple solution takes few words to state. The opposite of a
simple rule is a complex rule.
Figure 3 illustrates how valid rules (solutions) to a given problem may vary
in robustness and simplicity. In Section 4.1 we discuss how to evaluate a solver
6 Xinyu Yun, Tanner Bohn, and Charles Ling
Fig. 3. Bongard #5 and various valid solutions (assuming each tile is 100px by 100px).
with respect to validity, and in Sections 4.2 and 4.3, we discuss evaluation with
respect to robustness and simplicity.
4.1 Measuring Validity
A model is said to produce a valid solution for a BP if the proposed rule correctly
splits the 12 tiles into two groups. This corresponds to the evaluation method
used by [5] and [3] (without the next step of subjective analysis). To condense
the validity performance of a model into a single value, we average the validity
scores across a set of Bongard problems:
validity =1
#BPs X
pBP s
pCC (1)
where pCC is the set of all tiles in pcorrectly classified.
To accompany the validity score, we consider the average problem number
where a valid solution is found. This allows us to observe whether our models
have a bias, similar to humans, of solving more easy than difficult problems. This
works due to the trend of problem difficulty increasing with problem number in
the set of 200 BPs compiled by [5].
4.2 Measuring Robustness
Since the only way a solution can be robust is with respect to the intended
solution, we use a functional definition of robustness. If a rule is able to correctly
classify unseen samples from each class then it can be considered robust.
Here we define a subset BP s(v)to represent Bongard Problems with valid
solutions found by our model, and the set #BP (v)contains the total numbers
A Deeper Look at Bongard Problems 7
of BPs for which valid rules are found. We average the robustness score based
on BPs(v):
robustness =1
#BP s(v)X
pBP s(v)
newT ilesC C (2)
where newT ilesC C if the fraction of new tiles for pcorrectly classified.
As noted by [3], Bongard problems are unlike usual classification problems
in that the small number of examples for each class are often carefully chosen
to have a single property in common while ruling out as many alternatives as
possible. Leaving out even one or two tiles opens up to possibility for finding
many non-intended solutions. Additionally, this interpretation of robustness ig-
nores the case where a rule acts unexpectedly when presented with tiles that
do not clearly belong to either side. If left vs right is circles vs. squares, what
does it mean if a picture of a lamp is classified left? We therefore only consider
robustness under the assumption that all tiles presented will belong to either the
left or right.
4.3 Measuring Simplicity
Measuring the simplicity of a tile classification rule learned by an automated
solver may be extremely difficult. This problem of interpreting how a deep learn-
ing model works is well studied with regards to image classification and often
done with saliency maps, which show the parts of the input image which most
influence the classification results [15, 13]. In the present work, we do not at-
tempt to define a rule simplicity measure, however, in Section 5.3, we consider
visualizing activation maps to gain insight into the types of rules discovered by
out models.
5 Experiments and Results
In this section we analyze the performance and effects of hyperparameters of
two variations of our problem solving model. The first model, PT+SF, uses
pre-training and single feature classification (a decision tree of depth 1). Second
is PT+LR, which also utilizes pre-training, but can propose rules combining
many features using a logistic regression classifier.
First we discuss the experimental setup in Section 5.1, then we discuss ob-
servations made in 5.2, and in Section 5.3 we produce visualizations of the rules
implicitly learned by a solver and examine their utility.
5.1 Setup
To observe the effect of feature abstraction on BP solver performance with as
few confounding variables as possible, we use the same hyperparameters for each
of the convolutional layers (architecture shown in Figure 4):
8 Xinyu Yun, Tanner Bohn, and Charles Ling
3x3 conv, 64,
ReLU
Input tile
96x96x1
2x2 Max pool
3x3 conv, 64,
ReLU
2x2 Max pool
3x3 conv, 64,
ReLU
2x2 Max pool
3x3 conv, 64,
ReLU
2x2 Max pool
Flatten
Dense + Softmax
conv_0 conv_1 conv_2 conv_3
output
Fig. 4. Neural network architecture used. Four convolutional layers with the same
hyperparameters are used.
64 kernels of size 3x3 with stride 1, ReLU non-linearity, and followed by 2x2
max-pooling with stride 2.
For the PT+SF and PT+LR models, the output of the last convolutional
layer is mapped to shape class probabilities with a dense layer and softmax
activation
The models are trained with categorical cross-entropy loss and the Adam
optimizer [8] with the default hyperparameters defined by Keras [2].
We use 100,000 tiles (80/20 train/validation split) of size 96x96x1. Table 1
contains the details of the five different pre-training data types of increasing
complexity we experiment with and the number of training epochs we found
to produce stable validation scores without overfitting. The final validation ac-
curacy ranged from 100% for the easiest pre-training set to 93% for the most
complex set.
Table 1. Pre-training data details.
Pre-training data type Shape classes # Shape classes Training epochs
1 Single-segmented lines, dots, curves 3 3
2 #1, circles, ellipses 7 6
3 #2, 3-gons, equilateral 3-gons 11 20
4 #3, 2- and 3-segmented lines, 4-gons, equilateral 4-gons 17 20
5 #4, 5- and 6-gons, equilatereral 5- and 6-gons 25 20
To evaluate the overall validity power of our models, we incrementally com-
bine and keep all useful features from each convolutional layer, including the
output with small size of shape classes that may carrying simple shape infor-
mation to solve certain BPs, to obtain a consistent evaluation results. Sample
layers would be like:
output, output +CL3, output +C L3 + C L2, ....
A Deeper Look at Bongard Problems 9
5.2 Results and Observations
At first, we evaluate the validity scores and average values of BP#(v) as defined
in Section 4.1 for the 200 BPs. Then we manually created two additional test
tiles for each problem (1 for each side) in order to estimate robustness based
on the model’s validity. All experiments results for PT+SF and PT+LR are
averaged across 5 trials.
In Table 2 and Table 3 we see the validity and robustness for the proposed
PT+SF and PT+LR models, containing the results with respect to the pre-
training type and combined layers for feature extraction.
Table 2. Effects of pre-training and CNN combined layers used for feature extraction
on PT+SF performance with 5 trials. CLi refers to the ith convolutional layer. The
highest scores for each metric are bolded, and second and third highest underlined.
Pre-training type
Metric Layers 1 2 3 4 5
output 1.2% (89) 3.3% (50) 3% (21) 2.6% (49) 1.9% (66)
output+CL3 14.3% (78) 18.6% (84) 19.5% (79) 22.1% (85) 19.2% (84)
Validity (avg BP#(v)) output+CL3+CL2 18.8% (83) 22.1% (86) 25% (84) 27% (90) 24.7% (90)
output+CL3+CL2+CL1 21.5% (87) 25.1% (87) 26.4% (85) 28.6% (90) 28.0% (91)
output+CL3+CL2+CL1+CL0 23.3% (90) 26.4% (88) 27.7% (88) 30.2% (93) 28.6% (91)
output 97.50% 81.90% 64.94% 66.16% 84.16%
output+CL3 61.18% 60.58% 62.92% 64.74% 66.30%
Robustness output+CL3+CL2 62.50% 62.36% 61.84% 62.66% 65.64%
output+CL3+CL2+CL1 62.30% 63.30% 62.50% 64.58% 67.52%
output+CL3+CL2+CL1+CL0 63.86% 60.84% 63.36% 63.34% 66.26%
Table 3. Effects of pre-training type and CNN layers used for feature extraction on
PT+LR performance. For the results shown, the logistic regression penalty is fixed to
l2and inverse regularization strength is chosen from C= [1,2,4,8,16,32,64,128]. The
highest scores for each metric are bolded, and second and third highest underlined.
Pre-training type
Metric Layers 1 2 3 4 5
output 1% (78) 2.7% (67) 1.8% (49) 2.5% (22) 3.7% (43)
output+CL3 71.9% (96) 94.7% (96) 96.4% (97) 98.1% (98) 99.7% (99)
Validity (avg BP#(v)) output+CL3+CL2 78.8% (99) 95.9% (97) 97.9% (98) 98.6% (98) 99.8% (99)
output+CL3+CL2+CL1 79.2% (100) 95.9% (97) 98.1% (98) 98.7% (98) 99.8% (99)
output+CL3+CL2+CL1+CL0 78% (100) 95.9% (97) 98.1% (98) 98.7% (98) 99.8% (99)
output 100.00% 84.28% 58.00% 74.32% 60.56%
output+CL3 57.88% 54.64% 56.74% 54.74% 55.76%
Robustness output+CL3+CL2 56.08% 54.52% 56.68% 55.08% 55.92%
output+CL3+CL2+CL1 56.40% 54.34% 56.86% 55.42% 56.72%
output+CL3+CL2+CL1+CL0 57.00% 54.34% 56.86% 55.22% 56.92%
Effects of pre-training complexity. For both PT+SF and PT+LR, in-
creasing the variation of the shape set led to an improvement in both validity
and robustness, with the effect being stronger when using the logistic regression
10 Xinyu Yun, Tanner Bohn, and Charles Ling
classifier. The only exception is when the output class distributions are the only
extracted features, in while case the robustness scores may be unusually high
due to the smaller value of BP s(v)
Effects of layers combination. For both pre-trained models, it appears
that including more convolutional layers produces better features when mea-
suring validity, but robustness is less affected. This may be due to the deeper
convolutional layers learning features specific to the shape classification task and
thus less applicable to other tasks [15]. We can also observe that the PT+LR
model can be seen as over-fitting when measuring validity (as indicated by the
low corresponding robustness).
The output layer consistently performs poorly for both PT+SF and PT+LR
in terms of validity, likely due to the small number of shape classes as listed in
Table 1. This also suggests that just knowing what basic shapes are present in
the image is helpful for solving only a small set of simple Bongard problems.
Effects of classifier. In the PT+SF model, we used a decision tree with
depth 1 to choose a single visual feature to serve as a rule for each Bongard
problem. From the results, this very simple classifier has has generally smaller
validity scores compared with the PT+LR model, but is more robust. This ob-
servation matches the nature of BPs: they are often designed to be solved with
only one abstract rule or feature as an intended solution. Thus, PT+SF may
score higher in simplicity than PT+LR. The PT+LR model, using logistic re-
gression, linearly combines many features. Not surprisingly, this more expressive
classifier is capable of producing much higher validity.
Overall performance. While direct performance comparisons should not
be drawn to previous approaches due to differences in the types of rules automat-
ically produced, our approaches are capable of finding valid solutions to a greater
fraction of problems than previous approaches. Our PT+SF model finds valid
solutions for up to 30.2% of the problems (60/200) and correctly classifies two
new test tiles for 66.3% of the problems(38/60). The PT+LR achieve almost
100% validity, but at the cost of more complex solution rules. In contrast, [5]
reports 7.5% and the previous work most similar to ours and without further
test set validation, [7], reported 18% of 232 problems solved (and 19% of the 200
problems we use)).
5.3 Rule Visualization
In Figure 5 we present 8 problems for which a valid solution was found by a
PT+SF model which used pre-training set #4 and tile embeddings from the
feature maps in the last convolutional layer. Highlighted areas indicate higher
values in the activation map chosen by the BP solver for that problem. The
intended rules are provided for each problem.
Problems (a) to (d) in Figure 5 have arguably interpretable rules. The in-
tended rule for (a) is ’small shapes present on the left but not the right’, and
as expected, small figures are highlighted by the activation map of the auto-
matically chosen filter. In both (b) and (c), the shapes clearly associated with
A Deeper Look at Bongard Problems 11
(a) BP #21 - small figure
present vs. no small figure
present
(b) BP #25 - filled figure is
triangle vs. fill figure is circle
(c) BP #94 - filled circle not at
endpoint vs. filled circle at
endpoint
(d) BP #183 - same curvature
close to the middle vs. change
of curvature close to middle
(e) BP #8 - on the right side vs.
on the left side
(f) BP #17 - angle directed
inwards vs. no inward angle
(g) BP #101 - parallel dents vs.
perpendicular dents
(h) BP #164 - number of objects
is one less than sides vs. number
of objects is more than sides
Fig. 5. Examples of interpretable (a to d) and non-interpretable (e to h) visualizations
of valid rules found by the PT+SF model for Bongard problems.
the intended rules are highlighted. However, for (d), it appears that a valid, al-
though non-intended solution was identified: there is more empty space around
the corners on the right than on the left. The intended solution for this problem
is ’same curvature close to the middle vs. change of curvature close to mid-
dle’. Problems (e) to (h) have also had valid solutions identified, but serve to
demonstrate that a standard method of visualizing what a CNN has learned is
frequently not well-suited for Bongard problems, as it is not always clear what
part of the tiles should be highlighted to make the discovered rule more visible.
6 Conclusions
Bongard problems are a kind of visual puzzle which require skills central to
human intelligence: divergent thinking, abstract thinking, and the ability to learn
from little data. To solve these problems given raw images, we train a CNN to
perform the related task of shape classification and use the globally-averaged
feature maps to produce feature vectors for the tiles of a BP. We observed that
increasing the shape variation of the pre-training data as well as extracting
features from deeper convolutional layers tended to improve the quality of the
extracted feature vectors, increasing the number of problems for which valid and
robust solutions could be discovered.
The present work hints at many promising avenues. While the author of a
problem may have a particular rule in mind, an automated solver may identify
many valid solutions. Adding an active learning component to the Bongard prob-
lem requiring automated solvers to strategically test highly abstract hypotheses
may be interesting. Developing a visualization technique capable of conveying
12 Xinyu Yun, Tanner Bohn, and Charles Ling
the abstract rules learned by an automated solver is another task which may
prove to be important.
References
1. Bongard, M.M.: The problem of recognition. Fizmatgiz, Moscow (1967)
2. Chollet, F.: keras. https://github.com/fchollet/keras (2015)
3. Depeweg, S., Rothkopf, C.A., Jäkel, F.: Solving bongard problems with a visual
language and pragmatic reasoning. arXiv preprint arXiv:1804.04452 (2018)
4. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of ob-
ject categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4),
594–611 (Apr 2006). https://doi.org/10.1109/TPAMI.2006.79,
https://doi.org/10.1109/TPAMI.2006.79
5. Foundalis, H.: Phaeaco: A cognitive architecture inspired by bongard’s problems
[ph. d. thesis]. Indiana University, Indiana, Bloomington (2006)
6. Hofstadter, D.R.: Gödel, escher, bach. Vintage Books New York (1980)
7. Kharagorgiev, S.: Solving bongard problems with deep learning (Feb
2018), https://k10v.github.io/2018/02/25/Solving-Bongard-problems-with-deep-
learning/
8. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR
abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
9. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple
visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science
Society. vol. 33 (2011)
10. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400
(2013)
11. Raven, J.C., et al.: Raven’s progressive matrices. Western Psychological Services
Los Angeles, CA (1938)
12. Santoro, A., Hill, F., Barrett, D., Morcos, A., Lillicrap, T.: Measuring abstract
reasoning in neural networks. In: International Conference on Machine Learning.
pp. 4477–4486 (2018)
13. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net-
works: Visualising image classification models and saliency maps. arXiv preprint
arXiv:1312.6034 (2013)
14. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks
for one shot learning. In: Advances in Neural Information Processing Systems. pp.
3630–3638 (2016)
15. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In:
Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV
2014. pp. 818–833. Springer International Publishing, Cham (2014)
... Impressively, Phaeaco's architecture incorporates both low-level perceptual processes working at the pixel-and high-level symbolic analogical reasoning processes. Since then, additional systems have been proposed to solve BPs, including both deep learning neural network (Yun et al., 2020) and more symbolic (Depeweg et al., 2018) approaches. ...
... While this framework improves on the previous Phaeaco approach in terms of solved BPs, its lack of dynamics and the absence of any processes that use physical simulation to predict scene changes prevent it from successfully handling most PBPs. The same applies to a more recent approach (Yun et al., 2020), which employs a pretrained convolutional neural network (CNN) for feature generation and substitutes the symbolic grammar by either a one-level classification tree or a regression layer. In addition, the use of features generated by deep artificial neural networks also prevents the autonomous construction of human-readable rules. ...
... Even more basically, the overall rate with which PATHS solves the problems is comparable to humans, with humans solving about 45% of the problems, whereas PATHS regularly solves 40% of the same problems. Furthermore, these rates of solution for these PBPs are far higher than other computational models for solving BPs (Depeweg et al., 2018;Foundalis, 2006;Yun et al., 2020). To be sure, these other computational models were not designed to solve PBPs, and so the problems that we tested are outside of the scope of these models. ...
Article
Full-text available
A key component of humans’ striking creativity in solving problems is our ability to construct novel descriptions to help us characterize novel concepts. Bongard problems (BPs), which challenge the problem solver to come up with a rule for distinguishing visual scenes that fall into two categories, provide an elegant test of this ability. BPs are challenging for both human and machine category learners because only a handful of example scenes are presented for each category, and they often require the open-ended creation of new descriptions. A new type of BP called physical Bongard problems (PBPs) is introduced, which requires solvers to perceive and predict the physical spatial dynamics implicit in the depicted scenes. The perceiving and testing hypotheses on structures (PATHS) computational model, which can solve many PBPs, is presented and compared to human performance on the same problems. PATHS and humans are similarly affected by the ordering of scenes within a PBP. Spatially or temporally juxtaposing similar (relative to dissimilar) scenes promotes category learning when the scenes belong to different categories but hinders learning when the similar scenes belong to the same category. The core theoretical commitments of PATHS, which we believe to also exemplify open-ended human category learning, are (a) the continual perception of new scene descriptions over the course of category learning; (b) the context-dependent nature of that perceptual process, in which the perceived scenes establish the context for the perception of subsequent scenes; (c) hypothesis construction by combining descriptions into explicit rules; and (d) bidirectional interactions between perceiving new aspects of scenes and constructing hypotheses for the rule that distinguishes categories.
... In recent years, AI researchers developed visual analogy algorithms to solve visual reasoning problems, such as Raven's Progressive Matrices (RPM), first discussed in [47] and its extended versions in modern datasets, such as Procedurally Generated Matrices (PGM) [79], RAVEN [49], and Impartial-RAVEN [80]. These analogy problems are like the Bongard Problems originally discussed in [81] and later modernized into their namesake dataset [82]. Visual analogy problems and their real-world applications have been the subjects of a few recent review papers [6,83,84]. ...
... Structured representations make it possible to find systematic mappings across different contexts and to form relational generalizations. However, with a few important exceptions [82], most cognitive models simply hand-code the structured representations from images and have begged the basic question of where their structured representations come from. In practice, hand-coded representations restrict the applicability of models to relatively small "toy problems" (or at least to toy representations of what are much more complex problems). ...
Article
Full-text available
Artificial intelligence and machine learning (AI/ML) research has aimed to achieve human-level performance in tasks that require understanding and decision making. Although major advances have been made, AI systems still struggle to achieve adaptive learning for generalization. One of the main approaches to generalization in ML is transfer learning, where previously learned knowledge is utilized to solve problems in a different, but related, domain. Another approach, pursued by cognitive scientists for several decades, has investigated the role of analogical reasoning in comparisons aimed at understanding human generalization ability. Analogical reasoning has yielded rich empirical findings and general theoretical principles underlying human analogical inference and generalization across distinctively different domains. Though seemingly similar, there are fundamental differences between the two approaches. To clarify differences and similarities, we review transfer learning algorithms, methods, and applications in comparison with work based on analogical inference. Transfer learning focuses on exploring feature spaces shared across domains through data vectorization while analogical inferences focus on identifying relational structure shared across domains via comparisons. Rather than treating these two learning approaches as synonymous or as independent and mutually irrelevant fields, a better understanding of how they are interconnected can guide a multidisciplinary synthesis of the two approaches.
... Participants are asked to infer the hidden property characterizing the positive set, and describe it with a natural language statement, by observing a limited number of samples (six positive images and six negative ones, in the original formulation). Being notoriously complex for purely subsymbolic approaches (Yun et al., 2020), Bongard problems have been recently proposed in a variety of different settings and scenarios. Bongard-LOGO (Nie et al., 2020) is a synthetic benchmark directly inspired by the original Bongard problems, whereas Bongard-HOI (Jiang et al., 2022) is a natural image benchmark: both are framed as a few-shot binary classification task. ...
Article
Full-text available
Artificial intelligence is continuously seeking novel challenges and benchmarks to effectively measure performance and to advance the state-of-the-art. In this paper we introduce KANDY, a benchmarking framework that can be used to generate a variety of learning and reasoning tasks inspired by Kandinsky patterns. By creating curricula of binary classification tasks with increasing complexity and with sparse supervisions, KANDY can be used to implement benchmarks for continual and semi-supervised learning, with a specific focus on symbol compositionality. The ground truth is also augmented with classification rules to enable analysis of interpretable solutions. Together with the benchmark generation pipeline, we release two curricula, an easier and a harder one, that we propose as new challenges for the research community. With a thorough experimental evaluation, we show how state-of-the-art neural models, purely symbolic approaches, and vision language models struggle with solving most of the tasks, thus calling for the application of advanced neuro-symbolic methods trained over time.
... The dataset measures the human-level concept learning and reasoning of AI agents with the help of 12 000 matrices. Contrary to the approaches from [72,73], for each problem, the test set is not extracted from the context matrices, but two additional test images are associated with each problem instance (see Figs. 7d-7f). Moreover, Bongard-LOGO defines four testing splits that help to evaluate different generalization capabilities of the tested methods in a similar fashion to PGM regimes [15] in the RPM task. ...
Article
Visual Reasoning (AVR) problems are commonly used to approximate human intelligence. They test the ability of applying previously gained knowledge, experience and skills in a completely new setting, which makes them particularly well-suited for this task. Recently, the AVR problems have become popular as a proxy to study machine intelligence, which has led to emergence of new distinct types of problems and multiple benchmark sets. In this work we review this emerging AVR research and propose a taxonomy to categorise the AVR tasks along 5 dimensions: Input shapes, hidden rules, target task, cognitive function, and specific challenge. The perspective taken in this survey allows to characterise AVR problems with respect to their shared and distinct properties, provides a unified view on the existing approaches to solving AVR tasks, shows how the AVR problems relate to practical applications, and outlines promising directions for future work. One of them refers to the observation that in the machine learning literature different tasks are considered in isolation, which is in the stark contrast with the way the AVR tasks are used to measure human intelligence, where multiple types of problems are combined within a single IQ test.
Article
Full-text available
More than 50 years ago Bongard introduced 100 visual concept learning problems as a testbed for intelligent vision systems. These problems are now known as Bongard problems. Although they are well known in the cognitive science and AI communities only moderate progress has been made towards building systems that can solve a substantial subset of them. In the system presented here, visual features are extracted through image processing and then translated into a symbolic visual vocabulary. We introduce a formal language that allows representing complex visual concepts based on this vocabulary. Using this language and Bayesian inference, complex visual concepts can be induced from the examples that are provided in each Bongard problem. Contrary to other concept learning problems the examples from which concepts are induced are not random in Bongard problems, instead they are carefully chosen to communicate the concept, hence requiring pragmatic reasoning. Taking pragmatic reasoning into account we find good agreement between the concepts with high posterior probability and the solutions formulated by Bongard himself. While this approach is far from solving all Bongard problems, it solves the biggest fraction yet.
Article
Full-text available
Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.
Chapter
The Raven Progressive Matrices (RPM) tests measure “general cognitive ability” or, better, eductive, or “meaning making,” ability (Raven, Raven, & Court, 1998a,2000). The term “eductive” comes from the Latin root educere, which means, “to draw out.” The basic version of the test, known as the Standard Progressive Matrices (or SPM), consists of five sets of items of the kind shown in Figures 11.1 and 11.2. Within each set, the items become progressively more difficult. At the beginning of each set, the items, although easy again, follow a different logic. The sets in turn become progressively more difficult. The five sets offer those taking the test five opportunities to become familiar with the method of thought required to solve the problems. In addition to the Standard series, there is the Coloured Progressive Matrices (CPM), which is designed to spread the scores of children and less able adults and the Advanced Progressive Matrices (APM), developed to spread the scores of the top 20% of the population.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
The book deals with provisional solutions of as yet mathematically unformulated problems encountered in the design of machines that make use of a recognition function. A survey of work that has been done in a variety of approaches to the problem is developed beginning with conditioned-reflex and neuron theory and Rosenblatt's 'Perceptron.' Topics include similarity search, receptor-space transformation, recognition vs. simulation, causes of poor system performance, a 'discover-the-law' game, the information-theoretical usefulness concept, and statistical criteria. An analogy is drawn with learning in a child, where instruction is based not on direct modification of the physical system itself, but by indirect stimulation of its receptors. Appendixes deal with hypotheses containing only truth, optimum hypotheses, and nonlogarithmic optimum determining algorithms, and, lastly, present a set of 100 recognition problems for a program. (Author)
Article
Learning visual models of object categories notoriously requires hundreds or thousands of training examples. We show that it is possible to learn much information about a category from just one, or a handful, of images. The key insight is that, rather than learning from scratch, one can take advantage of knowledge coming from previously learned categories, no matter how different these categories might be. We explore a Bayesian implementation of this idea. Object categories are represented by probabilistic models. Prior knowledge is represented as a probability density function on the parameters of these models. The posterior model for an object category is obtained by updating the prior in the light of one or more observations. We test a simple implementation of our algorithm on a database of 101 diverse object categories. We compare category models learned by an implementation of our Bayesian approach to models learned from by Maximum Likelihood (ML) and Maximum A Posteriori (MAP) methods. We find that on a database of more than 100 categories, the Bayesian approach produces informative models when the number of training examples is too small for other methods to operate successfully.
Phaeaco: a cognitive architecture inspired by Bongard’s problems
  • H Foundalis
Foundalis, H.: Phaeaco: A cognitive architecture inspired by bongard's problems [ph. d. thesis]. Indiana University, Indiana, Bloomington (2006)