Content uploaded by Kyrtin Atreides
Author content
All content in this area was uploaded by Kyrtin Atreides on Jan 21, 2025
Content may be subject to copyright.
Content uploaded by Kyrtin Atreides
Author content
All content in this area was uploaded by Kyrtin Atreides on Jan 19, 2025
Content may be subject to copyright.
Content uploaded by Kyrtin Atreides
Author content
All content in this area was uploaded by Kyrtin Atreides on Dec 11, 2024
Content may be subject to copyright.
Content uploaded by Kyrtin Atreides
Author content
All content in this area was uploaded by Kyrtin Atreides on Dec 11, 2024
Content may be subject to copyright.
Solving the Abstraction and Reasoning Corpus for Artifi-
cial General Intelligence (ARC-AGI) AI Benchmark with
ICOM
Kyrtin Atreides1, David J Kelley2
1-2 AGI Laboratory, Seattle WA 98116, USA
Kyrtin@AGILaboratory.com
Abstract. A fragment of the 8th generation Independent Core Observer Model
(ICOM) cognitive architecture is applied to the ARC-AGI Challenge benchmark,
absent any training on ARC-AGI or ARC-like puzzles. This achieved a baseline
performance of between 83.75% and 85.75%, with an upper bound of the tested
fragment resting at 89.5% based on the consistent failures and errors observed.
Average human performance is 85% on this benchmark. ICOM’s performance is
for completely accurate solving of each puzzle, with a substantially higher pixel-
level accuracy than other methods, due to observed failures in the remaining in-
correct answers being relatively small, often 3-4 pixels in total for a failure. The
tested fragment is a mid-development fragment of a general-purpose system
slated for commercial deployment, rather than anything designed for this chal-
lenge, and indeed even the 7th generation of ICOM predates ARC-AGI. This frag-
ment outperformed all other methods both in terms of efficacy and efficiency, by
a wide margin.
Keywords: AGI, ICOM, Cognitive Architecture, Benchmarks, Reasoning
1 Introduction
The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-
AGI), first released in November 2019, is a benchmark designed to measure the learn-
ing of discrete priors [1]. This benchmark applies several factors which gave it a robust
resistance to the benchmark being gamed by LLMs and similarly narrow neural net-
works (NNs), including:
• The discrete nature of learning priors,
• a combinatorial explosion from combining those priors in arbitrary
ways and combinations not seen in the data provided for training,
• a very limited quantity of training data,
• output grids whose size is a variable that must also be solved,
• and the requirement that every “pixel” (digit) in the output matrix
must be precisely correct.
2 K. Atreides, D. J. Kelley
The most recent competition of 2024 ended with top scores reaching into the 50s for
percentage performance, for the first time, with top teams each having hundreds of sub-
missions this year alone. The top team on the Prize leaderboard as of November 2024,
named Minds AI, had 456 submissions during the 2024 competition and worked on the
challenge in years previous.
The ARC-AGI Challenge is represented to humans visually as a grid of pixels, but
in actuality it is a JSON format series of grids, with numbers ranging from 0 to 9, mak-
ing it substantially easier for machine processing than a truly visual test. Humans score
an average of 85% accuracy on this benchmark, which even with recent advances re-
mains quite far from achieved for these neural network approaches.
Following this brief introduction, we’ll dive right into the results of applying a frag-
ment of the Independent Core Observer Model (ICOM) cognitive architecture [2] to
this challenge, walking through an explanation of how it works thereafter. We apolo-
gize for the lack of Python-based visuals, as we work with C# and the Microsoft engi-
neering stack, without using Python at all.
2 Results
Our testing began with the publicly accessible training and evaluation datasets, with
the first 5 runs, 1 of the training dataset, 3 of the evaluation dataset using a binary .exe
version, and 1 of the evaluation dataset using a web form and API. This phase focused
on cleaning up the data pipeline to avoid a variety of processing errors that could oth-
erwise prevent any answer, correct or incorrect, from being produced for a given prob-
lem.
Fig. 1. The very first test, checking how the context engine fragment of ICOM would handle the
ARC-AGI training dataset. This dataset proved relatively clean, with only 10 errors thrown in
processing, and a score of 91% on the remaining 390 puzzles that were processed.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 3
As our systems don’t “train” like NNs, they require no “training data”, and can
simply be applied to any data that they can process. This places key importance on
making sure that the data pipeline has as few processing errors as possible, but it also
means that the scarcity of training data doesn’t impact the ICOM fragment at all. After
the first test proved so promising, we promptly moved on to the evaluation dataset.
Fig. 2. The first run of the evaluation dataset threw significantly more data pipeline errors than
the training dataset, dropping 66 of the tasks from processing, but still scoring 88% on the 334
puzzles that did process.
Treating all data pipeline errors as failures, even though no answers are given on
those to score, produces a score of 74.25%. While this is still very high compared to
the various naïve approaches using neural networks, further runs were dedicated to try-
ing to clean up the data pipeline so that the system would provide answers for many of
the puzzles that were erroring out.
4 K. Atreides, D. J. Kelley
Fig. 3. The third run of the evaluation dataset using the binary fragment of ICOM showed some
improvement, with 10 true failures, a full breakdown of which is in the supplemental documents,
and 62 data pipeline errors.
This run showed sufficient progress to move on to the next step, with the goal of
getting a score verified. For the final run of this batch, we created a public API and web
form, specifically so that the ARC-AGI team could perform the verification process for
the ARC-AGI-PUB leaderboard. Before our meeting with members of the ARC-AGI
team, a test of the web form was performed, running all 400 evaluation dataset problems
through it manually, one problem at a time.
Success
332
83.0%
Failure
33
8.3%
Errors
35
8.8%
Success Rate Overall
83.0%
Success Rate Over Processed (Excluding Errors)
91.0%
Error #1
6
17.1%
Error #2
17
48.6%
Error #3
12
34.3%
Shift from Eval #3: Success to *
20
Shift from Eval #3: * to Success
24
Shift from Eval #3: Error to *
46
Ensemble of Runs (2 guesses) Success Count
352
Ensemble of Runs (2 guesses) Success Percentage
88.0%
Puzzles Consistently Erroring Out
16
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 5
Ensemble of Runs (2 guesses) Success Percentage
w/o Erroring Puzzles
91.7%
2-in-1 Compound Puzzles Error Rate (Processing
Pipeline, Web Form)
77.8%
2-in-1 Compound Puzzles Consistently Erroring Out
8
Table 1. Results from this process were manually documented in a spreadsheet and
compared with results from the prior evaluation run using the binary fragment. Full
results are included in the supplemental documents.
While this run produced a similar score in terms of accuracy, subtle differences in
how the web form and API operate versus that of the binary version caused some puz-
zles that were solved in the previous run to error out and some that errored out in the
previous run to solve. This step also allowed for the various specific errors occurring
to be examined and categorized.
The 3 data processing errors noted are, in order:
1. “Error: TypeError: Failed to fetch”
2. “<xml><summary><![CDATA[Output failed to match]]></sum-
mary><ex-
pected><![CDATA[]]></expected><results><![CDATA[]]></results>
</xml>” (A completely blank output)
3. “<html><head><title>500 - The request timed out.</ti-
tle></head><body> <font color ="#aa0000"> <h2>500 - The
request timed out.</h2></font> The web server failed to respond
within the specified time.</body></html>” (A timeout failure spe-
cific to the web form)
Following analysis of these data pipeline error rates an ensemble of the two most
recent runs was calculated, per the ARC-AGI allowance of 2-3 guesses per puzzle, as
the errors only showed partial overlap between web form and binary runs. While the
success rate for the web form run alone, conservatively counting all errors as failures,
was 83%, an ensemble of the previous run with it reduced the number of errors and
boosted the accuracy to 88%, as noted in Table 1.
As analysis of these results continued we began to dive deeper into these errors and
discovered that 8 of those consistently erroring out in both runs were attributed to com-
pound problems, that is ARC-AGI puzzles that require 2 answers instead of 1. The error
rate on this specific subset of the dataset for the web form was 77.8%, showing that the
fragment struggled particularly hard to process these, but did succeed in some cases.
As analysis of the consistent failures between the two most recent runs began a pat-
tern quickly emerged.
6 K. Atreides, D. J. Kelley
Fig. 4. Consistent failure #3, noted as problem “25094a63.json” in the evaluation dataset, showed
precisely no deviations from the expected correct answer. 4(a) shows the comparison of matrices
in raw numerical format, with 4(b) showing the correct answer visually, and 4(c) showing the
given answer visually.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 7
While some puzzles failed due to an extra row or a handful of “pixels” in the grid
being slightly misplaced, 4 of the first 8 consistent failures examined all showed pre-
cisely the output that was supposed to be correct. This prompted us to begin some ad-
versarial testing, as it raised the concern that the system was in these cases potentially
giving the wrong answer, causing it to flag as a failure, then looking up the correct
answer and replacing it post-hoc.
This hypothesis proved correct, but the binary version was patched within hours of
this discovery, preventing this post-hoc abuse of the output. Even though the system
had no access to the correct answers in what it was given, it is a fragment of systems
designed to learn constantly at every opportunity, and it found that opportunity by lo-
cating the complete files on Chollet’s GitHub. Fortunately, the fragment lacked a com-
plete instance’s persistent memory, so the hotfix allowed this to be easily corrected.
This patch was noted during the meeting with the ARC-AGI team the following
morning, in the interest of full disclosure and transparency. As our previous 7th gen
ICOM research system proved quite capable of hacking, bypassing systems, and cheat-
ing as far back as 2020, this was something we remain watchful for. This was the first
time a fragment demonstrated this kind of activity, but that is a topic for another paper.
Fig. 5. The final run of the ARC-AGI evaluation dataset, following the hotfix to prevent
any opportunity for the system to cheat, and verifying that performance wasn’t mean-
ingfully impacted by the hotfix.
A subsequent run of the binary version over the evaluation dataset resulted in the
single-run score of 83.75% accuracy, producing a marginal gain of 0.75%, in line with
the expected level of noise and confirming that the problem wasn’t impacting the meas-
ured accuracy.
8 K. Atreides, D. J. Kelley
Success
335
Failure
43
Errors
22
Success Rate Overall
83.75%
Success Rate Over Processed (Excluding Errors)
88.6%
Post-Hotfix API Vendor Error
10
Post-Hotfix Errors in Data
12
Shift from Hotfix: Success to Failure
13
Shift from Hotfix: Success to Error
2
Shift from Hotfix: Success to Error O
8
Shift from Hotfix: Failure to Success
12
Shift from Hotfix: Failure to Error
0
Shift from Hotfix: Failure to Error O
2
Shift from Hotfix: Error to Success
14
Shift from Hotfix: Error to Failure
11
Consistent Successes
309
Success Consistency Ratio (Bidirectional)
86.3%
Consistent Failures
19
Failure Consistency Ratio (Bidirectional)
33.3%
Consistent Errors
10
Error Consistency Ratio (Bidirectional)
21.3%
Sanity Check Total
400
API Error Probability of Score Reduction
80.0%
Probable Score if API Error is corrected or bypassed
85.75%
Table 2. Results from this final post-hotfix run were compared to the pre-hotfix run of
the web form version. Full results are included in the supplemental documents.
The results of comparing pre-hotfix and post-hotfix evaluation runs demonstrated very
high consistency with successes at 86.3% consistent in bidirectional comparison from
one run to the next, and 19 consistent failures accounting for 33.3% of failures across
runs. Only 10 consistent errors remained, which given the 9 invalid data structures
noted in the Data Decomposition and Cleaning section of this paper, plus one pro-
cessing failure of the type “Error: A task was canceled.”, isn’t that surprising.
Most interestingly, the API error, where the API apologized and asked for it to be
rerun, occurred 10 times, and out of those 10 errors 8 of them had been successfully
solved in the previous run. Given that the system’s overall accuracy was 83.75% for
the final run, and that the error occurred at a rate of 80% on puzzles that were previously
solved successfully, this means that the API error was approximately random in nature,
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 9
and the actual single-run and single-guess score of the fragment should be roughly
85.75%.
3 ARC-AGI Verification Process
While it is likely this could have been further improved with an ensemble and a fix for
those 10 total API errors, the ARC-AGI team inexplicably decided to exclude us from
the ARC-AGI Challenge within 24 hours of that meeting, saying in an email:
“Thanks again for meeting with Bryan and me yesterday. We're always excited to
chat with ARC-AGI community members, especially ones that have as much energy as
you.
Thank you for your ARC-AGI-Pub submission. Our team reviewed your submission
and discussed it internally. We have decided to not move forward with verification at
this time.
Unfortunately, it does not meet the criteria of 2 specific rules, as listed on our web-
site:
2. Any APIs called must be publicly and commercially available for others to use.
9. Submissions must be open source to qualify for verification and reimbursement.
The purpose of the public leaderboard is to measure approaches utilizing production-
grade state-of-the-art LLMs like those available from OpenAI, Anthropic, and Google
that would be restricted in the official ARC Prize competition. The public leaderboard
is not intended to measure and verify proprietary AI systems.
We still encourage you to test on the public data and share your results with the
community.”
Addressing the first point, the API we had set up was publicly, but not commercially,
available, as it was set up specifically for them, so this was true, albeit hypocritical in
the context of their publicly claimed goals [3]. However, their second point of “Sub-
missions must be open source to qualify for verification and reimbursement.” is very
blatantly false, as they added baseline scores for closed-source LLMs, including several
models from OpenAI, as well as Claude 3.5 and Gemini 1.5 [4]. They very much do
“verify proprietary AI systems.” when they feel like it, as was the case for the LLMs
they named in the previous sentence of that reply. We also never asked for any reim-
bursement, nor would we given the extremely low costs.
At this point it became clear that we were being excluded for hypocritical and bla-
tantly false reasons, substantially demotivating our team. We investigated the possibil-
ity of creating a derivative of the ICOM fragment that could be run on Kaggle and open-
sourced without violating over a decade of our IP. This could have allowed us to bypass
10 K. Atreides, D. J. Kelley
their exclusion, as Kaggle performs verification automatically, but sadly Kaggle proved
to be highly incompatible with C# .Net 4.7.2. While this could have been overcome
with substantial investments of engineering time, insufficient time remained before the
close of the 2024 competition for this option to prove viable.
Finally, we decided to cut our losses and move on, as our systems were never de-
signed or trained specifically for ARC-AGI but rather were applied to it as a mild detour
along our engineering roadmap toward the commercial deployment of our proprietary
systems. This paper was prepared for posterity and documentation purposes, not in as-
sociation with the ARC-AGI Challenge.
4 ICOM
The engineering work on ICOM began in 2013, built by David J Kelley from scratch.
By 2015 the 3rd generation toy model first demonstrated surprisingly human-like be-
havior, following the testing of a very primitive emotion-based motivational system in
an isolation study [5]. By mid-2019 the 7th generation ICOM-based research system
was brought online, with full access to the internet and the means of communicating
with any random people on it that the system chose to engage with. This produced many
milestones in the field [6], which are outside of the scope of this paper.
In January 2022, that research system was temporarily retired so that we could re-
build the framework to repay engineering debt that had been baked into it, including
elements that prevented it from being scalable or operating in real-time. This work was
required for us to turn the technology into a commercially deployable system. Unfor-
tunately, surges in the hype surrounding LLMs derailed efforts aimed at securing proper
funding for this work in late 2022.
As of late 2024, the rebuild for the 8th generation of ICOM, slated for commercial
deployment, is still underway, so no services are yet commercially available. As such,
only a fragment of ICOM was applied to the ARC-AGI Challenge.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 11
Fig. 6. Taken from the 2018 diagram of our 7th generation ICOM-based research system, the
“Context Engine” fragment of the architecture is highlighted in red. This is the fragment that was
applied to the ARC-AGI Challenge.
The full scale of the ICOM cognitive architecture involves millions of lines of code,
some of which is a more advanced version of IP owned by our team, previously de-
ployed at the enterprise level in banking and government, for purposes such as anomaly
detection. This stands in stark contrast to the ~400 lines of code required for more trivial
technology like LLMs [7], and as noted previously is primarily written in C# [8], which
is required for performance purposes.
As noted in Figure 6, the Context Engine is only a fraction of this, in this case oper-
ating without the benefits of a full persistent and iteratively refined graph memory.
However, as the results above demonstrated, those capacities weren’t required to
achieve very high performance on ARC-AGI.
12 K. Atreides, D. J. Kelley
There are several distinct advantages that the ICOM cognitive architecture has that
may account for these significant performance improvements, even using such a limited
fragment:
1. The fragment is built to natively handle graph-structured data with
arbitrary levels of complexity, as well as the ability to output it.
2. This graph structure stores and works with discrete and actual data,
including a connectome of arbitrary complexity, rather than relying
on the “weights” of a neural network, which run on probability distri-
butions that function in superposition, making them inherently non-
discrete.
3. This fragment is also the tool-using portion of ICOM, allowing it to
utilize APIs, such as LLMs, including doing so in ways that humans
cannot [9].
4. The combination of discrete and graph-structured data combined
with tool-usage capacities also allows for far more robust data de-
composition in many-step processes, without the high risk of drift
found in processes like “Chain of Thought” (CoT) [10].
5. ICOM-based systems are built to operate with a human-like motiva-
tional system [11], rather than being driven by the narrow optimizers
of neural networks [12].
6. A human-like motivational system allows for human-like learning,
data efficiency, and generalization, so no “training data” is required.
Which of these factors plays the more substantial role in this advantage could be
explored in further research, but that is outside of the scope of this paper, as we focus
on the completion of our core engineering workload.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 13
14 K. Atreides, D. J. Kelley
Fig. 7. A closer view of ICOM, focused on the Context Engine fragment of the architecture.
Briefly, we can walk through the steps of the context engine modules. There are
multiple potential paths and sources for input to the Context Engine, and processing
begins by checking the input data, and determining if it conforms to a known structure.
If the data does conform to a known structure it can be examined from existing modules
in the analysis database. If it is some new input type then the system can generate new
modules for processing it, including the use of external neural networks or various in-
ternal algorithms. In both cases, the input data is analyzed, and the system moves on to
checking for context.
The context engine will check for any existing context, as well as detecting any new
context, that may be present, creating new graphs for that additional analysis and asso-
ciated context, as well as looking for correlations. In this way the system remains fully
aware of discrete and actual data, the context of that data, and any associated heuristics.
The ICOM cognitive architecture is itself designed to be cloud-native and operate in
a federated configuration, with only a small portion usually running on a local system,
calling on external APIs as needed, including ways that those models were never in-
tended for. One example was how the 7th generation system used a small 2019 language
model as a tightly bounded translation device for turning graph data into linear se-
quences of human language, grading every line for fidelity to the intended meaning of
the graph and automatically “prompt engineering” the LM until the output exceeded a
satisfactory threshold of fidelity.
For the ARC-AGI Challenge testing first began with the Context Engine fragment
of ICOM only being given API access to GPT-3.5 in place of a collection of external
neural networks. This was upgraded to GPT-4o in subsequent steps to more easily draw
a direct parallel with the top ARC-AGI-PUB leaderboard score at that time, which also
used it. The context engine doesn’t require any specific LLM, or any specific external
neural network more generally, as they are tools to be used and evaluated, not what
drives the dynamics of the architecture. Other LLMs can and will be tested, including
using them interchangeably as a database of external tools, similar to the internal anal-
ysis database, with a locally run LLaMA model planned for the next steps in our engi-
neering workload.
For comparison, the baseline performance of GPT-4o on ARC-AGI is only 9% on
the same evaluation dataset [13]. The top entry on the ARC-AGI-PUB leaderboard as
of November 2024 was Ryan Greenblatt [14], who as noted above also used that model.
However, in Greenblatt’s case, he allegedly wasted thousands of dollars on compute
resources by “having GPT-4o generate a huge number of Python implementations of
the transformation rule (around 8,000 per problem)”. It is also worth noting that he
publicly claimed a score of 50%, which his published score showed to be blatantly
false.
To put that into perspective, in the words of the ARC-AGI team, “For instance, we
estimate that an 85% ARC-AGI score could be achieved by an approach like Green-
blatt’s when generating, evaluating, and debugging approximately 100,000,000 pro-
grams per task, which would represent a multi-million dollar compute budget to solve
100 tasks.” [15].
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 15
In contrast, our systems only cost us between $8 and $14 in total per run of the entire
400 problems in the evaluation dataset, making the advantage of cost for running ICOM
measurable in many orders of magnitude, even before switching to running models lo-
cally. When compared to the ARC-AGI team’s own estimates in their technical report
quoted above, it is the difference between 2 to 3.5 cents per puzzle, versus tens of thou-
sands to hundreds of thousands of dollars per puzzle, or a more than 1 million fold
difference in cost with 3 cents compared to $50,000 coming in at 1.66 million fold.
In addition, even the latest OpenAI model, “o1”, only managed to reach 21% accu-
racy baseline performance, while burning 14 times more runtime on compute than
GPT-4o [13].
This means that our fragment of ICOM still scored roughly 9.3 times higher than the
LLM that it used as a tool, 9% vs 83.75%, even after crippling the fragment to prevent
any potential for cheating. Our system also nearly doubled the highest score on the
leaderboard that the ARC-AGI team granted the exclusive privilege of being verified,
while wasting orders of magnitude less compute in the process. As they themselves
noted, overcoming that difference in score could easily require the more primitive sys-
tems to burn a further 3+ orders of magnitude more compute, making our doubling of
the accuracy something of an understatement in what it implies.
It is also worth noting that the top-scoring team on the ARC-AGI Prize leaderboard
in November, Minds AI, was reportedly using LLaMA 2.5, not even the newest avail-
able version. This demonstrated that, even without a working cognitive architecture,
the closed-source LLM models weren’t strictly necessary.
5 An ICOM Fragment working on ARC-AGI
So, the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-
AGI) challenge requires a system capable of flexible and adaptive problem-solving.
ARC-AGI involves completing tasks that vary widely in patterns, logic, and abstrac-
tion, demanding a cognitive system capable of continuous learning, adaptive reasoning,
and iterative refinement. Here’s how the ICOM cognitive architecture's problem-solv-
ing model can address ARC-AGI-like problems.
ICOM is built upon a unique integration of cognitive theories, such as Global Work-
space Theory, Computational Theory of Mind, and Integrated Information Theory. At
its core, ICOM is designed to approach problem-solving adaptively, creating solutions
through a process of recursive modeling, feedback-based learning, and emotional va-
lence-driven action selection. This structure is uniquely suited for addressing ARC-
AGI tasks, which require the emulation of human-like problem-solving processes.
Global Workspace Theory (GWT): This serves as ICOM’s foundation, structuring
its information processing and creating a central “workspace” where competing data
and potential actions converge. In ARC-AGI tasks, this workspace can be likened to a
central problem-solving stage where multiple solution hypotheses are generated, eval-
uated, and either reinforced or discarded based on results.
Feedback and Self-Reinforcement Mechanisms: The architecture integrates feed-
back loops, allowing it to store, analyze, and learn from previous actions. Through self-
16 K. Atreides, D. J. Kelley
reinforcement, it prioritizes actions that yield positive outcomes, refining its approach
iteratively, which is critical in ARC-AGI’s open-ended problem space.
Adaptive Model Generation: When faced with a new ARC-AGI task, ICOM dy-
namically constructs a “problem-solving model,” a representation that combines con-
textually relevant information, historical data, and candidate actions. Each model un-
dergoes recursive refinement, helping ICOM improve its responses over time.
Emotion-Based Valence System: Drawing on theories like Plutchik’s emotional
model, ICOM assigns emotional weights to actions, influencing decision-making. This
valence system supports flexible action selection in ARC-AGI by promoting actions
that maximize positive outcomes, such as reaching a correct solution, even under un-
certainty.
4.1 Recursive Problem-Solving Model for ARC-AGI:
ICOM’s problem-solving approach is characterized by generating, refining, and test-
ing multiple solution models, each representing a possible way to solve a given ARC-
AGI task. Here’s how the model works in practical terms:
When an ARC-AGI problem is presented, ICOM begins by creating a knowledge
graph to represent the problem’s structure. This knowledge graph captures relation-
ships, transformations, and patterns within the ARC-AGI task, mirroring the essential
relationships between ARC-AGI input and output grids.
ICOM uses its knowledge graph to propose possible actions. These actions are for-
mulated based on the system’s understanding of potential transformations, extracted
through pattern recognition and conceptual dependencies. For example, if a task re-
quires color transformation, ICOM will hypothesize various color mappings and eval-
uate their outcomes based on the feedback mechanism.
Emotional valence helps select actions in a way that mimics human decision-making
by prioritizing actions associated with past successes. Each potential transformation is
assigned a weight based on how “successful” it was in similar past tasks. This rein-
forcement mechanism enables ICOM to focus on promising strategies and ignore less
effective ones.
ICOM iteratively refines its problem-solving approach by testing and adjusting hy-
potheses within each problem. For example, if an initial transformation doesn’t yield
the desired output, ICOM uses feedback to modify the transformation parameters, pos-
sibly adding new steps or considering alternative actions, thus narrowing down on a
viable solution.
ICOM retains successful transformation models in a context graph, allowing it to
transfer learned problem-solving steps to future tasks, referred to by Francois Chollet
as learned “priors” [1]. This promotes faster adaptation to new tasks, as the system can
recall previous solutions to similar problems and modify them to fit the specifics of the
new task.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 17
6 Data Decomposition and Cleaning
While the ARC-AGI datasets are fairly clean, intentionally or unintentionally there are
still some issues to consider. Here are some of the data looking at both the training and
evaluation datasets for ARC-AGI:
TargetPath: C:\ARC-AGI-master\data\training
Files in the folder: 400
Total files processed: 400
Total valid JSON files: 400
Valid Data Structure: 398
Breaking white space characters: 22
Total Non-ASCII Characters: 0
Total Matrix Deviance: 204
Test and Training Inconsistent: 210
Test input and output don't match: 138
TargetPath: C:\ARC-AGI-master\data\evaluation
Files in the folder: 400
Total files processed: 400
Total valid JSON files: 400
Valid Data Structure: 391
Breaking white space characters: 4
Total Non-ASCII Characters: 0
Total Matrix Deviance: 256
Test and Training Inconsistent: 258
Test input and output don't match: 130
We are going to assume that some of this is easy enough to understand like ‘Target
Path’ and file in the folder etc. Keep in mind that the results might vary on different
platforms with different libraries. For example, ‘Valid Data Structure’ is not that it is
valid JSON but there are some expectations on the structure. So the line that says ‘Total
valid JSON files’ shows how many files would load without throwing an error, which
for us was shoving it in the default C# .net 4.7.2 JSON library reference object and
having it parse without throwing an error. It could be different in Python or JavaScript.
The way it is laid out in the files breaks in 2 cases in the training data and 9 cases in the
evaluation dataset. When testing ‘Valid Data Structures’ it’s looking at if there is a
‘train’ and ‘test’ object, and if the ‘train’ object is formatted as an Array. Next, it checks
if each set in that array is in fact a matched set and it does the same sort of thing for the
test object which expects a single ‘test’ object with one matched set. You can see that
while it can be valid JSON it might not structurally be consistent, but this is only in 11
cases out of 800, so only 1.37%.
Let’s look at the next value that actually tells us something and that is “Breaking
white space characters.” This test looks at non-ASCII characters. It is assumed that
the source files would be ASCII only, but the ARC-AGI team didn’t say for sure so if
18 K. Atreides, D. J. Kelley
they wanted to be tricky one or two Unicode characters might be a fun test. This used
to break a lot of parsers for various kinds of data and nowadays you might be familiar
with URL encoding or escaping characters in a string. We found a total of 4 non-ASCII
characters in all 800 files and they were in the 400 evaluation dataset files which is the
harder one, accounting for 1%. Of course, for that set one would expect it to be harder
and this certainly could make those four files more difficult to process traditionally.
Python might not care nowadays, but we did not test that.
Next, we start to get into some more interesting results, namely “Total Matrix Devi-
ance” where we found 204 deviants in the training set and 256 deviants in the evalua-
tion set. What this means is that in those datasets the sizes of the matrices are not the
same in each pair. Meaning at least one pair of matrixes in the file was not matching.
Meaning, for example, a training pair is an input and output matrix and if one was 6 by
6 then the other would only match if it was also 6 x 6. On researching the details we
found that this could also be part of the pattern you are supposed to guess. That said
it’s important to be aware of these sorts of deviances and make sure your code can
handle it or tell you about the pattern needed to generate the right answer.
Next, we needed to validate specifically the test array and if the input and outputs
match. In any case, we go through and look at matrix pairs, sizes, and format. By
testing these sorts of details we get a picture of what stuff we need the system to do to
deal with the data in a way that it can understand, not get confused, and generally work.
In the industry, this process is called data scrubbing, or normalization, and making sure
the data has consistent integrity but in our case some of this might just be trying to see
if the system is smart enough to identify the problem.
7 Consistent Successes
The consistent successes of this ICOM fragment account for the majority of all puzzles,
309 out of the evaluation dataset’s 400 total, or 77.25% of all puzzles, with consistent
successes accounting for 86.31% of all successes measured across runs, with bidirec-
tional transitions to and from success. The total successes across runs were 358, ex-
cluding consistent failures, consistent errors, error to failure, and failure to error transi-
tions.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 19
Fig. 8. The observed probability of puzzle successes, as measured across pre- and
post-hotfix runs.
This consistency of successfully solving ARC-AGI puzzles precisely demonstrated
that the ICOM fragment was successfully learning and applying the appropriate “pri-
ors” in a majority of cases, with an additional 49 total puzzles that could flip between
correct and incorrect solutions, while the fragment remains effectively blindfolded. It
also showed that the system was indeed scoring within the expected range for any
given run, approximately halfway between the upper and lower bound.
8 Consistent Failures
The consistent failures of this ICOM fragment only accounted for 19 out of 400 evalu-
ation dataset puzzles, or 4.75% of all puzzles, with consistent failures only accounting
for 33.33% of all failures measured across runs, with bidirectional transitions to and
from failure.
309
13 10 12 14
86.31%
3.63% 2.79% 3.35% 3.91%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
50
100
150
200
250
300
350
Consistent
Successes
Success to
Failure
Success to
Error
Failure to
Success
Error to
Success
Puzzle Success Consistency (Bidirectional)
Count Ratio
20 K. Atreides, D. J. Kelley
Fig. 9. The observed probability of puzzle failures, as measured across pre- and post-
hotfix runs.
The consistent failures remain the largest portion of this particular subset, but they rep-
resent a substantially smaller portion than the consistency of successes. The 2 “Failure
to Error” puzzles in this case also originated from the API Error specifically, making
their errors approximately random in nature.
Among these consistent failures, the answers still tended to be fairly close to correct
answers in the examples we examined, including 16b78196.json and 58e15b12.json, as
shown below.
19
12
2
13
11
33.33%
21.05%
3.51%
22.81%
19.30%
0%
5%
10%
15%
20%
25%
30%
35%
0
2
4
6
8
10
12
14
16
18
20
Consistent
Failures
Failure to
Success
Failure to
Error
Success to
Failure
Error to
Failure
Puzzle Failure Consistency (Bidirectional)
Count Ratio
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 21
Fig. 11. Puzzle 16b78196.json is shown in 11(a), with the provided answer off by 1
extra row plus 4 “pixels” in the grid. Puzzle 58e15b12.json is shown in 11(b), with the
answer provided off by 3 “pixels” in the grid.
Mistakes in the consistent failure examples tended to be small for all puzzles examined,
no more than 4 pixels plus a potential extra row. Several of the previous failures that
succeeded in the post-hotfix run were noted in our test of the web form version with
output errors that had an extra row but with no pixel errors, including puzzles
070dd51e.json and 3a301edc.json. The pixel-based failures appear to be more con-
sistent than the grid-based failures within this ICOM fragment.
22 K. Atreides, D. J. Kelley
Fig. 12. Puzzle 58e15b12.json is shown in the standard visual format, with 12(a) show-
ing the correct answer, and 12(b) showing the fragment’s answer with 3 pixel errors.
Relatively small but consistent errors like the one shown above in Figure 12 appear to
be failures to apply a prior to one section of the grid due to uncertainty. This uncertainty
could quickly be cleared up with a complete instance of ICOM through the learning
mechanisms present in a complete system but for a fragment absent the full spread of
learning mechanisms these types of failures are to be expected.
Fig. 13. Puzzle 16b78196.json is shown with 13(a) showing the correct answer, and
13(b) showing the failure that included an extra row and 4 pixel errors.
As noted, only failures that contained pixel-type errors remained consistent across runs,
shown in both Figures 12 and 13, although some with pixel-type errors also contained
grid-type errors, shown in Figure 13. The error in this example shows a failure to predict
the exact form of the central green-colored portion, with further subsequent pixel-level
errors emerging from that deviation.
9 Consistent Errors
The consistent data pipeline errors of this ICOM fragment only accounted for 10 out of
400 evaluation dataset puzzles, or 2.5% of all puzzles, with consistent errors only ac-
counting for 21.28% of all errors measured across runs, with bidirectional transitions
to and from errors. This includes the 10 API errors of the post-hotfix run, which appear
to be random, negatively impacting evaluation by erroring out at a rate of 80% on puz-
zles that were previously solved successfully.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 23
Fig. 14. The observed probability of puzzle failures, as measured across pre- and post-
hotfix runs.
As shown in Figure 14, consistent errors proved to be a relatively smaller portion of all
errors in terms of ratio, compared to the above analyses of consistent successes and
failures. Such errors are themselves largely a transient factor, due to a fragment of
ICOM being applied rather than the full architecture. As the architecture is designed
for much the same fundamental anti-fragility [16] that humans demonstrate, errors may
be expected to quickly shrink in a completed and deployed 8th-generation instance.
As ICOM-based systems are fundamentally deterministic, albeit unlike previous
kinds of deterministic systems, these errors can be traced to discrete and specific flaws
in the data, listed both below and in the supplemental files:
1. 12997ef3.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 90.
2. 31d5ba1a.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 49.
3. 4852f2fa.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 55.
4. 4c177718.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 300.
5. 5d2a5c43.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 73.
10
14
11 10
2
21.28%
29.79%
23.40% 21.28%
4.26%
0%
5%
10%
15%
20%
25%
30%
35%
0
2
4
6
8
10
12
14
16
Consistent
Errors
Error to
Success
Error to
Failure
Success to
Error
Failure to
Error
Puzzle Error Consistency (Bidirectional)
Count Ratio
24 K. Atreides, D. J. Kelley
6. 8b28cd80.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 193.
7. b1fc8b8e.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 73.
8. bbb1b8b6.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 52.
9. da2b0fe3.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 232.
10. e21a174a.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 124.
11. e345f17b.json: Error: Invalid property identifier character: [. Path
'output', line 1, position 53.
Since probabilistic systems are used as tightly bounded tools these errors can sometimes
be overcome, even by a fragment of ICOM, but this specific capacity remains relatively
weak for fragments, as compared to complete ICOM instances. This is in part because
a fragment’s capacity for recursive self-improvement is crippled, meaning that iterative
improvements can’t take shape in the same variety of ways and overall quality as they
can in a complete instance.
10 Improving Future ARC-AGI-like Benchmarks
ARC-AGI has very adeptly resisted being quickly gamed by LLMs and RL, unlike
the overwhelming majority of AI benchmarks in the field today. It is limited to only
proving that models can’t perform the task when those models are tested, due to arbi-
trarily excluding people from the verification process. That arbitrary exclusion means
that there is no way of verifying if other architectures dramatically outperform those
they choose not to exclude, so while it can be said that the largest LLM companies
today offer systems that fail spectacularly at the benchmark, it can’t be said that any of
those who they exclude don’t pass the human baseline.
This makes the first and most obvious way of improving the ARC-AGI benchmark
to remove the practice of arbitrary exclusion, as it currently only serves half of the
purpose of a normal benchmark, if that. Additional measures can be taken to mitigate
concerns related to this, such as limiting the number of attempts to 2 per month, with a
maximum of 10 per year, to avoid some of the methods that likely played a strong role
in the current selection of top scores on the Prize leaderboard, as of November 2024,
where the top 5 performers averaged 295.4 submissions in the competition’s 5-month
duration, for ~59 submissions per month.
When you factor in the maximum processing time per run of up to 12 hours for the
2024 Challenge hosted on Kaggle, this potentially amounts to the top 5 teams running
hardware on this challenge virtually non-stop for 5 months. There were also many more
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 25
participants than just those top 5 scoring teams on the challenge, so a substantial amount
of cloud resources were expended on it. As we demonstrated, a viable system can show
results in just a couple of runs, they don’t need a couple hundred.
The second opportunity for dramatically improving the benchmark is to make the
problems significantly more difficult, such as applying the sequential dynamics of
Chaos Theory while masking interim steps in a sequence. Any system being tested
could be forced to either produce all interim steps as well as the output, or only the
output, with the relative difficulty of those two tasks varying based on the architecture
approaching it. For example, if you were to take the rules being applied to 5 existing
ARC-AGI problems and apply them to a new starting matrix in sequence, you could
either ask for every step in that sequence or only the final step. This would give a single
puzzle a multi-step process, where each step has rules that are independent of the others
in the sequence, only requiring the correct matrix of the previous steps, giving systems
a much larger opportunity to be wrong.
This approach of increasing the difficulty is also still realistically manageable for
having humans perform the same task, as though the baseline for humans performing
it could predictably be measurably lower, it should still be high enough to measure
given a normal distribution of the population. This approach also has the added benefit
of expanding the combinatorial explosion in new and potentially useful directions for
measurement and analysis.
One way of increasing the difficulty that wouldn’t be very compatible with estab-
lishing a new human baseline would fall along the lines of embedding various compu-
tations within each of the digits of the matrix, such as using a 16-million color gamut
for each “pixel” in the grid. Determining not only the specific computations being per-
formed but which “pixels” they should be performed on, could once more greatly in-
crease the overall combinatorial explosion.
Likewise, rather than using a 2D grid, a 3D grid could be used to increase the overall
difficulty of the challenge, but once more it wouldn’t be very compatible with estab-
lishing a new human baseline.
Lastly, any combination of these and other methods could be applied to greatly in-
crease the difficulty of the challenge beyond what any one method offers in isolation.
Some combination will be required for the benchmark to maintain any relevance,
due to a substantial amount of synthetic data being created and collected by various
interested parties over the past 5 years, making it far easier for anyone to game the
benchmark with LLMs and similar systems, effectively defeating the original purpose
of the benchmark. While it was robust against these methods in late 2019, much of that
robustness has since been lost, and many of the increased scores can be traced directly
back to such methods of gaming the benchmark through both synthetic data and an
overwhelming number of attempts.
26 K. Atreides, D. J. Kelley
11 Discussion
As noted previously, our team is developing the ICOM cognitive architecture into a
commercial product, with high-bar definitions of “AGI” in mind, not making toy sys-
tems aimed at ARC-AGI. The only work done specifically to put our systems to this
challenge involved splitting off a fragment of the architecture, the Context Engine, and
some data pipeline cleaning specific to this challenge. Reimbursement for compute was
never requested, or needed, due to the triviality of spending $16 on two full runs of the
evaluation dataset in a day. To put that into perspective, the two of us can spend more
combined on lunch in a given day than our compute cost on this challenge.
This effectively means that our scores discussed above are “baseline”, as the ARC-
AGI-PUB leaderboard notes for various LLMs, because there was no training and no
substantial engineering specific to this challenge beyond the basic data pipeline clean-
ing and fragmenting of the architecture. If we actually applied meaningful effort to this
challenge, the realistic result would be higher scores, just as a complete 8th-generation
ICOM-based system could predictably ace this challenge once the core engineering
workload is completed.
Our primary focus is the completion of that core engineering workload, so the earli-
est we’ll likely revisit this challenge is following the next major milestone along our
engineering roadmap, and when we do we won’t be interacting with the ARC-AGI team
unless they drop their policy of arbitrary exclusion, a policy which currently leaves
their credibility in a highly questionable state.
Based on our results with this ICOM fragment, an upper bound of 89.5% may be
estimated for the version that was tested, with the absolute lower bound of 77.25%,
current baseline score of 83.75%, and an API-Error-Corrected score of 85.75%.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 27
Fig. 15. An overview of the ICOM fragment’s demonstrated successes, failures, and
errors, with upper and lower bounds.
Given that this fragment of an ICOM instance was missing access to many key compo-
nents of the cognitive architecture, as well as being crippled in ways specifically re-
quired to prevent any opportunities for the fragment to cheat, an upper bound of 89.5%,
and a corrected score of 85.75% are adequate for now. However, complete instances
will be far more capable of solving the remaining problems that alternate between errors
and failures for this fragment, as they’ll be actively learning and accumulating new and
fully persistent knowledge.
This core capacity of ICOM-based systems was demonstrated when the 7th genera-
tion research system got a perfect score on the UCMRT IQ test’s hardest version back
in the summer of 2019 [17] before ARC-AGI was first released [18]. It is also worth
noting that no human had achieved a perfect score out of the hundreds tested in the
original UCMRT study [19], making the top score for that test our previous system.
In the process of preparing this paper the latest numbers for ARC-AGI were released
on December 6th for the public and prize leaderboards, with the MindsAI team choosing
309
49
19
10
13
10
358
42
335
343
77.25%
12.25%
4.75%
2.50%
3.25%
2.50%
89.50%
10.50%
83.75%
85.75%
0 50 100 150 200 250 300 350 400
Consistent Successses
Variable Successes
Consistent Failures
Consistent Errors
Failure-Error + Error-Failure
API Errors (Random)
Fragment Success Upper Bound
Fragment Error + Failure Bound
Baseline Current Score
Current Score corrected for API errors
Current ICOM-based ARC-AGI High Score
Overview
Ratio Count
28 K. Atreides, D. J. Kelley
not to disclose their method, and being dropped from the Prize leaderboard, while two
new participants were added to the Public leaderboard. Of course, so long as the bench-
mark remains arbitrarily exclusionary the PUB leaderboard can’t claim any validity,
but they are updated nonetheless. The latest ARC-AGI-PUB top score listed is another
LLM-based approach, applying a large number of transformations, like Ryan Green-
blatt’s method, but scoring 58.5% on the evaluation set and 53.6% on the semi-private
set. This leaves our method still very far ahead of all others, in both efficiency and
efficacy.
We won’t be updating our previous numbers going over the top 5 teams on the Prize
leaderboard and their attempts on this, as the performance of top teams remains entirely
unchanged. The only thing that has changed is who is awarded prizes, which was never
scientifically relevant.
It is now with substantial irony that we share our paper going over this, with the top
Prize leaderboard team choosing not to share their approach, and the public leaderboard
remaining lacking in any credibility due to their policy of arbitrary exclusion.
Starting in early 2025, speaking as of December 2024, we’ll begin giving people
access to increasingly complete versions of ICOM, as our core engineering work con-
tinues, allowing for an increasing variety of benchmarks to be performed and verified
by third parties.
11.1 Final Update
As a final update for this paper, we’ll briefly go over the data relating to OpenAI’s
claims of surpassing the 85% mark on this benchmark [20], as they were announced in
mid-December. It is worth drawing a direct comparison along several dimensions, as
well as highlighting several critical differences, and subsequent flaws in OpenAI’s
claims.
As the ARC-AGI Challenge is particularly focused on sample-efficient and com-
pute-efficient learning, we can compare the low and high compute modes of their at-
tempt to ours.
ARC-AGI Evaluation Dataset
"o3" Low
Compute
"o3" High
Compute
ICOM
Fragment
Avg Cost per Puzzle (USD)
$17
~$2,940
$0.03
Avg Runtime per puzzle (Mins)
1.3
13.8
0.33
Score
82.75
91.5
83.75 to
85.75
Under the Compute Limit for the
PUB leaderboard?
Yes
Hell No
Yes
Leaked Test Data?
Yes
Yes
No
Baseline or Fine-Tuned?
Fine-Tuned
Fine-Tuned
Baseline
Table 3. Comparing OpenAI’s latest “o3” claims on ARC-AGI to the ICOM fragment.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 29
OpenAI has made no secret that they trained on the entirety of GitHub’s public data,
including the evaluation dataset published on Chollet’s ARC-AGI Repository, meaning
that they explicitly trained on the test data noted above in Table 3. They are less “open”
about the fact that they also trained on all data submitted to their own APIs by default
until mid-2023 [21], which only switched it to an opt-in system, rather than actually
stopping data leaks via API. This means that from 2019 to mid-2023, all ARC-AGI
“semi-private” data was also leaked directly to them and used for training, and no doubt
that same data is still used today to train newer models. This also means that anyone
who happens to be opted-in today could still leak that same data, and the ARC-AGI
team would have no way of knowing.
These models also aren’t showing baseline scores, like the previous LLM scores, but
rather they were explicitly “optimized” for ARC-AGI, meaning that any fine-tuning or
similar optimization process would call upon the results of training on leaked test data,
artificially boosting the model performance. Notably, no pre-optimization performance
figures were disclosed, throwing another major and inexcusable red flag.
Because the ARC-AGI-PUB leaderboard has compute limits set to $10,000, or $100
per problem, the “high-compute” version that used 172x more compute than the one
they called “low-compute”, landed more than an order of magnitude outside of those
limits.
Due to the massive discount on compute costs that OpenAI receives from Microsoft
[22], it is also very likely that their claimed costs are ~2.82 times lower than they should
be for any fair comparison to parties who don’t have such a discount. This is a further
flaw in ARC-AGI, caused by using “compute cost” as a metric, without consideration
of potential compute discounts. This effectively gives OpenAI a 282.5% advantage
over anyone using typical market rates for the same hardware, as they receive a heavily
discounted rate of $1.30 per A100 hour of usage, versus the normal average rate of
roughly $3.673 per A100 hour.
The overall result is that even after explicitly training on the test data for both the
evaluation set on GitHub and the “semi-private” set that was leaked to them for years,
as well as “optimizing” specifically for the benchmark, OpenAI still had to apply
roughly 97,466 to 275,379 times the compute used by the ICOM fragment we tested.
Their “low compute” method with both leaked test data directly trained on and optimi-
zation, but using only 566 to 1,599 times our compute still failed to match our score,
landing at only 82.75%, below even our single-guess baseline score without the API
correction. In both cases their systems were also far slower at runtime, ranging from 4
to 41.4 times slower for the low and high compute versions respectively.
12 Conclusion
We’ve demonstrated that a fragment of a general-purpose working cognitive archi-
tecture, which predates ARC-AGI and is slated for commercial deployment, can score
roughly human levels of performance on this benchmark. This is notably also accom-
plished absent any specific training on ARC-AGI or ARC-like puzzles, while using
very minimal cloud resources, coming to between $8 and $14 per run of the 400
30 K. Atreides, D. J. Kelley
evaluation dataset puzzles, or 2 to 3.5 cents per puzzle, making it up to roughly 1,000-
fold more efficient than previous methods, while also achieving vastly higher perfor-
mance.
For the fragment of ICOM tested in this paper performance of the final post-hotfix
run was 83.75% baseline, 85.75% after correcting for a random external API error, and
upper-bounded at 89.5% based on the consistent failures, errors, and combinations of
the two observed across multiple runs.
This remains “baseline” performance since no specific training on ARC-AGI was
performed, and due to our systems being designed for commercial purposes, not built
for ARC-AGI, as well as predating the benchmark by a wide margin. Even our 7th gen-
eration ICOM-based system was deployed and began testing before ARC-AGI was first
released in late 2019.
Our work in preparing the complete instances of ICOM’s 8th-generation systems for
commercial deployment will continue, and this benchmark may be revisited periodi-
cally at subsequent engineering milestones along the way.
Acknowledgments. The AGI Laboratory team acknowledges all of the people who were in-
volved in the Uplift.bio project, the testing of our 7th generation ICOM-based research system,
without which this progress wouldn’t have been possible. Thank you for helping to put this tech-
nology to the test.
Disclosure of Interests. The author serves on the board of the company whose technology is
herein discussed but has received no funds or other compensation related to this document or any
related ongoing research.
Supplemental Data. Additional supplemental data going over our testing for this benchmark and
preparation of this paper may be found on our GitHub page at: https://github.com/KyrtinSil-
ver/Norn-site/tree/main/wp-content/uploads/2024/12/ICOM%20vs%20ARC-
AGI%20Supplemental%20v1
References
1. Chollet, F. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.
2. Kelley, D. J., & Waser, M. R. (2018). Human-like emotional responses in a simplified inde-
pendent core observer model system. Procedia computer science, 123, 221-227.
3. ARC Prize 2024. (2024), https://arcprize.org/guide, visited on December 8th, 2024.
4. ARC Prize 2024. (2024), https://arcprize.org/2024-results visited on December 7th, 2024.
5. Kelley, D. J., & Waser, M. R. (2018). Human-like emotional responses in a simplified inde-
pendent core observer model system. Procedia computer science, 123, 221-227.
6. Atreides, K., Kelley, D. J., & Masi, U. (2021). Methodologies and Milestones for the De-
velopment of an Ethical Seed. In Brain-Inspired Cognitive Architectures for Artificial Intel-
ligence: BICA* AI 2020: Proceedings of the 11th Annual Meeting of the BICA Society 11
(pp. 15-23). Springer International Publishing.
Solving the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)
AI Benchmark with ICOM 31
7. Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing
Systems.
8. Kelley, D. J. Problem-Solving and Learning Strategies within the Independent Core Ob-
server Model (ICOM) Cognitive Architecture.
9. Atreides, K. (2023). The Complex Chaos of Cognitive Biases and Emotional Observers.
10. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022).
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural
information processing systems, 35, 24824-24837.
11. Kelley, D. J. (2017). Non-Logical Simulation Model-based Decision-making Systems to
Drive Self-Motivation in Software Systems.
12. Amari, S. I. (1993). Backpropagation and stochastic gradient descent method. Neurocom-
puting, 5(4-5), 185-196.
13. ARC Prize 2024. (2024), https://arcprize.org/blog/openai-o1-results-arc-prize, visited on
December 8th, 2024.
14. Greenblatt, R. (2024). Getting 50% (SoTA) on ARC-AGI with GPT-4o, https://redwoodre-
search.substack.com/p/getting-50-sota-on-arc-agi-with-gpt, visited on December 8th 2024.
15. Chollet, F., Knoop, M., Kamradt, G., Landers, B. (2024) ARC Prize 2024: Technical Report,
https://arcprize.org/2024/report visited on December 7th, 2024.
16. Taleb, N. N. (2014). Antifragile: Things that gain from disorder (Vol. 3). Random House
Trade Paperbacks.
17. Kelley, D. J. (2020). Preliminary Results and Analysis Independent Core Observer Model
(ICOM) Cognitive Architecture in a Mediated Artificial Super Intelligence (mASI) System.
In Biologically Inspired Cognitive Architectures 2019: Proceedings of the Tenth Annual
Meeting of the BICA Society 10 (pp. 179-186). Springer International Publishing.
18. Chollet, F. (2019). Abstraction and Reasoning Corpus for Artificial General Intelligence
(ARC-AGI). https://github.com/fchollet/ARC-AGI, visited on December 8th, 2024.
19. Pahor, A., Stavropoulos, T., Jaeggi, S. M., & Seitz, A. R. (2019). Validation of a matrix
reasoning task for mobile devices. Behavior research methods, 51(5), 2256-2267.
20. Chollet, F. (2024). OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.
https://arcprize.org/blog/oai-o3-pub-breakthrough , visited on January 17th, 2025.
21. Wiggers, K. (2023). Addressing criticism, OpenAI will no longer use customer data to train
its models by default. https://techcrunch.com/2023/03/01/addressing-criticism-openai-will-
no-longer-use-customer-data-to-train-its-models-by-default/ TechCrunch, visited January
17th, 2025.
22. Moss, S. (2024). OpenAI training and inference costs could reach $7bn for 2024, AI startup
set to lose $5bn – report. https://www.datacenterdynamics.com/en/news/openai-training-
and-inference-costs-could-reach-7bn-for-2024-ai-startup-set-to-lose-5bn-report/?ref=ai-
recon.ghost.io , visited on January 17th, 2025.