Content uploaded by Nicolas Kuske
Author content
All content in this area was uploaded by Nicolas Kuske on Nov 05, 2024
Content may be subject to copyright.
1
Consciousness in Artificial Systems:
Bridging Global Workspace and Sensorimotor Theory in In-Silico Models
Nicolas Kuske1,2 & Rufin VanRullen1
1) CerCo, CNRS UMR5549, Artificial and Natural Intelligence Toulouse Institute
Université de Toulouse. 2) Faculty of Computer Science, Chemnitz University of Technology
Abstract
In the aftermath of the success of attention-based transformer networks, the debate over the potential and
role of consciousness in artificial systems has intensified. Prominently, the Global Neuronal Workspace
Theory emerges as a front-runner in the endeavor to model consciousness in computational terms. A
recent advancement in the direction of mapping the theory onto state-of-the-art machine learning tools is
the model of a Global Latent Workspace. It introduces a central latent representation around which
multiple modules are constructed. Leveraging dedicated encoder-decoder structures, content from the
central representation or any individual module, integrated via the latent space, can be translated to any
other module and back with minimal loss. This paper presents a thought experiment involving a minimal
setup with one deep sensory and one deep motor module, which illustrates the emergence of “globally”
accessible sensorimotor representations in the central latent space connecting both modules. In the human
brain, neuronally enacted knowledge of laws relating changes in sensory information to changes in motor
output or corresponding efferent copy information have been proposed to constitute the biological
correlates of phenomenal conscious experience. The underlying Sensorimotor Contingency Theory
encompasses a rich mathematical framework. Yet, the implementation of intelligent systems based on this
framework has thus far been confined to proof-of-concept and basic prototype applications. Here, the
natural appearance of global latent sensorimotor representations links two major neuroscientific theories
of consciousness in a powerful machine learning setup. A remaining question is whether this artificial
system is conscious.
Keywords: Consciousness, Global Latent Workspace, Sensorimotor Representations, Deep Learning
2
1. Introduction
Is there a “ghost in the machine”—a form of awareness within artificial systems? This question has
fascinated humanity since the development of the first automata, dating back at least to Hellenistic steam-
powered pigeons. The notion gained popularity with later mechanical devices, often resembling animals
or humans, that appeared at royal courts in the 18th century (Price and Bedini, 1964). Indeed, the
possibility of (phenomenally) conscious machines has been explored in the arts across cultures (Bowen et
al., 2019). However, the thought experiment of whether it is like something to be an artificial system and
what that experience would entail is not merely an inspiring curiosity.
The long-term goal of AI researchers in industry and academia is to create software and robotic
agents designed to support basic human rights, well-being, and dignity (Dwivedi et al., 2021; Eynon and
Young, 2021). Along this ambitious path, machine awareness becomes relevant for two major reasons:
one ethical, the other pragmatic. Ethically, if we create artificial systems that can “feel,” we must ensure
they do not suffer in performing our tasks, but rather experience contentment in fulfilling them (Agarwal
and Edelman, 2020). Moreover, if artificial consciousness can include the capability to experience
valence, it holds potential to introduce dimensions of positive feeling beyond current human capacity.
This potential carries a moral and ethical obligation to realize it, so that supporting well-being extends to
all sentient machines.
The pragmatic reason for emphasizing machine awareness in AI research is the connection
between consciousness and general intelligence in biological systems. Most tasks we consider markers of
intelligence in humans and animals cannot be performed unconsciously (Kaufmann, 2011). In other
words, phenomenal awareness may be essential for true, general intelligence to emerge in biological
systems. Conceivably, this relationship could carry over—at least to some extent—to artificial systems.
Both the ethical and pragmatic arguments for the importance of consciousness research in AI
have been valid since the early days of machine learning in the 20th century. Arguably, however, during
the era of expert systems, the advent of true general artificial intelligence seemed so distant that
discussions of machine awareness sounded more like fiction than science (Stuart and Norvig, 2016). Even
3
after the success of forward and recurrent deep learning models in the last decade, such topics were
largely overlooked (Chella, 2023). But with exponential advancements, today’s vision, language, and
reinforcement learning models now showcase human-like generalization capabilities, meeting Turing’s
criteria for intelligence (Lund and Wang, 2023; Adaptive Agent Team, 2023; Gallouédec et al., 2024).
Transformer networks are revolutionizing machine learning. While deep learning architectures
are inspired by neural network structures in biological brains, transformers specifically draw on cognitive
attention mechanisms. Following this analogy, consciousness could be the next biological cognitive
property to enhance the development of intelligent artificial systems. Current leading research in
foundational AI models emphasizes multimodality—a strength of conscious information processing in
humans and animals. Thus, it may now be essential for machine learning researchers to begin exploring
architectures that incorporate elements of awareness.
In the following sections, this paper explores two major neuroscientific theories of consciousness:
the Global Workspace Theory and the Sensorimotor Contingency Theory. It then introduces a recently
proposed multimodal deep learning architecture called the Global Latent Workspace, based on the Global
Workspace Theory. After presenting initial experimental results, a thought experiment illustrates the
emergence of global sensorimotor representations and compares these to the predictions of the
Sensorimotor Contingency Theory. The theoretical findings suggest that this novel artificial system can
integrate both neuroscientific theories of consciousness. The paper concludes with a discussion on the
possible functional consequences and the potential for this model to exhibit a form of phenomenal
consciousness if instantiated in an artificial system.
4
2. Neuroscientific Perspectives on Consciousness
To build potentially conscious artificial systems, it is sensible to examine how consciousness might be
implemented in the brain. However, there is currently no universally accepted neuroscientific theory of
consciousness (Seth and Bayne, 2022). Although several theories exist, the limited understanding of the
brain’s functional architecture, combined with the inherent challenges of conceptualizing consciousness,
leaves the scientific community without a consensus.
A recent interdisciplinary effort involving philosophers, psychologists, neuroscientists, cognitive
scientists, and machine learning researchers produced a white paper that leverages several prominent
theories of consciousness to develop indicators for conscious artificial systems (Butlin et al., 2023). The
following sections detail two influential theories relevant to understanding consciousness in artificial
systems: the Global Workspace Theory and the Sensorimotor Contingency Theory.
2.1 Sensorimotor Contingency Theory
This theory uniquely addresses the Hard Problem of consciousness (Buhrman, Di Paolo, and
Barandiaran, 2013; Müller, 2024; Silverman, 2017). It seeks to eliminate the need for a homunculus in a
Cartesian Theatre and explains why different modalities yield distinct perceptions (but see Dung, 2022).
The foundational concept of sensorimotor contingency suggests that the conscious experience of a
specific property of an object arises from enacted knowledge of a specific law governing the relationship
between changes in sensory information structure and the corresponding sequence of motor outputs
(O’Regan, 2011; O’Regan and Noë, 2001). Here, ‘law’ refers to any invariant involving a predictable and
consistent relationship between sensory inputs and the resulting motor outputs or the respective efference
copy information. For instance, as we move our eyes across a visual scene, changes in visual input are
systematically related to eye and head movements and are distinct from the patterns governing auditory
input.
The empirical foundations of the sensorimotor theory rest on a range of experiments, from the
perception of space through sensory substitution and augmentation devices (Lenay, Canu, and Villon,
5
1997; Kaspar et al., 2014) to developmental deficits in sensory cortex and perception due to lack of motor
correlation (Attinger, Wang, and Keller, 2017; Held and Hein, 1963). Over the past two decades, the
theory has evolved from an initial ambiguous formulation to a rigorous mathematical characterization,
leveraging tools from dynamical systems theory. Here, laws governing changes in sensory flux and motor
activity define differentiable manifolds embedded in sensorimotor space (Buhrman et al., 2013; Philipona
and O’Regan, 2006, 2010; Stober, 2015). Perceptions of color, geometric shapes, and behavioral habits
align within this framework.
Despite these theoretical advances, applications in intelligent artificial systems have so far been
limited to proof-of-concept and basic prototypes (Angulo and Acevedo-Valle, 2017; Egbert and
Barandiaran, 2022; Maye and Engel, 2011; Stober, Miikkulainen, and Kuipers, 2011).
2.2 Global Workspace Theory
The Global Workspace Theory (GWT) is widely regarded as the leading neuroscientific model of
conscious processing (Seth and Bayne, 2022) and emerges as a front-runner in efforts to model
consciousness computationally (Butlin et al., 2023; Dehaene, Lau, and Kouider, 2021). The theory posits
that the brain consists of specialized modules connected by long-distance neuronal links. Based on
sensory input salience and task context, content from one module is broadcast and shared with others
through a process known as global ignition. This shared content, termed the global neuronal workspace,
constitutes our conscious awareness (Baars, 1993; Sergent and Dehaene, 2004; Mashour et al., 2020).
The workspace model is based on three neurophysiological assumptions: the presence of
recurrent thalamocortical connections, the existence of specialized modules (vision, language, arithmetic,
motor, etc.) involving interconnected thalamocortical regions, and recurrent connections among these
modules via frontoparietal corticocortical axons. This network of networks forms the neural substrate for
the theory.
The theory further asserts that information processed in a feedforward or locally recurrent manner
within any one module remains subconscious until it is integrated into the globally recurrent workspace
via an ignition process. This ‘ignition’ functions as an attractor state, allowing neuronal patterns to remain
6
stable over hundreds of milliseconds (Sergent et al., 2021), a feature often associated with working
memory processes. For information to be broadcast as global working memory, it must be both salient and
contextually resonant, a characteristic linked to cognitive control.
Not all signals that reach sensory neurons or are computed in specialized modules should become
conscious, even if they are salient. GWT suggests that information enters the global workspace through a
competitive process, where unconscious modular processors vie for access. The most relevant or
important data is selected for conscious broadcasting, allowing it to influence other modules.
An advancement in mapping this theory to modern machine learning is the Global Latent
Workspace model. This model introduces a central latent representation around which multiple
specialized modules are constructed (VanRullen and Kanai, 2021). Content from any one module can be
translated through the central global workspace to any other module and back with minimal loss. The
following section outlines this multimodal deep learning architecture in more detail, discusses initial
experimental results, and presents a thought experiment linking the Global Latent Workspace model with
the Sensorimotor Contingency Theory of consciousness.
3 Global Latent Workspace
The Global Workspace Theory centers on a model where a central information space, or “global
workspace,” can both broadcast to and integrate content from specialized modules. This central hub, in
computational terms, resembles a multimodal read-and-write memory buffer, capable of handling data
from various sensory and cognitive modules. However, since each module processes information in its
specialized format, the global workspace faces two primary integration challenges: it could either
broadcast information in each module’s unique format or unify these formats into one central, universal
representation.
The first option, involving parallel distribution of modality-specific formats, demands
considerable bandwidth and assumes each module can decode information from all other modules—a
prohibitive requirement. The alternative, encoding each module’s content into a shared central format, is
7
both technically and phenomenologically more plausible. The unified nature of consciousness—our sense
that objects “look,” “sound,” and “feel” as part of a single coherent experience—suggests that the global
workspace indeed processes content in a single, central information format.
In modern machine learning, input and output modules interact via hidden or latent
representations, and the Global Latent Workspace (GLW) leverages this concept to unify multimodal data.
Here, a central latent hub integrates and disseminates information from specialized modules within a
global latent format, creating a cohesive central representation (VanRullen and Kanai, 2021). Each
modality connects to this latent hub through its own deep encoder and decoder, which are optimized
together to:
1) Minimize contrastive loss or align modality-specific data mapped to the central representation.
2) Minimize translation loss between modalities via this central hub.
3) Enforce cycle-consistency—where data can translate from one modality to the workspace and
back (demi-cycle) as well as from one modality via the workspace to another and then back to its
original identity (full-cycle as in Gorti et al., 2018; Zhu et al., 2017).
This central representation is thus optimized to retain only information necessary for meaningful cross-
modal interaction, aligning with Occam’s razor in its efficient encoding.
Furthermore, access to this global latent space is regulated by attention mechanisms similar to
those used in transformer models. Key-query pairs control access, with each modality possessing a unique
key and the global workspace a query. The similarity between keys and the query determines the degree
to which each module’s latent representation is integrated into the global workspace, effectively filtering
noise and prioritizing relevant modalities.
3.1 Experimental Results
Initial experiments with this multimodal framework reveal promising capabilities. In one
instance, a GLW-based bimodal architecture, translating between vision and language, required five times
fewer paired samples to achieve comparable performance when trained with an unsupervised cycle-
consistency objective, compared to a traditional supervised setup (Devillers et al., 2024). By reducing
8
reliance on direct supervision, the model demonstrates a marked efficiency in data usage, underscoring
the benefits of its latent encoding and cycle-consistency approach.
In a reinforcement learning context, the GLW structure facilitates zero-shot transfer across
modalities. When trained to align, translate, and cycle between image and vector data, the model learns a
global latent representation capable of guiding an action policy based on a single modality (e.g., vision).
Intriguingly, this same policy operating from the current workspace content is also activated for
information coming from the opposite modality (e.g., vector) into the workspace without further training.
This capacity to transfer knowledge between modalities surpasses that of traditional multimodal models,
such as CLIP, by a significant margin, highlighting the GLW’s potential for adaptive, cross-modal
generalization (Maytie et al., 2024).
Attentional mechanisms, inspired by transformers, further enhance the GLW’s utility by filtering
out irrelevant information. Each modality’s key vector, compared against the global workspace query,
modulates the unimodal representations before fusing them. This selective scaling allows the system to
prioritize essential information and effectively ignore noisy inputs, improving performance in tasks where
certain modalities may introduce irrelevant data (VanRullen, 2024).
In addition, adding a sequential attention mechanism—through a recurrent LSTM network—
extends the GLW’s capabilities in managing temporally sensitive tasks. When applied to an arithmetic
addition task, the attentional model can prioritize access to modules sequentially, facilitating the
workspace’s ability to solve numerical problems. Remarkably, this model generalizes to unseen numbers
within the training range, a behavior analogous to the transitive numerical ordering observed in deep
neural networks trained on arithmetic tasks, potentially hinting at an internal representational structure
akin to a “number line” (Mistry et al., 2023).
4 Thought Experiment
The results discussed above were obtained using Global Latent Workspace architectures without
incorporating a motor module. While ongoing experimental work is moving toward this direction, a
9
theoretical understanding of the expected outcomes when an action-oriented motor module is connected
to the workspace can guide analysis and accelerate development. This section presents a thought
experiment that logically examines the types of global latent representations that might arise in an
artificial agent “embrained” with a minimal, two-module sensorimotor model. We consider how
sensorimotor interactions might lay a foundation for conscious-like representations in a technological
system.
4.1 Agent, Environment and Task
In this thought experiment, we consider an artificial agent, referred to as “AG,” equipped with
two primary modules: a deep neural network vision input module and a deep neural network motor output
module. These modules are respectively connected to visual sensors and robotic arm effectors, where,
depending on its degrees of freedom, AG’s motor space may be as complex as its visual space. The vision
module’s output domain is linked to a global latent space (layer) via a deep encoder, and a decoder
connects the global latent information back to the vision module. Similarly, the global representation is
linked to the motor module’s latent input space through a deep decoder, and the motor latent space is
projected back toward the global latents via a deep encoder. In summary, visual sensors connect to a deep
vision module, which is bidirectionally connected to the global workspace, and this workspace is, in turn,
bidirectionally connected to a deep motor module that controls a robotic arm. Figure 1 illustrates this
structure schematically. Apart from the specific sensors and effectors, this setup aligns with the standard
approach of implementing two distinct domain modules within the global workspace architecture
(Devillers et al., 2024; Maytie et al., 2024).
AG is situated in a fixed position within its environment, where its task is to reach out and grasp
objects. These objects appear in various egocentric spatial relations within AG’s visual field, all within its
motor reach. Each specific object location (e.g., to the left or above) implies a unique egocentric visual
representation, although in our example each pertains to one of only two kinds of objects: teddy bears and
plush rabbits. Naturally, each position also determines a particular direction for the grasping movement,
which remains independent of the object type. For instance, a teddy bear to AG’s left generates a visual
10
representation of the bear on the left side of AG’s egocentric visual field and corresponds to a grasping
movement directed leftward. Likewise, a teddy bear in the center prompts a grasp toward the center, and
so forth. This pattern holds similarly for plush rabbits, with a distinct visual representation specific to
rabbits.
Figure 1: A schematic depiction of the agent AG described in the thought experiment. The Global Latent
Workspace is abbreviated GLW, and the domain module encoders and decoders are denoted E and D,
respectively. The transparent red arrows illustrate the directional information flow during a forward pass
from the sensory input vision module via the GLW to the motor output action module. Picture adapted
with permission from VanRullen (2024).
4.2 Training
In this thought experiment, AG learns the correct sensorimotor mappings, or policies, by training
its “brain” through two distinct yet parallel learning procedures: (1) reinforcement learning across the
entire network and (2) semi-supervised training within the Global Latent Workspace’s multimodal
encoder-decoder structure.
To begin with, a standard reinforcement learning algorithm, such as Proximal Policy
Optimization (PPO; Schulman et al., 2017), shapes the network’s weights through end-to-end
11
backpropagation. This approach means that gradient descent affects all layers from the motor module
backward through the motor-to-global latent decoder, the vision latent-to-global encoder, and the vision
module. Notably, this training process excludes the motor-to-global encoder and the global-to-vision
decoder, as these are outside the current architecture’s feedforward pass from visual input to motor output
(see Figure 1). The training commences as AG interacts with its environment, performing actions based
on an initial, potentially random policy and collecting experiences as state-action-reward-next-state
tuples. These experiences are used to compute advantage estimates, which in turn guide updates to the
policy network, parameterized by AG’s complete neural network “brain,” including the workspace. The
goal of this training is to maximize rewards—specifically, to successfully reach and grasp the presented
objects (either a bunny or a teddy).
Parallel to this reinforcement learning process, the semi-supervised training of the latent
workspace, along with its encoder-decoder structure, operates based on three principles: alignment,
translation, and cycle-consistency. These principles collectively generate four distinct loss functions
(Devillers et al., 2024). Minimizing the two losses associated with alignment and translation requires AG
to learn from paired samples, rendering this aspect of training supervised. The translation loss reaches its
minimum when AG successfully maps representations between the vision and motor domains, as defined
by the paired samples. This allows some domain-specific information to be retained within the encoder-
decoder structure and the global latent representation during cross-domain translation.
Conversely, alignment loss becomes minimal when latent representations entering the workspace
from both modules converge into an identical form. This process counterbalances the translational
structure by constraining and shaping the global representation. Meanwhile, cycle-consistency is achieved
through minimizing two unsupervised loss functions, independent of paired data. The first function works
by transferring information from one domain to the latent space of the other domain and then back again,
striving to capture and retain all original content (Gorti et al., 2018; Zhu et al., 2017). The second
function, or “demi-cycle,” involves only the encoder and decoder within a single domain, routing data to
the global latents and back to its identity. In effect, cycle-consistency maximizes domain-specific
12
information preservation within the workspace and encoder-decoder structure bidirectionally, supporting
the translation goal. Opposing this, alignment seeks to preserve only the essential information needed to
form a coherent, optimally integrated global latent representation.
The reinforcement learning algorithm is task-specific, modifying the network structure—
including the workspace—based on rewarding state-action pairs. In contrast, the workspace learning
algorithm is only indirectly related to the task. It relies to some extent on paired data, and training can
only commence after AG has acquired a foundational set of experiences guided by reinforcement
learning. Thus, the workspace training data itself is shaped by AG’s behavior, which statistically reflects
an inherent bias toward actions and environmental features linked to reward.
The following section delves into the interaction between reinforcement learning and workspace
training in AG, exploring the anticipated outcomes, including the potential emergence of global latent
sensorimotor representations.
4.3 Global Latent Sensorimotor Representations
To grasp the outcomes of training AG using both reinforcement learning and global latent
workspace learning, it is informative to first consider reinforcement learning in isolation. In this case, the
reinforcement learning procedure shapes the entire network: the vision module, the motor module, and the
connecting global workspace. Through this training, AG’s network acquires the capacity to reach for
certain types of objects located anywhere in its egocentric visual field. Given our simple scenario, where
each unique object-location combination corresponds to one specific motor response, AG’s “network
brain” embodies a fundamental sensorimotor function.
The representations that emerge within this network are dependent on the reward structure and
the environment. For instance, if only reaching for teddy bears is rewarded, the resulting representational
structure can be relatively abstract, as AG’s network does not need to encode the visual details that would
differentiate a teddy bear from a plush rabbit. Instead, the network needs only to represent features
sufficient to distinguish the teddy bear from the environment. The network also learns to encode the
13
location of the teddy in a way that maps visual input to the corresponding motor action, such as reaching
left, right, or center. A similar pattern of representation emerges if only the plush rabbit is rewarded.
However, if both teddy bears and plush rabbits are equally rewarding, AG’s network will learn to
represent the shared features of both objects without differentiating between them, resulting in a kind of
“commonality” representation that does not signify the existence of two distinct object types. In contrast,
an oscillating reward function—one that alternates between rewarding the teddy bear and the plush
rabbit—encourages AG to develop distinct latent representations for each object type. In this scenario,
AG’s representations will encode shape and color differences to the extent necessary to distinguish
between the two objects, while retaining structural similarity regarding location, since reaching for a
teddy bear on the left requires the same motor action as reaching for a plush rabbit in the same position.
This outcome of the thought experiment draws inspiration from domain-adversarial training
(Ganin et al., 2016). If we reinterpret class labels as motor actions (e.g., reaching leftward or rightward)
and domain labels as object types (e.g., teddy bear or plush rabbit), then the central feature vector—
following adversarial training—optimally integrates visual and motor information. Here, the oscillating
reward function in our thought experiment serves as an analog to the adversarial component in domain-
adversarial training, influencing AG to maintain both specific and generalized features of sensorimotor
information.
Standard feedforward sensorimotor architectures implementing conventional PPOs, however, do
not have a designated feature bifurcation layer. In such architectures, without additional constraints, there
is minimal pressure to construct optimally integrated multimodal representations. Reinforcement learning
alone may encourage a vertical division in representations, with bunnies processed along one subset of
network nodes and teddies along another. Even if network size is compressed to enforce representation
integration at each layer, ambiguity remains regarding which layers retain specific information. Since all
modules and layers contribute to the overall sensorimotor function, each module, layer, and node—with
its respective activations—encodes some aspects of the visual input and the subsequent motor output.
14
In conclusion, sensorimotor representations are pervasive across feedforward architectures and
throughout AG’s network brain.
4.3.1 Global Locality
The pervasiveness of sensorimotor representations brings to mind a recurring critique of the
Sensorimotor Contingency Theory of consciousness. This theory’s proposal to link specific conscious
content with neuronal activity representing object-specific, law-like sensorimotor invariances is both
elegant and compelling (O’Regan and Noë, 2001). However, as AG’s representational structure reveals,
an essential question about sensorimotor correlates of consciousness remains open: where exactly are
these law-like invariances located? For instance, lesions in sensory cortices disrupt subjective
phenomenal experience directly, while an inactive primary motor cortex does not impact immediate
awareness (Donoghue and Sanes, 1994; Snider et al., 2020).
As we have seen, the sensorimotor representations discussed above arose from training AG using
only reinforcement learning. These representations integrate to perform a particular sensorimotor task and
are optimized for this purpose. However, they lack a key property compared to multimodal
representations that emerge through the combined reinforcement and four-loss global workspace training:
locality.
In a feedforward network, the structure of sensorimotor knowledge is inherently distributed
across the entire artificial “brain.” Representations of teddy bears on the left and plush rabbits on the right
blend and divide within the network without any explicit mechanism for leveraging this information in
downstream processes. By contrast, the Global Latent Workspace approach encourages knowledge
localization. Here, each type of sensory input and its associated motor output must be represented within
the global workspace to satisfy the criteria of alignment, translation, and cycle consistency. While this
process only achieves a relative localization, it significantly enhances the ability to use multimodal
information for other connected modalities in downstream tasks.
15
In this context, it is worth noting that the term “downstream” in the global workspace framework
does not necessarily imply spatial positioning further from the input source, as it might in cognitive
science or neuroscience. Instead, much like its usage in machine learning, “downstream” here refers to
sequential order defined by task dependencies rather than spatial topography. For example, one of the
experimental results from the Global Latent Workspace architectures discussed above states that a
reinforcement learning policy trained on global latent representations from the vision module can transfer
effectively to global latents derived from vector descriptions, without further training (Maytie et al.,
2024). This means that if visual information is unavailable, the agent can still leverage structural
associations formed during training to continue performing tasks based on vector inputs.
Adding an additional sensory module to AG’s sensorimotor thought experiment could further
enrich its capabilities. For instance, if rewarding objects (teddy bears, plush rabbits, or otherwise) are
capable of producing sounds, a third module providing audio input could connect to the visuomotor
workspace via its own encoder-decoder structure. This addition would generate a multimodal global latent
representation, incorporating visual, auditory, and motor information. This scenario leads to two
predictions for future experiments:
1. Faster Learning: With an audio module integrated through four-loss semi-supervised training, AG
would require fewer steps for the reinforcement learning algorithm to associate an audio signal
with the correct motor command than would an audiomotor agent without such training.
2. Greater Flexibility: If the environment changes and an opaque obstacle blocks AG’s view, AG
would need to navigate around it to reach the plush animal, relying on sound for locating the
hidden object. By integrating directionally informative audio data in the global latent workspace,
AG could adapt its visuomotor representations, allowing it to reach around the obstacle and grasp,
for example, a teddy bear located behind it on the left.
While this multimodal interaction could theoretically be achieved in feedforward architectures without a
workspace, it would require specific knowledge of where representations are located in the network and
in what information format they exist. The locality of information and globality of format within a central
16
latent workspace likely facilitates such multimodal tasks. Additionally, this type of integration may
support grounding language and spatial reasoning more generally, laying a foundation for a form of
intelligence that large language and vision-language models are only beginning to approach (Chen et al.,
2024; Fu et al., 2024; Zhang et al., 2024).
At its core, the Global Workspace Theory of consciousness has always been a theory of cognition,
proposing a structural component of global connectivity within the brain’s network. Importantly, this
structure remains topologically distinct rather than being equated with the brain in its entirety. By
mapping the theory onto a model constrained by machine learning principles, the Global Latent
Workspace offers a new perspective. Comparing different levels of neuro-cognitive models is a critical
part of scientific inquiry into the inner workings of intelligent and conscious systems (Kuske et al., 2022;
Marr, 1971). Here, the emergence of global latent sensorimotor representations, structurally localized
within an abstract artificial system, provides an initial step toward unifying two major neuroscientific
theories of consciousness while harnessing their potential in the development of intelligent agents.
5. Conclusion
This paper examines the integration of two major neuroscientific theories of consciousness—
Sensorimotor Contingency Theory and Global Workspace Theory—within a novel machine learning
architecture, the Global Latent Workspace. By exploring both theoretical insights and initial experimental
results, it demonstrates how sensorimotor representations, and global latent workspaces might effectively
converge to create a more comprehensive approach to artificial consciousness.
Yet, numerous questions remain. How can this technical integration be formalized
mathematically? The combined reinforcement and global latent workspace training approach awaits
comparison to manifold learning algorithms, such as embodied Isomap and Sensorimotor Embedding
(Philipona and O’Regan, 2011; Stober, 2015). What functional advantages might these representations
confer? For instance, a similar architecture applied in a virtual reality agent performed comparably to
conventional reinforcement learning models on the Obstacle Tower Challenge benchmark, while
17
exhibiting sparse latent representations conducive to resource-efficient computation (Clay et al., 2021,
2023). Additionally, integrating global latent sensorimotor representations may accelerate robotic tasks
like object picking by eliminating transfer overhead, reducing the need for supervised training, and
supporting zero-shot transfer learning—thereby potentially fostering more flexible, multimodal
intelligence (Devillers et al., 2024; Maytie et al., 2024). Could this architecture further integrate
additional neuroscientific theories of consciousness? Many such theories are not mutually exclusive, and
the recursive information flow involving sensorimotor representations aligns with contenders like
predictive processing theory (Seth, 2014). Ultimately, could a global latent sensorimotor representation in
silico suffice for the phenomenal perception of space? Skeptics advocating for neurobiological
implementations must substantiate their claims with functional advantages or foundational premises
(Doerig et al., 2019; Seth, Manuscript).
The journey toward artificial consciousness is undeniably complex, but this work provides an
essential step in exploring its feasibility and implications. Given the rapid advancements in AI and the
growing capabilities of multimodal models, it is uncertain when artificial consciousness might emerge.
Currently, theoretical foundations in this area are limited, possibly due to the longstanding perception of
consciousness research as unscientific. However, if we take seriously the possibility of intelligent
artificial systems, there is no reason to dismiss the potential for conscious artificial systems. As
emphasized earlier, the ethical implications of this question are significant, especially with respect to the
potential for suffering or pleasure in feeling machines. Moreover, a profound connection may exist
between certain forms of consciousness and intelligence, as observed in humans and animals. This paper
has illustrated one theoretical avenue toward this connection, utilizing a sensorimotor global latent
workspace model within a multimodal agent.
In conclusion, if AI research aims to improve life on Earth, it may now be crucial to address the
profound challenge of integrating awareness into artificial systems. Consciously engaging with this
research question could prevent unintended suffering in artificial entities and accelerate the development
of true AI capable of supporting biological life.
18
References
Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar
Bhoopchand, Nathalie Bradley-Schmieg et al. 2023. "Human-timescale adaptation in an open-ended
task space." arXiv preprint arXiv:2301.07608.
Agarwal, Aman, and Shimon Edelman. 2020. "Functionally effective conscious AI without suffering."
Journal of Artificial Intelligence and Consciousness 7, 01: 39-50.
Angulo, Cecilio, and Juan M. Acevedo-Valle. 2017. "On Dynamical Systems for Sensorimotor
Contingencies. A First Approach from Control Engineering." In Recent Advances in Artificial
Intelligence Research and Development, eds. Isabel Aguilo, Rene Alquezar, Cecilio Angulo, Alberto
Ortiz and Joan Torrens. Amsterdam: IOS Press BV.
Attinger, Alexander, Bo Wang, and Georg B. Keller. 2017. "Visuomotor coupling shapes the functional
development of mouse visual cortex." Cell 169(7): 1291-1302.
Baars, Bernard J. 1993. A cognitive theory of consciousness. Cambridge University Press.
Bowen, J., Tula Giannini, Rachel Ara, Andy Lomas, and Judith Siefring. 2019. "Digital art, culture and
heritage: New constructs and consciousness." Electronic Visualisation and the Arts (EVA 2019).
Buhrmann, Thomas, Ezequiel Alejandro Di Paolo, and Xabier Barandiaran. 2013. "A dynamical systems
account of sensorimotor contingencies." Frontiers in psychology 4: 285.
Butlin, Patrick, Robert Long, Eric Elmoznino, Yoshua Bengio, Jonathan Birch, Axel Constant, George
Deane et al. 2023. "Consciousness in artificial intelligence: Insights from the Science of
consciousness." arXiv preprint arXiv:2308.08708.
Chella, Antonio. 2023 "Artificial consciousness: the missing ingredient for ethical AI?." Frontiers in
Robotics and AI 10.
Chen, Boyuan, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024.
"Spatialvlm: Endowing vision-language models with spatial reasoning capabilities." In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14455-14465.
Chen, Zhi, Jingjing Li, Yadan Luo, Zi Huang, and Yang Yang. 2020. "Canzsl: Cycle-consistent adversarial
networks for zero-shot learning from natural language." In Proceedings of the IEEE/CVF winter
conference on applications of computer vision 874-883.
Clay, Viviane, Peter König, Kai-Uwe Kühnberger, and Gordon Pipa. 2021. "Learning sparse and
meaningful representations through embodiment." Neural Networks 134: 23-41.
Clay, Viviane, Peter König, Kai-Uwe Kühnberger, and Gordon Pipa. 2023. " Development of Few-Shot
Learning Capabilities in Artificial Neural Networks When Learning Through Self-Supervised
Interaction." IEEE Transactions on Pattern Analysis and Machine Intelligence 46(1): 209-219.
Dehaene, Stanislas, Hakwan Lau, and Sid Kouider. 2021. "What is consciousness, and could machines
have it?" Robotics, AI, and Humanity: Science, Ethics, and Policy: 43-56.
Devillers, Benjamin, Léopold Maytié, and Rufin VanRullen. 2024. Semi-Supervised Multimodal
Representation Learning Through a Global Workspace," IEEE Transactions on Neural Networks and
Learning Systems: 1-15.
Doerig, Adrien, Aaron Schurger, Kathryn Hess, and Michael H. Herzog. 2019. "The unfolding argument:
Why IIT and other causal structure theories cannot explain consciousness." Consciousness and
cognition 72: 49-59.
Donoghue, John P., and Jerome N. Sanes. 1994. "Motor areas of the cerebral cortex." Journal of Clinical
Neurophysiology 11(4): 382-396.
Dwivedi, Yogesh K., Laurie Hughes, Elvira Ismagilova, Gert Aarts, Crispin Coombs, Tom Crick, Yanqing
Duan et al. 2021. "Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges,
19
opportunities, and agenda for research, practice and policy." International Journal of Information
Management 57: 101994.
Dung, Leonard. 2022. "Against the explanatory argument for enactivism." Journal of Consciousness
Studies 29: 57-68.
Egbert, Matthew D., and Xabier E. Barandiaran. 2022. "Using enactive robotics to think outside of the
problem-solving box: How sensorimotor contingencies constrain the forms of emergent autonomous
habits." Frontiers in Neurorobotics 16: 847054.
Eynon, Rebecca, and Erin Young. 2021. "Methodology, legend, and rhetoric: The constructions of AI by
academia, industry, and policy groups for lifelong learning." Science, Technology, & Human Values
46: 166-191.
Fu, Xingyu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith,
Wei-Chiu Ma, and Ranjay Krishna. 2024. "Blink: Multimodal large language models can see but not
perceive." arXiv preprint arXiv:2404.12390.
Gallouédec, Quentin, Edward Beeching, Clément Romac, and Emmanuel Dellandréa. 2024 "Jack of All
Trades, Master of Some, a Multi-Purpose Transformer Agent." arXiv preprint arXiv:2402.09844.
Ganin, Yaroslav, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François
Laviolette, Mario Marchand, and Victor Lempitsky. 2016 "Domain-adversarial training of neural
networks." The journal of machine learning research 17(1): 2096-2030.
Gorti, Satya Krishna, and Jeremy Ma. 2018. "Text-to-image-to-text translation using cycle consistent
adversarial networks." arXiv preprint arXiv:1808.04538.
Held, Richard, and Alan Hein. 1963. “Movement-produced stimulation in the development of visually
guided behavior”. Journal of Comparative and Physiological Psychology 56 (5): 872.
Kaspar, Kai, Sabine König, Jessika Schwandt, and Peter König. 2014. "The experience of new
sensorimotor contingencies by sensory augmentation." Consciousness and cognition 28: 47-63.
Kaufman, Scott Barry. 2011. "Intelligence and the cognitive unconscious." The Cambridge handbook of
intelligence: 442-467.
Kuske, N., Ragni, M., Röhrbein, F., Vitay, J., & Hamker, F. (2022). Demands and potentials of different
levels of neuro-cognitive models for human spatial cognition. In, E. Ferstl, L. Konieczny, & R.
Stülpnagel (Eds.), Proceedings of KogWiss2022, the 15th Biannual Conference of the German Society
for Cognitive Science (pp. 115-116). Albert-Ludwigs-Universität Freiburg.
Lenay, Charles, Stéphane Canu, and Pierre Villon. 1997. "Technology and perception: The contribution of
sensory substitution systems." In Proceedings Second International Conference on Cognitive
Technology Humanizing the Information Age 44-53.
Lund, Brady D., and Ting Wang. 2023. “Chatting about ChatGPT: how may AI and GPT impact academia
and libraries?” Library Hi Tech News 40(3): 26-29.
Marr, David. 2010. Vision: A computational investigation into the human representation and processing
of visual information. MIT press.
Mashour, George A., Pieter Roelfsema, Jean-Pierre Changeux, and Stanislas Dehaene. 2020. "Conscious
processing and the global neuronal workspace hypothesis." Neuron 105(5): 776-798.
Maye, Alexander, and Andreas K. Engel. 2011. "A discrete computational model of sensorimotor
contingencies for object perception and control of behavior." In 2011 IEEE International Conference
on Robotics and Automation 3810-3815.
Maytié, Léopold, Benjamin Devillers, Alexandre Arnold, and Rufin VanRullen. 2024. "Zero-shot cross-
modal transfer of Reinforcement Learning policies through a Global Workspace." arXiv preprint
arXiv:2403.04588.
Mistry, Percy K., Anthony Strock, Ruizhe Liu, Griffin Young, and Vinod Menon. 2023. "Learning-
induced reorganization of number neurons and emergence of numerical representations in a
biologically inspired neural network." Nature Communications 14(1): 3843.
20
Müller, Vincent C. 2024. “Philosophy of AI: A Structured Overview.” Forthcoming in Cambridge
Handbook on the Law, Ethics and Policy of Artificial Intelligence, eds. Nathalie Smuha. Cambridge:
Cambridge University Press.
O'Regan, Kevin J. 2011. Why red doesn't sound like a bell: Understanding the feel of
Consciousness. New York: Oxford University Press.
O’Regan, Kevin J., and Alva Noë. 2001. “A sensorimotor account of vision and visual consciousness.
Behavioral and Brain Sciences 24(5): 939–973.
Philipona, David L., and J. Kevin O'Regan. 2006. "Color naming, unique hues, and hue cancellation
predicted from singularities in reflection properties." Visual neuroscience 23(3-4): 331-339.
Philipona, David L., and J. Kevin O’Regan. 2011. "The sensorimotor approach in CoSy: The example of
dimensionality reduction." In Cognitive Systems, eds. Henrik Christensen, Jeremy Wyatt, and Geert-
Jan Kruijff. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg.
Price, Derek, and S. Bedini.1964. "Automata in History." Technology and Culture 5: 9-23.
Russell, Stuart J., and Peter Norvig. 2016. Artificial intelligence: a modern approach. Pearson.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. "Proximal policy
optimization algorithms." arXiv preprint arXiv:1707.06347.
Seth, Anil K. 2014. "A predictive processing theory of sensorimotor contingencies: Explaining the puzzle
of perceptual presence and its absence in synesthesia." Cognitive neuroscience 5: 97-118.
Seth, Anil K., and Tim Bayne. 2022. "Theories of consciousness." Nature Reviews Neuroscience 23: 439-
452.
Seth, Anil K. 2024. "Conscious artificial intelligence and biological naturalism." psyarxiv June 2024.
Sergent, Claire, Martina Corazzol, Ghislaine Labouret, François Stockart, Mark Wexler, Jean-Rémi King,
Florent Meyniel, and Daniel Pressnitzer. 2021. "Bifurcation in brain dynamics reveals a signature of
conscious processing independent of report." Nature communications 12: 1149.
Sergent, Claire, and Stanislas Dehaene. 2004. "Neural processes underlying conscious perception:
experimental findings and a global neuronal workspace framework." Journal of Physiology-Paris
98(4-6): 374-384.
Silverman, David. 2017. "Sensorimotor theory and the problems of consciousness." Journal of
Consciousness Studies 24(7-8): 189-216.
Snider, Samuel B., Joey Hsu, R. Ryan Darby, Danielle Cooke, David Fischer, Alexander L. Cohen, Jordan
H. Grafman, and Michael D. Fox. 2020. "Cortical lesions causing loss of consciousness are
anticorrelated with the dorsal brainstem." Human brain mapping 41(6): 1520-1531.
Stober, Jeremy Michael. 2015. Sensorimotor embedding: a developmental approach to learning
geometry. Dissertation. The University of Texas at Austin.
Stober, Jeremy, Risto Miikkulainen, and Benjamin Kuipers. 2011. "Learning geometry from sensorimotor
experience." In 2011 IEEE International Conference on Development and Learning (ICDL) 2: 1-6.
VanRullen, Rufin. 2024. "Neuroscience and AI meeting." Meeting of the Society of Neuroscience, Mai 23,
2024, University of Bordeaux, Bordeaux, France.
VanRullen, Rufin, and Ryota Kanai. 2021. "Deep learning and the global workspace theory." Trends in
Neurosciences 44(9): 692-704.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. "Attention is all you need." Advances in neural information
processing systems 30.
Zhang, Yizhe, He Bai, Ruixiang Zhang, Jiatao Gu, Shuangfei Zhai, Josh Susskind, and Navdeep Jaitly.
2024. "How Far Are We from Intelligent Visual Deductive Reasoning?." arXiv preprint
arXiv:2403.04732.
21
Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. "Unpaired image-to-image
translation using cycle-consistent adversarial networks." In Proceedings of the IEEE international
conference on computer vision 2223-2232.