ArticlePDF Available


IBM Research is engaged in a research program in symbiotic cognitive computing to investigate how to embed cognitive computing in physical spaces. This article proposes five key principles of symbiotic cognitive computing: context, connection, representation, modularity, and adaptation, along with the requirements that flow from these principles. We describe how these principles are applied in a particular symbiotic cognitive computing environment and in an illustrative application for strategic decision making. Our results suggest that these principles and the associated software architecture provide a solid foundation for building applications where people and intelligent agents work together in a shared physical and computational environment. We conclude with a list of challenges that lie ahead. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved.
In 2011, IBM’s Watson competed on the game show Jeop-
ardy! winning against the two best players of all time, Brad
Rutter and Ken Jennings (Ferrucci et al. 2010). Since this
demonstration, IBM has expanded its research program in
articial intelligence (AI), including the areas of natural lan-
guage processing and machine learning (Kelly and Hamm
2013). Ultimately, IBM sees the opportunity to develop cog-
nitive computing — a unied and universal platform for
computational intelligence (Modha et al. 2011). But how
might cognitive computing work in real environments —
and in concert with people?
In 2013, our group within IBM Research started to explore
how to embed cognitive computing in physical environ-
ments. We built a Cognitive Environments Laboratory (CEL)
(see gure 1) as a living lab to explore how people and cog-
nitive computing come together.
Our effort focuses not only on the physical and computa-
tional substrate, but also on the users’ experience. We envi-
sion a uid and natural interaction that extends through
time across multiple environments (ofce, meeting room,
living room, car, mobile). In this view, cognitive computing
systems are always on and available to engage with people in
the environment. The system appears to follow individual
users, or groups of users, as they change environments, seam-
lessly connecting the users to available input and output
devices and extending their reach beyond their own cogni-
tive and sensory abilities.
We call this symbiotic cognitive computing: computation that
takes place when people and intelligent agents come togeth-
er in a physical space to interact with one another. The intel-
ligent agents use a computational substrate of “cogs” for visu-
al object recognition, natural language parsing, probabilistic
decision support, and other functions. The term cog is from
FALL 2016 81
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
Symbiotic Cognitive Computing
Robert Farrell, Jonathan Lenchner, Jeffrey Kephart, Alan Webb,
Michael Muller, Thomas Erickson, David Melville, Rachel Bellamy,
Daniel Gruen, Jonathan Connell, Danny Soroker, Andy Aaron,
Shari Trewin, Maryam Ashoori, Jason Ellis, Brian Gaucher, Dario Gil
IIBM Research is engaged in a
research program in symbiotic cognitive
computing to investigate how to embed
cognitive computing in physical spaces.
This article proposes ve key principles
of symbiotic cognitive computing: con-
text, connection, representation, modu-
larity, and adaptation, along with the
requirements that ow from these prin-
ciples. We describe how these principles
are applied in a particular symbiotic
cognitive computing environment and
in an illustrative application for strate-
gic decision making. Our results suggest
that these principles and the associated
software architecture provide a solid
foundation for building applications
where people and intelligent agents
work together in a shared physical and
computational environment. We con-
clude with a list of challenges that lie
the book The Society of Mind where Marvin Minsky
likened agents to “cogs of great machines” (Minsky
1988). These cogs are available to intelligent agents
through programmatic interfaces and to human par-
ticipants through user interfaces.
Our long-term goal is to produce a physical and
computational environment that measurably
improves the performance of groups on key tasks
requiring large amounts of data and signicant men-
tal effort, such as information discovery, situational
assessment, product design, and strategic decision
making. To date, we have focused specically on
building a cognitive environment, a physical space
embedded with cognitive computing systems, to sup-
port business meetings for strategic decision making.
Other applications include corporate meetings
exploring potential mergers and acquisitions, execu-
tive meetings on whether to purchase oil elds, and
utility company meetings to address electrical grid
outages. These meetings often bring together a group
of participants with varied roles, skills, expertise, and
points of view. They involve making decisions with a
large number of high-impact choices that need to be
evaluated on multiple dimensions taking into
account large amounts of structured and unstruc-
tured data.
While meetings are an essential part of business,
studies show that they are generally costly and
unproductive, and participants nd them too fre-
quent, lengthy, and boring (Romano and Nunamak-
er 2001). Despite this, intelligent systems have the
potential to vastly improve our ability to have pro-
ductive meetings (Shrobe et al. 2001). For example,
an intelligent system can remember every conversa-
tion, record all information on displays, and answer
questions for meeting participants. People making
high-stakes, high-pressure decisions have high expec-
tations. They typically do not have the time or desire
to use computing systems that add to their workload
or distract from the task at hand. Thus, we are aim-
ing for a “frictionless” environment that is always
available, knowledgeable, and engaged. The system
must eliminate any extraneous steps between
thought and computation, and minimize disruptions
while bringing important information to the fore.
The remainder of this article is organized as fol-
lows. In the next section, we review the literature
that motivated our vision of symbiotic cognitive
computing. We then propose ve fundamental prin-
ciples of symbiotic cognitive computing. We list
some of the key requirements for cognitive environ-
ments that implement these principles. We then pro-
vide a description of the Cognitive Environments
Lab, our cognitive environments test bed, and intro-
duce a prototype meeting-support application we
built for corporate mergers and acquisitions (M&A)
that runs in this environment. We wrap up by return-
ing to our basic tenets, stating our conclusions, and
listing problems for future study.
In his paper “Man-Machine Symbiosis,” J. C. R. Lick-
lider (1960) originated the concept of symbiotic com-
puting. He wrote,
Present-day computers are designed primarily to solve
preformulated problems or to process data according
to predetermined procedures. … However, many prob-
lems … are very difficult to think through in advance.
They would be easier to solve, and they could be
solved faster, through an intuitively guided trial-and-
error procedure in which the computer cooperated,
turning up flaws in the reasoning or revealing unex-
pected turns in the solution.
Licklider stressed that this kind of computer-sup-
ported cooperation was important for real-time deci-
sion making. He thought it important to “bring com-
puting machines effectively into processes of
thinking that must go on in real time, time that
moves too fast to permit using computers in conven-
Figure 1. The Cognitive Environments Lab.
CEL is equipped with movement sensors, microphones, cameras, speakers, and displays. Speech and gesture are used to run cloud-based
services, manipulate data, run analytics, and generate spoken and visual outputs. Wands, and other devices enable users to move visual ele-
ments in three dimensions across displays and interact directly with data.
FALL 2016 83
tional ways.” Licklider likely did not foresee the
explosive growth in data and computing power in
the last several decades, but he was remarkably pre-
scient in his vision of man-machine symbiosis.
Distributed cognition (Hutchins 1995) recognizes
that people form a tightly coupled system with their
environment. Cognition does not occur solely or
even mostly within an individual human mind, but
rather is distributed across people, the artifacts they
use, and the environments in which they operate.
External representations often capture the current
understanding of the group, and collaboration is
mediated by the representations that are created,
manipulated, and shared. In activity theory (Nardi
1996), shared representations are used for establish-
ing collective goals and for communication and coor-
dinated action around those goals.
Work on cognitive architectures (Langley, Laird,
and Rogers 2009; Anderson 1983; Laird, Newell, and
Rosenbloom 1987) focuses on discovering the under-
lying mechanisms of human cognition. For example,
the adaptive control of thought (ACT family of cog-
nitive architectures includes semantic networks for
modeling long-term memory and production rules
for modeling reasoning, and learning mechanisms for
improving both (Anderson, Farrell, and Sauers 1984).
Work on multiagent systems (Genesereth and
Ketchpel 1994) has focused on building intelligent
systems that use coordination, and potentially com-
petition, among relatively simple, independently
constructed software agents to perform tasks that
normally require human intelligence. Minsky (1988)
explained that “each mental agent in itself can do
some simple thing that needs no mind or thought at
all. Yet when we join these agents in societies — in
certain very special ways — this leads to true intelli-
Calm technology (Weiser and Brown 1996) sug-
gests that when peripheral awareness is engaged, peo-
ple can more readily focus their attention. People are
typically aware of a lot of peripheral information,
and something will move to the center of their atten-
tion, for example when they perceive that things are
not going as expected. They will then process the
item in focus, and when they are done it will fade
back to the periphery.
We have used these ideas as the basis for our vision
of symbiotic cognitive computing.
Principles of Symbiotic
Cognitive Computing
Our work leads us to propose ve key principles of
symbiotic cognitive computing: context, connection,
representation, modularity, and adaption. These
principles suggest requirements for an effective sym-
biosis between intelligent agents and human partici-
pants in a physical environment.
The context principle states that the symbiosis
should be grounded in the current physical and cog-
nitive circumstances. The environment should main-
tain presence, track and reect activity, and build and
manage context. To maintain presence, the environ-
ment should provide the means for intelligent agents
to communicate their availability and function, and
should attempt to identify people who are available
to engage with intelligent agents and with one
another. To track and reect activity, the environ-
ment should follow the activity of people and
between people and among people, and the physical
and computational objects in the environment or
environments. It should, when appropriate, commu-
nicate the activity back to people in the environ-
ment. At other times, it should await human initia-
tives before communicating or acting. To build and
manage context, the environment should create and
maintain active visual and linguistic contexts within
and across environments to serve as common ground
for the people and machines in the symbiosis, and
should provide references to shared physical and dig-
ital artifacts and to conversational foci.
The connection principle states that the symbiosis
should engage humans and machines with one
another. The environment should reduce barriers,
distractions, and interruptions and not put physical
(for example, walls) or digital barriers (for example,
pixels) between people. The environment should,
when appropriate, detect and respond to opportuni-
ties to interact with people across visual and audito-
ry modalities. The environment should provide mul-
tiple independent means for users to discover and
interact with agents. It should enable people and
agents to communicate within and across environ-
ments using visual and auditory modalities. The
environment should also help users establish joint
goals both with one another and with agents. Final-
ly, the environment should include agents that are
cooperative with users in all interactions, helping
users understand the users’ own goals and options,
and conveying relevant information in a timely fash-
ion (Grice 1975).
The representation principle states that the symbiosis
should produce representations that become the
basis for communication, joint goals, and coordinat-
ed action between and among humans and
machines. The environment should maintain inter-
nal representations based on the tracked users, joint
goals, and activities that are stored for later retrieval.
The environment should externalize selected repre-
sentations, and any potential misunderstandings, to
coordinate with users and facilitate transfer of repre-
sentations across cognitive environments. Finally,
the environment should utilize the internal and
external representations to enable seamless context
switching between different physical spaces and
between different activities within the same physical
space by retrieving the appropriate stored represen-
tations for the current context.
The modularity principle states that the symbiosis
should be driven by largely independent modular
composable computational elements that operate on
the representations and can be accessed equally by
humans and machines. The environment should
provide a means for modular software components
to describe themselves for use by other agents or by
people in the environment. The environment should
also provide a means of composing modular software
components, which perform limited tasks with a sub-
set of the representation, with other components, to
collectively produce behavior for agents. The envi-
ronment should provide means for modular software
components to communicate with one another inde-
pendently of the people in the environment.
Finally, the adaptation principle states that the sym-
biosis should improve with time. The environment
should provide adequate feedback to users and
accept feedback from users. Finally, the environment
should incrementally improve the symbiosis from
interactions with users and in effect, learn.
We arrived at these principles by reecting upon
the state of human-computer interaction with intel-
ligent agents and on our own experiences attempt-
ing to create effective symbiotic interactions in the
CEL. The context principle originates from our obser-
vation that most conversational systems operate
with little or no linguistic or visual context. Break-
downs often occur during human-machine dialogue
due to lack of shared context. The connection prin-
ciple arises out of our observation that today’s
devices are often situated between people and
become an impediment to engagement. The repre-
sentation principle was motivated by our observation
that people often resolve ambiguities, disagreements,
and diverging goals by drawing or creating other
visual artifacts. The use of external representations
reduces the domain of discourse and focuses parties
on a shared understanding. The modularity principle
arose from the practical considerations associated
with building the system. We needed ways of adding
competing or complementary cogs without reimple-
menting existing cogs. The adaptation principle was
motivated by the need to apply machine learning
algorithms to a larger range of human-computer
interaction tasks. Natural language parsing, multi-
modal reference resolution, and other tasks should
improve through user input and feedback.
One question we asked ourselves when designing
the principles was whether they apply equally to
human-human and human-computer interaction.
Context, connection, representation, and adaptation
all apply equally well to these situations. The modu-
larity principle may appear to be an exception, but
the ability to surface cogs to both human and com-
puter participants in the environment enables both
better collaboration and improved human-computer
Cognitive environments that implement these
requirements enable people and intelligent agents to
be mutually aware of each others’ presence and activ-
ity, develop connections through interaction, create
shared representations, and improve over time. By
providing both intelligent agents and human partic-
ipants with access to the same representations and
the same computational building blocks, a natural
symbiosis can be supported.
It is impossible to argue that we have found a
denitive set of principles; future researchers may
nd better ones or perhaps more self-evident ones
from which the ones we have articulated can be
derived. It may even be possible to create a better
symbiotic cognitive system than any we have created
and not obey one or more of our principles. We look
forward to hearing about any such developments.
We are starting to realize these principles and
requirements by building prototype cognitive envi-
ronments at IBM Research laboratories worldwide.
The Cognitive Environments
The Cognitive Environments Laboratory is located at
the IBM T. J. Watson Research Center in Yorktown
Heights, New York. The lab is meant to be a test bed
for exploring what various envisioned cognitive envi-
ronments might be like. It is more heavily instru-
mented than the vast majority of our envisioned cog-
nitive environments, but the idea is that over time
we will see what instrumentation works and what
does not. The lab is focused on engaging users with
one another by providing just the right technology to
support this engagement.
Perhaps the most prominent feature of the CEL is
its large number of displays. In the front of the room
there is a four by four array of high denition moni-
tors (1920 x 1080 pixel resolution), which act like a
single large display surface. On either side of the
room are two pairs of high denition monitors on
tracks. These monitor pairs can be moved from the
back to the front of the room along tracks inlaid in
the ceiling, enabling fast and immediate recongura-
tion of the room to match many meeting types and
activities. In the back of the room there is an 84-inch
touch-enabled 3840 x 2160 pixel display. The moni-
tors are laid out around the periphery of the room.
Within the room, visual content can either be moved
programmatically or with the aid of special ultra-
sound-enabled pointing devices called “wands” or
with combinations of gesture and voice, from moni-
tor to monitor or within individual monitors.
In addition to the displays, the room is outtted
with a large number of microphones and speakers.
There are several lapel microphones, gooseneck
microphones, and a smattering of microphones
attached to the ceiling. We have also experimented
with array microphones that support “beam form-
ing” to isolate the speech of multiple simultaneous
speakers without the need for individual micro-
An intelligent agent we named Celia (cognitive
environments laboratory intelligent agent) senses the
conversation of the room occupants and becomes a
supporting participant in meetings. With the aid of a
speech-to-text transcription system, the room can
document what is being said. Moreover, the text and
audio content of meetings is continuously archived.
Participants can ask, for example, to recall the tran-
script or audio of all meetings that discussed “graph
databases” or the segment of the current meeting
where such databases were discussed. Transcribed
utterances are parsed using various natural language
processing technologies, and may be recognized as
commands, statements, or questions. For example,
one can dene a listener that waits for key words or
phrases that trigger commands to the system to do
something. The listener can also test whether certain
preconditions are satised, such as whether certain
objects are being displayed. Commands can invoke
agents that retrieve information from the web or
databases, run structured or unstructured data ana-
lytics, route questions to the Watson question-
answering system, and produce interactive visualiza-
tions. Moreover, with the aid of a text-to-speech
system, the room can synthesize appropriate respons-
es to commands. The system can be congured to use
the voice of IBM Watson or a different voice.
In addition to the audio and video output sys-
tems, the room contains eight pan-tilt-zoom cam-
eras, four of which are Internet Protocol (IP) cam-
eras, plus three depth-sensing devices, one of which
is gimbal mounted with software-controllable pan
and tilt capability. The depth-sensing systems are
used to detect the presence of people in the room
and track their location and hand gestures. The cur-
rent set of multichannel output technologies (that
is, including screens and speakers) and multichannel
input technologies (that is, keyboard, speech-to-text,
motion) provide an array of mixed-initiative possi-
People in the CEL can simultaneously gesture and
speak to Celia to manipulate and select objects and
operate on those objects with data analytics and serv-
ices. The room can then generate and display infor-
mation and generate speech to indicate objects of
interest, explain concepts, or provide affordances for
further interaction. The experience is one of inter-
acting with Celia as a gateway to a large number of
independently addressable components, cogs, many
of which work instantly to augment the cognitive
abilities of the group of people in the room.
Dependence on a single modality in a complex
environment generally leads to ineffective and
inconvenient interactions (Oviatt 2000). Environ-
ments can enable higher level and robust interac-
tions by exploiting the redundancy in multimodal
inputs (speech, gesture, vision). The integration of
speech and gesture modalities has been shown to
provide both exibility and convenience to users
(Krum 2002). Several prototypes have been imple-
mented and described in the literature. Bolt (1980)
used voice and gesture inputs to issue commands to
display simple shapes on a large screen. Sherma
(2003) and Carbini (2006) extended this idea to a
multiuser interaction space. We have built upon the
ideas in this work in creating the mergers and acqui-
sitions application.
Mergers and Acquisitions
In 2014 and 2015 we built a prototype system, situ-
ated in the Cognitive Environments Laboratory, for
exploring how a corporate strategy team makes deci-
sions regarding potential mergers and acquisitions.
As depicted in gure 2, one or more people can use
speech and gestures to interact with displayed objects
and with Celia. The system has proven useful for
exploring some of the interaction patterns between
people and intelligent agents in a high-stakes deci-
sion-making scenario, and for suggesting architec-
tural requirements and research challenges that may
apply generally to symbiotic cognitive computing
In the prototype, specialists working on mergers
and acquisitions try to nd reasonable acquisition
targets and understand the trade-offs between them.
They compare companies side by side and receive
guidance about which companies are most aligned
with their preferences, as inferred through repeated
interactions. The end result is a small set of compa-
nies to investigate with a full-edged “due diligence”
analysis that takes place following the meeting.
When the human collaborators have interacted
with the prototype to bring it to the point depicted
in gure 2, they have explored the space of mergers
and acquisitions candidates, querying the system for
companies with relevant business descriptions and
numeric attributes that fall within desired ranges,
such as the number of employees and the quarterly
revenue. As information revealed by Celia is inter-
leaved with discussions among the collaborators,
often triggered by that information, the collaborators
develop an idea of which company attributes matter
most to them. They can then invoke a decision table
to nish exploring the trade-offs.
Figure 3 provides a high-level view of the cognitive
environment and its multiagent software architec-
ture. Agents communicate with one another through
a publish-and-subscribe messaging system (the mes-
sage broker) and through HTTP web services using
the Representational State Transfer (REST) software
design pattern. The system functions can be divided
into command interpretation, command execution,
agent management, decision making, text and data
analysis, text-to-speech, visualization, and manage-
FALL 2016 85
ment. We explain each of these functions in the sec-
tions that follow.
Command Interpretation
The system enables speech and gesture to be used
together as one natural method of communication.
Utterances are captured by microphones, rendered
into text by speech recognition engines, and pub-
lished to a “transcript” message channel managed by
the message broker. The message broker supports
high-performance asynchronous messaging suitable
for the real-time concurrent communication in the
cognitive environment.
We have tested and customized a variety of speech-
recognition engines for the cognitive environment.
The default engine is IBM Attila (Soltau, Saon, and
Kingsbury 2010). It has two modes: a rst that ren-
ders the transcription of an utterance once a break is
detected on the speech channel (for example, half a
second of silence), and a second that renders a word-
by-word transcription without waiting for a pause. In
the former mode there is some probability that sub-
sequent words will alter the assessment of earlier
words. We run both modes in parallel to enable
agents to read and publish partial interpretations
Position and motion tracking uses output from the
position and motion sensors in combination with
visual object recognition using input from the cam-
eras to locate, identify, and follow physical objects in
the environment. The user identity tracking agent
maps recognized people to unique users using
acoustic speaker identication, verbal introduction
(“Celia, I am Bob”), facial recognition upon entry, or
other methods. The speaker’s identity, if known, is
added to each message on the transcript channel,
making it possible to interleave dialogues to some
degree, but further research is needed to handle com-
plex multiuser dialogues. The persistent session infor-
mation includes persistent identities for users,
including name, title, and other information collect-
ed during and across sessions.
The natural language parsing agent subscribes to
the transcript channel, processes text transcriptions
of utterances containing an attention word (for
example, “Celia” or “Watson”) into a semantic repre-
sentation that captures the type of command and
any relevant parameters, and publishes the represen-
tation to a command channel. Our default parser is
Figure 2. The Mergers and Acquisitions Prototype Application.
People working with one another and with Celia to discover companies that match desired criteria obtain detailed information about like-
ly candidates and winnow the chosen companies down to a small number that are most suitable.
based on regular expression matching and template
lling. It uses a hierarchical, composable set of func-
tions that match tokens in the text to grammatical
patterns and outputs a semantic representation that
can be passed on to the command executor. Most
sentences have a subject-verb-object structure, with
“Celia” as the subject, a verb that corresponds to a
primitive function of one of the agents or a more
complex domain-specic procedure, and an object
that corresponds to one or more named entities, as
resolved by the named entity resolution agent, or
with modiers that operate on primitive types such
as numbers or dates. Another parser we have running
in the cognitive environment uses a semantic gram-
mar that decomposes top-level commands into ter-
minal tokens or unconstrained dictation of up to ve
words. The resulting parse tree is then transformed
into commands, each with a set of slots and llers
(Connell 2014). A third parser using Slot Grammar
(SG) is from the Watson System (McCord, Murdock,
and Boguraev 2012). It is a deep parser, producing
both syntactic structure and semantic annotations
on input sentences. Interrogatives can be identied
by the SG parser and routed directly to a version of
the IBM Watson system, with the highest condence
answer generated back to the user through text-to-
An important issue that arose early in the devel-
opment of the prototype was the imperfection of
speech transcription. We targeted a command com-
pletion rate of over 90 percent for experienced users,
but with word-recognition accuracies in the low to
mid 90 percent range and commands ranging from 5
to 20 or more words in length, we were not achiev-
ing this target. To address this deciency, we devel-
oped several mechanisms to ensure that speech-
based communication between humans and Celia
would work acceptably in practice. First, using a half
hour of utterances captured from user interaction
with the prototype, we trained a speech model and
enhanced this with a domain-specic language mod-
el using a database of 8500 company names extract-
ed from both structured and unstructured sources.
The named entity resolution agent was extended to
resolve acronyms and abbreviations and to match
both phonetically and lexicographically. To provide
FALL 2016 87
Figure 3. Architecture of the Mergers and Acquisitions Prototype.
Analysis Agents
(Query Processor, Concept
Analyzer, Documents &
Visualization Agents
Decision Agents
(Probabilistic Decision
Command Executor Text to Speech
Microphones Cameras Displays Speakers
& Motion
& Motion
Visual Object
Speech Transcript
User Identity
Named Entity
Gestural &
(Decision Table, Cog
Browser, Information
Formatter & Graph Viewer
better feedback to users, we added a speech tran-
script. Celia’s utterances are labeled with “Celia,” and
a command expression is also shown to reect the
command that the system processed. If the system
doesn’t respond as expected, users can see whether
Celia misinterpreted their request, and if so, reissue
it. Finally, we implemented an undo feature to sup-
port restoration of the immediately prior state when
the system misinterprets the previous command.
The gestural and linguistic reference agent is
responsible for fusing inputs from multiple modes
into a single command. It maintains persistent refer-
ents for recently displayed visual elements and
recently mentioned named entities for the duration
of a session. When Celia or a user mentions a partic-
ular company or other named entity, this agent cre-
ates a referent to the entity. Likewise, when the dis-
play manager shows a particular company or other
visual element at the request of a user or Celia, a ref-
erent is generated. Using the referents, this agent can
nd relevant entities for users or Celia based on lin-
guistic references such as pronouns, gestures such as
pointing, or both. Typically the referent is either
something the speaker is pointing toward or is some-
thing that has recently been mentioned. In the event
that the pronoun refers to something being pointed
at, the gestural and linguistic reference agent may
need to use the visual object recognition’s output, or
if the item being pointed at resides on a screen, the
agent can request the virtual object from the appro-
priate agent.
Command Execution
The command executor agent subscribes to the com-
mand channel and oversees command execution.
Some of its functions may be domain-specic. The
command executor agent communicates over HTTP
web services to the decision agents, analysis agents,
and visualization agents. Often, the command execu-
tor serves as an orchestrator, calling a rst agent,
receiving the response, and reformulating that
response to another agent. Thus, the command
executor maintains state during the utterance. A sin-
gle request from a user may trigger a cascade of agent-
to-agent communication throughout the command
execution part of the ow, eventually culminating in
activity on the displays and/or synthesized speech
being played over the speakers.
Decision Making
In the mergers and acquisitions application, people
have high-level goals that they want the system to
help them achieve. A common approach from deci-
sion theory is to specify a utility function (Walsh et
al. 2004). Given a utility function dened in terms of
attributes that are of concern to the user, the system’s
objective is to take actions or adjust controls so as to
reach a feasible state that yields the highest possible
utility. A particularly attractive property of this
approach is that utility can be used to propagate
objectives through a system from one agent to anoth-
er. However, experience with our symbiotic cognitive
computing prototype suggests that this traditional
approach misses something vital: people start with
inexact notions of what they want and use computa-
tional tools to explore the options. It is only during
this sometimes-serendipitous exploration process
that they come to understand their goals better. Cer-
tain companies appeal to us, sometimes before we
even know why, and it can take a serious introspec-
tion effort (an effort that may be assisted by the cog-
nitive system) to discover which attributes matter
most to us or to realize we may be biased. This real-
ization prompted us to design a probabilistic decision
support agent (Bhattacharjya and Kephart 2014).
This agent starts with a set of candidate solutions to
the decision support problem and attributes, and a
highly uncertain model of user preferences. As the
user accepts or rejects its recommendations, or
answers questions about trade-offs, the agent pro-
gressively sharpens its understanding of the users’
objectives, which it models as a probability distribu-
tion of weights in the space of possible utility func-
tions. The agent is able to recommend ltering
actions, such as removing companies or removing
attributes, to help users converge on a small number
of targets for mergers and acquisitions.
Text and Data Analysis
The concept analyzer agent provides additional data
for decision making by extracting concepts and rela-
tionships from documents. While IBM Watson was
trained for general question answering using prima-
rily open web sources such as Wikipedia, we antici-
pate that most applications will also involve har-
nessing data from private databases and third-party
services. For the mergers and acquisitions applica-
tion, we developed a large database of company
annual reports. Concepts extracted from the reports
can be linked to the companies displayed in the
graph viewer and when a user asks for companies
similar to a selected company, the system is able to
retrieve companies through the concepts and rela-
tionships. The query processor agent provides a
query interface to a database of company nancial
information, such as annual revenue, price-earnings
ratio, and income.
Persistent Session Information
We have added the ability for the cognitive environ-
ment to capture the state of the interaction between
users and agents either as needed or at the end of a
session. It does this by creating a snapshot of what
agents are active, what is being displayed, what has
been said, and what commands have been complet-
ed. The snapshots are saved and accessible from any
device thus enabling users to take products of the
work session outside of the cognitive environment.
The session capture feature allows users to review
past decisions and can support case-based reasoning
(Leake 1996). It also provides continuity because
users can stop a decision-making process and contin-
ue later. Finally, it allows for some degree of portabil-
ity across multiple cognitive environments.
The text-to-speech agent converts text into spoken
voice. The current system has a choice of two voices:
a North American English female voice or a North
American English male voice (which was used by the
IBM Watson system). In order to keep interruptions
to a minimum, the speech output is used sparingly
and usually in conjunction with the visual display.
The visualization agents work in the cognitive envi-
ronment’s multiuser multiscreen networked environ-
ment. Celia places content on the 25 displays and
either the visualization agents or the users then
manipulate the content in three dimensions. We
designed the M&A application visualizations to work
together in the same visual space using common
styles and behaviors. The display manager coordi-
nates content placement and rendering using place-
ment defaults and constraints. The cog browser
enables people to nd and learn about the available
cogs in the cognitive environment. It displays a cloud
of icons representing the society of cogs. The icons
can be expanded to reveal information about each
cog and how they are invoked.
Many agents have functions that can be addressed
through speech commands or through gesture. For
example, the information formatter can display com-
pany information (on the right in gure 2) and
allows a user to say “products” or select the products
tab to get more information about the company’s
products. In addition, many of the agents that pro-
vide building blocks for Celia’s decision-making
functions are cogs that are also independently
addressable through visual interfaces and speech
commands. For example, the graph viewer can be
used by Celia to recommend companies but is also
available to users for visualizing companies meeting
various criteria.
Agent Management
We have implemented several management modules
that operate in parallel with the other parts of the
system to allow agents to nd instances of other
agents that are running and thereby enable discov-
ery and communication. When an agent is rst
launched, it registers itself to a life-cycle manager
module to advertise its REST interfaces and types of
messages it publishes over specic publish-and-sub-
scribe channels. When an agent requires the services
of a second agent, it can locate an instance of the sec-
ond agent by querying the lifecycle manager, there-
by avoiding the need to know details of the second
agent’s running instance.
The agents used in the M&A application and oth-
ers are available as cogs in the CEL and work as one
intelligent agent to provide access to cognitive com-
puting services. Taken together, the agents provide a
completely new computing experience for business
users facing tough decisions.
A sample dialogue is shown in gure 4. To handle
exchange 1, the system processes the rst sentence
using the speech recognition engine and sends a
message with the transcribed text to the message
broker on the transcript channel. The natural lan-
guage parser listens on the transcript channel and
publishes a semantic representation based on a
dependency parse of the input that identies the
name Brian in the object role. The user identity
tracking agent resolves the name against the persist-
ent identier for Brian and publishes a command on
the command channel with a set user identier. The
command executor then requests the speech recog-
nition agent to start using the speaker’s speech mod-
el. The next sentence is processed similarly, but the
gestural and linguistic reference agent resolves “I” to
“Brian.” The verb “help” is recognized as the main
action and the actor is the persistent identier for
Brian. The command executor recognizes the repre-
sentation as the initialization of the mergers and
acquisitions application using a pattern rule. It calls
the display manager, which saves and hides the state
of the displays. The command executor then gener-
ates the response that is sent to the text to speech
agent. This requires querying the user identity track-
FALL 2016 89
Figure 4. A Sample Dialogue Processed
by the Mergers and Acquisitions Prototype.
Exchange 1
Brian: Celia, this is Brian. I need help with acquisitions.
Celia: Hello Brian, how can I help you with mergers and acquisitions?
Exchange 2
Brian: Celia, show me companies with revenue between $25 million and $50
million and between 100 and 500 employees, pertaining to analytics.
Celia: Here is a graph showing 96 companies pertaining to biotechnology
(Celia displays the graph).
Exchange 3
Brian: Celia, place the companies named brain science, lintolin, and tata, in a
decision table.
Celia: Ok. (Celia shows a table with the 3 companies, one per row, and with
columns for the name of the company, the revenue, and number of employees).
Celia: I suggest removing Lyntolin. Brain Sciences, Incorporated has greater
revenue and greater number of employees (Celia highlights Brain Sciences and Lyntolin).
ing agent to map the persistent identier for Brian to
his name.
To handle exchange 2, a similar ow happens,
except the command executor calls the query proces-
sor agent to nd companies matching the revenue
and company size criteria. Upon receiving the
response, the command executor calls the graph
viewer, which adds nodes representing each match-
ing biotechnology company to a force-directed graph
on the display (in the center in gure 2). The com-
mand executor also calls the text-to-speech agent to
play an acknowledgement over the speakers. The
user can then manipulate the graph using gestures or
issue further speech commands.
For exchange 3, the command executor rst calls
the named entity resolver three times to resolve the
exact names of the companies referred to by the user;
for example it might resolve “brain science” into
“Brain Sciences, Incorporated.” Upon receiving these
responses, the command executor calls the query
processor agent to obtain company information,
which it then sends to the probabilistic decision sup-
port agent. This agent must interact with the deci-
sion table agent, which in turn uses the display man-
ager to display the output to the user. While all of
this is happening, the executor also calls the text-to-
speech agent to acknowledge the user’s request.
Coordination between speech and display thus hap-
pens in the command executor. During this interac-
tion, the transcript displayer and command display-
er display the utterance and the interpreted
In the next section, we discuss our work to date on
the cognitive environment in terms of both prior
work and our original symbiotic cognitive comput-
ing principles.
Prior intelligent decision-making environments
focus primarily on sensor fusion, but fall short of
demonstrating an intelligent meeting participant
(Ramos et al 2010). The New EasyLiving Project
attempted to create a coherent user experience, but
was focused on integrating I/O devices (Brumitt et al.
2000). The NIST Meeting Room has more than 280
microphones, seven HD cameras, a smart white-
board, and a locator system for the meeting attendees
(Stanford et al. 2003), but little in the way of intelli-
gent decision support. The CALO (Cognitive Assis-
tant that Learns and Organizes) DARPA project
includes a meeting assistant that captures speech,
pen, and other meeting data and produces an auto-
mated transcript, segmented by topic, and performs
shallow discourse understanding to produce a list of
probable action items (Voss and Ehlen 2007), but it
does not focus on multimodal interaction.
The experience of building and using the M&A
application has been valuable in several respects.
First, while we haven’t yet run a formal evaluation,
we’ve found that the concept of a cognitive environ-
ment for decision making resonates well with busi-
ness users. To date we have now had more than 50
groups of industry executives see a demonstration
and provide feedback. We are now working closely
with the mergers and acquisitions specialists at IBM
to bring aspects of the prototype into everyday use.
Second, the prototype has helped us to rene and at
least partially realize the symbiotic cognitive com-
puting principles dened in this article, and to gain a
better understanding of the nature of the research
challenges. Here we assess our work on the prototype
in terms of those principles.
First, how much of the symbiosis is grounded in
the physical and cognitive circumstances? We have
just started to explore the use of physical and lin-
guistic context to shape the symbiosis. Some aspects
of maintaining presence are implemented. For exam-
ple, motion tracking and visual recognition are used
to capture the presence of people in the room. How-
ever, endowing intelligent agents with the ability to
effectively exploit information about individual peo-
ple and their activities and capabilities remains a sig-
nicant research challenge. Multiple people in the
cognitive environment can interact with the system,
but the system’s ability to associate activities with
individual users is limited. The session capture agent
tracks both human commands and agent actions, but
additional work is required to reect the activity of
people and agents in the environment back to par-
ticipants. The linguistic and gestural reference agent
maintains some useful context for switching between
applications, but additional research is needed to
exploit this context to enable an extended dialogue.
Second, how much does the cognitive environ-
ment support the connection principle, enabling
people and intelligent agents to engage with one
another? We feel that the architecture and imple-
mentation support all of the requirements, at least to
some degree. First, barriers between human intention
and system execution are reduced by multimodal
interactions that allow users to converse with the sys-
tem almost as if it were a human partner rather than
having to deal with the cumbersome conventions of
typical user interfaces, but the lack of affordances in
speech-based interaction remains a challenge. The
cog browser provides users with some understanding
of the capabilities of various cogs but the system does
not offer assistance. The system supports interactions
across multiple environments; cogs can in effect fol-
low the user to different physical spaces, marshaling
the input and output resources that they nd there —
thereby reducing the time required to initiate system
computations and actions when moving across cog-
nitive environments. The cognitive environment
cooperates with users in multiple ways: decision
agents use elicitation techniques to develop an
understanding of user goals and trade-offs and then
to guide users toward decisions that best realize
them, Celia listens for commands and the command
executor responds only when adequate information
is available for a response. Because multiple users can
see the same display and Celia has access to displayed
objects through the display manager, Celia can track
and respond to their coordinated actions.
Does the cognitive environment support the rep-
resentation principle? The CEL and its agents main-
tain representations that are the basis for communi-
cation between participants and with Celia. The
identity and location of users in the room, the devel-
oping conversation with Celia and recognized com-
mands, and the state of ongoing decisions are all cap-
tured, externalized, and shared outside the
environment, providing common ground between
both people in the physical environment and those
connected to the environment only remotely or peri-
odically. In future work, we would like to recognize
individual differences in representation preferences,
and be able to conduct “private” interactions with
individual users through the media of their choice.
We have realized the modularity requirements of
self-description, composition, and intercomponent
communication by implementing the cognitive envi-
ronment as a multiagent system. Some agents are
strongly human centered, providing services such as
speech and gesture transcription, speech synthesis, or
visualization. Others mainly serve the needs of other
agents. For example, the life-cycle manager facilitates
agent communication and composition by enabling
agents to advertise their capabilities to one another
and use one another’s services. An important research
challenge is to create deeper semantic descriptions of
services to allow users to select and compose services
as needed through natural language dialogue.
How does the cognitive environment support
adaptation, improving with time? Currently most of
the system’s improvement is ofine, not during the
dialogue. For example, we trained speech models and
extended the language model with custom diction-
aries. Users’ gestural vocabularies could also be mod-
eled and interpreted, or we could develop individu-
alized models of combinations of speech, gesture,
and larger movements. We currently capture useful
data during interactions with users that can be used
to improve interactions in the future. For example,
the system captures linguistic and gestural context,
which can in principle be mined by other agents
seeking to detect patterns that might be used to bet-
ter anticipate user needs. Ultimately we would like
cognitive systems to adapt to users’ goals and capa-
bilities, available I/O resources, and available cogs to
maximize the effectiveness of the entire session and
improve the symbiotic relationship between users
and the system.
Despite our successes with engineering a symbiot-
ic cognitive computing experience, practical applica-
tions continue to be a challenge: speech commands
in a noisy room are often misrecognized and we can-
not reliably identify individual speakers, the tracking
of gestures is still error prone with both wands and
hand gestures, and natural language inputs require
domain-specic natural language engineering to
map commands to the proper software services and
invoke decision, analysis, and visualization agents.
Despite these challenges, the cognitive environment
provides a valuable test bed for integrating a variety
of IBM Research cognitive computing technologies
into new scenarios and business applications.
This article introduced our work on symbiotic cog-
nitive computing. We outlined ve principles: con-
text, connection, representation, modularity, and
adaptation, and we showed how key requirements
that ow from these principles could be realized in a
cognitive environment. We have started to apply
this environment to real business problems, includ-
ing strategic decision making for corporate mergers
and acquisitions.
Reaching a true symbiosis between cognitive com-
puting and human cognition is a signicant multi-
year challenge. The IBM Watson question-answering
system and other intelligent agents can be embed-
ded in physical spaces, but additional research is
needed to create cognitive computing systems that
can truly sense the world around them and fully
interact with people to solve difcult problems.
Our future directions include detection and
understanding of emotion, cognitive computing in
virtual and mixed reality, simultaneous speech and
gesture understanding, integration of uncertain data
from sensors into real-time interaction, and machine
learning to improve decisions over time. We are also
interested in exploring tasks such as information dis-
covery, situational assessment, and product design
where difcult decisions require bringing together
people who have complementary skills and experi-
ence and providing them with large amounts of
structured and unstructured data in one collabora-
tive multisensory multimodal environment.
The IBM Watson Jeopardy! system demonstrated
the ability of machines to achieve a high level of per-
formance at a task normally considered to require
human intelligence. We see symbiotic cognitive
computing as the next natural step in the evolution
of intelligent machines: creating machines that are
embedded in the world and integrate with every
aspect of life.
Thanks also to Wendy Kellogg, Werner Geyer, Casey
Dugan, Felicity Spowart, Bonnie John, Vinay
Venkataraman, Shang Gao, Mishal Dholakia, Tomas
Beren, Yedendra Shirnivasan, Mark Podlaseck, Lisa
Amini, Hui Su, and Guru Banavar.
FALL 2016 91
Anderson, J. 1983. The Architecture of Cognition. Mahwah,
NJ: Lawrence Erlbaum Associates.
Anderson, J.; Farrell, R.; and Sauers, R. 1984. Learning to
Program in LISP. Cognitive Science 8(2): 87–129.
Bhattacharjya, D., and Kephart, J. O. 2014. Bayesian Inter-
active Decision Support for Multi-Attribute Problems with
Even Swaps. In Proceedings of the 30th Conference on Uncer-
tainty in Articial Intelligence, 72–81. Seattle, WA: AUAI Press.
Bolt, R. A. 1980. Put-That-There. In Proceedings of the 7th
Annual Conference on Computer Graphics and Interactive Tech-
niques: SIGGRAPH ’80, 262–270. New York: Association for
Computing Machinery.
Brumitt, B. L.; Meyers, B.; Krumm, J.; Kern, A.; and Shafer,
S. 2000. EasyLiving: Technologies for Intelligent Environ-
ments. In Handheld and Ubiquitous Computing, 2nd Interna-
tional Symposium, September, Lecture Notes in Computer
science Volume 1927, ed P. Thomas and H.-W. Gellersen,
12–27. Berlin: Springer.
Carbini, S.; Delphin-Poulat, L.; Perron, L.; and Viallet, J.
2006. From a Wizard of Oz Experiment to a Real Time
Speech and Gesture Multimodal Interface. Signal Processing
86(12): 3559-3577.
Connell, J. 2014. Extensible Grounding of Speech for Robot
Instruction. In Robots That Talk and Listen, ed. J.
Markowitz,175–201. Berlin: Walter de Gruyter GmbH and
Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.;
Kalyanpur, A. A.; Lally, A.; Murdock, J. W.; Nyberg, E.;
Prager, J.; Schlaefer, N.; and Welty, C. 2010. Building Wat-
son: An Overview of the DeepQA Project. AI Magazine 31(3):
Genesereth, M. R., and Ketchpel, S. P. 1994. Software
Agents. Communications of the ACM 37(7), 48–53. dx.doi.
Grice, P. 1975. Logic and Conversation. In Syntax and
Semantics. 3: Speech Acts, ed. P. Cole and J. Morgan, 41–58.
New York: Academic Press.
Hutchins, E. 1995. Cognition in the Wild. Cambridge, MA:
The MIT Press.
Kelly, J., and Hamm, S. 2013. Smart Machines: IBM’s Watson
and the Era of Cognitive Computing. New York: Columbia Uni-
versity Press.
Krum, D.; Omoteso, O.; Ribarsky, W.; Starner, T.; and
Hodges, L. 2002. Speech and Gesture Multimodal Control
of a Whole Earth 3D Visualization Environment. In Pro-
ceedings of the 2002 Joint Eurographics and IEEE TCVG Sym-
posium on Visualization, 195-200. Goslar, Germany: Euro-
graphics Association.
Laird, J.; Newell, A.; and Rosenbloom, P. 1987. SOAR: An
Architecture for General Intelligence. Articial Intelligence
33(1): 1–64.
Langley, P.; Laird, J. E.; and Rogers, S. 2009. Cognitive Archi-
tectures: Research Issues and Challenges. Cognitive Systems
Research 10(2): 141–160.
Leake, D. B. 1996. Case-Based Reasoning: Experiences, Lessons,
and Future Directions. Menlo Park, CA: AAAI Press.
Licklider, J. C. R. 1960. Man-Computer Symbiosis. IRE
Transactions on Human Factors in Electronics Volume HFE-
1(1): 4–11.
McCord, M. C.; Murdock, J. W.; and Boguraev, B. K. 2012.
Deep Parsing in Watson. IBM Journal of Research and Devel-
opment 56(3.4): 3:1–3:15.
Minsky, M. 1988. The Society of Mind. New York: Simon and
Modha, D. S.; Ananthanarayanan, R.; Esser, S. K.; Ndirango,
A.; Sherbondy, A. J.; and Singh, R. 2011. Cognitive Com-
puting. Communications of the ACM 54(8): 62–71. dx.doi.
Nardi, B. A. 1996. Activity Theory and Human-Computer
Interaction. In Context and Consciousness: Activity Theory and
Human-Computer Interaction, ed. B. Nard, 7–16. Cambridge,
MA: The MIT Press.
Oviatt, S.; and Cohen, P. 2000. Perceptual User Interfaces:
Multimodal Interfaces that Process what Comes Naturally.
Communications of the ACM 43(3): 45-53.
1145/330534.330538Ramos, C.; Marreiros, G.; Santos, R.;
and Freitas, C. F. 2010. Smart Ofces and Intelligent Deci-
sion Rooms. In Handbook of Ambient Intelligence and Smart
Environments, ed. H. Nakashima, H. Aghajan, and J. C.
Augusto, 851–880. Berlin: Springer.
Romano, N. C., and Nunamaker, J. F. 2001. Meeting Analy-
sis: Findings from Research and Practice. In Proceedings of the
34th Annual Hawaii International Conference on System Sci-
ences. Los Alamitos, CA: IEEE Computer Society.
Sharma, R.; Yeasin, M.; Krahnstoever, N.; Rauschert, I.; Cai,
G.; Brewer, I.; MacEachren, A.; Sengupta, K. 2003. Speech-
Gesture Driven Multimodal Interfaces for Crisis Manage-
ment. Proceedings of the IEEE 91(9): 1327–1354. dx.doi.
Shrobe, H.; Coen, M.; Wilson, K.; Weisman, L.; Thomas, K.;
Groh, M.; Phillips, B.; Peters, S.; Warshawsky, N.; and Finin,
P. 2001. The Intelligent Room. MIT AI Laboratory AFRL-IF-
RS-TR-2001-168 Final Technical Report. Rome, New York:
Air Force Research Laboratory.
Soltau, H.; Saon, G.; and Kingsbury, B. 2010. The IBM Atti-
la Speech Recognition Toolkit. In 2010 IEEE Workshop on
Spoken Language Technology, SLT 2010 — Proceedings, 97–
102. Piscataway, NJ: Institute for Electrical and Electronics
Stanford, V.; Garofolo, J.; Galibert, O.; Michel, M; and
Laprun, C. 2003. The (NIST) Smart Space and Meeting Room
Projects: Signals, Acquisition Annotation, and Metrics. In
Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP ’03), 4, 6–10. Piscat-
away, NJ: Institute for Electrical and Electronics Engineers.
Voss, L. L., and Ehlen, P. 2007. The CALO Meeting Assistant.
In Proceedings of Human Language Technologies: The Annual
Conference of the North American Chapter of the Association for
Computational Linguistics: Demonstrations. Stroudsberg, PA:
Association for Computational Linguistics 2007.
Walsh, W. E.; Tesauro, G.; Kephart, J. O.; and Das, R. 2004.
Utility Functions in Autonomic Systems. In Proceedings of the
1st International Conference on Autonomic Computing (ICAC
2004), 70–77. Piscataway, NJ: Institute for Electrical and
Electronics Engineers.
Weiser, M., and Brown, J. S. 1996. Designing Calm Technol-
ogy. PowerGrid Journal 1.01, 1–5.
Robert Farrell is a research staff member at the IBM T. J.
Watson Research Center in Yorktown Heights, NY, USA. He
has a long-term research interest in the cognitive processes
of human learning, knowledge representation, reasoning,
and language understanding. His past work includes cogni-
tive models, intelligent tutoring systems, and social com-
puting applications. He is currently working on software to
extract knowledge from unstructured information sources.
Jonathan Lenchner is chief scientist at IBM Research-
Africa. Previously he was one of the founders of the IBM
Cognitive Environments Lab in Yorktown Heights, NY. His
research interests include computational geometry, robot-
ics, AI, and game theory. His recent work includes research
on humanoid robots and development of an immersive
environment to help a professional sports team with trades
and draft picks.
Jeffrey Kephart is a distinguished research staff member at
IBM T. J. Watson Research Center, and a Fellow of the IEEE.
He is known for his work on computer virus epidemiology
and immune systems, self-managing computing systems,
electronic commerce, and data center energy management.
Presently, he serves as a principal investigator on a cognitive
computing research project with a large energy company
and leads work on applying intelligent agent technologies
to corporate mergers and acquisitions.
Alan Webb is a senior software engineer at the IBM T. J.
Watson Research Center. His present research interests are
focused upon applying the principles of distributed cogni-
tion as an inspiration for pervasive cognitive environments.
He is currently working on a generalized system architec-
ture for the cognitive environment and development of the
mergers and acquisitions application.
Michael Muller is a research staff member in the Cognitive
User Experience group at IBM Research in Cambridge, MA.
His research areas have included collaboration in health
care, metrics and analytics for enterprise social software,
participatory design, and organizational crowdfunding. His
current work focuses on employee experiences in the work-
Thomas Erickson is a social scientist and interaction
designer at the IBM T. J. Watson Research Center. His
research has to do with designing systems that enable
groups of people to interact coherently and productively in
both virtual and real environments.
David Melville is a research staff member at IBM T. J. Wat-
son Research Center. His research interests include immer-
sive data spaces, spatial computing, adaptive physical archi-
tecture, and symbiotic experience design.
Rachel Bellamy is a principal research staff member and
group manager at IBM T. J. Watson Research Center and
heads the Research Design Center. Her general area of
research is human-computer interaction and her current
work focuses on the user experience of symbiotic cognitive
Daniel Gruen is a cognitive scientist in the Cognitive User
Experience group at IBM Research in Cambridge, MA. He is
interested in the design of systems that let strategic decision
makers seamlessly incorporate insights from cognitive com-
puting in their ongoing deliberations and creative thinking.
He is currently working with a variety of companies to under-
stand how such systems could enhance the work they do.
Jonathan Connell is a research staff member at IBM T. J.
Watson Research Center. His research interests include com-
puter vision, machine learning, natural language, robotics,
and biometrics. He is currently working on a speech-driven
reactive reasoning system for multimodal instructional dia-
Danny Soroker is a research staff member at IBM T. J. Wat-
son Research Center. His research interests include intelli-
gent computation, human-computer interaction, algo-
rithms, visualization, and software design. He is currently
working on agents to support problem solving and decision
making for corporate mergers and acquisitions.
Andy Aaron is a research staff member at IBM T. J. Watson
Research Center. He was on the speech team for the IBM
Watson Jeopardy! match. Along with his work in speech
synthesis and speech recognition, he has done sound design
for feature lms and produced and directed TV and lms.
Shari Trewin is a research staff member at IBM T. J. Watson
Research Center. Her current research interests include mul-
timodal human-computer interaction and accessibility of
computer systems. She is currently working on knowledge
extraction from scientic literature and interaction designs
for professionals working with this extracted knowledge.
Maryam Ashoori is a design researcher at IBM T. J. Watson
Research Center. She has a passion for exploring the inter-
section between art and computer science. Her work has
resulted in several novel applications for the Cognitive
Environments Laboratory, including a ”Zen Garden” for
unwinding after a long day and a service for sparking cre-
ativity for inventors.
Jason Ellis is a research staff member at IBM T. J. Watson
Research Center. His research interests include social com-
puting and usability. He is currently working on collabora-
tive user interfaces for cognitive systems.
Brian Gaucher is senior manager of the Cognitive Envi-
ronments Laboratory at IBM T. J. Watson Research Center.
He leads teams specializing in user experience design and
physical infrastructure of cognitive computing environ-
ments. His work focuses on the creation of highly interac-
tive physical spaces designed to improve decision making
through always-on ambient intelligence.
Dario Gil is the vice president of science and technology
for IBM Research. As director of the Symbiotic Cognitive
Systems department, he brought together researchers in
articial intelligence, multiagent systems, robotics,
machine vision, natural language processing, speech tech-
nologies, human-computer interaction, social computing,
user experience, and interaction design to create symbiotic
cognitive computing technology, services, and applications
for business.
FALL 2016 93
... Its key features are: a) consideration of instinctive, reflective and self-conscious responses, commonsense reasoning, and creativity as central to the act of thinking, b) inclusion of 'subjective' event-descriptors into existing objective system-operation equations, c) use of involuntary and volitional attention dynamics to handle real-world stimuli, and d) provision for processing multisensory stimuli and memories. Our work is inspired from Turing's 'thinking machine' [12] and Minsky's theories [2], [3] of cognition. ...
Full-text available
A generally intelligent cognitive machine is deliberative, reflective, adaptive, empathetic and rational; continually aiming for the best responses to complex real-world events. This article describes our efforts towards the realization of such a machine – an embodied machine-mind – endowed with abilities of multisensory processing, commonsense reasoning, reflection, consciousness, and empathy. A procedure for contemplation and comprehension by the machine-mind has been formalized. It uses: a) Z*-numbers for active-thought abstraction, b) multisensory data-structures to encapsulate objective and subjective real-time inputs, and transport these across framework-modules, and c) dynamic action-descriptor formulations for bespoke behavior, perception, and real-time attention modulation. The defined procedure acknowledges the machine-mind's interest in the subject being comprehended, and covers different levels of thinking (instinctive reactions to self-consciousness reasoning). This investigation contributes to the synthesis of ‘thinking machines’ for man-machine symbiosis.
... As the proliferation of general purpose, conversational cognitive assistants grows, it will become increasingly important that they include a representation of the user's goals, and "theory-of-mind" elements that support effective communication and collaboration. [64] 3.3.1. DARPA PAL program An ambitious and multi-university program, the Defense Advanced Research Projects Agency (DARPA) program, Personal Assistant that Learns (PAL), 14 gave rise to the Cognitive Assistant that Learns and Organizes (CALO) system. ...
Full-text available
Explainability has been an important goal since the early days of Artificial Intelligence. Several approaches for producing explanations have been developed. However, many of these approaches were tightly coupled with the capabilities of the artificial intelligence systems at the time. With the proliferation of AI-enabled systems in sometimes critical settings, there is a need for them to be explainable to end-users and decision-makers. We present a historical overview of explainable artificial intelligence systems, with a focus on knowledge-enabled systems, spanning the expert systems, cognitive assistants, semantic applications, and machine learning domains. Additionally, borrowing from the strengths of past approaches and identifying gaps needed to make explanations user- and context-focused, we propose new definitions for explanations and explainable knowledge-enabled systems.
... They form a society of agents which is capable of solving problems the individual members alone would not be able to tackle in a dynamic, as well as only partially structured, observable and predictable work environment. The main traits of symbiotic HRC are the following (see also [57]). (1) While all the parties possess their own autonomies, they form together a team or group which is responsible for the successful and efficient performance of a set of tasks. ...
Full-text available
In human-robot collaborative assembly, robots are often required to dynamically change their pre-planned tasks to collaborate with human operators in a shared workspace. However, the robots used today are controlled by pre-generated rigid codes that cannot support effective human-robot collaboration. In response to this need, multi-modal yet symbiotic communication and control methods have been a focus in recent years. These methods include voice processing, gesture recognition, haptic interaction, and brainwave perception. Deep learning is used for classification, recognition and context awareness identification. Within this context, this keynote provides an overview of symbiotic human-robot collaborative assembly and highlights future research directions.
Full-text available
This research focuses on developing a robot digital twin (DT) and the communication methods to connect it with the corresponding physical robot in collaborative human–robot construction work. Robots are being increasingly deployed on construction sites to assist human workers with physically demanding work tasks. Robot simulations in a process-level DT can be used to extend design models, such as building information modeling, to the construction phase for real-time monitoring of robot motion planning and control. Robots can be enabled to plan work tasks and execute them in the DT simulations. Once simulated tasks and trajectories are approved by human workers, commands can be sent to the physical robots to perform the tasks. However, a system to bridge a virtual DT and a physical robot and allow for such communication to occur is a capability that has not been readily available thus far, primarily due to the complexity involved in physical robot operations. This paper discusses the development of a system to bridge robot simulations and physical robots in construction and digital fabrication. The Gazebo robot simulator is used for DT, and the robot operating system is leveraged as the primary framework for bi-directional communication with the physical robots. The virtual robots in Gazebo receive planned trajectories from motion planners and then send the commands to the physical robots for execution. Two different robot control modes, i.e., joint angle control mode and Cartesian path control mode, are developed to accommodate various construction strategies. The system is implemented in a digital fabrication case study with a full-scale KUKA KR120 six-degrees-of-freedom robotic arm mounted on a track system. We evaluated the system by comparing the data transmission time, joint angles, and end-effector pose between the virtual and physical robot using several planned trajectories and calculated the average and maximum mean square errors. The results showed that the proposed real-time process-level robot DT system can plan the robot trajectory inside the virtual environment and execute it in the physical environment with high accuracy and real-time performance, offering the opportunity for further development and deployment of the collaborative human–robot work paradigm on real construction sites.
Room-sized immersive environments and interactive 3D spaces can support powerful visualizations and provide remarkable X-reality experiences to users. However, designing and developing applications for such spaces, in which user interaction takes place not only by gestures, but also through body movements is a demanding task. At the same time, contemporary software development methods and human-centered design mandate short iteration cycles and incremental development of prototypes. In this context, traditional design and software prototyping methods can no longer cope up with the challenges imposed by such environments. In this paper, we introduce an integrated technological framework for rapid prototyping of X-reality applications for interactive 3D spaces, featuring real-time person and object tracking, touch input support and spatial sound output. The framework comprises the interactive 3D space, and an API for developers.
In this paper, we describe a cognitive assistant that helps astrophysicists visualize and analyze exoplanet data via natural interactions. When users speak about and point to objects of interest, the assistant finds relevant data or performs requested computations, shows the results on a large display wall, and explains them with synthesized speech.
Traditionally, two approaches have been used to build intelligent room applications. Mouse-based control schemes allow developers to leverage a wealth of existing user-interaction libraries that respond to clicks and other events. However, systems built in this manner cannot distinguish among multiple users. To realize the potential of intelligent rooms to support multi-user interactions, a second approach is often used, whereby applications are custom-built for this purpose, which is costly to create and maintain. We introduce a new framework that supports building multi-user intelligent room applications in a much more general and portable way, using a combination of existing web technologies that we have extended to better enable simultaneous interactions among multiple users, plus speech recognition and voice synthesis technologies that support multi-modal interactions.
Conference Paper
In the past few years, augmented reality (AR) and virtual reality (VR) technologies have experienced terrific improvements in both accessibility and hardware capabilities, encouraging the application of these devices across various domains. While researchers have demonstrated the possible advantages of AR and VR for certain data science tasks, it is still unclear how these technologies would perform in the context of exploratory data analysis (EDA) at large. In particular, we believe it is important to better understand which level of immersion EDA would concretely benefit from, and to quantify the contribution of AR and VR with respect to standard analysis workflows. In this work, we leverage a Dataspace reconfigurable hybrid reality environment to study how data scientists might perform EDA in a co-located, collaborative context. Specifically, we propose the design and implementation of Immersive Insights, a hybrid analytics system combining high-resolution displays, table projections, and augmented reality (AR) visualizations of the data. We conducted a two-part user study with twelve data scientists, in which we evaluated how different levels of data immersion affect the EDA process and compared the performance of Immersive Insights with a state-of-the-art, non-immersive data analysis system.
Full-text available
Most present-day voice-based assistants require that users utter a wake-up word to signify that they are addressing the assistant. While this may be acceptable for one-shot requests such as “Turn on the lights”, it becomes tiresome when one is engaged in an extended interaction with such an assistant. To support the goal of developing low-complexity, low-cost alternatives to a wake-up word, we present the results of two studies in which users engage with an assistant that infers whether it is being addressed from the user’s head orientation. In the first experiment, we collected informal user feedback regarding a relatively simple application of head orientation as a substitute for a wake-up word. We discuss that feedback and how it influenced the design of a second prototype assistant designed to correct many of the issues identified in the first experiment. The most promising insight was that users were willing to adapt to the interface, leading us to hypothesize that it would be beneficial to provide visual feedback about the assistant’s belief about the user’s attentional state. In a second experiment conducted using the improved assistant, we collected more formal user feedback on likability and usability and used it to establish that, with high confidence, head orientation combined with visual feedback is preferable to the traditional wake-up word approach. We describe the visual feedback mechanisms and quantify their usefulness in the second experiment.
Even swaps is a method for solving deterministic multi-attribute decision problems where the decision maker iteratively simplifies the problem until the optimal alternative is revealed (Hammond et al. 1998, 1999). We present a new practical decision support system that takes a Bayesian approach to guiding the even swaps process, where the system makes queries based on its beliefs about the decision maker's preferences and updates them as the interactive process unfolds. Through experiments, we show that it is possible to learn enough about the decision maker's preferences to measurably reduce the cognitive burden, i.e. the number and complexity of queries posed by the system.
In the last decade, the availability of massive amounts of new data, and the development of new machine learning technologies, have augmented reasoning systems to give rise to a new class of computing systems. These "Cognitive Systems" learn from data, reason from models, and interact naturally with us, to perform complex tasks better than either humans or machines can do by themselves. In essence, cognitive systems help us perform like the best by penetrating the complexity of big data and leverage the power of models. One of the first cognitive systems, called Watson, demonstrated through a Jeopardy! exhibition match, that it was capable of answering complex factoid questions as effectively as the world's champions. Follow-on cognitive systems perform other tasks, such as discovery, reasoning, and multi-modal understanding in a variety of domains, such as healthcare, insurance, and education. We believe such cognitive systems will transform every industry and our everyday life for the better. In this talk, I will give an overview of the applications, the underlying capabilities, and some of the key challenges, of cognitive systems.
Two deep parsing components, an English Slot Grammar (ESG) parser and a predicate-argument structure (PAS) builder, provide core linguistic analyses of both the questions and the text content used by IBM Watson™ to find and hypothesize answers. Specifically, these components are fundamental in question analysis, candidate generation, and analysis of passage evidence. As part of the Watson project, ESG was enhanced, and its performance on Jeopardy!™ questions and on established reference data was improved. PAS was built on top of ESG to support higher-level analytics. In this paper, we describe these components and illustrate how they are used in a pattern-based relation extraction component of Watson. We also provide quantitative results of evaluating the component-level performance of ESG parsing.