Available via license: CC BY 4.0
Content may be subject to copyright.
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User
Modeling
CHENYI LI∗,Tandon School of Engineering, New York University, United States
GUANDE WU∗,Tandon School of Engineering, New York University, United States
GROMIT YEUK-YIN CHAN, Adobe Research, United States
DISHITA G TURAKHIA, Tandon School of Engineering, New York University, United States
SONIA CASTELO QUISPE, Visualization and Data Analytics Lab, New York University, United States
DONG LI, Tandon School of Engineering, New York University, United States
LESLIE WELCH, Brown University, United States
CLAUDIO SILVA, New York University, United States
JING QIAN, Tandon School of Engineering, New York University, United States
Manuscript submitted to ACM 1
arXiv:2410.16668v1 [cs.HC] 22 Oct 2024
Fig. 1. Satori is a mind-reading monkey-shaped monster in Japanese folklore. We name our system by Satori to highlight the
importance of understanding the user state (human mind) in building proactive AR assistants. The Satori system combines the
users’ self-knowledge and their goal of the task to the immediate user action with an LLM model to provide visual assistance that is
relevant to the user’s immediate needs. We call this proactive AR assistance. This is achieved by implementing the Belief-Desire-
and-Intention model based on two formative studies with 12 experts. Here the belief reflects whether the users know where the
task object is, and how to do certain tasks (e.g., knowledge level); the desire component is the actionable goal; and the intent the
immediate next step needed to complete the actionable goal. The code will be open-sourced upon acceptance.
Augmented Reality assistance are increasingly popular for supporting users with tasks like assembly and cooking. However, current
practice typically provide reactive responses initialized from user requests, lacking consideration of rich contextual and user-specic
information. To address this limitation, we propose a novel AR assistance system, Satori, that models both user states and environmental
contexts to deliver proactive guidance. Our system combines the Belief-Desire-Intention (BDI) model with a state-of-the-art multi-
modal large language model (LLM) to infer contextually appropriate guidance. The design is informed by two formative studies
involving twelve experts. A sixteen within-subject study nd that Satori achieves performance comparable to an designer-created
Wizard-of-Oz (WoZ) system without relying on manual congurations or heuristics, thereby enhancing generalizability, reusability
and opening up new possibilities for AR assistance.
CCS Concepts: •Human-centered computing →Mixed / augmented reality.
Additional Key Words and Phrases: Augmented reality assistant, proactive virtual assistant, user modeling
ACM Reference Format:
Chenyi Li, Guande Wu, Gromit Yeuk-Yin Chan, Dishita G Turakhia, Sonia Castelo Quispe, Dong Li, Leslie Welch, Claudio Silva,
and Jing Qian. 2018. Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling . 1, 1 (October 2018), 42 pages.
https://doi.org/XXXXXXX.XXXXXXX
1 Introduction
Satori
悟り
, a Japanese ghost-like deity long known to read human minds, responds to one’s thoughts before actions
arrive. While such supernatural beings belong to folklore, modern AI technologies are beginning to emulate this
capability, striving to predict human actions and provide proactive assistance during task interactions [
38
]. Proactive
virtual/digital assistance, which determine the optimal content and timing without explicit user commands, are gaining
traction for their ability to enhance productivity and streamline workows by anticipating user needs from context and
past interactions [
62
]. However, there is a scarcity of research to understand how to best design and implement such
system.
∗Both authors contributed equally to this research.
Authors’ Contact Information: Chenyi Li, chenyili@nyu.edu, Tandon School of Engineering, New York University, New York, New York, United States;
Guande Wu, guandewu@nyu.edu, Tandon School of Engineering, New York University, New York City, New York, United States; Gromit Yeuk-Yin Chan,
ychan@adobe.com, Adobe Research, San Jose, California, United States; Dishita G Turakhia, d.turakhia@nyu.edu, Tandon School of Engineering, New
York University, New York City, New York, United States; Sonia Castelo Quispe, Visualization and Data Analytics Lab, New York University, New York,
New York, United States; Dong Li, dl5214@nyu.edu, Tandon School of Engineering, New York University, Brooklyn, New York, United States; Leslie
Welch, Brown University, Providence, Rhode Island, United States; Claudio Silva, New York University, New York City, New York, United States; Jing
Qian, jqian1590@gmail.com, Tandon School of Engineering, New York University, Brooklyn, New York, United States.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
2Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 3
Most current assistance in augmented reality (AR) remain reactive, responding to user commands or environmental
triggers without the capability for more active engagement. Such reactive systems require users to initiate interactions,
which can be inecient in AR environments where users often have limited attention for interface interactions or
busy with physical tasks. Some AR assistants incorporate proactive elements; for instance, in maintenance tasks, they
provide guidance based on recognized objects or components [
44
,
61
]. Yet, these systems are made with xed rules and
often lack adaptability and reusability, struggling to generalize across diverse tasks or support eective multitasking.
Designing proactive assistance for AR is particularly challenging due to the need to understand both the user’s
state and the 3D physical environment. Users often perform real-world tasks such as maintenance and assembly while
wearing AR headsets. The assistance should therefore be relevant to the user’s current immediate tasks and long-term
goals. Due to limited attention and assitive nature in tasks, timely assistance is crucial. Providing assistance too early,
too late, or simply too frequent can increase cognitive load and negatively impact the user’s experience [2].
To identify the in-depth benets, challenges for creating a proactive AR assistance and explore how to design one,
we conducted two formative studies to explore the benets, challenges, and potential solutions for developing such a
system. The rst study involved six professional AR designers and revealed several challenges: 1) limited generalizability
and reusability of current non-proactive AR assistance, and 2) diculties in accurately detecting user intentions; and the
3) the need to balance general advice with task-specic solutions. The professionals recognized that using proactive AR
assistance could potentially improving the scalability and eciency, but also highlighted technical challenges related to
accurately tracking and understanding users’ actions.
Building on the ndings from the rst study, the second formative study engaged six experts (three HCI researchers
and three psychology researchers) in a participatory design to explore strategies for making AR assistance more
proactive. The design sessions highlighted several key factors: understanding human actions, recognizing surrounding
objects and tools, assessing the current task, and anticipating immediate next steps. These insights were integrated
with the well-established Belief-Desire-Intention (BDI) theory, resulting in an AR-specic adaptation that guided the
development of our system, Satori.
To adapt the BDI theory for AR assistance, Satori needed to account for the limitations of AR headset hardware,
which primarily relies on egocentric vision. We built the BDI model’s prediction framework using an ensemble of
egocentric vision models combined with a multi-modal Large Language Model (LLM) to leverage its robust task-learning
capabilities. The BDI model is crucial for regulating both the timing and content of the assistance provided. We propose
a multi-modal assistance framework where the intention (i.e., immediate next step) determines the content of the
assistance. Meanwhile, Scene objects and user action history regulates the timing of the assistance. This approach
ensures that the AR assistance delivers relevant information at appropriate moments, enhancing the user’s experience
without overwhelming them.
We evaluated Satori against four everyday AR tasks designed by six professional AR designers and found that Satori’s
proactive guidance was as eective, useful, and comprehensible as AR guidance applications created by the designers.
Additionally, Satori’s guidance allowed participants to switch between tasks without the need for pre-training or
scanning. Our ndings indicate that the application of the BDI model not only successfully understood users’ intentions
but also captured the semantic context of a given task, reducing the need to craft AR guidance for every specic scenario
and improving the generalizability and reusability of AR guidance.
To summarize, our contributions include:
Manuscript submitted to ACM
4 Li and Wu, et al.
(1)
The design requirements for creating a proactive AR assistant and adapting the BDI model for AR environments,
based on two consecutive formative studies involving twelve experts;
(2)
An AR proactive assistant system, Satori, that applies concepts from the BDI model combined with a deep
model-LLM fusion architecture to infer users’ current tasks and intentions, providing guidance for immediate
next steps through multimodal feedback;
(3)
A sixteen-user empirical study demonstrating that our proactive AR assistant delivers performance comparable
to designer-created AR guidance in terms of timing, comprehensibility, usefulness, and ecacy.
2 Related Work
2.1 Virtual Assistant in AR/VR
Virtual assistants in AR/MR can well support the task in assembly [
7
,
35
], surgery [
20
,
69
], maintenance [
6
,
22
,
36
] and
cooking [
16
]. These assistant systems are often task-oriented and their principles do not easily generalize to other
domains. One way to improve generalizability is via a command-based AR assistant, which can enhance user condence
in the agent’s real-world inuence and awareness [
32
]. Yet, they require the user’s explicit commands and limit usability.
Our work builds on the previous research as a virtual assistant in AR/MR, while addressing the users’ needs without
explicit commands or domain limitations.
Pre-made assistance in AR and VR applications typically involves prepared actions or reminders triggered by
specic user inputs or situations. This kind of assistance is simple and intuitive, providing users with readily available
support that can be accessed on-demand or in time sequences [
45
], which is straightforward to implement and use [
78
].
Assistants require extensive user manual interactions to describe and conrm the user’s needs. For example, Sara et al.
demonstrated an AR maintenance assistant while the technician still needs to manually conrm the completion of each
step and proceed to the next step using touchpad controls or voice commands [78].
Proactive assistance, on the other hand, is designed to actively recognize context information and infer user intentions
even if they are not explicitly provided [
18
,
70
,
80
]. Such assistance normally does not require human intervention [
41
,
79
,
94
] and can easily be scalable for everyday AR tasks, such as health care [
71
,
81
], navigation [
66
] and laboratory
education [
83
]. They enhance usability [
82
], foster trust [
39
], and improve task eciency [
93
]. During AR interactions,
proactive assistance often takes into account the user’s surrounding environment, predicts the user’s goals and oers
context-aware recommendations, often for the sake of improving attention [
29
,
59
,
60
,
85
]. However, existing proactive
assistant relies on pre-determined contextual signals such as the location, time, and events to trigger the assistants to
intervene [
60
]. For instance, Ren et al. [
76
] propose a proactive interaction design method for smart product-service
systems, using functional sensors to gather explicit contextual data (e.g., physical location, light intensity, environment
temperature) to predict implicit user states (e.g., user attention level). Although these methods advance the progress
of proactive assistance, such signals may not align with the actual users’ needs, leading to ineective and obtrusive
assistance [
42
,
91
]. To address this, we propose to model the user’s intent directly to determine a better timing and type
of guidance.
Furthermore, even though proactive assistants have been widely used, most AR assistants today remain passive as
dening user intents is dicult in everyday settings. One main challenge is that understanding users’ intent relies
not only on the explicit cues (e.g., verbal or signals) but also signicantly on the implicit non-verbal cues and visual
embodiment [
32
]. Successfully decomposing and reasoning with the implicit cues improves the chances of intent
referring. Recent advancements in vision-language models oer new opportunities to be used to understand visual
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 5
embodiment if integrated into virtual assistants. Therefore, we propose a multimodal input mechanism that takes both
voice and visual cues to support a better understanding of users’ interaction intentions.
2.2 Understanding User Intention
Understanding user intention in electronic devices, ranging from smart mobile devices to augmented reality (AR)
systems, is essential for improving user interaction and experience. Research in the eld of information needs has
highlighted the importance of intention classication and systematic taxonomy to achieve this goal. Border proposed
the taxonomy of web searches, classifying intentions into navigational, informational, and transactional [
11
]. This
foundational work laid the groundwork for more detailed classications. For instance, Dearman et al. categorized
information needs and sharing entries into nine distinct categories, extending the concept of information needs to
a collaborative level [
17
]. This classication allows developers to design products that better facilitate collaborative
sharing of information. Church et al. found that contexts such as location, time, social interactions, and user goals
inuence users’ information needs. For example, it was found that users generated more needs with locational or
temporal dependencies when they were on the go. Users also require more geographical information when they are
commuting. This study enabled researchers to design an information search platform SocialSearchBrowser to t dierent
users’ information needs in a context-sensitive way [
14
]. Additionally, Li et al. furthered this branch of research by
developing a design space of digital follow-up actions for multimodal information [
52
]. They classied actions into 17
types and identied seven categories of follow-up actions by qualitative analysis of users’ diaries. They also deployed
the system on mobile AR and conducted a user study to test the capacity of follow-up action designs. The study showed
that the system could accurately predict user’s general actions, provide proactive assistance, and reduce frictions [
52
].
Generally, prior studies on information needs, particularly on mobile devices, demonstrated that intention taxonomy
could inspire the design of information search systems with more proactive and contextual assistance.
2.3 Belief-Desire-Intention Framework
The Belief-Desire-Intention (BDI) model [
8
,
15
,
30
,
49
] is a framework to simulate human decision-making behaviors in
both individual [
74
] and multi-agent settings [
33
,
48
,
65
]. The model originates from folk psychology and is extensively
applied in cognitive modeling, agent-oriented programming, and software development. This model comprises three
primary components: beliefs, desires, and intentions [
8
]. Beliefs represent the information that humans perceive about
a situation (e.g., it is raining), limited by their perceptions. Desires are the goals that individuals aim to achieve given
the current situation (e.g., A person prefers not to get wet during a rainy day). Intentions are “conduct-controlling
pro-attitudes, ones which we are disposed to retain without reconsideration, and which play a signicant role as inputs
to [means-end] reasoning” [
8
]. In other words, user’s behavior towards achieving the desire (i.e., goals) by selecting
and committing to specic plans of action (e.g. The person plans to get an umbrella).
Previous studies have demonstrated the eectiveness of the BDI framework in modeling human behavior [
33
,
65
].
Therefore, the BDI model can help in the building of intelligent agents in various applications. For example, in agent-
oriented programming, the BDI model is pervasively used to model an agent executing programming functions.
Agent-oriented software engineering utilizes beliefs, actions, plans, and intentions to develop programs. The BDI
model enables more rational and autonomous executions in unpredictable environments, such as AgentSpeak(L) [
73
],
3APL [
28
], JACK [
12
], JADEX [
9
], and GOAL [
27
]. One benet of using the BDI framework is that it makes agent
behavior intelligible to end users and stakeholders. By committing to specic courses of action or intentions, BDI agents
enhance user understanding and predictability of actions [1, 5, 19, 21, 25, 31, 34, 67, 75, 86].
Manuscript submitted to ACM
6 Li and Wu, et al.
Though BDI-inspired agents have enabled automatic decisions, making decisions in AR requires a dierent type of
intelligent and realistic behavior. The environment for AR applications involves complex real-world dynamics, such as
egocentric video, audio, and gestural inputs [
4
]. The users’ interaction goals, physical actions, and surrounding context
(e.g., objects, tools, interaction agents) further increase the diculty of providing in-time assistance [53].
Although the BDI framework has not yet been applied to AR, our work draws inspiration from the philosophy and
design of prior BDI-based systems to enhance AR assistance. With recent advancements in large language models
(LLMs), BDI-driven agents present a promising direction [
5
], as LLMs can naturally serve as interpreters and reasoning
machines, bridging language and text within the BDI framework.
2.4 User Modelling in Human-AI Collaboration
Modelling the user state is a long-standing problem in HCI [
3
,
58
]. Previous research focuses on the user goal and intent
[
92
], Expertise modeling to support adaptive computing systems [
87
], and the study of the memory of the user for
AR/MR-specic research [
26
,
84
]. The BDI model, a commonly accepted psychological framework [
24
,
49
], becomes
crucial in the emergent human-AI collaboration, necessitating a better model of the user state [
43
]. Existing research,
however, focuses on the user’s intention and goal and seldom addresses the user’s knowledge or belief [
23
,
47
,
89
,
90
].
Furthermore, there’s a lack of distinction between high-level goals (desires) and immediate goals (intents) [
37
]. Hence,
we propose a general model for the user state, amalgamating belief, desire, and intent.
3 Formative Study 1: Design with Professional AR Designers
We rst conduct a formative study to explore the problem space and potential benets of proactive AR assistance.
The study begins with a semi-structured interview on participants’ background knowledge, followed by four dierent
interaction scenarios that are common everyday AR tasks shown to participants for design feedback. A nal apparatus
combining participants’ design feedback is created for later study.
3.1 Participants
Using email and snowball sampling, we recruited six professional AR designers (three female and three male, age:
¯
𝑥=
30). Since we want to collect insights from experienced individuals, all participants are professional AR designers
currently working in companies with at least three years of experience developing AR applications and who have
experience creating AR applications for real-world guiding tasks. Participants are paid $30 per hour.
3.2 Task Design
The study contains two sessions: a semi-structured interview and a task to design AR assistance for four everyday
scenarios. Each participant was asked to design two out of the four scenarios, ensuring a balanced distribution across
scenarios. As a result, each scenario was designed by three dierent participants.
In the rst session, we asked participants about their prior working experience with AR assistants, the challenges they
faced in creating them, potential benets, and applications in everyday settings. We further collected their responses on
insights, potential benets, and use scenarios of proactive assistants.
In the second session, designers were asked to design AR assistants for two everyday scenarios. The scenarios were
assigned in a pre-determined order to balance the total number of designs. Since these scenarios are in everyday settings,
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 7
we use WikiHow
1
to obtain detailed, step-by-step instructions as the task’s background information for participants.
The tasks have an average of 7 steps, but participants can add additional substeps if needed. We take screenshots and
videos using Hololense 2 following the obtained instructions. We use them as visual guidance to demonstrate the
task’s interaction context. The tasks are presented to participants as digital forms containing the above information,
and participants are asked to design: 1) if guidance is needed for the current step; 2) guidance appearing timing and
duration; 3) modality of guidance; 4) content of guidance. The above questions focus on the questions of “if”, “when”,
“how”, and “what” in AR guidance, which are common ways to guide users in the literature and current practice[todo].
3.3 Procedure
Since participants reside globally, the experiment was conducted remotely via Zoom after obtaining their informed
consent. Participants were rst asked to introduce their background, describe their daily work, and discuss projects
related to AR guidance. We further inquired about their insights into the advantages and disadvantages of AR guidance,
including challenges faced during development and by end users. Finally, we presented the concept of proactive AR
guidance and solicited their opinions on potential challenges, applications, and feasibility.
After the semi-structured interviews, participants received digital forms containing materials to design AR guidance
for their assigned tasks, including textual descriptions, corresponding images, and videos. During this phase, participants
were introduced to the process outlined in the previous paragraph. The experimenters addressed any questions
participants raised via Zoom.
On average, the study’s rst session lasted approximately 28 minutes (
¯
𝑥
= 28), while the second session took around
60 minutes (
¯
𝑥
= 60). The entire experiment lasted about 1.5 hours. All participants successfully completed the design
task, resulting in the creation of four AR task designs.
3.4 Results
3.4.1 Interview result.
Benets of AR Assistance. The experimenters found AR guidance particularly benecial in providing real-time,
contextual information that enhances both the user’s awareness and decision-making in physical environments. A
key advantage of AR is its ability to reveal forgotten or overlooked information. For instance, E1 emphasized
that “I nd that AR assistance most useful when it helps the user realize something they might not know... they might
forget about an object, or are not aware that this object could be used in this situation... then (with AR guidance) they have
this Eureka moment.” The “Eureke moment” refers to the moments where users suddenly realize the utility of objects
or actions they hadn’t considered. This function is especially useful in spatial tasks, as mentioned by E2 and E3. E2
highlighted that by overlaying visual cues such as arrows or animations directly onto the environment, AR can help the
user better understand complex electrical circuits. E3 stated that “in tasks with spatially sensitive movements...AR is a
proper medium because users can intuitively know what they need to do.” E3 further introduced that the users received
spatially positioned guidance on turning knobs or pressing buttons in a machine operation task, which was more
intuitive than traditional 2D instructions. Additionally, E4 stated that AR reduced interaction costs, particularly for
tasks that require high-frequency operations, and enabled hands-free operations, making it highly valuable in scenarios
like cooking. The rst-person perspective oered by AR also aids in better comprehension of instructions, as mentioned
by E7, especially for individuals with limited experience, such as students in laboratory settings.
1https://www.wikihow.com/
Manuscript submitted to ACM
8 Li and Wu, et al.
Challenges of AR Assistance. The interview revealed several key challenges in designing AR assistance. One major
challenge is the diculty of generalizing AR content to t diverse contexts, as AR designers often create designs
based on their assumptions about the user’s environment. However, users may interact with objects that fall outside
these initial assumptions. As E1 noted, “It’s hard to cover all the edge cases of what a person might have... I assume they’re
in an indoor space, but that might not be the case,” highlighting the complexity of accommodating varied environments.
Another challenge, as E5 noted, is the lack of a standardized approach in the expansive interaction design space,
especially compared to traditional 2D interaction. E3 pointed out the diculty in creating 3D visual assets from scratch,
further complicating the process. Additionally, designing eective assistance that accurately aligns with user
actions and intentions remains problematic. Both E3 and E4 noted the diculty in dening an accurate mapping
between user actions and AR responses. E4 emphasized that misinterpreting user behavior can result in irrelevant or
unhelpful guidance. For example, recommending a taxi when the user merely intends to walk. E3 also emphasized the
diculty faced by task experts without engineering expertise, stating, “Suppose I am a designer and I know nothing
about coding, but I still want to make AR assistance for users, how should I do that?” Finally, E6 highlighted that current
designs often fail to account for users’ prior experiences with AR, which could hinder novices from gaining an
immersive experience in spatial interactions.
Benets of Proactive AR Assistance. The experts highlighted several benets of Proactive AR assistance from both the
AR developers and users point of view.
For AR Developers: First of all, E1, E2, and E6 agreed that proactive assistance could tremendously reduces
development time and increases eciency. For instance, E2 remarked, “We will denitely see a huge improvement in
the eciency of the content creation through this auto-generation process.” Similarly, E1 noted that automatically assist
user can simplify tasks such as adding labels, recognizing objects, and generating guidance. She continued to oer an
example of a cooking app where such automation would be particularly useful in identifying ingredients or suggesting
cooking steps.
Both E1 and E3 highlighted how automatic AR design could generalize across dierent domains. According
to E1, “If we have a pipeline... using computer vision, it would save a lot of time... could have a universal pipeline to
create guidance.” Moreover, E3 pointed out that proactive AR assistance may be adapt as authoring tools like spatial
programming and program-by-demonstration, increasing the accessibility for non-developer experts. E4, E5, and E6
envision the potential of the automatic design process as provide proactive guidance. E4 pointed out that such
assistance anticipates the user’s intentions and the environment to provide accurate guidance. E6 added that they could
be used to detect errors in gestures and automatically provide corrective guidance.
For users:, designers pointed that proactive assistance may help avoid information overload. E5 emphasized that
automatic detection of user intent could help avoid information overload by presenting only relevant information. It
may also gain trust from users since the proactive assistances make the user think that they understand.
Challenges of Proactive AR Assistance Design. Key challenges of automatic AR design center on scalability and user
understanding. E1 highlighted the need for a universal system that can operate across dierent devices and domains.
However, as E3 elaborated, scalability remains a signicant hurdle because AR systems require domain-specic
knowledge to provide eective guidance. As E3 noted, “Scalability is the main issue... AR systems must lie in a specic
domain, and it’s hard to do this for every domain.” E6 added that the automatic system must be adept at managing
unforeseen situations, which require a deep understanding of the task at hand. Even with the help of large language
models, further training and customization of the tasks were necessary. Experts E2, E4, E5, and E6 also emphasized the
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 9
Modality Detailed Assistance Type Content
text text overview; instruction information; reminder
visuals animations instruction
image instruction
arrows location; interaction point
progress bar check progress
checkpoint cue step completion; warning
audio sound cue step completion; warning
voice instruction
gadget timer count time
Table 1. Types of assistance provided across dierent modalities suggested by AR design experts.
diculty of accurately detecting user intentions. E5 highlighted the limited eld of view in AR headsets and the
low accuracy of detection algorithms in real-world environments, as she noted “Sometimes, the system might trigger
guidance when the user doesn’t need it, which could lead to confusion.” Similarly, E4 discussed how AR software in the
industry struggled to fully understand complex user environments and actions in real-time. Furthermore, E2 mentioned
that such automation can also confuse users due to its lack of self-explanatory features. E2 stated that “if (the system is)
fully automatic, you need the system to have some type of feedback. Automation without feedback may confuse the user.”
Finally, E6 stressed the need to balance general advice with task-specic solutions. AR systems must remain
relevant to the user’s current task, oering guidance that is both practical and actionable. E5 also stated the challenge
to balance the initiative of the user and the system. As she remarked, “Finding the balance between system exibility and
the control it gives users is a challenge.”
3.4.2 Design result. Designers opted to display appropriate content at the right time. For each step, AR professionals
designed the relevant assistance. The designers used user-centered and object-centered strategies to determine when
to assist. The user-centered strategy relies on the user’s actions. For example, one designer created the instructions
to show up when the user gets stuck in the step or when the user shows intention. The object-centered relies on an
object’s status. For example, one expert designed a reminder to change the mop pad when the old pad is dirty in the
room cleaning task. Some designers created instructions to show up when the user nishes the last step and when
something unexpected happens. They also designed a success cue when the user complete the step.
The expert-designed assistance expands multiple modalities, including text, visuals, audio, and another tool. Notably,
the experts tended to choose a certain type of modality ("how") for dierent contents ("what"). Table 1 shows an
overview of the assistance modality and content. Text assistance is usually used to show an overview of the step,
detailed sub-steps in the step, information about the object, or a gentle reminder. The visuals designed by the expert
can be classied into visual overlays (e.g., arrows, progress bar, checkpoint cue), images, and animations. The arrows
and other visual highlights are used to indicate the locations or interaction points of the objects. The progress bar is
designed for the progress check for the user. The image and animation can illustrate the detailed instructions. The
checkpoint cue is designed to show the user step completion or warning. The audio assistance can create a timely
warning, step-by-step guidance, or a success cue. The timer is designed to count time for time-sensitive steps, for
instance, making pour-over coee.
Manuscript submitted to ACM
10 Li and Wu, et al.
3.4.3 Wizard of Oz system. We asked the AR professionals to design an apparatus that can be used for later empirical
study. Each designer made two AR assistance designs for two tasks. The WoZ contains image, voice, and text-based
assistance. The images were sourced from task instructions on WikiHow, and the text and voice guidance were developed
based on expert designs and WikiHow instructions. In total, AR designers created each task three times. We combined
similar timing, modality, and content to form one nal AR assistance per task. We then implement these AR assistances
in Unity and employ a wizard-of-oz to trigger the assistance timely and accurate via wireless keyboard control. To
visualize highlights, we overdraw visuals about interaction points, task locations, and the quantities of materials directly
on static images. The animations were simplied by concatenating multiple images to oer step-by-step guidance.
The resulting system is video-recorded over Microsoft Hololens and sent back to AR professionals for recognition. All
designers agree with how each step is implemented after any discrepancies are resolved through either clarication or
modication to the apparatus.
4 Formative Study 2: Co-Design With the Psychological and HCI Experts
To build an automated AR assistance that proactively provides user assistive information (instructions or tips) and
understands the current task, we recruited six experts from computer science and psychology (E1-6) to spur discussions
for potential solutions over two sessions of dyadic interviews. The study is focused on how to design the system by
asking the experts to discuss factors that construct a proactive assistance system. We paired experts with complementary
backgrounds to form three groups (Groups A, B, and C) as table 2 shows. Their ideas and design decisions are then
formed as design ndings that motivate our system implementations.
4.1 Dyadic Interviews
We conduct two sessions of dyadic interviews where two experts of complementary backgrounds were grouped to
discuss the solutions to the presented challenges. Dyadic interviews allow two participants to co-work towards open-
ended questions [
64
]. This setup let us understand how to design the proacting AR assistance from interaction and user
modeling.
4.2 Challenges
Based on the ndings from professional AR designers in the previous formative study, we further performed a round of
literature survey by searching AR assistance, embodied assistant, and immersive assistant on Google Scholar and ACM
DL. We lter out the unrelated papers and derive the design challenges from the ltered paper collection. Two authors
separately reviewed these papers and coded the key challenges from them. In total, we found 25 common challenges
and grouped them by themes, following the concepts from thematic analysis [10].
4.2.1 C1: Triggering assistance at right time is challenging. The AR assistance needs to be triggered at the proper
time during AR interaction. Improper timing strategy may disappoint the users and reduce the user’s trust [
40
]. For
example, when the user is occupied or under stress, frequent displaying of AR assistance may further increase the user’s
stress. Current proactive assistants regulate the timing using the user’s intent and actions [
79
], or using xed intervals
to display assistance periodically. However, these methods do not consider the user’s goal and lead to sub-optimal
performance.
4.2.2 C2: Reusuability and scalability in AR assistance is a problem. Most of the existing AR assistance systems are
designed as ad-hoc solutions, where the forms of assistance (image, text, and voice) are developed individually [
60
,
66
,
71
]
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 11
Expert Background Gender Group
E1 HCI M A
E2 Psychology F A
E3 Computer Vision & Psychology F B
E4 Psychology M B
E5 HCI M C
E6 Psychology M C
Table 2. Expert background in the co-design. We paired one computer science expert with one psychology expert per each group. In
total three groups participated the co-design.
(a) Initially presented diagram and the available modules. (b) Sample result from Group B.
Fig. 2. Participatory design in the first session. The experts need to collaborate on creating a desired assistant framework based on
the provided baseline diagram. At the boom of figure (a), the experts can find the system components for perception.
and re-adapt to later use. This is because each interaction scenario is dierent; thus, contextual information is required.
Designers often need to modify their designs during development, resulting in unnecessary slowdowns.
4.2.3 C3: Task interruption and Multi-task tracking is diicult. Users handling multiple tasks at once is common in
everyday life but challenging for AR assistance [?]. If the assistance cannot respond correctly following the user’s
immediate task switching or pausing, the interaction ecacy will be aected, resulting in users not trusting the assistive
system [
57
,
95
]. However, existing technologies that can provide proactive AR assistance are limited in reasoning the
current task in a multi-task setting.
4.3 First session: participatory design
To formalize how do we design a proactive system capable of determining what to show users for task completion,
we rst present the background knowledge of AR assistance, modalities, applications, and challenges described in
Sec 4.2. During the presentation, we claried any concerns experts raised. At the end of the presentation, each dyadic
group is asked to discuss 1) what the system needs to know to act proactively, 2) what kind of features the system needs
to have, and 3) whether user modeling would be helpful, and 4) how do we mitigate the known challenges?
After a 50-minute open-ended discussion, we provided them with a list of commonly used tracking, computer
perception, contextual understanding, and display technologies and introduced their functions(Fig 2). Based on the
Manuscript submitted to ACM
12 Li and Wu, et al.
discussion, the dyadic group can add more categories or functions to this list if they nd it theoretically useful for
proactive assistance. Their modied lists are illustrated in Miro 2.
4.4 Second session: design for implementation
The second session involved reconvening the same groups of experts for dyadic interviews. Initially, we presented the
outcomes of the rst session alongside our synthesized framework, seeking conrmation that it accurately reected
their initial ideas. This was followed by an open discussion where the experts delved into the framework’s details and
made adjustments to rene it further. This session, which lasted approximately one hour for each of the three groups of
experts, was essential for nalizing the design framework for the AR assistant.
4.5 Data Collection
We screen-recorded both the discussions and the participatory design sessions. The audio from these recordings was
then transcribed into text using Zoom’s auto-transcription feature. Two co-authors independently analyzed the video
recordings and transcribed text, coding the ndings into similar insights. The insights are then combined into the
following ndings, and discrepancies are resolved through discussion.
4.6 Key Findings
[KI1]
BDI may be benecial for building AR assistants. During the discussions, all three psychology experts (E2, E4,
and E6) brought up the importance of considering What the user sees and understands in the surroundings
when discussing C1. For instance, E4 emphasized, “... it is important to model the human’s mental space, so we
can adjust the AR (assistance’s) timing.” E4 further introduced the belief-desire-intention model, describing it as
a well-established cognitive model for understanding human behavior and could be used towards proactive
assistance. E2 emphasized the importance of providing this feedback loop to ensure the assistant correctly
understands the user’s goals, thus enhancing the eectiveness of the multi-task AR assistant.
[KI2]
User intention can bias the guidance type and content and it reveals the user’s immediate and concrete goal.
Group A and Group C recognized the importance of understanding user intention (i.e., immediate step in a task
performance). In the participatory design, both groups included next-step prediction as an essential system
design component. Additionally, E1 stated, “We can understand the attention (and the intended action) of users; for
example, during the cooking, his attention is on doing something not important (and we can drag his attention back
to the correct step).” E6 mentioned a series of possible functions for inferring short-term memory based on the
egocentric view, which E5 opposed, highlighting that current computer vision methods cannot do this reliably.
As a result, new methods are required to infer user’s intention.
[KI3]
High-level goals, or desire, improves the transparency in the task switching, or Multi-tasking. E1 and E2 agreed
that to support multi-tasking eectively, it is necessary for users to view the assistant’s inferred task or desire.
This idea aligns with the BDI model, as mentioned by E2. We further validated this concept with E1-3 and
conrmed their proposed ideas were consistent. We also shared the concept with E5 and E6, and they agreed it is
crucial for multi-task assistance design.
[KI4]
Modern AI models enable new opportunities in understanding the context, environment, objects, and actions
in an image. E5 has extensive experience in traditional computer vision models and expressed concerns that
2https://miro.com
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 13
current computer vision models may not be sucient due to the inaccuracy of action and intent prediction.
Even if users’ intention (i.e., immediate goals) can be detected, predicted intent cannot be fully used because
these models often lack the ability to handle complex scenarios or make accurate decisions based on intent
predictions. E1, who has signicant experience in large language model (LLM) development, suggested that
multi-modal LLMs like GPT-4V could oer a solution because of their advanced reasoning capabilities. Given the
strong outcomes of current multimodal LLMs, exploring on eciently prompting them may help to better detect
context, environment, objects, and actions with its strong reasonings.
5 Design Requirements
Summarizing the key ndings the aforementioned formative study, we propose the design requirements for proactive
AR assistance.
[D1]
BDI model can be used (K1) in deciding the timing and content of the assistant. Ideas from the belief compo-
nent help the assistant lter the duplicated and unnecessary assistants while consideration of desire adds the
importance of the current task.
[D2]
Proactively, AR assistance should provide multi-modal, multi-task assistance based on the user’s current intention.
The assistance should guide the intended task step with auxiliary support to enhance the user’s capability for
completing the task; and the user should be able to exit and entry any other tasks during the task performance
[KI2].
[D3]
AR assistants should communicate the system state to the user to ensure transparency and interpretability [KI3].
Due to the intricate complexity of the physical space, predicting assistance for physical tasks is non-trivial, and
users may lose trust in the AR assistant if unexpected conditions occur. Providing the immediate reasoning paths
leading to the nal assistance is benecial for users to understand the system status and maintain trust.
[D4]
LLMs are useful to actively perceive the environment, model the user’s action status track the task using the
implemented BDI model, and select appropriate assistance [KI4].
6 Satori System
In this section, we present the design of our proposed Satori system, guided by the design requirements summarized
above. Satori is a proactive AR assistant that implements the BDI model to provide multimodal assistance at the
appropriate time, using the multi-modal LLM and an ensemble of vision models. We rst adapt the BDI model for the AR
domain and describe our implementation for predicting the BDI model in Satori. Next, we detail the implementations for
timing prediction and assistance prediction. Among the modalities of assistance, image generation is the most important
and nuanced. Therefore, we propose a novel image generation pipeline based on DALLE-3 and a customizable prompting
method. Finally, we describe our interface and interaction design, which enhances the aordance and interpretability of
the assistant.
6.1 Implementing BDI model for AR Assistance
To fulll [D1], we need to align the BDI model with the constraints and aordances of AR-mediated tasks. To do so, we
must account for the unique characteristics of AR technology, including limited eld of view, real-time environmental
mapping, and the blend of physical and digital information.
Manuscript submitted to ACM
14 Li and Wu, et al.
6.1.1 Belief. Human belief is a complex psycho-neural function integrally connected with memory and cognition [
56
,
68
]. Precise modeling of human belief within the constraints of AR technology is not feasible without access to human
neural signals. For AR assistance, primary information sources are visual perception and user communication. We
propose a two-fold method from capturing scenes and objects from visual input, and past action history from user
communication, to approximate the user’s belief state within AR constraints.
Scenes provide information on the user’s current task context and potential shifts in their desires and intentions.
For example, when the user looks at task-unrelated areas, the assistant can reduce assistance to avoid distraction. We
represent the scene by the label predicted by the image classication model.
Object information helps users to locate and interact with relevant objects. We use two dierent models for object
detection. A Detr model [
13
] to detect objects in the scene in zero-shot [
13
]; and LLaVA model to detect objects that
are being held/touched/moved by human hands while prompting it to “detect the objects interacting with the human
hands” [
54
]. We did not use the object detection models in this case because these models are trained to predict a xed
set of labels, limiting the generalizability.
Action and assistant history enables non-repeating guidance; for repeated actions, the system progresses from
detailed visual guides to concise text instructions, reducing cognitive load as the user gains familiarity. This history
contains a log of user interactions and the AR assistant responses, such as both the type of assistance provided and
descriptions of each instance. This log serves as a reference to determine whether the user has previously encountered
specic assistance. Utilizing this historical data aids in accurately inferring the user’s expectations regarding the type
of support they receive, allowing for more tailored and eective assistance in future interactions.
6.1.2 Desire. Desire represents the user’s high-level goals, such as cleaning a room or organizing a shelf. We infer the
user’s desire using a standalone LLM agent, which analyzes the current camera frame and supports voice interaction
with the user. Voice interaction allows the user to provide feedback on the predicted desire, and the transcribed text can
then be used by the LLM agent to adjust the desire prediction.
6.1.3 Intention. Intention refers to the user’s immediate, low-level actions, such as grabbing a broom to sweep. It
focuses on specic steps contributing to broader goals. Recognizing intent is crucial for providing action-oriented
AR assistance. We infer intention primarily through perceptual information ([D1]), including visual cues and user
interactions with objects. Voice commands serve as a secondary source. The system compares the intention with a
desire to detect potential user errors.
6.1.4 Intention forecasting. To predict user intentions, we propose using a multi-modal LLM to forecast upcoming user
intentions. Intention forecasting is inherently challenging due to the vast range of potential future actions and the
ambiguous nature of user goals. Predicting intentions between humans is dicult, and similarly, prior experiments show
that current action forecasting models struggle in our scenario due to misalignment with the label set. We constrain the
forecasting process by incorporating user desire, which helps narrow down the range of possible future intentions. By
setting a specic desire in Sec. 6.1.2, the search space for the next intention is signicantly reduced. We then prompt the
LLM to predict the intention within this constrained context [D4]. This prediction process follows a search-and-reect
framework consisting of three stages:
(1)
Analysis Stage: The LLM rst analyzes the current desired task and its corresponding task plan. This involves
understanding the user’s goal and breaking it down into actionable steps, thereby establishing a foundation for
anticipating the next actions.
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 15
(2)
Prediction Stage: Based on the current action and the established task plan, the LLM determines the next
plausible steps. This involves leveraging contextual cues and understanding typical sequences of actions related
to the current task.
(3)
Reection Stage: The LLM evaluates the feasibility of the predicted next steps by considering the available
objects and tools in the scene. Actions that require missing or unavailable objects are eliminated, ensuring that
only viable actions are suggested. This ltering helps rene the prediction further by aligning it with the actual
scene context, reducing irrelevant or impossible options.
6.2 BDI-Driven Timing Prediction
The assistance timing is basically determined by the occurrence of user intention. When the intention is detected,
the corresponding assistance should be generated and presented. However, the user may be distracted in the real
scenario. Therefore, a delayed assistance mechanism is implemented based on the user’s perceived scene captured by
the ego-centric camera. To implement, we use a combination of intention forecasting and Early forecasting mechanisms.
In a standard pipeline, intention forecasting begins only after the previous action is completed. This approach can
negatively impact user experience due to the latency, requiring users to wait for the pipeline to nish processing. To
address this issue, we propose an early forecasting mechanism. While the intention forecasting pipeline runs and caches
continuously, a parallel process early forecasting mechanism detects when an action is completed. Once the action is
detected as nished, the cached intention and the corresponding assistance are immediately displayed. This way the
user no longer has to wait for the intention forecasting pipeline to complete, which is normally high in latency.
6.2.1 Early forecasting mechanism with action finish detection. This pipeline aims to minimize waiting times, providing
a more seamless and responsive interaction. It allows the AR assistance to proactively anticipate user needs and deliver
timely guidance as soon as an action is completed, rather than waiting for the forecasting process to begin afterward.
Unlike conventional action recognition or temporal grounding tasks, action completion detection lacks pre-trained
models or large-scale datasets. To address this, we use the zero-shot learning capabilities of vision-language models and
propose an ensemble-based approach to balance latency and eectiveness in predicting the next action.
Implementation: We use an in-step checkpoint to predict the action nish detection. We use checkpoints to mark
where a user succesfully nishes steps progress toward task completion. These checkpoints are derived from various
cues, such as action detection results, object states, and hand-object interaction states. A task planner (formulated as
boolean statements) is used to generate these checkpoints.
To see if a checkpoint is reached, we ensemble the local image captioning model BLIP-2 [
51
] with the online GPT-
4V model. Since the BLIP-2 model has lower accuracy, its predictions require double conrmation to be considered
reliable, while the GPT-4V model is inherently trusted. This ensemble strategy optimizes the performance of action
completion detection by leveraging the strengths of both models, ensuring more accurate and timely assistance in the
AR environment.
6.3 Multi-Modal Assistance Prediction
6.3.1 Assistance design. AR assistance can come in dierent modalities such as sound, text, image, etc [D2]. Each has
dierent functions and should be used depending on the scene context and users’ actions. For that, we implement:
(1)
Textual assistance: is essential for guiding users toward their desired outcomes. We use white text on a black,
transparent container to ensure readability.
Manuscript submitted to ACM
16 Li and Wu, et al.
(2)
Tool and object utility visualization: For tasks involving tools visual cues can aid proper handling and support
operation tailored to the task context. We focus on automatically generating a single image using DALLE-3
model to represent action and objects with illustration styles. For more complex actions, we also employ multiple
images as an alternative. We elaborated the design and implementation in Sec 6.4 .
(3)
Timer: For tasks requiring specic timing (e.g., boiling water, monitoring reactions), timer oers a way to count.
We implement a timer in AR to proactively counting time for users during time-related tasks.
6.3.2 Inferring modality. Given the predicted intention, the AR assistance should infer assistance modalities. For
example, a cook chopping carrots while waiting for water to boil involves multiple components: time management
(monitoring boiling time), and tool utilization (using a knife, water pitcher, etc.). The AR assistant can then provide
targeted assistance for each of these components. This component-based approach enhances the scalability and adapt-
ability of the AR assistance system. By reusing assistance strategies for tasks that share similar intention components,
the system improves eciency and consistency in delivering guidance.
6.3.3 Regulating assistance content. To avoid increasing task load and reading time, the assistant omits image-based
assistance if similar content has already been displayed in recent interactions. When an object is present in the user’s
belief state, an indicator showing the object’s location in the scene is provided. By aligning generated image assistance
with the user’s scene context, the system enhances user engagement. The assistance images should incorporate scene
information and be consistent with the current belief state.
6.4 Image Assistance Generation
Image assistance aims to provide clear and straightforward instructions for tool usage. Image assistance is generated
using the DALLE-3 model to exibly adapt to current user states and contexts [55].
Clarity and conciseness are prioritized in the design of each image. Avoiding any unnecessary details that might
confuse or overwhelm the user is important. Second, consistent formatting needs to be maintained throughout all
images. The similar design style reduces cognitive load and helps users quickly grasp the instructions without needing
to adapt to a new format.
Additionally, some images are action-oriented, with directions explicitly demonstrated through the arrows to guide
the user step by step. This visual cue enhances comprehension by focusing the user’s attention on the required actions.
Lastly, an immersive experience is emphasized by ensuring that the objects depicted in the images are consistent
with those in the user’s real-world environment. This consistency aligns with the user’s belief state [D1], allowing for a
more seamless and intuitive interaction between the system and the physical task at hand.
6.4.1 Prompt template with modifier. We have developed a prompt template and an associated modier taxonomy
using the above design consideration. The prompt template is structured to encapsulate all necessary elements of an
instructional image in a standardized format:
[Object][Attributes][Action].[Indicator][Attributes][Direction].[Background].[Style Modifier].
[Quality Booster]
Each element in the prompt template, detailed in Table 3, is associated with a specic modier to rene its description.
The basic prompt includes Object and Action modiers, which depict the action and the targeted object within the
assistance image. The Attribute modier incorporates real-world attributes such as color, shape, and materials of the
objects, enhancing task immersion and operational accuracy. Additionally, Indicator modiers, such as arrows, and
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 17
Modier Description Example
Object Denotes the main object or the tool button
Indicator Graphic element that points out specic parts or directions arrow
Attribute Attributes of objects or indicators, such as shape and color red
Action Species the action being performed by the object press
Direction Claries the direction in which the action or attention is guided pointing to right
Background Species the absence or presence of a background no background
Style Modier Dictates the artistic style of the illustration instructional illustration
Quality Booster Enhancements that improve the overall clarity and eectiveness accuracy
Table 3. Taxonomy of image assistance generation prompt modifiers.
Direction modiers explicitly highlight the action direction or focal points for user attention. The Background modier
removes unnecessary details, clarifying the image. The Style Modier governs the aesthetic of the image; we choose to
apply the "at, instructional illustration" style based on feedback from experiments, ensuring that the visuals maintain
a consistent appearance. Lastly, the Quality Booster enhances image quality with descriptors such as "accurate" and
"concise," further ensuring clarity and eectiveness in the visual assistance provided.
6.4.2 Examples. We present two examples of image assistance generation to demonstrate the eectiveness of our
template, as shown Fig.3. Please refer to the supplementary materials for details.
(a) Ours (b) Raw (c) Ours (d) Raw
Fig. 3. Comparison of naively generated image from GPT model (i.e., Raw) with our proposed prompt (i.e., Ours). (a) “One hand
presses a white buon on a white espresso machine. A large red arrow points to the buon. No background, in the style of flat, instructional
illustrations. Accurate, concise, comfortable color style.” (b) “One hand presses a white buon on a white espresso machine.” (c) “Cut stem
of a red flower up from boom, with white scissors at 45 degrees. One big red arrow pointing to boom of the flower stem. In the style of
flat, instructional illustrations. No background. Accurate, concise, comfortable color style.” (d) “Cut stem of a red flower up from boom
with white scissors at 45 degrees.”
6.4.3 Implementing image assistance generation. We propose using Dalle-3 to generate visual illustrations. The image
generation pipeline receives an initial prompt input, which includes the user’s next-step intention and a list of key
objects, from the guidance generation pipeline. The pipeline nalizes the prompt by integrating the information with
modier, as outlined in the prompt template. For generating multiple images to support complex illustrations, we rst
decompose the user’s next-step intention, if applicable, and then form the nal prompts, which are sent to the Dalle-3
model in parallel.
Manuscript submitted to ACM
18 Li and Wu, et al.
(a) Interface with belief and desire state display. (b) Interface with task assistance conrmation displayed on top.
Fig. 4. The interface displays the user’s belief and desire states, action completion feedback, and assistance content. In this example,
the user is connecting a game console to the monitor. The interface accurately represents the belief state as "Switch dock" and
the desired state as "Connect Switch." The action checkpoint indicates that the current step is completed. (b) A task assistance
confirmation appears when the system detects step completion. The confirmation prompts the user, asking if they are about to use a
coee filter and whether they need image assistance.
6.5 Interface and Interaction Design
6.5.1 Interface design. As shown in Figure 4, the Satori interface displays the current assistance and the related BDI
states. At the top of the interface, Satori presents the desired task (e.g., Connect Switch in the example) and the in-belief
object indicator (e.g., Switch deck), as shown in Figure 4(a). These BDI states help users understand and track the
assistant’s state, aligning with design requirement [D3]. The in-belief object indicator shows the relative position of
the object that the user needs to interact with. The arrow indicates the relative position of the object relative to the user.
The object is detected in advance and stored in the modeled belief state, which is an in-memory program variable. The
object will change based on the user’s intention, and its location will be updated based on object detection results. In the
center of the interface, as shown in Figure 4(a), are the in-step checkpoints described in Sec. 6.2.1. These checkpoints
help users track the progress of the current step of the task, thereby improving the interpretability of the model’s
predictions. Users can control the checkpoints manually, and when all checkpoints are reached (represented by green
check symbols in the example), the AR assistant will display the next guidance step.
6.5.2 Interaction design. In the Satori system, we support multiple interactions for the user to feedback on the model
prediction and adjust the system states. The voice interaction is used to convey the user’s intention and desire directly.
The AR assistant can adjust the BDI states based on the transcribed text. Satori presents the next assistance when the
current action is detected to be nished. In the early version of the system, we found that users could be overwhelmed
when the action state changed abruptly. Therefore, it is necessary to display a conrmation panel asking whether the
system’s intention prediction matches the user’s intention, as shown in Fig.4. Moreover, when a user’s desired task
involves multiple steps (especially when the number of steps exceeds ve), errors in the earlier steps may propagate to
later steps without human correction. For example, in making coee, if the assistant fails to detect that the coee beans
have been ground, it may continue prompting the user to grind the beans. To address this, we allow users to manually
control the assistant and provide feedback on the current prediction. When new assistance is displayed, the user must
conrm whether it matches their needs, as shown in Fig. 4(b) (Looks you are going to open a coee lter and place it in
the brewer...). For accessibility, we provide an option for audio output through button control.
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 19
Camera View
Dialogue
user: I want to arrange flowers.
I am a novice. Can you give me
some assistance
Satori: Sure!
Logger
Time = 4
Desire = Make
Coffe
Flag = Fals
Type = tex
Title = pouring
water over the
coffee
Time = 5
Desire = Arrange
Flowe
Flag = Fals
Type = tex
Title = Add water
to vas
Time = 6
Desire = Arrange
Flowe
Flag = Tru
Type = tex
Title = Add water
to vas
...
Desire
Arrange Flowers
Intent
Belief
Scene: arrange flowers
Assistance (GPT4)
Type = imag
Title: use scissors to cut the stem
Text Content: Cut the flower stems
diagonally, about an inch from the ends,
using sharp scissor
Dalle3 Prompt: One hand cuts the stem
of a red flower up from the bottom with
white scissors at 45 degrees. A large
red arrow points to the cut, no
background, in the style of flat,
instructional illustrations. Accurate,
concise, comfortable color styl
Image Assistance
Use scissors to cut flower stems
Assistance History:
Ob
j
ect
[
flower, red, center
]
,
[
vase, glass,
center
][
scissors, white, right
]
...
Belief
-
Desire
-
Intention Assistance Timing Pre
d
iction
Fig. 5. Overview of BDI user model. The input is the camera’s view, the dialogue (the voice communication between the user and the
GPT model), and the historical logger (prior Assistance already happened). The flowchart displays how each input is transferred
to the corresponding components in the BDI model. The last layer of the model contains how the parameters in the Assistance is
aected by the BDI model, for example, the Belief component would aect the timing and frequency of the assistance.
6.6 Other Implementation Details
We implemented the timing prediction and next assistance prediction modules on GPT-4o model. We concat the last
four frames and feed OpenAI’s API with the prepared prompt text. The frame is sampled on 1 FPS. We implemented the
spatial object belief by using the OWL-ViT model [
63
], which is the zero-shot object detection model. The scene belief
is implemented with the zero-shot image recognition model CLIP model [
72
]. The action history belief is preserved by
an array recording the data. We implemented our interface on Unity with MRTK framework. We stream the HoloLens
video stream to a streaming server, where the dierent downstream ML modules are triggered by the streams of data
and then run parallel.
6.6.1 Prompt techniques. We enhance the reasoning capabilities of the LLM components by applying the Chain-of-
Thought (CoT) technique along with in-context examples [
88
]. CoT reasoning allows the model to deliver assistance in
a structured manner by sequentially following logical steps. By conceptualizing the BDI model as a series of thoughts,
the model can systematically produce the appropriate assistance. To facilitate CoT, we encode the concepts in the BDI
model and formalize the output using XML format. Each thought in the process is marked with a hashtag, enabling
the model to decompose complex tasks into manageable steps, thereby improving the accuracy and relevance of the
provided assistance.
7 Evaluation
We evaluate Satori prototype through an open-ended exploratory study, focusing on the following research questions:
(1) Can Satori provide the correct assistant content at the right timing?
(2) Can Satori provide comprehensible and eective guidance?
(3) How does our proposed system guide users compared to guidance made by professional AR experts?
Manuscript submitted to ACM
20 Li and Wu, et al.
(a) Satori: clean room
(b) Satori: connect Nintendo
Switch (c) WoZ: make coee (d) WoZ: arrange flowers
Fig. 6. Study procedures. (a) The participant is assembling a mop during the task room cleaning, following Satori system. (b)
The participant is connecting a HDMI cable to a Nintendo Switch dock during the task connecting Nintendo Switch, following
Satori system. (c) The participant is preparing a filter during the task of making coee, following WoZ system. (d) The participant is
trimming flower stems during the task of arranging flowers, following WoZ system.
7.1 Conditions
Participants were presented with two conditions, Wizard-of-OZ (WoZ) and Satori , respectively. They completed
both conditions in a counterbalanced order to control for sequencing eects. The tasks (indexed as 1, 2, 3, and 4) and
conditions were presented in a counterbalanced order to mitigate the learning eects.
7.2 Study Procedure
7.2.1 Participants. A total of sixteen participants (P01-P16, 11 male, 5 female) were recruited via a university email
group and yer. The average age was 23.8 with the maximum age at 27 and a minimum age at 21. 10/16 participants
had the AR experience before the study. Each participant was compensated with a $30 gift card for their participation.
Sickness information was collected from participants both before and after the study, and no sickness was observed
following the study.
7.2.2 Apparatus. We used a Microsoft HoloLens 2 headset as the AR device for the study. They use the Satori system
described earlier while performing the tasks. The headset connects to a server with a Nvidia 3090 graphics card to fetch
real-time results.
7.2.3 Tasks. The study began with a brief tutorial introducing participants to the AR interface of two systems. Afterward,
participants were assigned four daily tasks designed to evaluate the guidance provided by both systems. The four
tasks were initially sampled from WikiHow
3
and subsequently rewritten to ensure a consistent task load across them.
Following RQ2, we selected these particular tasks because they represent typical daily activities that are neither too
familiar nor too complex for the average user.
(1)
Arranging Flowers: Participants arranged a variety of owers into a vase, testing the system’s ability to provide
accurate and aesthetic guidance.
(2)
Connecting Nintendo Switch: This task involved setting up a Nintendo Switch with a monitor, evaluating the
system’s technical guidance and troubleshooting support.
(3)
Room Cleaning: Participants assembled a mop and a duster, and cleaned the desk and oor, where the AR system
suggested assembling instruction and cleaning strategy.
(4)
Making Coee: This task required making coee using the pour-over method, with the AR assistant providing
instructions on tool usage and pouring techniques.
3https://www.wikihow.com/
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 21
The participants completed the task with the help of WoZ system or Satori system, as shown in Fig.6. After each task
completion, participants evaluated their experience using a usability scale and assessed their cognitive load using
the NASA Task Load Index. We also conducted a brief, recorded interview, asking participants about the advantages,
disadvantages, usefulness, and timeliness of the two systems. The experiments were supervised by the Institutional
Review Board (IRB), and all task sessions were video-recorded. These recordings were securely stored on an internal
server inaccessible from outside the university. Participant provided consent, and their personal identity was strictly
protected. We collected data on participants’ well-being both before and after the experiment and observed no signicant
adverse eects. The duration of the entire study is 2 hours on average. All the participants completed the four tasks
using both systems.
7.3 Evaluation Metrics
To assess the performance of Satori and the user experience, we utilized the following evaluation metrics.
7.3.1 Usability scale. To answer RQ1 and RQ2, we used the system usability scale consisting of a series of questions
designed to measure the timeliness, ease of use, eectiveness and satisfaction of the AR interface. We designed our
questionnaire based on [
50
] by selecting the usability-related questions and adding some task-specic questions. Final
questions included items such as I think the guidance appeared at the right moment,I can easily comprehend the content via
the text/audio/image guidance,The guidance’s content is helpful in completion the task,I found that the guidance accurately
reects my task intentions, and Overall, I think the system can be benecial in my everyday life. In total, there are 11
questions. Responses were collected on a seven-point Likert scale from strongly disagree to strongly agree, allowing us
to quantitatively gauge user satisfaction and system usability. We computed the mean and condence intervals for
each question using the bootstrapping method. Specically, 1,000 bootstrap samples were generated from the original
data set to compute the mean of each sample. The 95% condence intervals were derived from the percentiles of the
bootstrap distribution to provide a robust estimate of the uncertainty around the mean.
7.3.2 NASA task load index. To further answer RQ2, we also used the NASA TLX form consisting of a series of scales to
evaluate the cognitive loads. The evaluation includes mental demand, physical demand, temporal demand, performance,
eort, and frustration. They are rated within a 100-point range. We calculated the mean and condence intervals for
each question using the bootstrapping technique. In this process, 1,000 bootstrap samples were drawn from the original
dataset to estimate the mean for each sample. The 95% condence intervals were then obtained from the percentiles of
the bootstrap distribution, oering a reliable measure of the uncertainty surrounding the mean.
7.4 Data Analysis
For NASA-TLX and usability scale, we rst average the measured task loads across the four tasks for each participant
under each condition. Then we conduct the Wilcoxon Signed-Rank test on the averaged data to test whether there is a
signicance in the task loads between the WoZ condition and Satori condition.
7.4.1 Wilcoxon signed-rank one-sided test . Simply verifying that there is no signicant dierence between Satori and
WoZ condition does not ensure that the two conditions are similar. Instead, we aim to test whether Satori is no worse
than the WoZ by a predened margin Δ. To achieve this, we use a one-sided Wilcoxon signed-rank test,
The test denes
𝐷𝑖=𝑋𝐴𝑖 −𝑋𝐵𝑖
as the dierence between the scores for each participant
𝑖
under Conditions
𝑆
(Satori )
and W (WoZ ), respectively. The adjusted dierence accounting for the margin is given by
𝐷′
𝑖=𝐷𝑖−Δ=𝑋𝐴𝑖 −𝑋𝐵𝑖 −Δ
.
Manuscript submitted to ACM
22 Li and Wu, et al.
Question Condition Mean 95% CI Vanilla Non-Inferiority
W𝑝-value W 𝑝-value
[Q1] I can easily comprehend content
via text/audio/image guidance.
Satori 6.25 [6.00, 6.75]
WoZ 5.94 [5.25, 6.50] 26.500 0.099 89.500 0.001
[Q2] The guidance’s content is helpful
in completion the task.
Satori 6.22 [5.75, 6.75]
WoZ 5.80 [5.50, 6.50] 26.000 0.094 131.000 0.000
[Q3] I think the guidance appear at the
right moment.
Satori 5.97 [6.00, 6.50]
WoZ 5.53 [5.00, 6.25] 17.500 0.090 125.000 0.001
[Q4] I found that the guidance
accurately reects my task intentions.
Satori 6.48 [6.00, 7.00]
WoZ 5.95 [5.62, 6.50] 11.500 0.016 134.500 0.000
[Q5] The guidance appears at a
adequate location.
Satori 6.23 [5.88, 6.75]
WoZ 5.66 [5.25, 6.25] 15.000 0.032 131.500 0.000
[Q6] I am able to complete my work
quickly using this system.
Satori 6.08 [5.50, 6.62]
WoZ 5.75 [5.25, 6.50] 30.000 0.273 108.500 0.003
[Q7] It was easy to learn to use this
system.
Satori 6.48 [6.00, 7.00]
WoZ 6.06 [5.75, 7.00] 22.000 0.179 103.500 0.001
[Q8] How engaged I am using the
system?
Satori 6.16 [5.88, 6.50]
WoZ 5.75 [5.38, 6.50] 20.500 0.145 109.500 0.002
[Q9] The system’s guidance matches
the context.
Satori 6.27 [6.00, 6.75]
WoZ 6.05 [5.62, 7.00] 32.500 0.357 91.000 0.001
[Q10] Overall, the system’s guidance
frequency is suitable.
Satori 6.30 [5.75, 6.75]
WoZ 5.92 [5.75, 6.50] 30.000 0.156 97.500 0.002
[Q11] Overall, I think the system can
be benecial in my everyday life.
Satori 5.94 [5.50, 6.50]
WoZ 5.58 [5.25, 6.25] 30.000 0.277 105.500 0.005
Table 4. Results for NASA TLX estions and Non-Inferiority Tests. The table summarizes the mean scores and 95% confidence
intervals (CI) for each system (our Satori system and WoZ designed by AR designer) across various user experience questions
related to system usability and cognitive load. The “Vanilla” columns provide the Wilcoxon signed-rank test results (W statistic and
p-values) for significant dierences between systems. The “Non-Inferiority” columns show W statistics and p-values testing if Satori’s
performance is non-inferior to WoZ within a set margin. Green-highlighted cells indicate established non-inferiority, suggesting that
Satori performs as well as or beer than WoZ in user guidance and cognitive demand.
The hypotheses for this non-inferiority test are:
𝐻0:median(𝐷′)>0(A is worse than B by more than Δ),
𝐻1:median(𝐷′) ≤ 0(A is no worse than B by at most Δ).
Similar to the vanilla Wilcoxon signed-rank test, the procedure involves ranking the absolute adjusted dierences
|𝐷′
𝑖|
, calculating the sum of ranks for positive (
𝑊+
) and negative (
𝑊−
) dierences, and using the test statistic
𝑊=
min(𝑊+,𝑊 −)
to compute a one-sided p-value. This p-value indicates whether we can reject
𝐻0
in favor of
𝐻1
. We
chose the margin value
Δ𝑇𝐿𝑋 =
2
.
5for NASA TLX and
Δ𝑢𝑠
for the usability scale as they represent half of the rating
interval.
8 User Study Result
8.1 Usability Scale Result
We present the participants’ raw scale data across the dierent tasks in Figure 8 and processed statistics in Table 4. The
usability of Satori was assessed using a series of questions (Q1-Q11) related to the ease of comprehension, helpfulness,
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 23
Fig. 7. NASA TLX box plot for WoZ and Satori. The plot illustrates the distribution of cognitive load ratings across six dimensions:
Mental Demand, Physical Demand, Temporal Demand, Performance, Eort, and Frustration. Each box represents the interquartile
range (IQR) with the median marked by a horizontal line, showing the variability and central tendency of participants’ workload
ratings for both systems. The comparison highlights dierences in perceived workload between the WoZ and Satori conditions,
providing insights into the eectiveness and usability of each approach.
timeliness, and overall eectiveness of the system’s guidance. For most questions, there was no signicant dierence
between Satori and the WoZ condition, as indicated by p-values above the conventional signicance threshold (e.g., Q1:
𝑝=
0
.
099, Q2:
𝑝=
0
.
094, Q3:
𝑝=
0
.
090, Q6:
𝑝=
0
.
273). However, non-inferiority tests demonstrated that Satori was not
worse than the WoZ condition for these questions, with p-values well below 0.05 (e.g., Q1:
𝑝=
0
.
001, Q2:
𝑝=
0
.
000, Q6:
𝑝=0.001) and a margin 𝛿=0.5. Below we will analyze the ne-grained results.
8.1.1 Satori can provide the timely assistance. Satori demonstrates its capability to provide timely guidance to users by
dynamically aligning its assistance with the user’s current task context. This result is reected by Q3 (
𝑝=
0
.
0
.
90 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =<
0
.
05) and Q11 (
𝑝=
0
.
357 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =<
0
.
05). Specically, Satori achived the higher user
scale compared to the WoZ condition. For instance, P6 noted that “there is no circumstance, where my intention does not
align with the provided guidance,” underscoring the perceived reliability of Satori in understanding and anticipating user
needs. Furthermore, the BDI model helps regulate the assistance timing and eliminate duplicated guidance, contributing
to a better user rating.
8.1.2 Satori can provide comprehensible and eective guidance through the multi-modal content and intention-related task
instructions. Reected by Q2 (
𝑝=
0
.
094 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =<
0
.
05), Satori provides helpful content for guiding the
user to achieve their desired task. We nd that the participants highly appreciate the multi-modal content’s eectiveness
in this guidance. For example, P3 stated, “I liked that it combines the various modalities of text, audio, and image to generate
guidance, I believe that was helpful on multiple occasions where I might have been uncertain with only a single modality.”
Furthermore, the assistance matching the participant’s belief state contributes to the eectiveness of the system. P14
commented, “The guidance helps me a lot, especially in coee making. It provides me with very detailed instructions
including time, and amount of coee beans I need. I would have to google it if I don’t have the guidance.” Similarly, P8
appreciated the assistance for the detailed guidance, noting, “For task like arranging the ower vase, the intricate details
Manuscript submitted to ACM
24 Li and Wu, et al.
Fig. 8. User ratings on the system usability using a seven-point Likert scale. There are twelve participants in the study and the ratings
are encoded by the color from red to blue. The figure compares the responses for Satori and WoZ systems across four tasks: Arranging
Flowers, Making Coee, Cleaning the Room, and Connecting a Console. Each bar represents the distribution of responses for a
specific usability question, highlighting dierences in user satisfaction, comprehensibility, and task support provided by both systems.
like trim the leaves, cutting the stem at 45 degree etc. are very necessary details that I might not have performed on my
own.” The comprehensibility of the assistance content is proved by Q1 (
𝑝=
0
.
099 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =<
0
.
05). The
image content is comprehensible with our carefully designed image prompts. For example, P1 noticed that “the picture
of the second one is very nice and it looks good.” We also notice that the WoZ also bears impressive accessibility, as P8
remarked, “Guidance as a whole (text, images, and animations) was very helpful. Whereas, text alone as shown in the
image lacks information.”
8.1.3 Satori demonstrates competitiveness with the WoZ condition, achieving similar eectiveness in task guidance while
fostering a more engaging and trustworthy user experience. We observe that for all questions, Satori demonstrates a
non-inferior result compared to the WoZ condition (
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 <
0
.
05). Given that the WoZ was designed by
experts who carefully considered the timing and content of the assistance for the specic scenario, this result indicates
that Satori can achieve promising outcomes without requiring extensive manual eort, thus demonstrating better
generalizability. Participant feedback underscores this increased engagement facilitated by Satori’s design. For example,
P3 remarked, “Not a singular component by itself, but all components together do make me more engaged,” indicating
that the combination of transparency, interpretability, and contextual guidance creates a cohesive and compelling user
experience. Similarly, P10 expressed a sense of active involvement in the task, stating, “Yes. It may automatically detect
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 25
my progress to make me more engaged in the task.” This sentiment reects how the system’s automatic action nish
display aligns with user expectations and promotes a sense of partnership between the user and Satori system. While
Satori eectively matches the guidance performance of the WoZ and excels in fostering engagement, it does have
limitations in the modalities of guidance provided. Unlike the WoZ, which utilizes more varied and dynamic modalities,
such as animations and warnings, Satori’s guidance is comparatively straightforward. P1 pointed out, “The animation is
really clear,” highlighting the benet of animated guidance that the WoZ oers, which is only available in WoZ system.
8.1.4 Perceived latency and system errors. Even though Satori manages to reduce latency through the early forecasting
mechanism, the latency is still observable to participants. P8 mentioned, “most of the time the system knows what I have
done in the past step eventually, but I wish it could be more responsive so I don’t need to wait for the system to recognize
what I have done.” This latency is mainly caused by the action nish detection module, which shares GPU resources with
the belief prediction module. We notice that the data transfer between the GPU and CPU can sometimes lead to latency
in the action nish detection module. We could eliminate this latency by adding more computing resources to avoid
memory sharing between dierent modules. Another technical challenge contributing to these delays is the limited
eld of view (FoV) of the HoloLens 2, which aects the system’s ability to capture complete contextual information
when the user’s head position changes, such as looking up. In such cases, the camera may fail to detect relevant scene
information, resulting in a temporary guidance mismatch. However, we also notice that Satori’s interaction mechanisms,
which allow users to manually adjust or correct the system’s understanding, can help participants address the system
latency manually. For example, P12 mentioned, “I like the check mark and the previous/next step.”
8.1.5 Satori shows promising results for building a proactive AR assistant in everyday life. Q11 (
𝑝=
0
.
277 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 <
0
.
05) indicates that participants believe Satori has the potential to be generalized to everyday scenarios. For example,
P9 mentioned, “ maybe when we need to assemble furniture, instead of going through the manual back and forth all the
time, we can just have this system to guide us.” Since Satori does not rely on a specic label set or manual conguration,
it can be directly applied to new tasks. Furthermore, Q7 (
𝑝=
0
.
179 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 <
0
.
05) suggests that Satori has
a at learning curve, which was supported by participant feedback, as most participants acknowledged they would not
need additional training. Such ease of learning opens opportunities to turn Satori into a system for learning new tasks,
as P10 mentioned, “(The system can be used for) learning to complete a dicult task.”
8.2 NASA TLX Result
We present the test results in Table 5 and the mean values in Figure 7. When there is no signicant dierence
between Satori and WoZ on all TLX measures, Satori achives signicantly no-worse than the WoZon performance and
frustration.
8.2.1 Mental Demand, physical demand, and temporal demand. For the Mental Demand dimension, there is no signicant
dierence between Satori and the WoZ condition (
𝑝=
0
.
744). The 95% condence interval for Satori is [17.50, 43.81],
while for the WoZ condition, it is [16.25, 50.00]. These overlapping condence intervals suggest similar levels of mental
demand perceived by participants in both conditions. A similar pattern is reected in the physical demand (
𝑝=
0
.
776
and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =
0
.
281). We observe that the physical demand may be attributed to the increased interaction
required by Satori. As shown in Sec. 6.5, Satori exploits interaction for intention conrmation and assistance control.
Such interactions with the buttons can be challenging in HoloLens 2, requiring participants to perform the clicking
action multiple times. For example, P9 mentioned, “the biggest diculty for me was using the AR device. It was frustrating
Manuscript submitted to ACM
26 Li and Wu, et al.
Question Condition Mean 95% CI Vanilla Non-Inferiority
W𝑝-value W 𝑝-value
Mental Demand Satori 34.06 [17.50, 43.81]
WoZ 33.12 [16.25, 50.00] 60.500 0.744 78.000 0.316
Physical Demand Satori 32.34 [11.25, 50.00]
WoZ 30.47 [10.00, 43.75] 55.000 0.776 80.000 0.281
Temporal Demand Satori 28.52 [21.25, 37.50]
WoZ 26.41 [15.00, 32.50] 46.000 0.274 65.000 0.388
Performance Satori 16.17 [7.50, 21.25]
WoZ 17.27 [7.50, 20.00] 44.000 0.593 95.500 0.022
Eort Satori 28.20 [17.50, 37.50]
WoZ 26.02 [15.00, 36.25] 52.500 0.464 58.500 0.353
Frustration Satori 19.84 [10.00, 28.75]
WoZ 26.95 [11.25, 34.38] 41.500 0.175 126.000 0.001
Table 5. Results for NASA TLX estions and Non-Inferiority Tests. This table shows the mean scores and 95% confidence intervals
(CI) for Satori and WoZ systems across six dimensions: Mental Demand, Physical Demand, Temporal Demand, Performance, Eort,
and Frustration. The Vanilla Wilcoxon signed-rank test results and non-inferiority tests (highlighted in green) indicate whether the
Satori system performs comparably or beer than the WoZ system in terms of cognitive load.
when I failed to click on the button multiple times.” The Temporal demand shows a similar pattern (
𝑝=
0
.
274 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =0.388), indicating that Satori does not increase the time pressure for completing the task.
8.2.2 Performance. The performance dimension shows no signicant dierence between Satori and the Wizard-of-Oz
condition (
𝑝=
0
.
593). Furthermore, the non-inferiority test indicates that Satori is no worse than the Wizard-of-Oz
condition (
𝑝=
0
.
593 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 <
0
.
05), strictly verifying that Satori provides a performance experience that
is equivalent to that of the WoZ condition.
8.2.3 Eort. Regarding Eort, there is no signicant dierence between Satori and the WoZ condition (
𝑝=
0
.
464 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 =
0
.
353). We believe this is reasonable, as most of the eort lies in task completion rather than using
the AR assistant interface.
8.2.4 Frustration. Satori achieved strictly no-worse-than performance compared to the WoZ condition (
𝑝=
0
.
175 and
𝑝𝑛𝑜𝑛_𝑖𝑛 𝑓 𝑒 𝑟𝑖𝑜 𝑟𝑖 𝑡 𝑦 <
0
.
05). Furthermore, the mean values suggest that Satori has lower frustration (
𝑚𝑒𝑎𝑛𝑆𝑎𝑡𝑜𝑟 𝑖 =
19
.
84
and
𝑚𝑒𝑎𝑛𝑊 𝑜𝑍 =
26
.
94). We believe this can be explained by the more engaging user experience and the system’s
transparency. As P5 mentioned, “It (Satori) gives me the impression that the machine understands what I’m doing, making
its instructions feel trustworthy.” This aligns with our ndings in Sec. 8.1.3.
9 Discussion
9.1 Towards the Proactive AR Assistant
Our Satori system represents an early attempt to provide appropriate assitance at the right time. The study ndings in
timing, comprehensibility, and cognitive load all demonstrated that Satoriperforms similar to AR assistance created by
human designers. However, there is still substantial room for improvement to fully realize the vision of a truly proactive
AR assistant. For example, the current system still has a latency about 2-3 seconds, limiting application to non-rapid
performance. We describe insights below:
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 27
9.1.1 A broader range of assistance modalities is needed to accommodate the diversity of human tasks. Human tasks
are inherently diverse, ranging from highly cognitive activities to more physical tasks [?]. Therefore, expanding the
modalities of assistance could signicantly enhance the system’s adaptability and usability. For example, incorporating
additional modalities such as auditory warnings, dynamic animations, and spatial object indicators could provide more
intuitive and immediate feedback, especially in tasks requiring quick reexes or spatial awareness. These modalities
can help bridge the gap between virtual guidance and real-world task execution by aligning the assistance more closely
with the specic demands of the task.
9.1.2 Greater access to environmental information is crucial for predicting the timing and content of the assistance. Cur-
rently, the Hololens 2 setup only provides an egocentric view of the user’s environment, which restricts the information
available to the model. This limitation constrains the performance of our system, particularly in understanding the
broader context of the user’s surroundings and interactions. Enhancing the system with additional environmental
sensing capabilities, such as better cameras or third-person views, could provide a more holistic understanding of
the environment. This would allow the AR assistant to oer more precise and contextually appropriate guidance. For
instance, a third-person view could help in scenarios where the user’s immediate perspective is limited or obstructed,
enabling the system to infer critical information that would otherwise be missed [46].
9.1.3 Consideration of collaborative interactions and social dynamics is crucial for future development. Humans often
perform tasks in group settings, where individual intentions are inuenced by social interactions, such as collaboration,
negotiation, and shared goals. In our formative studies, the AR designers mentioned that the truly proactive AR assistant
should be capable of recognizing these social contexts and adapting its guidance to support not only individual users but
also the group as a whole. Other non-AR studies also conrm the importance of collaborative interaction in designing
an AI assistant [?].
9.2 Limitation and Future Directions
The primary limitation lies in the incomplete prediction of user beliefs (i.e., surrounding objects, history, and actions).
In psychology, human beliefs are highly complex and nuanced, and our current implementation only partially captures
this complexity. Research in cognitive psychology suggests that human beliefs are inuenced by a variety of factors
including personal experiences, social inuences, and cognitive biases [
30
,
77
]. Thus, our system may benet from
incorporating more sophisticated models of belief formation and updating, drawing from interdisciplinary research on
cognitive science and decision-making.
Additionally, our implementation relies on the GPT model, which suers from latency issues. Although we experi-
mented with LLaVA in our preliminary studies, the results were not satisfactory. To address this, we plan to explore
alternative models and techniques for improving the eciency and responsiveness of our system. This may involve
ne-tuning existing models or investigating novel architectures that better suit our application domain.
Furthermore, the limited eld of view provided by devices such as the Hololens poses a challenge for object detection
and prediction accuracy. This limitation becomes particularly evident when users raise their heads to view visual content,
as the camera may not be centered on the relevant task objects. Future iterations of our system could explore strategies
to mitigate this issue, such as incorporating multi-view object detection algorithms or optimizing the placement of
virtual elements within the user’s eld of view.
Manuscript submitted to ACM
28 Li and Wu, et al.
9.3 Design Lessons
In designing Satori, we went through multiple iterative stages. Our initial goal was to build an AR guidance system that
provided step-by-step instructions based on predened recipes. However, this approach proved to have limited usability
due to the lack of generalizability of predened recipes and the diculty in creating them. To address this, we shifted
our focus to supporting a more exible task model using task-agnostic machine learning models. We experimented with
various implementations based on object detection, action detection, and our current backbone GPT models. A major
challenge we encountered was cascading errors, where an error in one step could lead to further errors in subsequent
steps. For instance, if an earlier step is not detected as completed and the user proceeds to the next step, the system
might fail to detect this transition and remain stuck, as the user’s actions no longer correspond to the expected step. To
address this issue, we introduced user interaction into the system. Although we observed that this interaction could
increase task load and lead to some negative feedback, it allowed users to provide feedback on model predictions,
thereby improving prediction accuracy in the remaining steps.
10 Conclusion
We presented Satori, a proactive AR assistant system that integrates the BDI model with a deep model-LLM fusion
architecture to provide context-aware, multimodal guidance in AR environments. Our research addresses the current
limitation by focusing on proactive assistance that can understand user intentions, anticipate needs, and provide timely
and relevant guidance during complex tasks. Through two formative studies involving twelve experts, we identied
key design requirements for adapting the BDI model to the AR domain, emphasizing the importance of understanding
human actions, surrounding objects, and task context. Building on these ndings, Satori was developed to leverage the
BDI model for inferring user intentions and guiding them through immediate next steps. Our empirical study with
sixteen users demonstrated that Satori performs comparably to designer-created AR guidance systems in terms of
timing, comprehensibility, usefulness, and ecacy, while also oering greater generalizability and reusability. The
results of our research indicate that combining the BDI model with multi-modal deep learning architectures provides a
robust framework for developing proactive AR assistance. By capturing both user intentions and semantic context,
Satori reduces the need for customized AR guidance for each specic scenario, addressing the scalability challenges
faced by current AR systems. Our work opens new possibilities for enhancing AR experiences by providing adaptive,
proactive assistance that can support users eectively across a wide range of tasks and environments.
References
[1]
Dejanira Araiza-Illan, Tony Pipe, and Kerstin Eder. 2016. Model-based testing, using belief-desire-intentions agents, of control code for robots in
collaborative human-robot interactions. arXiv preprint arXiv:1603.00656 (2016).
[2]
James Baumeister, Seung Youb Ssin, Neven AM ElSayed, Jillian Dorrian, David P Webb, James A Walsh, Timothy M Simon, Andrew Irlitti, Ross T
Smith, Mark Kohler, et al
.
2017. Cognitive cost of using augmented reality displays. IEEE transactions on visualization and computer graphics 23, 11
(2017), 2378–2388.
[3]
David Benyon and Dianne Murray. 1993. Adaptive systems: from intelligent tutoring to autonomous agents. Knowl. Based Syst. 6, 4 (1993), 179–219.
https://doi.org/10.1016/0950-7051(93)90012- I
[4]
Dan Bohus, Sean Andrist, Nick Saw, Ann Paradiso, Ishani Chakraborty, and Mahdi Rad. 2024. SIGMA: An Open-Source Interactive System for
Mixed-Reality Task Assistance Research–Extended Abstract. In 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops
(VRW). IEEE, 889–890.
[5]
Rafael H Bordini, Amal El Fallah Seghrouchni, Koen Hindriks, Brian Logan, and Alessandro Ricci. 2020. Agent programming in the cognitive era.
Autonomous Agents and Multi-Agent Systems 34 (2020), 1–31.
[6]
Diego Borro, Ángel Suescun, Alfonso Brazález, José Manuel González, Eloy Ortega, and Eduardo González. 2021. WARM: Wearable AR and
tablet-based assistant systems for bus maintenance. Applied Sciences 11, 4 (2021), 1443.
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 29
[7]
Carola Botto, Alberto Cannavò, Daniele Cappuccio, Giada Morat, Amir Nematollahi Sarvestani, Paolo Ricci, Valentina Demarchi, and Alessandra
Saturnino. 2020. Augmented Reality for the Manufacturing Industry: The Case of an Assembly Assistant. In 2020 IEEE Conference on Virtual
Reality and 3D User Interfaces Abstracts and Workshops, VR Workshops, Atlanta, GA, USA, March 22-26, 2020. IEEE, 299–304. https://doi.org/10.1109/
VRW50115.2020.00068
[8] Michael Bratman. 1987. Intention, plans, and practical reason. (1987).
[9]
Lars Braubach, Alexander Pokahr, and Winfried Lamersdorf. 2005. Jadex: A BDI-agent system combining middleware and reasoning. In Software
agent-based applications, platforms and development kits. Springer, 143–168.
[10] Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psychological Association.
[11] Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM New York, NY, USA, 3–10.
[12]
Paolo Busetta, Ralph Rönnquist, Andrew Hodgson, and Andrew Lucas. 1999. Jack intelligent agents-components for intelligent agents in java.
AgentLink News Letter 2, 1 (1999), 2–5.
[13]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection
with transformers. In European conference on computer vision. Springer, 213–229.
[14]
Karen Church and Barry Smyth. 2009. Understanding the intent behind mobile information needs. In Proceedings of the 14th international conference
on Intelligent user interfaces. 247–256.
[15] Philip R Cohen and Hector J Levesque. 1990. Intention is choice with commitment. Articial intelligence 42, 2-3 (1990), 213–261.
[16]
J. D’Agostini, L. Bonetti, A. Salee, L. Passerini, G. Fiacco, P. Lavanda, E. Motti, Michele Stocco, K. T. Gashay, E. G. Abebe, S. M. Alemu, R. Haghani, A.
Voltolini, Christophe Strobbe, Nicola Covre, G. Santolini, M. Armellini, T. Sacchi, D. Ronchese, C. Furlan, F. Facchinato, Luca Maule, Paolo Tomasin,
Alberto Fornaser, and Mariolino De Cecco. 2018. An Augmented Reality Virtual Assistant to Help Mild Cognitive Impaired Users in Cooking a
System Able to Recognize the User Status and Personalize the Support. In 2018 Workshop on Metrology for Industry 4.0 and IoT, Brescia, Italy, April
16-18, 2018. IEEE, 12–17. https://doi.org/10.1109/METROI4.2018.8428314
[17]
David Dearman, Melanie Kellar, and Khai N Truong. 2008. An examination of daily information needs and sharing opportunities. In Proceedings of
the 2008 ACM conference on Computer supported cooperative work. 679–688.
[18]
Yang Deng, Wenqiang Lei, Minlie Huang, and Tat-Seng Chua. 2023. Rethinking Conversational Agents in the Era of LLMs: Proactivity, Non-
collaborativity, and Beyond. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
in the Asia Pacic Region (Beijing, China) (SIGIR-AP ’23). Association for Computing Machinery, New York, NY, USA, 298–301. https://doi.org/10.
1145/3624918.3629548
[19]
Brian R Duy, Mauro Dragone, and Gregory MP O’Hare. 2005. Social robot architecture: A framework for explicit social interaction. In Android
Science: Towards Social Mechanisms, CogSci 2005 Workshop, Stresa, Italy. 3–4.
[20]
David Escobar-Castillejos, Julieta Noguez, Fernando Bello, Luis Neri, Alejandra J Magana, and Bedrich Benes. 2020. A review of training and
guidance systems in medical surgery. Applied Sciences 10, 17 (2020), 5752.
[21]
Loris Fichera, Daniele Marletta, Vincenzo Nicosia, and Corrado Santoro. 2011. Flexible robot strategy design using belief-desire-intention model. In
Research and Education in Robotics-EUROBOT 2010: International Conference, Rapperswil-Jona, Switzerland, May 27-30, 2010, Revised Selected Papers.
Springer, 57–71.
[22]
James Frandsen, Joe Tenny, Walter Frandsen Jr, and Yuri Hovanski. 2023. An augmented reality maintenance assistant with real-time quality
inspection on handheld mobile devices. The International Journal of Advanced Manufacturing Technology 125, 9 (2023), 4253–4270.
[23]
Qi Gao, Wei Xu, Mowei Shen, and Zaifeng Gao. 2023. Agent Teaming Situation Awareness (ATSA): A Situation Awareness Framework for Human-AI
Teaming. CoRR abs/2308.16785 (2023). https://doi.org/10.48550/ARXIV.2308.16785 arXiv:2308.16785
[24]
Michael George, Barney Pell, Martha Pollack, Milind Tambe, and Michael Wooldridge. 1999. The belief-desire-intention model of agency. In
Intelligent Agents V: Agents Theories, Architectures, and Languages: 5th International Workshop, ATAL’98 Paris, France, July 4–7, 1998 Proceedings 5.
Springer, 1–10.
[25]
Sebastian Gottifredi, Mariano Tucat, Daniel Corbatta, Alejandro Javier García, and Guillermo Ricardo Simari. 2008. A BDI architecture for high level
robot deliberation. In XIV Congreso Argentino de Ciencias de la Computación.
[26]
Morgan Harvey, Marc Langheinrich, and Geo Ward. 2016. Remembering through lifelogging: A survey of human memory augmentation. Pervasive
Mob. Comput. 27 (2016), 14–26. https://doi.org/10.1016/J.PMCJ.2015.12.002
[27]
Koen V Hindriks. 2009. Programming rational agents in GOAL. In Multi-agent programming: Languages, tools and applications. Springer, 119–157.
[28]
Koen V Hindriks, Frank S de Boer, Wiebe van der Hoek, and John-Jules Ch Meyer. 1998. Formal semantics for an abstract agent programming
language. In Intelligent Agents IV Agent Theories, Architectures, and Languages: 4th International Workshop, ATAL’97 Providence, Rhode Island, USA,
July 24–26, 1997 Proceedings 4. Springer, 215–229.
[29]
Pranut Jain, Rosta Farzan, and Adam J Lee. 2023. Co-Designing with Users the Explanations for a Proactive Auto-Response Messaging Agent.
Proceedings of the ACM on Human-Computer Interaction 7, MHCI (2023), 1–23.
[30]
Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of nancial
decision making: Part I. World Scientic, 99–127.
[31]
Burak Karaduman, Baris Tekin Tezel, and Moharram Challenger. 2023. Rational software agents with the BDI reasoning model for Cyber–Physical
Systems. Engineering Applications of Articial Intelligence 123 (2023), 106478.
Manuscript submitted to ACM
30 Li and Wu, et al.
[32]
Kangsoo Kim, Luke Boelling, Steen Haesler, Jeremy Bailenson, Gerd Bruder, and Greg F Welch. 2018. Does a digital assistant need a body? The
inuence of visual embodiment and social behavior on the perception of intelligent virtual agents in AR. In 2018 IEEE International Symposium on
Mixed and Augmented Reality (ISMAR). IEEE, 105–114.
[33]
Sojung Kim, Hui Xi, Santosh Mungle, and Young-JunSon. 2012. Modeling human interactions with learning under the extended belief-desire-intention
framework. In IIE Annual Conference. Proceedings. Institute of Industrial and Systems Engineers (IISE), 1.
[34]
David Kinny, Michael George, and Anand Rao. 1996. A methodology and modelling technique for systems of BDI agents. In European workshop on
modelling autonomous agents in a multi-agent world. Springer, 56–71.
[35]
Maximilian König, Martin Stadlmaier, Tobias Rusch, Robin Sochor, Lukas Merkel, Stefan Braunreuther, and Johannes Schilp. 2019. MA 2 RA-manual
assembly augmented reality assistant. In 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM). IEEE,
501–505.
[36]
Fotios K Konstantinidis, Ioannis Kansizoglou, Nicholas Santavas, Spyridon G Mouroutsos, and Antonios Gasteratos. 2020. Marma: A mobile
augmented reality maintenance assistant for fast-track repair procedures in the context of industry 4.0. Machines 8, 4 (2020), 88.
[37]
Chulmo Koo, Youhee Joun, Heejeong Han, and Namho Chung. 2016. A structural model for destination travel intention as a media exposure:
Belief-desire-intention model perspective. International Journal of Contemporar y Hospitality Management 28, 7 (2016), 1338–1360.
[38]
Matthias Kraus, Marvin R. G. Schiller, Gregor Behnke, Pascal Bercher, Michael Dorna, Michael Dambier, Birte Glimm, Susanne Biundo, and Wolfgang
Minker. 2020. "Was that successful?" On Integrating Proactive Meta-Dialogue in a DIY-Assistant using Multimodal Cues. In ICMI ’20: International
Conference on Multimodal Interaction, Virtual Event, The Netherlands, October 25-29, 2020, Khiet P. Truong, Dirk Heylen, Mary Czerwinski, Nadia
Berthouze, Mohamed Chetouani, and Mikio Nakano (Eds.). ACM, 585–594. https://doi.org/10.1145/3382507.3418818
[39]
Matthias Kraus, Nicolas Wagner, Zoraida Callejas, and Wolfgang Minker. 2021. The Role of Trust in Proactive Conversational Assistants. IEEE
Access 9 (2021), 112821–112836. https://doi.org/10.1109/ACCESS.2021.3103893
[40]
Matthias Kraus, Nicolas Wagner, and Wolfgang Minker. 2020. Eects of Proactive Dialogue Strategies on Human-Computer Trust. In Proceedings of
the 28th ACM Conference on User Modeling, Adaptation and Personalization, UMAP 2020, Genoa, Italy, July 12-18, 2020, Tsvi Kuik, Ilaria Torre, Robin
Burke, and Cristina Gena (Eds.). ACM, 107–116. https://doi.org/10.1145/3340631.3394840
[41]
Matthias Kraus, Nicolas Wagner, and Wolfgang Minker. 2021. Modelling and Predicting Trust for Developing Proactive Dialogue Strategies in Mixed-
Initiative Interaction. In ICMI ’21: International Conference on Multimodal Interaction, Montréal, QC, Canada, October 18-22, 2021, Zakia Hammal, Carlos
Busso, Catherine Pelachaud, Sharon L. Oviatt, Albert Ali Salah, and Guoying Zhao (Eds.). ACM, 131–140. https://doi.org/10.1145/3462244.3479906
[42]
Matthias Kraus, Nicolas Wagner, and Wolfgang Minker. 2022. ProDial - An Annotated Proactive Dialogue Act Corpus for Conversational Assistants
using Crowdsourcing. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June
2022, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara,
Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 3164–3173.
https://aclanthology.org/2022.lrec-1.339
[43]
Yi Lai, Atreyi Kankanhalli, and Desmond C. Ong. 2021. Human-AI Collaboration in Healthcare: A Review and Research Agenda. In 54th Hawaii
International Conference on System Sciences, HICSS 2021, Kauai, Hawaii, USA, January 5, 2021. ScholarSpace, 1–10. https://hdl.handle.net/10125/70657
[44]
Ze-Hao Lai, Wenjin Tao, Ming C Leu, and Zhaozheng Yin. 2020. Smart augmented reality instructional system for mechanical assembly towards
worker-centered intelligent manufacturing. Journal of Manufacturing Systems 55 (2020), 69–81.
[45]
Jean-François Lapointe, Mohand Saïd Allili, Luc Belliveau, Loucif Hebbache, Dariush Amirkhani, and Hicham Sekkati. 2022. AI-AR for Bridge
Inspection by Drone. In Virtual, Augmented and Mixed Reality: Applications in Education, Aviation and Industry: 14th International Conference,
VAMR 2022, Held as Part of the 24th HCI International Conference, HCII 2022, Virtual Event, June 26 – July 1, 2022, Proceedings, Part II. Springer-Verlag,
Berlin, Heidelberg, 302–313. https://doi.org/10.1007/978-3- 031-06015-1_21
[46]
Gun A Lee, Hye Sun Park, and Mark Billinghurst. 2019. Optical-reection type 3d augmented reality mirrors. In Proceedings of the 25th ACM
symposium on virtual reality software and technology. 1–2.
[47]
Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model
Capabilities. In CHI ’22: CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April 2022 - 5 May 2022, Simone D. J.
Barbosa, Cli Lampe, Caroline Appert, David A. Shamma, Steven Mark Drucker, Julie R. Williamson, and Koji Yatani (Eds.). ACM, 388:1–388:19.
https://doi.org/10.1145/3491102.3502030
[48]
Seung Ho Lee. 2009. Integrated human decision behavior modeling under an extended belief-desire-intention framework. The University of Arizona.
[49]
Alan M Leslie, Tim P German, and Pamela Polizzi. 2005. Belief-desire reasoning as a process of selection. Cognitive psychology 50, 1 (2005), 45–85.
[50]
James R Lewis. 1995. IBM computer usability satisfaction questionnaires: psychometric evaluation and instructions for use. International Journal of
Human-Computer Interaction 7, 1 (1995), 57–78.
[51]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders
and Large Language Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of
Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett
(Eds.). PMLR, 19730–19742. https://proceedings.mlr.press/v202/li23q.html
[52]
Jiahao Nick Li, Yan Xu, Tovi Grossman, Stephanie Santosa, and Michelle Li. 2024. OmniActions: Predicting Digital Actions in Response to Real-World
Multimodal Sensory Inputs with LLMs. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–22.
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 31
[53]
David Lindlbauer, Anna Maria Feit, and Otmar Hilliges. 2019. Context-aware online adaptation of mixed reality interfaces. In Proceedings of the 32nd
annual ACM symposium on user interface software and technology. 147–160.
[54]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems 36
(2024).
[55]
Vivian Liu and Lydia B. Chilton. 2022. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. In CHI ’22: CHI Conference
on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April 2022 - 5 May 2022, Simone D. J. Barbosa, Cli Lampe, Caroline Appert,
David A. Shamma, Steven Mark Drucker, Julie R. Williamson, and Koji Yatani (Eds.). ACM, 384:1–384:23. https://doi.org/10.1145/3491102.3501825
[56] Frederick Hansen Lund. 1925. The psychology of belief. The Journal of Abnormal and Social Psychology 20, 1 (1925), 63.
[57]
Amama Mahmood, Jeanie W. Fung, Isabel Won, and Chien-Ming Huang. 2022. Owning Mistakes Sincerely: Strategies for Mitigating AI Errors.
In CHI ’22: CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April 2022 - 5 May 2022, Simone D. J. Barbosa,
Cli Lampe, Caroline Appert, David A. Shamma, Steven Mark Drucker, Julie R. Williamson, and Koji Yatani (Eds.). ACM, 578:1–578:11. https:
//doi.org/10.1145/3491102.3517565
[58]
Michael F. McTear. 1993. User modelling for adaptive computer systems: a survey of recent developments. Artif. Intell. Rev. 7, 3-4 (1993), 157–184.
https://doi.org/10.1007/BF00849553
[59]
Anna-Maria Meck, Christoph Draxler, and Thurid Vogt. 2023. How may I interrupt? Linguistic-driven design guidelines for proactive in-car voice
assistants. International Journal of Human–Computer Interaction (2023), 1–15.
[60]
Christian Meurisch, Maria-Dorina Ionescu, Benedikt Schmidt, and Max Mühlhäuser. 2017. Reference model of next-generation digital personal
assistant: integrating proactive behavior. In Adjunct Proceedings of the 2017 ACM International Joint Conferenceon Per vasive and Ubiquitous Computing
and Proceedings of the 2017 ACM International Symposium on Wearable Computers, UbiComp/ISWC 2017, Maui, HI, USA, September 11-15, 2017,
Seungyon Claire Lee, Leila Takayama, and Khai N. Truong (Eds.). ACM, 149–152. https://doi.org/10.1145/3123024.3123145
[61]
Christian Meurisch, Cristina A. Mihale-Wilson, Adrian Hawlitschek, Florian Giger, Florian Müller, Oliver Hinz, and Max Mühlhäuser. 2020.
Exploring User Expectations of Proactive AI Systems. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 4 (2020), 146:1–146:22. https:
//doi.org/10.1145/3432193
[62] Ondrej Miksik, I. Munasinghe, J. Asensio-Cubero, S. Reddy Bethi, S.-T. Huang, S. Zylfo, X. Liu, T. Nica, A. Mitrocsak, S. Mezza, Rory Beard, Ruibo
Shi, Raymond W. M. Ng, Pedro A. M. Mediano, Zafeirios Fountas, S.-H. Lee, J. Medvesek, H. Zhuang, Yvonne Rogers, and Pawel Swietojanski. 2020.
Building Proactive Voice Assistants: When and How (not) to Interact. CoRR abs/2005.01322 (2020). arXiv:2005.01322 https://arxiv.org/abs/2005.01322
[63]
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab,
Mostafa Dehghani, Zhuoran Shen, et al
.
2022. Simple open-vocabulary object detection. In European Conference on Computer Vision. Springer,
728–755.
[64]
David L Morgan, Jutta Ataie, Paula Carder, and Kim Homan. 2013. Introducing dyadic interviews as a method for collecting qualitative data.
Qualitative health research 23, 9 (2013), 1276–1284.
[65]
Alexandre Pauchet, Nathalie Chaignaud, and Amal El Fallah Seghrouchni. 2007. A computational model of human interaction and planning for
heterogeneous multi-agent systems. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. 1–3.
[66]
Veljko Pejovic and Mirco Musolesi. 2015. Anticipatory mobile computing: A survey of the state of the art and research challenges. ACM Computing
Surveys (CSUR) 47, 3 (2015), 1–29.
[67]
David Pereira, Eugnio Oliveira, Nelma Moreira, and Lus Sarmento. 2005. Towards an architecture for emotional BDI agents. In 2005 portuguese
conference on articial intelligence. IEEE, 40–46.
[68]
Nicolas Porot and Eric Mandelbaum. 2021. The science of belief: A progress report. Wiley Interdisciplinary Reviews: Cognitive Science 12, 2 (2021),
e1539.
[69]
Long Qian, Anton Deguet, and Peter Kazanzides. 2018. ARssist: augmented reality on a head-mounted display for the rst assistant in robotic
surgery. Healthcare technology letters 5, 5 (2018), 194–200.
[70]
Rodrigo Chacón Quesada and Yiannis Demiris. 2022. Proactive Robot Assistance: Aordance-Aware Augmented Reality User Interfaces. IEEE
Robotics & Automation Magazine 29, 1 (2022), 22–34. https://doi.org/10.1109/MRA.2021.3136789
[71]
Mashqui Rabbi, Min Hane Aung, Mi Zhang, and Tanzeem Choudhury. 2015. MyBehavior: automatic personalized health feedback from user
behaviors and preferences using smartphones. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing,
UbiComp 2015, Osaka, Japan, September 7-11, 2015, Kenji Mase, Marc Langheinrich, Daniel Gatica-Perez, Hans Gellersen, Tanzeem Choudhury, and
Koji Yatani (Eds.). ACM, 707–718. https://doi.org/10.1145/2750858.2805840
[72]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack
Clark, et al
.
2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR,
8748–8763.
[73]
Anand S Rao. 1996. AgentSpeak (L): BDI agents speak out in a logical computable language. In European workshop on modelling autonomous agents
in a multi-agent world. Springer, 42–55.
[74] Anand S Rao and Michael P George. 1997. Modeling rational agents within a BDI-architecture. Readings in agents (1997), 317–328.
[75] Anand S Rao and Michael P George. 1998. Decision procedures for BDI logics. (1998).
[76]
Mengyang Ren, Liang Dong, Ziqing Xia, Jingchen Cong, and Pai Zheng. 2023. A Proactive Interaction Design Method for Personalized User Context
Prediction in Smart-Product Service System. Procedia CIRP 119 (2023), 963–968. https://doi.org/10.1016/j.procir.2023.01.021 The 33rd CIRP Design
Manuscript submitted to ACM
32 Li and Wu, et al.
Conference.
[77] Lee Ross and Richard E Nisbett. 2011. The person and the situation: Perspectives of social psychology. Pinter & Martin Publishers.
[78]
Gabriele Sara, Giuseppe Todde, and Maria Caria. 2022. Assessment of video see-through smart glasses for augmented reality to support technicians
during milking machine maintenance. Scientic Reports 12, 1 (2022), 15729.
[79]
Ruhi Sarikaya. 2017. The Technology Behind Personal Digital Assistants: An overview of the system architecture and key components. IEEE Signal
Process. Mag. 34, 1 (2017), 67–81. https://doi.org/10.1109/MSP.2016.2617341
[80]
Andreas J. Schmid, Oliver Weede, and Heinz Worn. 2007. Proactive Robot Task Selection Given a Human Intention Estimate. In RO-MAN 2007 - The
16th IEEE International Symposium on Robot and Human Interactive Communication. 726–731. https://doi.org/10.1109/ROMAN.2007.4415181
[81]
Benedikt Schmidt, Sebastian Benchea, Rüdiger Eichin, and Christian Meurisch. 2015. Fitness tracker or digital personal coach: how to personalize
training. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2015 ACM
International Symposium on Wearable Computers, UbiComp/ISWC Adjunct 2015, Osaka, Japan, September 7-11, 2015, Kenji Mase, Marc Langheinrich,
Daniel Gatica-Perez, Hans Gellersen, Tanzeem Choudhury, and Koji Yatani (Eds.). ACM, 1063–1067. https://doi.org/10.1145/2800835.2800961
[82]
Maria Schmidt, Wolfgang Minker, and Steen Werner. 2020. How Users React to Proactive Voice Assistant Behavior While Driving. In Proceedings of
The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, Nicoletta Calzolari, Frédéric Béchet, Philippe
Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción
Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 485–490. https://aclanthology.org/2020.lrec-1.61/
[83]
Philipp M. Scholl, Matthias Wille, and Kristof Van Laerhoven. 2015. Wearables in the wet lab: a laboratory system for capturing and guiding
experiments. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp 2015, Osaka, Japan,
September 7-11, 2015, Kenji Mase, Marc Langheinrich, Daniel Gatica-Perez, Hans Gellersen, Tanzeem Choudhury, and Koji Yatani (Eds.). ACM,
589–599. https://doi.org/10.1145/2750858.2807547
[84]
Junxiao Shen, John J. Dudley, and Per Ola Kristensson. 2023. Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded
Egocentric Perception. CoRR abs/2308.05822 (2023). https://doi.org/10.48550/ARXIV.2308.05822 arXiv:2308.05822
[85]
Naai-Jung Shih, Hui-Xu Chen, Tzu-Yu Chen, and Yi-Ting Qiu. 2020. Digital preservation and reconstruction of old cultural elements in augmented
reality (AR). Sustainability 12, 21 (2020), 9262.
[86]
K Ujjwal and J Chodorowski. [n. d.]. A case study of adding proactivity in indoor social robots using belief-desire-intention (bdi) model, vol. 4
(4)(2019).
[87]
Kent P. Vaubel and Charles F. Gettys. 1990. Inferring User Expertise for Adaptive Interfaces. Hum. Comput. Interact. 5, 1 (1990), 95–117. https:
//doi.org/10.1207/S15327051HCI0501_3
[88]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al
.
2022. Chain-of-thought prompting
elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
[89]
Guande Wu, Jianzhe Lin, and Cláudio T. Silva. 2022. IntentVizor: Towards Generic Query Guided Interactive Video Summarization. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 10493–10502. https://doi.org/10.
1109/CVPR52688.2022.01025
[90]
Guande Wu, Chen Zhao, Claudio Silva, and He He. 2024. Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in
Blocks World. arXiv preprint arXiv:2404.00246 (2024).
[91]
Jun Xiao, Richard Catrambone, and John T. Stasko. 2003. Be Quiet? Evaluating Proactive and Reactive User Interface Assistants. In Human-Computer
Interaction INTERACT ’03: IFIP TC13 International Conference on Human-Computer Interaction, 1st-5th September 2003, Zurich, Switzerland, Matthias
Rauterberg, Marino Menozzi, and Janet Wesson (Eds.). IOS Press.
[92]
Surya B. Yadav. 2010. A conceptual model for user-centered quality information retrieval on the World Wide Web. J. Intell. Inf. Syst. 35, 1 (2010),
91–121. https://doi.org/10.1007/S10844-009- 0090-Y
[93]
Neil Yorke-Smith, Shahin Saadati, Karen L. Myers, and David N. Morley. 2012. The Design of a Proactive Personal Agent for Task Management. Int.
J. Artif. Intell. Tools 21, 1 (2012). https://doi.org/10.1142/S0218213012500042
[94]
Nima Zargham, Leon Reicherts, Michael Bonfert, Sarah Theres Voelkel, Johannes Schöning, Rainer Malaka, and Yvonne Rogers. 2022. Understanding
Circumstances for Desirable Proactive Behaviour of Voice Assistants: The Proactivity Dilemma. In CUI 2022: 4th Conference on Conversational User
Interfaces, Glasgow, United Kingdom, July 26 - 28, 2022, Martin Halvey, Mary Ellen Foster, Je Dalton, Cosmin Munteanu, and Johanne Trippas (Eds.).
ACM, 3:1–3:14. https://doi.org/10.1145/3543829.3543834
[95]
Naim Zierau, Christian Engel, Matthias Söllner, and Jan Marco Leimeister. 2020. Trust in Smart Personal Assistants: A Systematic Literature Review
and Development of a Research Agenda. In Entwicklungen, Chancen und Herausforderungen der Digitalisierung: Proceedings der 15. Internationalen
Tagung Wirtschaftsinformatik, WI 2020, Potsdam, Germany, March 9-11, 2020. Zentrale Tracks, Norbert Gronau, Moreen Heine, Hanna Krasnova, and
K. Poustcchi (Eds.). GITO Verlag, 99–114. https://doi.org/10.30844/WI_2020_A7-ZIERAU
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 33
A Implementation Details
A.1 Backend Streaming Server
The backend server receives the streaming data from HoloLens 2 headset, processes the data, and sends the processed
result back to HoloLens 2. To implement this pipeline, we implement a streaming server with Redis-Stream and create
an extension module with the Lambda program.
A.1.1 Hardware. Due to hardware limitations, we are unable to host our entire backend program on a single server. As
a result, we distribute the backend server across two devices. One server is equipped with an Intel Core™i7-8700K
CPU @ 3.70GHz and an NVIDIA GeForce GTX 1080 GPU. The other server features an Intel®Core i9-10980XE CPU @
3.00GHz and two NVIDIA RTX 3090 GPUs.
Stream Description Data Format
main
The stream which contains the frames sent by
HoloLens 2 headset
Image Bytes
processed_main
The stream which contains the frames pro-
cessed by an image recognition module to lter
out the related frames
Image Bytes
guidance
The stream used to store the generated guidance
result
JSON
assistant:images
The stream used to store the generated image
assistance.
Protobuf Bytes
intent:belief
The stream used to store the inferred belief
state.
JSON
intent:desire
The stream used to store the inferred desire
state
JSON
intent:task_plan
The stream used to store the task plan consisting
the desired task plan and action checkpoints
generated by the task plan LLM
JSON
intent:task:checkpoints
The stream used to store the inferred states of
the action checkpoints
JSON
intent:task:step:nex
The stream used to store the inferred next in-
tended action.
JSON
intent:task:step:current
The stream used to store the inferred next in-
tended action.
JSON
feedback The stream used to store the user’ feedback. JSON
Table 6. The stream list used in the system. We list the stream names, corresponding descriptions, and data formats in the table. The
names in the open-sourced code may be slightly dierent from the names in the development build.
A.1.2 Redis-Streams Module. To accommodate this setup and enable data communication across multiple devices, we
introduced the Redis-Streams module. Redis Streams is a feature index designed to handle high-throughput streaming
data. It allows for managing time-ordered events and is particularly useful for building message queues and real-time data
processing systems. We use it for handling the streaming data generated by HoloLens 2, enabling ecient processing
and communication of the data across the HoloLens client, backend server, and distributed model servers. The basic unit
in this module is the Stream class, which is dened by a stream key and a pre-dened format, supporting various data
types such as the Protobuf bytes, JSON, plain strings, and image bytes. Each stream consists of a series of entries, where
Manuscript submitted to ACM
34 Li and Wu, et al.
each entry represents a piece of data indexed by the timestamp. The stream data can be added, read, and processed
by the dierent clients and servers. We list the streams in Table 6. Our server supports both WebSocket and HTTP
requests, enabling client-side applications to subscribe to streams using WebSocket connections and send new data into
the streams via WebSocket as well. This setup provides real-time, bidirectional communication between the HoloLens
device and the backend server. For scenarios where WebSocket is not available, we also support data submission and
retrieval through standard HTTP requests, oering exibility in how data is transmitted.
A.1.3 Lambda Extension to Redis-Streams Module. Lambda functions are stateless, event-driven functions that execute
in response to specic events or data triggers. In the context of our streaming program, the stateless nature of Lambda
functions makes them a suitable choice for handling real-time data processing tasks without the need for persistent
state management. Our implementation is based on a class named Pipeline (the actual name used in the codebase),
which serves as a stateless stream processor. Each Pipeline instance can subscribe to multiple streams, process the data
as needed, and publish the processed results to output streams. For example, the object detection module subscribes to
image streams, processes the images to detect objects, and then outputs the detected objects in JSON format to another
stream. To simplify the development of such modules, we implemented a set of base classes for tasks such as running
GPT models and performing image analysis. These base classes provide common functionalities, allowing other services
to extend and implement specic processing logic with minimal eort. Given the high frame rate (fps) of the image
stream and the fact that our machine-learning modules may not need or be capable of processing data at such high
frequencies, we introduced a frequency adjustment mechanism. This mechanism allows the Lambda function to cache
incoming frames and process them at a congurable interval, reducing the computational load and ensuring ecient
processing.
A.2 Client Implementation
We implemented our HoloLens 2 interface using Unity and the Mixed Reality Toolkit (MRTK). For data communication,
we utilized Google’s Protocol Buers (Protobuf) module and used the NativeWebSocket library to establish WebSocket
connections. We acquired the main camera frames from the HoloLens 2 using the Research Mode and the corresponding
C++ API.
B BDI Inference and Assistance Generation Prompt
1Yo u a re an A R a s s is t a n t h e l pi n g u s e rs wit h t a s ks . Gi v e n an i m ag e , a ta s k g u id a n c e a nd
th e n ex t s te p , ge n e ra t e g u i dan c e i n th e r equ i r e d f or m a t . P lea s e m ak e s u re y o u r
gu i d a nce i s n ot t o o s imp l e a nd c a n a ctu a l l y h el p t he us e r .
2
3< TA S K _D E S CR I PT I O N >: [ Ta s k de s c ri p t io n ]
4< NE X T _S T EP > : [ De s c r ip t i o n o f th e ne x t st e p ]
5< im ag e >: [ fi rs t - p er s on p e r sp e c ti v e im a ge o f t h e u s er 's e n vi r on m en t ]
6
7<INSTRUCTIONS >:
8Ba s e d o n th e < NE X T_ S TE P > , pr o vi d e th e f ol l o w in g :
90. < DE SO R E > [B a s ed on t h e giv e n < N EXT _ S TE P > , g ene r a t e the u ser 's h ig h - l e ve l go a l . R e fe r
to the <TASK_DESCRIPTION > for possible tasks'g oa l . Ou tp u t th is h ig h - l ev e l de s ir e
pr e f i x ed w i th < D ES I R E >. ]
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 35
10 1. < I NTE N T > [D e s c rib e a b as i c , c o n c r et e a c tio n i n t he s tep . ke ep con c i s e a nd c l e a r ]
11 2. < ME TA _IN TE NT > [ Ge ne ra te me ta - in te nt fr om g iv en me ta - in ten t li st [ mak e a to ol ,
in te ra ct w it h t ime - de pe nd en t to ols , i nt er ac t w it h ti me - in de pe nd en t to ols , i nt er ac t
wi t h m a ter i a l s ] b as e d on th e u s er_ i n t e nt < IN T EN T > . Ou t p u t th i s s i ngl e m et a - int e n t
pr e f ix e d wi t h < ME T A_ I N TE N T >. \
12 Th e m et a - in t e n t r efe r s t he us er 's m os t f u n da m e n t al i n te n t w i t h out t he c ont e x t u al
information.
13 ma k e a too l e x a m ple i n t en t : a s sem b l e S w i ffe r m ob , m ak e c o f f ee f ilt e r , ar r a nge f l o we r
creatively;
14 in t er a ct w it h ti me - d ep e nd e nt t oo l ex a mp l e in te n t : u se g r in de r to g ri n d c of f ee , he at
fo o d u s in g mi c ro w av e ;
15 in t e r act w i th t im e - i nde p e n d ent e x a mpl e i n te n t : c o n n ect t o VR h ea d s et , u se a m op , use
a s t ai n er ;
16 in t e r act w i th m a t e ria l s e x a m ple i n t en t : ad d i n g re d i e n ts to a b ow l , po u r w a ter i nto a
cu p , cu t f l ow e r st em s ]
17 3. < G UI D AN C E _T Y PE > [ S el e ct b e tw e e n 'im age 'and 't ime r '. Bas e d o n th e i d e nt i f i e d
< ME T A _I N TE N T > , s e l ec t th e c or r e s po n d i ng g u i da n c e ty pe f r om t he f o ll o w in g m a pp i n gs : \
18 {" m a ke a t o ol " : " i ma g e ",
19 " i nt e ra c t w i th t im e de pe n de n t - t oo l s ": " ti m er " ,
20 " i nt e ra c t w i th t im e in d ep en d en t - t oo ls ": " im a ge " ,
21 " i nt e ra c t w i th m at e ri a l s ": " im a ge " }]
22 4. < TE X T _G U ID A NC E _T I TL E > [ S h or t ti t le ]
23 5. < T E X T _GU I D A N CE _ C O N TEN T > [ G e n er a t e t e xt g u i dan c e c o n ten t b e s t fi t f or th e u ser i n t hi s
st e p . Th i s g u id a n c e s ho u l d c o ns i d e r th e c o n tex t u a l i nf o r mat i o n , e. g . th e p rop e r t i es
ob j e c t in r ea l e n vir o n men t , th e t ip s t hat u s er sh o u ld pa y a tte n t i on to . O utp u t b ase d
on th e < S T E P_ D E SC R IP T I ON > an d us er 's <I N TE N T >, th e o b je c t i nt e r a c t io n l i st
< OBJ E C T_ L I S T > an d l e vel of d et a i l < LOD >, st a r t ing w i th < T E X T_G U I D A NCE _ T I TLE > a n d
< T EX T _ G U IDA N C E _ CON T E N T >. NO T gen e r a te te x t gui d a n ce fo r < DE S IR E > , on l y ge n e r a te
gu i d a n ce f o r th e fu n d a m en t a l a ct i o n <I N T EN T >\
24 In < T EX T _G U I DA N C E_ C O NT E NT > , i nc o r po r at e c o nc r et e n um b e rs a s r e q ui r ed b y t h e
< TA S K _D E S CR I PT O N > i f po s s ib l e . ]
25 6. < D A L LE_ P R OMP T > [ B ase d o n < INT E NT > , < O BJE C T _LI S T > an d < E X PER T ISE > , g e n er a t e a D ALL E
pr o m p t wi t h t he f o l l o wi n g t e m p lat e i n a ppr o p r i ate d e t ail . The pr o m pt sho u l d no t
co n s i der < D ESI R E >. T he pro m p t s hou l d i n t e gra t e t h e i nt e n t < I NT E NT > , a s s is t a n c e
< T EX T _ GU I D AN C E _ CO N T EN T > an d ob j e ct i n t er a c t io n s < OB J E CT _ LI S T > t o de p i ct a c ti o n
cl e ar l y an d in c lu d e a re d ar r ow ( < IN DI C AT OR > ) sh o wi n g ac t io n di r ec t i on . \
26 if < E XP E RT I SE > is n ov i ce , pr o mp t D AL L E to s ho w ac t i o ns a nd i n t er a c t in g o b je c t s us i ng
th e t em p l at e : " < IN T EN T >o r < T E X T _G U I D AN C E _ CO N T E NT > < OB J E CT _ L IS T > . < I N D IC A TO R > ".
27 if < EXP ER TISE > is e xpe rt , pr om pt DA LL E to s how f inal r es ult o f th e < IN TEN T > u sin g
th e t em p l at e : "< I N TE N T > o r < T E XT _ G UI D A N CE _ C ON T E NT > < O BJ E CT _ L IS T > . < IN D I CA T OR > " .]
28 7. < O BJ E CT _ L IS T > [ K e y ob j ec t s wi t h pr o p er t i e s i n th e im a g e : i d e nt i f y ke y ob j e ct s w hi c h
th e us e r i s i nt e r ac t i ng w it h f ol l o wi n g th e < ST E P_ D E SC R I PT I O N > an d th e p ro p e rt i e s o f
th e o bj ect s , e .g . co lor , s hap e , te xt ure , si ze . O ut pu t an o bj ec t i nt er ac ti on li st
wi th d es c ri pt i on s o f pr op e rt ie s .]
29 8. < H IG H LI G H T_ O B JE C T _F L AG > [ Tr u e if ke y ob j ec t s to h ig h li g h t ]
Manuscript submitted to ACM
36 Li and Wu, et al.
30 9. < H IG H L IG H T_ O B JE C T _L O C > [ L o ca t i on o f k e y o b j ec t if a pp l i ca b l e ]
31 10 . < HI G H LI G HT _ O BJ E C T_ L A BE L > [ N a me o f k e y o b je c t if a pp l i ca b le ]
32 11 . < CO N F IR M A T IO N _ CO N T E NT > [ Se l e ct c o n fi r m a ti o n c on t en t b a se d on < ME T A _I N TE N T > f r o m t h e
fo l l o w in g o p ti o ns , an d i ns e r t <I N TE N T > i n t o se n te n c e , s t ar t i n g wi t h
< C ON F I R M ATI O N _ C ON T E N T > : " L oo k s l i k e y ou a re g oi n g t o < I NTE N T >, d o y ou n ee d
< GU I D AN C E_ T Y PE >? " ]
33 -- -
34 Ex a mp l e :
35 In p ut :
36 < TA S K _D E S CR I PT I O N > M a ki n g po ur - o ve r co f fe e
37 < NE X T _S T EP > Po u r wa t er i n to t he c o ff e e b re w er
38 <im age > [ im a g e of u s i ng a bl ack c o f f ee b r e wer a n d m et a l k e ttl e ]
39 Output:
40 < IN T EN T > Po ur w a t er i n to c o f fe e b r ew e r
41 < DE S IR E > ma ke c o f fe e
42 < ME T A _I N TE N T > in t er a c t wi t h t im e - d ep e n de n t t oo l s
43 < GU I D AN C E_ T Y PE > t im e r
44 < TE X T _G U I DA N CE _ T IT L E > po ur w at e r in t o co ff e e br e w er
45 < TE X T _G U I DA N C E_ C ON T E NT > P ou r wa t er i nt o c of f ee b r ew e r .
46 < DAL L E _PR O M PT > Hand p o u rin g w a te r f r om g o o s e ne c k k e ttl e i n to p ou r - ove r c off e e m ak e r . Red
ar r ow s ho ws p ou r d ir e ct i on . Ti m er d is p la y s 3 0 se c on ds .
47 < OBJ E C T_ L I S T > Co f f e e b r ewe r ( bl a ck ) , Ke t t l e ( met al , g o os e n e c k ) , C o ff e e g rou n d s ( d a rk
br o wn )
48 <HIGHLIGHT_OBJECT_FLAG > True
49 <HIGHLIGHT_OBJECT_LOC > center
50 < HI G H LI G H T_ O B JE C T_ L A BE L > co f fe e b re w er
51 < C ON F I R M ATI O N _ C ON T E N T > Loo k s l ik e y o u ar e g o in g t o pou r w a ter i n t o a bl a ck c o f fe e
br e we r , do y ou ne e d a tim e r a s s ist a n c e f or i t ?
C Image Generation Prompt
We generated images for each step as a backup, although images may not always be the most suitable modality for
every step. We provided the prompts and the corresponding generated images as references. To maintain consistent
styles, we appended “in the style of at, instructional illustrations. No background. Accurate, concise, comfortable color
style” to the end of each prompt. We also prexed the prompt with “I NEED to test how the tool works with extremely
simple prompts. DO NOT add any detail, just use it AS-IS:” to prevent any modication to the prompts.
C.1 Task1: Arrange Flowers
Prompt1: pouring half of ower food packet into glass vase. Red arrow indicating the pouring action.
Prompt2: pour 16 oz of water from glass measuring cup into a glass vase. Red arrow shows pouring direction.
Highlight "16 oz" on the measuring cup.
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 37
Prompt3: Trim yellow and purple ower leaves below the waterline in a glass vase using white scissors. Red lines
indicate the waterline.
Prompt4: trimming 2-3 inches o yellow and purple ower stems at a 45-degree angle with white scissors. Red line
highlights the cutting angle.
Prompt5: Arrange yellow and purple owers neatly in a glass vase lled with water. Red arrows indicate the
positioning steps for a neat arrangement.
(a) Pouring flower food (b) Pouring water (c) Trimming leaves (d) Trimming stems (e) Arranging flowers
Fig. C.1. Steps for arranging flowers.
C.2 Task2: Clean Room
Prompt1: connect green mop poles to white square mop pad of swier mop. Red arrows indicate the connection
points.
Prompt2: wrapping white square mop pad around green mop poles of swier sweeper. Red arrows indicate wrapping
direction around mop head.
Prompt3: inserting white square mop pad into four sockets on green mop head. Red arrows highlight insertion
points.
Prompt4: connect yellow swier duster handles. Red arrow shows the alignment and connection direction.
Prompt5: connect yellow handles to blue feather dusters of swier duster. Red arrows show the connection points.
Prompt6: mop the oor with green swier sweeper mop using white square mop pad. Red arrow illustrates mopping
motion across the oor.
Prompt7: dusting a white desk using a blue feather duster with a yellow handle. Red arrows highlight careful dusting
around fragile items.
C.3 Task3: Make Pour-Over Coee
Prompt1: Measure 11g coee beans using a silver kitchen scale. Red arrow points to the digital display showing 11g.
Coee beans are dark brown.
Prompt2: Grinding coee beans into powder using a black grinder. Red arrows highlight the grinding action.
Prompt3: placing a brown coee lter on a white coee brewer. Red arrows show movement toward the brewer.
Manuscript submitted to ACM
38 Li and Wu, et al.
(a) Connect mop
poles
(b) Wrap mop pad
to mop
(c) Insert mop pad
to sockets
(d) Connect duster
handles
(e) Connect
dusters
(f) Mop the floors
thoroughly
(g) Duster the ta-
ble carefully
Fig. C.2. Steps for cleaning the room.
Prompt4: Place the white coee brewer on a cup and wet the coee lter using a black gooseneck kettle. Red arrow
points from brewer to the cup. Red arrow indicates the water pouring direction.
Prompt5: Adding dark brown coee grounds to a brown coee lter in a white coee brewer. Red arrow emphasizes
the pouring motion of the coee grounds.
Prompt6: setting the silver kitchen scale to zero. Red arrow highlighting the zero mark on the display.
Prompt7: Pouring water from black gooseneck kettle into white coee brewer with dark brown coee grounds in 30
seconds, using circular motion. Red arrow shows circular pour direction. Highlight timer displays 30 seconds and 50g
on silver kitchen scale.
(a) Measure coee-
beans
(b) Grind coee
beans
(c) Place filter on
brewer
(d) Set brewer on
cup
(e) Wet coee filter
(f) Add coee
grounds
(g) Set scale to zero
Fig. C.3. Steps for making pour-over coee.
C.4 Task4: Connect Switch to Monitor
Prompt1: Connect the black HDMI cable to the black HDMI port on the Nintendo Switch dock. Red arrow indicates
the connection direction.
Prompt2: connecting black type C power cable to the black dock of the Nintendo switch. Red arrow shows the
connection direction.
Prompt3: connecting black power cable to an AC outlet. Red arrow shows the direction of connection.
Prompt4: Inserting Nintendo Switch into black Nintendo Switch dock. Red arrow shows direction of insertion.
Prompt5: Press the power button on the Nintendo Switch console to turn it on. Red arrow indicates the power button
location.
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 39
Prompt6: black monitor with a visible power button in the left corner. pressing power button on monitor to turn it
on. Red arrow indicates the power button location.
(a) Connect HDMI ca-
ble
(b) Connect Type C
cable
(c) Connect power ca-
ble
(d) Insert Switch into
dock
(e) Press Switch
power buon
(f) Turn on computer
monitor
Fig. C.4. Steps for connecting the Nintendo Switch to a monitor.
C.5 Comparison between original prompt and modified prompt
As stated in Sec.6.4.2, we present two examples of image assistance generation to demonstrate the eectiveness of our
template, as shown Fig.3.
The rst example involves the task of making coee. The basic prompt, without any modier, is “presses a button
on an espresso machine.” Our enhanced prompt incorporates specic modiers for clarity: “presses a white button on a
white espresso machine. A red arrow points to the button. No background, styled in at, instructional illustrations. Accurate,
concise, comfortable color style.”
In the second scenario, the user needs to cut the stem of a ower at a specic angle. The raw prompt reads: “cuts stem
of a ower up from the bottom with scissors at 45 degrees.” Our proposed prompt, enriched with detailed modiers, is
“One hand cuts the stem of a red ower up from the bottom with white scissors at 45 degrees. A large red arrow points to the
cut, set against a white background in the style of at, instructional illustrations. Accurate, concise, comfortable color style.”
As shown in Fig 3, Fig 3a and Fig 3c, derived from our template, eliminate unnecessary details and emphasize the
core action of pressing a button. Clear visual elements, such as bold outlines and directional arrows, highlight the
instructed action, making it immediately apparent what action is being instructed. This type of imagery is especially
eective in instructional materials where rapid comprehension is essential. In contrast, Fig 3b and Fig 3d may introduce
ambiguity in an instructional context due to their realistic depiction that includes reective surfaces and shadows.
While aesthetically pleasing, this level of detail can distract from the core instructional message. Therefore, our template
enhances image clarity and directly aligns with the user’s need for clear, actionable instructions in their specic context.
D User Study Interview ote
We clustered the opinions shared by participants and presented the corresponding quotes. The opinions were collected
through transcriptions of in-person interviews conducted after the experiments and from follow-up questionnaires.
D.1 Satori System
D.1.1 Satori system is beer for novice users. P8: “if someone is new to, let’s say, doing a certain task, there were visual
cues that were there, which we had to see on the screen and replicate that.”
P6: “if someone who doesn’t know what is a grinder, or doesn’t know what is the brewer, or stu like that, it (animation)
actually showed me.”
Manuscript submitted to ACM
40 Li and Wu, et al.
D.1.2 Satori system can provide clear and useful instructions. P8: “though the tasks were simple, the instructions were
very clear in both the things.”
P14: “The guidance helps me a lot, especially in coee making. It provides me with very detailed instructions including
time, and amount of coee beans I need. I would have to google it if I don’t have the guidance.”
D.1.3 Satori system can detect my intentions. P9: “the provided guidance pretty much aligned with my intention all the
time. The conrmation message helps but also annoyed me a bit since it showed up after every step.”
P12: “In my experience, most of the time the system knows what I have done in the past step eventually, but I wish it
could be more responsive so I don’t need to wait for the system to recognize what I have done”
D.1.4 Design in Satori makes the users more engaged and feel trustworthy. P9: “I like to have a voice talking to me more for
emotional support such as compliments after completing one step successfully. ” P5: “The automatic step-by-step experience
is highly engaging and makes me excited about its future. It gives me the impression that the machine understands what
I’m doing, making its instructions feel trustworthy.” P10: “It automatically detects my progress.” P12: “I like how the system
automatic play the text that you nished the task which made me more engaged at the beginning, but awhen I realize that it
often takes time for the system to know you have nished and I have to wait for it, that engagement and enthusiasm faded.”
D.1.5 Satori system can not always detect the completion of a step. P8: “most of the time. .. if I have to associate a number
with it, 60% of the time.. . it wasn’t able to pick if I did nish the task or not.”
D.1.6 Satori is bit lag behind Woz in terms of timely assistance. P4: “It was a bit delayed compared to the second one.”
P6: “(Satori) I think maybe it took a little bit extra. I think ve, 10 seconds extra I needed to show the image properly for
it to understand that I’ve done the stu. The second one, (WOZ) I saw that it could comprehend much more easily.”
D.1.7 Satori main frustration point is detection of finish. P3: “But also I think that it would be helpful if it actually did
recognize all of the things that I was doing.”
P2: “the rst system, what I feel is the guidance was good. The images were nice, the animation was on point, just the
detection was not good.”
P9: “I just hope the detection, whatever algorithm, can be more accurate, so I don’t need to click it by myself.”
D.1.8 The substep checkpoint and steps are useful. P9: “I like the mission kind of point of view, and it shows each step,
like the sub-steps, with the progress checkings, like the circle thing. It’s easier for me to understand if I’m working on the
correct step, or if I already messed something up before I even noticed.”
P13: “what you are doing and the overall objectives and individual steps. To show, like, it’s easy to understand what step I
did and going back and forth.”
D.1.9 Combination of visual, text and audio modality makes task completion easier. P3: “So I like that the second system
(SATORI) has also like a text feedback. So in case I’m lost, I can just gure it out. I think this one (WOZ) did not have a text
feedback. So once I miss the voice, if I happen to miss the voice or if I forget, if it’s a long task, then I might forget exactly
where I am or what the subtasks involved in this task are.”
P3: “Animations are helpful in either case. But the second one seems kind of more detailed. But I just feel that might be
just because it combines dierent modalities. That I feel like better, a better sense of where I am in the task.”
P5: “like, the images and the text and audio..the whole thing is composed more neatly.”
D.1.10 Image assistance is useful. P1: “the picture of the second one is very nice and it looks good.”
Manuscript submitted to ACM
Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling 41
P2: “The images were nice, the animation was on point.”
P6: “what I liked better was there was animation for everything. So yeah, I mean, if someone who doesn’t know what is a
grinder, or doesn’t know what is the brewer, or stu like that, it actually showed me. And who doesn’t know how to cut
below the waterline and stu like that, it actually showed me the animation on whether to cut.”
D.1.11 Image assistance is confusing, or complex. P1: “but I think it can be more simple because maybe many content in
the picture is not necessary for me. icon can be just show the most important thing during the task and not many extra lines
or something.”
P5: “But in the second system, you just use, like, the AI-generated images. And that sometimes is just not, like, matching
the real setting. So, sometimes if you use the real live objects, it may make the image instruction more clear.”
P14: “sometimes the gure is confusing, I think because it’s in ash style and it’s sometimes it’s a little dierent from
what the actual item is.”
D.1.12 Timer is useful. P13: “...for example the system gives video instructions on how to fold the coee lter and place it
in the cone; that’s very helpful. I also like the timer when I make the coee”
D.1.13 Satori system gives user higher satisfaction on the assistance. P2: “I felt the rst one was much better, but the
second was, I felt I was able to do the task much faster. Reason: the rst system, what I feel is the guidance was good. The
images were nice, the animation was on point, just the detection was not good.”
D.1.14 Task transitions are natural for both. P2: “in the two systems is it (natural). I felt there was a lag in rst, in system
one, but in system two, it was everything was robust.”
D.1.15 Satori can be applied in daily life. P12: “I think this specic scenario is actually pretty helpful because sometimes
we need to connect dierent devices that are new to us. I had a similar issue with my Wi-Fi router and modem at my
apartment last year and I have to call a specialist to come and x it, but an AR tutorial will be helpful.”
P8: “Things like IKEA assembly etc., would be a great use case.”
P13: “cooking, assembling equipment (ex. PC) or furnitures (ex. shelf), operating on machines (ex. coee maker), exercise
(ex. dierent yoga moves)”
P9: “...maybe when we need to assembly a furniture, instead of going through the manual back and forth all the time, we
can just have this system to guide us.”
D.2 WoZ
D.2.1 WoZ system detect user’s intentions. P9: “There is no misalignment between my intentions and the provided
guidance”
P13: “sometimes there are time lag, but mostly it works ne”
D.2.2 The modality design is not as much as helpful. P1: “The timing of the rst one is really not useful for me because I
cannot gure out whether which way it will present to me. Maybe the voice, maybe the image. There was a time that I think
it will be a voice to lead me but actually there is an image but I didn’t know. I just wait for the voice and I don’t know how
to do the next.”
P2: “...However, the lack of text guidance made some tasks more dicult.”
P4: “Its guidance was not uniform. It showed text, audio, images randomly. When I need a animation to help, it only
showed me a text.”
Manuscript submitted to ACM
42 Li and Wu, et al.
D.2.3 Woz helps user complete the task faster and smoother. P2: “I felt the rst one was much better, but the second was, I
felt I was able to do the task much faster. I would denitely prefer like the system two, over like system one, because it’s
much faster, even though there are less instructions (WOZ).”
P8: “This system had through-and-through audio guidance. This helped me feel the process was much smoother”
P2: “The guidance in System B was helpful due to the fast detection of task completion. For example, when connecting the
mop and duster, the system quickly recognized the task as complete. ”
D.2.4 Animation and image assistance are beer. P9: “animation from the last one, from the rst system, was easier to
follow compared to the images that’s in today’s system.”
“The system gives me realistic images and voice guidance.”
Received September 2024; revised 1 May 2024
Manuscript submitted to ACM