Content uploaded by Christos Mousas
Author content
All content in this area was uploaded by Christos Mousas on May 01, 2022
Content may be subject to copyright.
Audio Maers Too: How Audial Avatar Customization Enhances
Visual Avatar Customization
Dominic Kao Rabindra Ratan Christos Mousas
kaod@purdue.edu rar@msu.edu cmousas@purdue.edu
Purdue University Michigan State University Purdue University
USA USA USA
Amogh Joshi Edward F. Melcer
joshi134@purdue.edu eddie.melcer@ucsc.edu
Purdue University UC Santa Cruz
USA USA
ABSTRACT 1 INTRODUCTION
Avatar customization is known to positively aect crucial outcomes
Avatars are ubiquitous across digital applications. Using avatars as
in numerous domains. However, it is unknown whether audial cus-
representations of ourselves, we socialize, play, and work. Increas-
tomization can confer the same benets as visual customization. We
ingly, researchers have become interested in avatar customization
conducted a preregistered 2 x 2 (visual choice vs. visual assignment
[
152
]. Avatar customization, or the ability to modify one’s avatar,
x audial choice vs. audial assignment) study in a Java programming
increases outcomes including intrinsic motivation [
25
], helping be-
game. Participants with visual choice experienced higher avatar
havior [
59
], user retention [
26
], learning [
175
], ow [
136
], and of es-
identication and autonomy. Participants with audial choice expe-
pecial importance, avatar identication [
221
]. Avatar identication,
rienced higher avatar identication and autonomy, but only within
or the temporary alteration in self-perception of the player induced
the group of participants who had visual choice available. Visual
by the mental association with their game character [
224
], leads to
choice led to an increase in time spent, and indirectly led to in- increased motivation [
25
,
26
,
220
], creative thinking [
34
,
56
,
87
],
creases in intrinsic motivation, immersion, time spent, future play
enjoyment [
134
,
175
,
217
], performance [
113
], player experience
motivation, and likelihood of game recommendation. Audial choice
[
107
], ow [
207
], and trust in others [
117
]. Despite the large cor-
moderated the majority of these eects. Our results suggest that pora of literature on avatar customization, studies have focused
audial customization plays an important enhancing role vis-à-vis
almost exclusively on visual aspects of the avatar. Limited adoption
visual customization. However, audial customization appears to of audial aspects in avatar customization is potentially because
have a weaker eect compared to visual customization. We discuss
avatar audio is perceived as non-critical and has substantial over-
the implications for avatar customization more generally across
head (e.g., multiple voice actors, region localization) [237]. Recent
digital applications.
advances in articial intelligence (e.g., neural networks) have vastly
improved text-to-speech engines and voice cloning software, how-
ever, and these programs are able to produce articial voices nearly
CCS CONCEPTS
indistinguishable from real ones. Voice cloning software was used
• Human-centered computing → Empirical studies in HCI.
in a study in which participants played a game with avatars that
had either a similar voice or a dissimilar voice (as compared to the
KEYWORDS
player) [
114
]. Results showed that participants in the similar voice
Games; Avatar; Audio; Voice; Customization; Identication; Player condition had increased performance, time spent, similarity identi-
Experience
cation, competence, relatedness, and immersion. Prior research
adds further support that the importance of avatar audio may be
ACM Reference Format:
underappreciated. Audio in games is linked to increased physi-
Dominic Kao, Rabindra Ratan, Christos Mousas, Amogh Joshi, and Edward
ological responses [
90
], emotional realism [
20
,
66
], performance
F. Melcer. 2022. Audio Matters Too: How Audial Avatar Customization
[
103
], and immersion [
67
,
115
,
131
,
168
,
199
]. A meta-analysis of
Enhances Visual Avatar Customization. In CHI Conference on Human Factors
83 studies in virtual environments found that adding audio has a
in Computing Systems (CHI ’22), April 29-May 5, 2022, New Orleans, LA, USA.
small- to medium-sized eect on presence [
55
]. Given prior work
ACM, New York, NY, USA, 27 pages. https://doi.org/10.1145/3491102.3501848
demonstrating the importance of avatar customization and audio
separately, allowing players to audially customize their avatars may
have benecial eects.
This work is licensed under a Creative Commons Attribution-Share Alike
International 4.0 License.
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9157-3/22/04.
https://doi.org/10.1145/3491102.3501848
Customizing one’s avatar is often viewed as inherently enjoyable
[
105
]. This customization is now part of a lucrative “skin” market
in online games [
102
]. Game skins can be used to customize an
avatar’s appearance, and research estimates the skin market to
be worth $40 billion (USD) per year [
226
]. While a few ventures
have begun to explore customization of the player’s voice, these
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Dominic Kao et al.
eorts have been limited to external tools (e.g., voice-changing
software [
163
,
229
]). A small number of games do oer the option
of customizing avatar audio. Final Fantasy XIV [
208
], Saints Row
IV [
230
], and Monster Hunter: World [
36
] allow the user to choose
between dierent sets of voices. Black Desert Online [
178
], Red
Dead Redemption 2 [
192
], and The Sims 4 [
68
] allow the user to
customize pitch. More generally, avatar customization interfaces
are understood to vary greatly between games with regards to both
quantity and quality of customization options [
152
,
156
]. For the
purposes of the present study, we created four character models and
four character voices. We then created four character customization
interfaces that varied (1) whether the character model was chosen
or randomly assigned and (2) whether the character voice was
chosen or randomly assigned. These customization interfaces were
explicitly designed to test whether audial customization would have
any eect on outcomes vis-à-vis visual customization.
We conducted an online study on Amazon’s Mechanical Turk
(MTurk) in which participants were randomly assigned to one
of the four character customization interfaces. Participants then
played a Java programming game for 10 minutes. After 10 minutes
had passed, an in-game survey collected measures of avatar iden-
tication, autonomy, intrinsic motivation, immersion, motivated
behavior, motivation for future play, and likelihood of game recom-
mendation.
1
After completing the survey, participants could quit
or continue playing for as long as they liked, reecting motivated
behavior.
Our results show that visual customization leads to higher avatar
identication and autonomy. Audial customization leads to higher
avatar identication and autonomy, but only within the group-
ing of participants in which visual customization was available. In
the grouping of participants without visual customization, audial
customization had no eect on avatar identication or autonomy.
Visual customization leads to higher time spent playing, and indi-
rectly (through the mediators of avatar identication and auton-
omy), it leads to higher intrinsic motivation, immersion, time spent
playing, motivation for future play, and likelihood of game recom-
mendation. Audial customization moderated the direct eect of
visual customization on time spent playing, as well as the indirect
eects of visual customization on intrinsic motivation, immersion,
motivation for future play, and likelihood of game recommendation.
The moderation eect was such that the eect was non-signicant
when audial customization was unavailable but signicant when
audial customization was available. Our results show that audial
customization, although having an overall weaker eect than visual
customization, can strengthen existing eects of visual customiza-
tion on outcomes. This suggests that avatar customization systems
in games can be improved by adding audial customization options.
Moreover, our study provides motivation to extend this research to
other domains as potential beneciaries of audial avatar customiza-
tion (e.g., virtual reality, digital learning, health applications). In the
highly understudied area of avatar audio, we contribute baseline
results in a large-scale preregistered study that can spur further
work in this domain.
1
Study hypotheses, analyses, experiment design, data collection, sample size, and
measures were all preregistered.
Preregistration: https://osf.io/dbvkp/.
Raw Data: https://osf.io/mnpsd/.
2 RELATED WORK
2.1 Avatar Customization
Avatar customization is the process of changing aspects of a video
game character. Players customize their avatars’ physical (e.g., body
shape), demographic (e.g., age, race, gender), and transient (e.g.,
clothes, ornaments) aspects. The avatar customization process can
also include choosing roles (e.g., playing as a warrior, archer, mage,
or a healer), attributes (e.g., luck, intelligence), and group member-
ship (e.g., playing as horde or alliance) [
53
,
219
]. Customizing one’s
avatar can lead to direct and indirect eects on gameplay [
100
,
219
].
For example, choosing a role of a warrior aords dierent game
mechanics and play strategies (i.e., favoring close combat) com-
pared to playing as an archer. Similarly, customizing skill attributes
can also aect gameplay—e.g., favoring increased charisma gives
lower prices on game items in Fallout 4 [
24
]. Customizing avatars’
physical appearance or the name of the avatar, on the other hand,
usually does not aect gameplay (directly) but can have a psycho-
logical eect on the players [
25
,
138
,
201
]. To understand these
psychological eects, many studies have used o-the-shelf games
(e.g., Massively Multiplayer Online Games, or MMOs) that oer
a comprehensive avatar customization process, such as changing
physical, demographic, and transient aspects; as well as choosing
roles, group membership, and attributes. Lim and Reeves used a
popular MMORPG (World of Warcraft, or WoW [
53
]) where par-
ticipants were randomly assigned to play the game with avatar
customization or to play with a premade avatar [
138
]. The study
found that players who customized their avatar experienced greater
physiological arousal [
138
]. Similarly, players reported greater phys-
iological arousal and subjective feelings of presence when playing
advergames that oered avatar customization options, suggesting
greater game enjoyment [
14
]. It has also been shown that players re-
member more game features—such as spatial features of landmarks
and characteristics of NPCs—when playing with customized avatars
[
175
]. Teng [
215
] examined how customizing avatars’ transient as-
pects in MMORPGs impact identication and loyalty with the game.
The study found that customizing these items (e.g., clothes, shoes,
etc.) positively impacted identication with the avatar, which subse-
quently increased gamer loyalty. Other studies have also explored
how customizing non-human objects (e.g., race cars) inuences
player experience [
185
,
201
]. One study used the game Need for
Speed: ProStreet [
64
] to understand if customizing a racecar aects
players’ enjoyment of the game [
201
]. Players customized their
cars’ visual appearance, such as changing the car’s shape, after-
market components (spoilers, rims), color, and skins. Players who
customized their cars experienced greater identication, leading
to higher game enjoyment, than those who played with pre-made
customized cars. One key limitation of these studies is the time
duration of their investigation. Many studies have only investigated
the eect of avatar customization on short playing time (~1 hour)
[
220
]. MMOs are long-term games, with players’ gameplay experi-
ence and expertise evolving with time. Previous studies have found
that players playing these games spend approximately 10 hours
playing each week [
63
]. Turkay and Kinzer investigated how play-
ers’ identication and empathy towards their avatar evolved over
ten hours of playing Lord of the Rings Online (LotRO [
209
]) [
220
].
The study found that players who customized their avatars had
How Audial Avatar Customization Enhances Visual Avatar Customization CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
a stronger identication and expressed greater empathy towards
them than those who played the game with premade avatars.
Studies have also used bespoke games to understand the ef-
fects of avatar customization [
25
,
126
,
140
]. Birk, Atkins, Bowey,
and Mandryk [
25
] investigated if players who customized their
avatars experienced greater intrinsic motivation compared to those
who used premade avatars. The researchers leveraged Unity Mul-
tipurpose Avatar [
1
] to develop a character creator which allowed
players to customize their game characters’ appearance (e.g., skin
tone, clothing), personality (e.g., extraversion), and attributes (e.g.,
intelligence, stamina, willpower). Players who customized their
game character experienced greater identication with their avatars,
which led to greater autonomy, immersion, invested eort, enjoy-
ment, positive aect, and time spent playing in an innite runner
[
25
]. In a subsequent paper, Birk and Mandryk investigated the
eect of avatar customization on attrition and sustained engage-
ment while playing a mental health game over three weeks [
26
].
The study found a reduced attrition rate for the players who cus-
tomized their avatar compared to those who played with a generic
avatar [
26
]. In another study, playing an exergame with autonomy-
supportive features (which included customizing an avatar) led to
increased eort, autonomy, motivation to play the game again, and
greater likelihood to recommend the game to peers compared to
participants who played the game without autonomy-supportive
features [
180
]. Similarly, in a virtual reality exergame, players cus-
tomized their avatars using an o-the-shelf software tool (Autodesk
Character Creator [
11
]) to create an avatar similar to themselves.
Players could customize their avatars (e.g., skin tone, hair and eye
color, clothes, shoes). The study found that players who competed
against their customized self-similar avatars performed signicantly
better compared to the players who competed with generic avatars
[
126
]. The eect of customization has also been observed in learn-
ing environments. Students engaged with a computational learning
game (over seven sessions lasting an hour each) with a customized
avatar of their choosing [
140
]. Customization options included skin
tone, hairstyle, and eye-color options. The study found that players
who customized their avatars remembered and understood greater
computational concepts than those who played the game with a
premade avatar. Kao and Harrell [
113
] investigated how avatar
identication inuenced players in a computational learning envi-
ronment (MazeStar [
112
]). Players customized their avatars using
a freely available Mii creator. The study found that avatar identi-
cation promoted outcomes including player experience, intrinsic
motivation, and time spent playing [113].
These studies suggest that avatar customization aects player
experience in a wide variety of settings (e.g., games for entertain-
ment or learning), virtual environments (e.g., desktop, VR) and
timespans (both one-o play sessions and longitudinal) [
25
,
114
,
126
,
138
,
201
,
220
]. More importantly, a subset of these studies
highlight that avatar customization generates attachment and iden-
tication with their game character [
26
,
113
,
201
,
215
,
220
], which
consequently aects a wide range of variables: intrinsic motiva-
tion [
25
,
180
], autonomy [
25
,
26
,
114
], empathy [
220
], performance
[
26
,
114
,
126
], game enjoyment [
217
], loyalty [
215
] and player ex-
perience [25, 114, 138, 201, 220].
2.1.1 Avatar Identification. Identication is a mechanism wherein
media experiences—such as reading a story or watching a movie—
are interpreted and experienced by audiences as if “the events
were happening to them” [
46
]. The mechanism of identication
diers in interactive and non-interactive media experiences. In a
typical media experience (e.g., movie or a late-night talk show), the
relationship between the audience and media-character is often
categorized as a self versus other (often referred to as a dyadic
relationship) [
44
,
61
]. Within games, the distance between the self
and the other is said to be diminished due to games’ aording direct
control over the game character and their interactions in the virtual
world [
91
,
122
]. Players control, customize, and interact with their
game character and the game world using an avatar. Consequently,
the player-avatar relationship is often said to be “a merging of [the
player] and the game protagonist” [44].
Avatar identication is thought to be a shift in self-perception
[
123
]. Players can temporarily adopt salient characteristics of the
avatar [
224
] or channel their expectations into the avatar creation,
thereby facilitating avatar identication [
220
]. Many factors inu-
ence the nature of identication that can take place with the avatar.
Flanagan [
73
] asserts that player identication with a game char-
acter is complicated by the various roles embodied by the player
(such as being a subject, spectator, participant, etc.) during game-
play. Murphy [
167
] elaborates on how players’ abilities, player
characters’ abilities, game events, and other players inuence the
player’s sense of agency in virtual environments. While many au-
thors agree that identication takes place between a player and the
game character, the nature of identication remains understudied
[220].
One avenue of understanding identication is through under-
standing the avatar customization process. When players customize
their avatar, they cycle through many “possible selves” [
148
] as
they experiment and adopt the game characters’ attributes for them-
selves. In two separate studies by dierent researchers, there are
a few common trends regarding players’ avatar creation and cus-
tomization experiences [
62
,
105
]. In one of the studies, researchers
investigated reasons for avatar customization and creation in three
virtual worlds: World of Warcraft [
53
], Second Life [
141
] and Maple
Story [
238
]. Researchers found that players in these virtual worlds
created and customized their avatars for various reasons, including
to project an ideal self, follow a trend, or stand out from others
[
62
]. Another study examined the avatar creation and customiza-
tion process for players in Whyville [
105
]. Players customized their
avatars for aesthetic reasons, to follow a popular trend, and to ex-
press themselves (e.g., show some aspect of their authentic selves).
Moreover, they also found that players customize their avatars with
a functional intention, such as to experiment with gender or to play
dierent roles [105].
These ndings have led researchers to consider avatar identica-
tion as a multi-faceted construct [
61
], which has been operational-
ized into three distinct dimensions: similarity identication, wishful
identication, and embodied identication [
224
]. Similarity identi-
cation refers to players identifying with an avatar that looks like
them [
61
]. Avatars that look similar to players can facilitate feelings
of familiarity and stronger empathetic experience [
224
]. Research
shows that similarity identication can play an important role in
the player’s motivations for playing [
224
], learning outcomes [
114
],
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Dominic Kao et al.
player experience [
107
], and behaviors [
126
,
134
]. Players can also
identify with their game characters and see them as role models
for future action or identity development [
224
]. Players desiring
to align their personal attitudes, aesthetics, and attributes with
those of their game character is referred to as wishful identica-
tion [
61
,
224
]. For example, previous research has documented that
older players often create avatars younger than themselves [
62
].
Lastly, players also identify with their avatars when manipulating
avatars’ bodies as their own. Perceiving to be present in a virtual
environment through one’s avatar, or so-called “body container”
[61, 224], heightens embodied identication [220].
The process of avatar customization is often a precursor for
generating greater avatar identication. For example, players want-
ing to create an avatar that has similar attributes (e.g., physical
appearance, hair style, hair color) may generate greater similarity
identication [
220
]. On the other hand, players customizing their
avatars according to their ideal self may increase their wishful iden-
tication [
220
]. Players typically interact with a user interface that
allows players to uidly cycle through choices to allow players
to constitute their desired digital body. As such, the design and
options presented to the players can play a crucial role in helping
(or hindering) players to create their desired avatar [153, 155].
2.1.2 Avatar Customization Interface. The interface that the players
use to create and customize their avatars—sometimes referred to as
a character customization interface (CCI) [
156
]—represents a “space
of liminality" [
234
] where players spend a signicant amount of
time intentionally creating their desired avatar [
62
,
156
]. McArthur
states that these interfaces generate action possibilities for avatar
creation and customization [
156
]. Players cycle through many pos-
sible customization options to create their desired avatar. Avatar
customization interfaces are not only important in terms of usability,
but also in how they communicate cultural ideologies [153, 156].
For instance, the design of “default” options in avatar customiza-
tion interfaces and the order (hierarchy) of body customization op-
tions oftentimes implicitly reinforces existing hegemonic structures
in society [
156
,
170
]. Avatar customization interfaces are known to
constrain user choices, in part due to their oftentimes exclusion-
ary design [
155
]. Previous research has found a limited number of
options for players belonging to diverse ethnic groups and gender,
suggesting that customization favors the creation of light-skinned
male avatars [
51
,
156
,
177
]. While our focus in the present study is
on understanding if audial avatar customization can confer similar
benets to visual avatar customization, the exclusionary potential
of audial avatar customization options should be studied closely in
future research.
Research has emphasized the role played by other aspects includ-
ing game world aesthetics, co-situated players, social context, and
avatars of other characters in inuencing the avatar customization
process [
105
,
153
]. Kafai found that new players felt out of place
with their generic avatars when interacting with avatars with de-
tailed customization [
105
]. Players also reported customizing their
avatars to avoid being bullied in online settings by other players
[
105
]. Players customize their avatars dierently depending on
the context of the virtual environment, such as changing clothes
and accessories when the social context switched between “game”
and “job” [
218
]. Players also adhere to group norms while creating
and customizing their character [
101
]. User characteristics, such as
age, gender and self-esteem, play a role in the avatar creation and
customization process. Individuals with higher self (and body) es-
teem represent their avatars with a greater number of body details
and emphasis on sexual characteristics that identied their gender
[
227
]. Adolescent boys customized their avatars to create a more
stereotypical masculine body compared to girls who focused on
customizing transient aspects of the avatar, such as clothing and
accessories [227].
Although the process of avatar customization has been exten-
sively investigated, research has largely ignored the eect of voice
options on avatar creation and customization. Contemporary games
seldom oer voice customization options; however, there do exist
some examples. Some games oer a “voice template” that can be
chosen during avatar customization, such as in Black Desert Online
[
178
]. Sims 4 [
68
] allows characters’ voices to be customized ac-
cording to three voice qualities: “sweet,” “melodic,” and “lilted” for
women, and “clear,” “warm,” and “brash” for men. Other games allow
players to customize a given voice by directly changing specic as-
pects of the voice, such as pitch. The games Saints Row IV [
230
] and
Cyberpunk 2077 [
38
] oer the ability to modify pitch. This project
investigates the eect of providing audial avatar customization
options on a variety of player outcomes.
2.2 Audio in Games
Game audio performs many functions, such as emphasizing visu-
als [
174
], contextualizing a place [
67
], highlighting emotions and
thoughts of the game-character [
174
], and immersing the player in
the game world [
193
]. To understand the design of audio in games,
researchers have dened audio typologies. One typology classies
sound based on the source [
19
]. Sound is referred to as “diegetic”
if it originates from the game world (e.g., game sound [
86
,
169
]),
and sounds that have origins dierent than the game world (e.g.,
interface sounds) are called “non-diegetic” [
86
,
169
]. Liljedahl [
137
]
classies sounds into three categories: speech and dialogue, sound
eects (e.g., ambient noise, avatar sounds, object and ornamental
sounds), and music.
Research shows that players appreciate the inclusion of audio
elements in the game. Klimmt et al. [
124
] investigated the role of
background music on gameplay experiences of players. Players ex-
perienced greater enjoyment while playing a game (Assassins Creed:
Black Flag [
222
]) with background music included. Background mu-
sic can also aect performance in a game—participants who played a
role-playing adventure game (The Legend of Zelda: Twilight Princess
[
176
]) performed better with background music present [
205
]. Some
games incorporate background music that changes according to
events in games. An adaptive soundtrack has also been shown to
improve player experience. Researchers designed a game with a
soundtrack that increased in tension depending on the chance of
success or failure of players in the game [
182
]. Participants who
played the game with an adaptive soundtrack experienced greater
tension, suggesting a more engaging experience. Players playing
a rst-person shooter game reported higher game experience (im-
mersion, ow, positive aect) with the presence of sound eects
(e.g., ornamental and character sounds) [
169
]. Audio may also in-
uence motivated behaviors such as time played [
110
] and actions
How Audial Avatar Customization Enhances Visual Avatar Customization CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
performed [
108
]. Lack of thematic t between audio and visuals
(also known as game atmosphere) can aect player experience. In
a study, players played a survival horror game (Bloodborne [
76
])
either with background and voiceover audio relevant for the game
(built-in game audio) or with experimenter-induced music and
voiceovers [
83
]. Players experienced a lower degree of perceived
game atmosphere when the audial elements did not t the game’s
visual elements.
Avatar sounds are sounds related to the avatar activity, such as
breathing and footstep sounds [
137
]. These sounds help immerse
the player into the game world [
85
], provide feedback for avatar
movement [
67
], and play a crucial role in localizing the player
in audio games [
75
,
78
]. For example, Adkins et al. [
3
] developed
an audio game wherein the players selected an animal as a game
character—a cow, dog, cat, and frog—to navigate through a maze.
The four animals also had representative animal sounds that pro-
vided essential user feedback for nearby obstacles and intersections.
Providing sound cues for the movement of an avatar helps the vir-
tual world conform to the players’ expectations [
75
] and induce
immersion into the game world [85].
2.2.1 Avatar Voice. Avatar voice includes linguistic (e.g., dialogue
and voiceovers) and non-linguistic vocalizations such as emotes
(e.g., eort grunts, screams, sighs) [
95
]. Avatar voice can be used to
control actions of the game character [
8
], converse with NPCs [
60
],
and converse with other players in the game world [
231
]. While
conversation with NPCs is usually supported through prerecorded
dialogues [
95
], games also facilitate avatar control and player-to-
player communication through voice interaction [37].
Voice dialogue in games supports storytelling, the development
of a rich and believable world, and setting emotional tone [
95
]. As
players explore and interact with a novel game world, conversing
with NPCs can reveal important information regarding historical
events and new quests that can ultimately help in the narrative
progression. A common feature in many open-world games is the
presence of a social space (e.g., local tavern) containing music and
ambient sounds that are concurrent and continuous [
206
]. The
social space also contains jumbled, indistinct conversations (Walla)
among social actors (NPCs) [
95
]. Therefore, a sonic environment
comprising of music, sounds, and voices helps in several ways:
creating a game-feel, setting the mood, and making the game world
believable [
49
,
95
]. Game characters also use emotionally-laden
dialogues to engender emotions in a player that can forge a deeper
connection between the game character and the player [
211
]. For
instance, an urgent request for help can arouse the player to take
action.
Voice interaction focuses on using players’ voices as input in the
game [
8
,
37
]. Beyond using voice interaction to converse with other
players [
231
], recent advances in software and hardware technol-
ogy [
7
] have made it possible to use voice interaction to control
avatar actions and in-game events [
8
]. Two popular approaches
exist here: “voice-as-sound” [
88
,
99
] and “voice-as-speech” [
8
,
37
].
Voice-as-sound uses players’ voice characteristics such as pitch and
tone [
88
,
99
]. Hämäläinen et al. [
88
] describes the design of two
games that used the voice-as-sound approach. The players navigate
a hedgehog through a maze in the rst game by singing at the
correct pitch. The authors also developed a turn-based ping-pong
game where the players had to navigate their paddle at appropri-
ate positions using the correct pitch. Voice-as-speech uses speech
recognition technology to interpret players’ commands in games
[
7
,
8
,
88
]. Players can use their voice to navigate menus [
37
], engage
in unscripted conversations with a virtual pet (a sh in the game
Seaman [
228
]) [
8
], and cast spells using voice commands in Skyrim
[7, 23].
Carter suggests that voice interaction can facilitate a deeper
connection with the players’ game characters [
37
]. The voice of
an avatar is a part of game characters’ identity, and providing a
way to use players’ voices for avatar actions can lead to a merging
of identity (player-avatar convergence). Embodied identication,
that is, the degree of control over the game characters’ movement
and action, can imbue players with a greater sense of agency and
identication [
220
]. Players playing Tomb Raider [
54
] can use voice
commands to initiate player actions, such as attack and defend [
37
],
while simultaneously performing (other) actions with the game
controller. In this sense, voice interaction may facilitate embodied
identication by aording greater control over game characters’
actions. Voice interaction may also facilitate wishful identication
by aording associations between players’ voice and the game
characters’ voice. Splinter Cell Blacklist [
223
] allows users to distract
enemy NPCs by using a specic speech phrase (“Hey, you!”), which
is repeated by the voice of the game character in the virtual world.
Players in FIFA 14 [
65
] embody the role of manager and perform
actions such as selecting players for the tournament and giving
advice on the eld. Players can voice specic commands that change
the behavior of their chosen team to adopt a defensive or attacking
mindset [
37
]—a typical action that coaches and managers perform.
Lastly, voice interactions can facilitate similarity identication by
allowing users to interact with the game characters using their
voices. For example, avatar representation in karaoke games is
almost entirely through the voice of the player [37].
2.2.2 Avatar Voice and Learning Environments. Studies have inves-
tigated how engagement and learning outcomes are inuenced by
voice characteristics of the instructional agents [
133
,
150
]. Learners
rate voices more likeable when voice characteristics of instruc-
tional agents are similar to themselves in perceived gender [
132
] or
personality [
171
,
172
]. Research also documents persistent stereo-
types in the design of instructional agents’ voices. Deutschmann
[
58
] evaluated how students perceive a male and female avatar
delivering a lecture. Students perceived the male avatar as more
knowledgeable, and the female character as more likable. Along a
similar line, authors designed three avatars—the instructor’s face,
male-anime, and female-anime—to understand how students per-
ceive and perform in an online course. Students showed higher
likeability for the female-anime avatar but performed higher when
instructors’ own face delivered lectures. Although these studies
show that the voice of an avatar plays a role in students’ perception
and performance, a general limitation is the poor quality of voice
morphing in these studies [58, 97].
More recently, research has also sought to understand how an
avatar’s voice can aect self-presentation in digital environments.
Zhang et al. characterized users’ voice customization preferences on
social media websites [
242
]. The study highlighted gender, personal-
ity, age, accent, pitch, and emotions as key factors that users wanted
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Dominic Kao et al.
to customize to represent their avatar in digital spaces [
242
]. The
study also highlighted the need to provide customization options to
modulate pitch and voice depending on the context—e.g., sounding
serious and formal for professional websites such as LinkedIn. A
common trend in studies leveraging personalized avatar voice in
virtual environments is the benecial eects of using a self-similar
avatar voice [
12
,
114
]. In a public speaking experiment, participants
stood in front of a virtual classroom to give a speech [
12
]. Partici-
pants either used their own voice to give the speech or had another
participant’s speech played back. Participants who used their own
voice showed signicantly higher social presence [
12
]. Kao, Ratan,
Mousas, and Magana leveraged recent advances in voice cloning
and found that learners using a more self-similar voice (as opposed
to a self-dissimilar voice) in a game-based learning environment
had higher performance, time spent, similarity identication, com-
petence, relatedness, and immersion. Additionally, they found that
similarity identication was a signicant mediator between voice
similarity and all measured outcomes [114].
While research provides strong support for avatar voice inuenc-
ing avatar identication, no study (to the best of our knowledge)
has investigated the eects of providing avatar audial customization
options. We present a study that provides audial (voice) avatar cus-
tomization options alongside visual avatar customization options in
a Java programming game. Our goal is to understand how providing
audial avatar customization options aect measured outcomes.
2.3 Hypotheses
We had seven overarching hypotheses (each broken down into
three sub-hypotheses) in this study. All hypotheses and research
questions were part of the study preregistration.
2
Because prior
work has shown that avatar customization leads to an increase in
avatar identication (similarity identication, embodied identica-
tion, and wishful identication) [25, 26, 57], we hypothesized that
visual customization would lead to an increase in avatar identica-
tion. Research has shown that game audio is important to player
experience (PX) [
66
,
67
,
168
] and that avatar audio can inuence
avatar identication [
114
]. Therefore, we hypothesized that audial
customization would lead to an increase in avatar identication.
Additionally, we hypothesized a lack of an interaction eect be-
tween visual and audial customization because existing work gives
us no reason to believe their eects would depend on one another.
H1.1:
Visual customization will lead to higher avatar identication.
H1.2:
Audial customization will lead to higher avatar identication.
H1.3:
No interaction eect between visual and audial customiza-
tion for avatar identication.
Prior studies have shown that character customization leads to
greater autonomy [
25
,
180
]. Therefore, we hypothesized that visual
customization would lead to greater autonomy. Similar to H1.2, we
hypothesized that audial customization will play a similar role to
visual customization and will also increase autonomy. We again
hypothesized a lack of an interaction eect for the same reason as
H1.3.
H2.1: Visual customization will lead to higher autonomy.
H2.2: Audial customization will lead to higher autonomy.
2https://osf.io/dbvkp/.
H2.3:
No interaction eect between visual and audial customiza-
tion for autonomy.
Prior work has shown that avatar customization is linked to
intrinsic motivation [
25
], immersion [
25
], time spent playing [
25
],
motivation for future play [
180
], and likelihood of game recom-
mendation [
180
]. Furthermore, avatar identication and autonomy
are increased through avatar customization (e.g., [
25
,
180
]), and
also aect intrinsic motivation, immersion, time spent playing, mo-
tivation for future play, and likelihood of game recommendation
[
25
,
113
,
180
,
183
,
198
]. Therefore, we hypothesized a model in
which visual customization directly, and indirectly through avatar
identication and autonomy, inuences intrinsic motivation, im-
mersion, time spent playing, motivation for future play, and likeli-
hood of game recommendation. Lastly, given the lack of prior work
on audial customization, we posed as research questions (without
any formal hypotheses) whether audial customization moderated
any of these eects.
H3.1:
Visual customization will lead to higher intrinsic motivation.
H3.2: Avatar identication will mediate H3.1.
H3.3: Autonomy will mediate H3.1.
H4.1: Visual customization will lead to higher immersion.
H4.2: Avatar identication will mediate H4.1.
H4.3: Autonomy will mediate H4.1.
H5.1:
Visual customization will lead to higher time spent playing.
H5.2: Avatar identication will mediate H5.1.
H5.3: Autonomy will mediate H5.1.
H6.1:
Visual customization will lead to higher motivation for future
play.
H6.2: Avatar identication will mediate H6.1.
H6.3: Autonomy will mediate H6.1.
H7.1:
Visual customization will lead to higher likelihood of game
recommendation.
H7.2: Avatar identication will mediate H7.1.
H7.3: Autonomy will mediate H7.1.
Research Question
: Does audial customization moderate H3–H7?
3 EXPERIMENTAL TESTBED
Our experimental testbed is CodeBreakers
4
[
109
], which was cre-
ated for conducting avatar-based studies. CodeBreakers is a Java
programming game in which players solve increasingly dicult
problems by throwing snippets of code. See Figure 1. CodeBreak-
ers was iteratively created with feedback from professional game
developers, game designers, and Java developers, and it included
informal play testing over an eighteen-month span with playtesters.
There were 14 total puzzles, spanning 6 levels. CodeBreakers was
designed to incorporate best practices on eective learning curves
[
142
]. Programming topics include data types, conditionals and con-
trol ow, classes and objects, inheritance and interfaces, loops and
recursion, and data structures. Each puzzle had up to 3 hints, which
are increasingly detailed. Players controlled their character using
the keyboard and mouse. CodeBreakers was originally developed
for Microsoft Windows and macOS. However, for the purposes of
3
Note that the avatar model color was changed to gray for this study. See Section 4
for details.
4Gameplay video: https://youtu.be/x5U-Jd6tKXA.
How Audial Avatar Customization Enhances Visual Avatar Customization CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 1: Data type puzzle (L). Curing a wounded knight (R). Placeholders . . . indicate where code snippets can be thrown.3
this experiment, CodeBreakers was converted to WebGL and was
therefore playable on any PC inside of the browser (e.g., Chrome,
Firefox, Safari). See Section 4.4.1 for details. In total, there were
30 possible voice lines that could have been triggered. Other than
the rst voice line (What am I doing here? Did my ship crash? How
long have I been lying here for? I guess I should get up and look
around.), audio lines typically come before and after each puzzle.
For example, prior to puzzle #7: The castle is under siege!. And after
completing puzzle #7: It worked! I neutralized all of the bugs by using
the sta. These voice lines were accompanied by speech bubbles
(see Figure 2).
4 METHODS
For this study, we explicitly aimed to create stereotypically-appearing
(and sounding) “male” and “female” avatars. We created four avatar
appearances (two male and two female) and four avatar voices (two
male and two female). We made these design decisions with an
understanding that a binary view of gender is problematic, but we
did so for ecological validity with the majority of existing games.
While it would have been possible to create a more inclusive set
of gender choices, this might present as a possible confound as
such choices are not currently available in most of today’s games.
Our goal is to develop a baseline understanding of the presence
of customization choices that mirror current games. Such baseline
understandings can inform future avatar customization research
and implementation, in which we hope that more inclusive design
choices become the norm. Finally, our rationale for creating two
visual choices and two audial choices for each gender was to add a
(minimal) degree of visual and audial choice.
4.1 Model Development
All four models used in this experiment were designed and created
from scratch by a professional 3D game artist. The models were
purposefully designed to avoid known color eects (e.g., the color
red is known to reduce mood, aect, and performance in cognitive-
oriented tasks [
84
,
98
,
111
,
127
,
158
,
159
]). We chose gray because
it matched the aesthetic of the game and is not associated with
negative physiological eects on cognition and heart rate variability
(HF-HRV) [
69
]. All four models shared the same identical skeleton
and joints, and therefore all animations (i.e., idle, walking, picking
up code, throwing code, using weapons, falling, dying, stopped in
front of a wall, etc.) were identical across the four models. Only
visual appearance diered. See Figure 3.
4.2 Voice Development
4.2.1 Voice Development Goal. Our goal was to create four avatar
voices (two stereotypical male and two stereotypical female). We
wanted each voice to be appropriate for the game and to be appropri-
ate for either of the two models from the same gender. Additionally,
we wanted each male voice to have a “matching” female voice
as rated on a scale of perceived vocal dimensions—e.g., strong vs.
weak, smooth vs. rough, resonant vs. shrill [
82
].
5
In other words,
we wanted these matched voices to sound as similar as possible.
The reason this matching was done was to mitigate confounds from
large dierences between voices. High variance between voices
would add an additional dimension to the manipulation which
could inuence the study results. Nevertheless, we wanted both
male voices to be distinct from one another and both female voices
to be distinct from one another. If this were not the case (e.g., both
male voices sounded the same), then our manipulation of giving
users a choice of voice would only be illusory.
4.2.2 Creating Voices. We hired two professional voice actors with
over ten years of experience in character voice acting. Both voice
actors were screened through their portfolios, which contained sam-
ples of their work. Both voice actors provided sample voice clips for
CodeBreakers prior to being hired. We decided on hiring two voice
actors instead of four because: (1) we could ensure greater overall
consistency across voices, helping to bound the variance across
voices and (2) both voice actors had demonstrated evidence of being
able to perform a multitude of dierent voices and characters, as-
surance that each voice actor could produce two unique-sounding
voices. Both voice actors self-identied as white and have lived
in the U.S. for their entire lives. One voice actor self-identied as
male and was 49 years old. The other voice actor self-identied as
female and was 38 years old. The two voice actors were instructed
5We discuss this scale in more detail in the validation section below.
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Dominic Kao et al.
Figure 2: Voice audio occurs in conjunction with speech bubbles that appear on top of the avatar.3
Figure 3: Front view (L) and back view (R) of the four models.
to work together to create two “matching” voice pairs as described
in Section 4.2.1. Our goals for the four voices, including the scale of
vocal dimensions [82], were clearly articulated to the voice actors.
Additionally, both voice actors familiarized themselves with the
game by watching video gameplay of CodeBreakers. Both voice
actors were also shown the four models that they were voicing. All
voices were recorded in the same professional audio recording stu-
dio with both voice actors physically copresent. Identical recording
equipment and software was used for recording each voice clip:
Sennheiser MK-416 (microphone), Universal Audio Arrow (audio
interface), and Ableton Live 10 (digital audio workstation). Com-
pleted voice clips were reviewed by the project team, and several
iterations were made on the voice clips to ensure that our criteria
in Section 4.2.1 appeared to be satised. A total of 120 voice clips
(30 per voice) were recorded and nalized. Sample audio clips can
be found at https://osf.io/mnpsd/. M1 is male voice one, M2 is male
voice two, F1 is female voice one, and F2 is female voice two.
4.2.3 Voice Loudness Normalization. While the same identical record-
ing studio and recording equipment was used for recording each
voice, it is possible that relative amplitude (i.e., loudness) could
dier between voices, especially between the two dierent voice
actors. To normalize loudness across all voices and voice clips,
we adopted the EBU R 128 (issued by the European Broadcasting
Union) standard’s recommendation for loudness normalization [
71
].
It recommends normalization of audio to -23
±
0.5 Loudness Units
Full Scale (LUFS), and a max peak of -1 decibel True Peak (dBTP).
A professional audio engineer with 15+ years of experience per-
formed this normalization using Nuendo 11 Pro and veried that
the loudness normalization recommendation was satised.
4.3 Voice Validation
4.3.1 Expert Voice Validation. To ensure that we had created two
distinct matching pairs of voices (similarity within each pair but
variance between them), we hired three expert speech pathologists
to evaluate each voice. Each speech pathologist was given instruc-
tions to listen to a set of voices then asked to rate each voice on a
scale. Each speech pathologist was compensated $25. Speech pathol-
ogists all had at least 10 years of professional speech pathology
experience (M=20.0, SD=8.19), with an average age of M=47.67
(SD=4.93). Before rating the voices, each speech pathologist was
instructed to familiarize themselves with the validated scale on
perceptual attributes of voice [
82
].
6
This scale consists of 17 items,
and all items are rated on a Likert scale from 1 to 9. Anchor points
for each item are listed in Table 1. Each speech pathologists was
provided the 30 voice clips associated with each voice, and each was
asked to listen to the entire set of clips belonging to a single voice
before rating that voice. Speech pathologists performed the ratings
6
The scale has been used with speech pathologists revealing modest within-group
agreement despite absence of any training in interpretation of the scale descriptors
[82].
How Audial Avatar Customization Enhances Visual Avatar Customization CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Lower Anchor—Upper Anchor M1 (SD) M2 (SD) F1 (SD) F2 (SD)
High Pitch—Low Pitch
Loud—Soft
Strong—Weak
Smooth—Rough
Pleasant—Unpleasant
Resonant—Shrill
Clear—Hoarse
Unforced—Strained
Soothing—Harsh
Melodious—Raspy
Breathy Voice—Full Voice
Excessively Nasal—Insuciently Nasal
Animated—Monotonous
Steady—Shaky
Young—Old
Slow Rate—Rapid Rate
I Like This Voice—I Do Not Like This Voice
6.33 (0.58)
4.67 (1.53)
2.00 (0.00)
2.33 (0.58)
1.67 (0.58)
2.67 (0.58)
2.33 (0.58)
3.00 (1.00)
3.33 (0.58)
3.33 (0.58)
7.00 (1.73)
5.00 (0.00)
1.67 (0.58)
2.00 (0.00)
4.33 (0.58)
4.67 (0.58)
1.67 (1.15)
8.00 (0.00)
4.67 (2.31)
2.33 (1.53)
4.00 (1.73)
2.33 (0.58)
1.67 (0.58)
3.67 (2.89)
4.33 (2.52)
2.67 (0.58)
4.33 (2.08)
8.33 (0.58)
5.00 (0.00)
4.67 (1.53)
2.33 (0.58)
5.67 (0.58)
5.33 (0.58)
2.00 (1.00)
3.33 (0.71)
4.67 (1.41)
3.33 (0.71)
2.00 (0.00)
1.67 (0.71)
3.67 (2.83)
2.33 (0.71)
3.00 (0.71)
2.67 (0.71)
2.33 (0.00)
5.00 (2.83)
5.00 (0.00)
1.67 (0.00)
2.33 (0.00)
3.33 (0.71)
5.33 (0.71)
1.67 (1.41)
4.33 (0.58)
4.00 (1.73)
3.00 (1.00)
3.67 (1.15)
3.00 (0.00)
3.33 (1.15)
3.67 (2.89)
3.67 (1.53)
3.33 (1.53)
4.67 (0.58)
7.00 (1.00)
4.00 (1.00)
4.00 (1.73)
2.33 (0.58)
4.33 (1.15)
5.33 (0.58)
3.33 (1.53)
Table 1: Mean expert speech pathologist ratings for each voice. All items are rated on a 9-pt Likert scale from 1:Lower Anchor
to 9:Upper Anchor.
using their own computers, and they were asked to use the most
professional audio equipment available to them to perform the eval-
uation. Across the three speech pathologists’ ratings, we calculated
the intraclass correlation to be ICC=0.83, 95% CI [0.75, 0.89] (two-
way mixed, average measures [
203
]), indicating high agreement.
Mean ratings for each voice can be seen in Table 1. As a measure
of similarity between voices, we then calculated an absolute mean
dierence across the scale between every possible pair of voices.
As expected, this dierence was lower in the two matched pairs
(M1/F1: M=0.67; M2/F2: M=0.88) when compared to mismatched
pairs (M1/F2: M=2.33; M2/F1: M=1.41) or to same-gender pairs
(M1/M2: M=1.08; F1/F2: M=0.98). Although the same-gender pairs
have an absolute mean dierence close to the two matched pairs,
we attribute some of this due to voice attributes that are often-
times known to vary naturally between genders (e.g., pitch [
30
]).
Nevertheless, one potential concern arising from these results is
that the same-gender voices may not be perceived as distinct from
one another. Therefore, we performed an additional crowdsourced
validation.
4.3.2 Crowdsourced Voice Validation. To ensure that we had cre-
ated two distinct matching pairs of voices, that all voices would be
perceived as as being high quality, that voices would be perceived
as the stereotypical intended gender, and that voices across the
same gender would be perceived as unique and distinct voices, we
ran a crowdsourced validation study. This was to reinforce and
extend the prior expert validation. We recruited 91 participants
(39% self-identied as female) on MTurk to rate voices based on
sets of audio clips. Each participant was compensated $1.00 (USD).
Participants had a mean age of 40.62 (SD=13.82). All participants
were from the U.S. After lling out a consent form, each participant
was rst presented with, randomly, either a stereotypical male or
female voice clip of an English word, which they needed to type cor-
rectly. This was to ensure that the participant’s audio was turned
on and working. Each of the following questions was equipped
with analytics that tracked the amount of time that each participant
spent listening to audio clips. These analytics were used to validate
that participants had actually listened to the audio clips before an-
swering the questions. ~10% of participants were removed for not
having listened to all audio clips in the study in their entirety.
Participants were then asked to “Please listen to
ALL
of the fol-
lowing audio clips before answering the question below comparing the
rst (left-side) and second (right-side) voices.” And to rate: “Besides
gender-related voice characteristics, I consider these two voices as
similar,” on a scale of 1:Strongly Disagree to 7:Strongly Agree. This
question was asked four times comparing the following pairs of
voices in a randomized order: M1/F1, M2/F2, M2/F1, and M1/F2.
For each comparison, 5 voice clips were selected at random (from
the total 30), and those same 5 voice clips were shown for both of
the two voices being compared (i.e., the same speech dialog).
7
Re-
sults indicated that matched pairs (M1/F1: M=5.51, SD=1.50; M2/F2:
M=4.92, SD=1.68) were rated to be more highly similar to one
another than unmatched pairs (M1/F2: M=4.01, SD=1.64; M2/F1:
M=3.13, SD=1.71).
Participants were then asked to “Please listen to
ALL
of the fol-
lowing audio clips. All clips belong to one voice. After listening to all
of the clips, you will be asked a question regarding the voice.” And to
rate: “Based on the voice you just listened to, please rate the follow-
ing: The voice is high-quality,” “The speaker sounds (stereotypically)
male,” and “The speaker sounds (stereotypically) female” on a scale
of 1:Strongly Disagree to 7:Strongly Agree. This question was asked
for each of the four voices in randomized order. For each voice,
5 voice clips were selected at random (from the total 30). Results
indicated that all voices were perceived to be relatively high quality
(M1: M=6.02, SD=0.80; F1: M=6.06, SD=0.98; M2: M=5.80, SD=1.12;
F2: M=5.60, SD=1.08) and that voices sounded stereotypically male
(M1: M=6.74, SD=0.51; F1: M=1.20, SD=0.56; M2: M=6.85, SD=0.39;
F2: M=1.34, SD=0.89) or female (M1: M=1.32, SD=0.77; F1: M=6.79,
SD=0.44; M2: M=1.15, SD=0.52; F2: M=6.70, SD=0.55) as intended.
Participants were then asked to “Please listen to
ALL
of the fol-
lowing audio clips before answering the question below comparing
the rst (left-side) and second (right-side) voices.” And to rate: “In
comparing the two voices above (left audio clips vs. right audio clips),
7
Note that randomization is done per participant and per question, so the 5 voice clips
selected vary both across questions and across participants.
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Dominic Kao et al.
please rate the following: These two voices are distinct and dierent
from one another,” on a scale of 1:Strongly Disagree to 7:Strongly
Agree. This question was asked twice for voices in each gender
(M1/M2 and F1/F2) in a random order. For each comparison, 5 voice
clips were selected at random. Results indicated that same-gender
voice pairs were perceived to be relatively distinct (M1/M2: M=5.78,
SD=1.07; F1/F2: M=5.73, SD=1.30). Participants then entered demo-
graphic information.
4.4 Model and Voice Integration
4.4.1 WebGL Conversion and Technical Testing. Over 4 months, the
original CodeBreakers game, which is playable on machines run-
ning either Microsoft Windows or macOS [
114
], was converted to
WebGL to allow for a more convenient play experience. The WebGL
version is playable on any PC inside of the browser (e.g., Chrome,
Firefox, Safari). This conversion was performed by a professional
game development team with expertise in game optimization. Dur-
ing the conversion process, we iterated on the game internally every
few days and externally every few weeks. Our main goal during
these iterations was to ensure that performance (e.g., frames per
second) was adequate and that there were no technical issues (e.g.,
crashing). Internal iterations were performed by the development
and research team where feedback was fed into the next iteration.
Performance proling tools were used extensively to diagnose areas
of the game (e.g., code loops, rendering of certain geometry) respon-
sible for increased CPU and memory usage. External iterations were
performed when we wanted the game to be tested more widely. We
performed iterations with batches of 10-20 participants at a time on
MTurk. Participants were asked to play the entire game and were
provided a walkthrough video in case they were unable to progress.
This ensured that each participant would cover the breadth of the
entire game. Data, including gameplay metrics, performance, crash
logs, and PC details, was automatically logged on the server for
further analysis. Participants could report any issues, problems,
or concerns they experienced during playtesting. A total of 121
participants, all from the U.S., took part in external playtesting.
Each participant was compensated $10 (USD). Our testing ended
when no new technical issues arose in the most recent internal and
external iterations, all known technical issues were xed, and the
game performed adequately (e.g., frames per second, load times)
under a wide variety of PCs. Additionally, the development and
research team agreed that, for all intents and purposes, the WebGL
game played and felt identical to the original.
4.4.2 Character Customization UI. A professional game UI de-
signer created four dierent character customization screens that
we requested. These also correspond to our experimental condi-
tions. (See Figure 4.) We made the explicit design decision never to
allow mismatched model–voice gender pairings (i.e., male model
and female voice or vice versa), since this may be unnatural for
players, lacks general ecological validity with existing games, and
may be an experimental confound (e.g., in conditions where one
or both features are assigned at random). Therefore, avatar cus-
tomization is, in all cases, a two-step process that involves rst
choosing or being assigned a model (one of four), then choosing
or being assigned a voice (one of two since the model has already
been selected, and there are only two voices corresponding to the
designed stereotypical gender).
In Choice-None, the player does not have any choice over the
model or voice. Both model and voice are randomly assigned. In
Choice-Audio, a model is randomly preselected, and a player is
able to choose the voice. In Choice-Visual, the player chooses a
model, after which the voice is randomly assigned. In Choice-All,
the player chooses both model and voice. Note that the two voices
corresponding to “Voice 1” and “Voice 2” will dier depending on
the model selected. In Choice-All, both voice options are grayed out
and unavailable until a model has been selected. If a dierent model
is selected after a voice has been selected, the voice is automatically
deselected. In all conditions, players must enter a name for their
character. For conditions that allow for a model choice (Choice-
Visual and Choice-All), the UI initially shows an empty box where
the selected model would normally appear (i.e., no model is selected
by default). For conditions that allow for a voice choice (Choice-
Audio and Choice-All), no voice is selected by default (i.e., one of the
two voices must be selected manually by the player). When a voice is
selected, a single audio clip is played from that voice so that players
can compare voices. In all conditions, players must complete all
customization options available (e.g., name, model, voice) before the
“Start Game” button becomes available. Character customization
conditions were designed in this manner to minimize dierences
between conditions, while still varying the manipulations (visual
choice and audial choice).
4.4.3 Expert UI Validation. To assess the appropriateness of our
character customization UIs, we performed a validation study with
three professional game UI designers. Game UI designers were re-
cruited from the online freelancing platform Upwork, and were each
paid $20 (USD). The job posting was Assess Character Customization
Interface in Educational Game, and the job description stated that
we were looking for expert game UI designers to evaluate a set
of character customization interfaces in an educational game. The
three UI designers had an average of 9.00 (SD=4.36) years of UI
design experience and an average of 7.67 (SD=5.69) years of game
development experience. UI designers all had work experience and
portfolios that reected recent UI design and game development
experience (all within one year). UI designers were instructed to
give their honest opinions and were told their responses would be
anonymous, and proceeded to our survey. Each UI designer was rst
asked to watch 30 minutes of gameplay footage from CodeBreakers
to familiarize themselves with the game. Afterwards, each designer
loaded CodeBreakers WebGL on their own machine and interacted
with every version of the UI in a randomized order. After interacting
with a specic version of the UI, the UI designer was asked to rate
“The character customization interface is appropriate for the game,” on
a scale of 1:Strongly Disagree to 7:Strongly Agree. UI designers were
asked to rate each interface individually, not in comparison to the
other interfaces they had already seen. UI designers were also able
to report open-ended feedback. The survey took approximately 1.5
hours to complete. Responses showed that UI designers generally
agreed that the character customization interface was appropriate
(Choice-None: M=6.67, SD=0.58; Choice-Audial: M=6.33, SD=1.16;
Choice-Visual: M=6.33, SD=1.16; Choice-All: M=7.00, SD=0.00). One
UI designer did note as open-ended feedback that they had not
How Audial Avatar Customization Enhances Visual Avatar Customization CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
(a) Choice-None: Participant is randomly assigned both model and voice. (b) Choice-Audio: Participant is randomly assigned model and chooses voice.
(c) Choice-Visual: Participant chooses model and is randomly assigned voice. (d) Choice-All: Participant chooses both model and voice.
Figure 4: Avatar customization screens.
expected to be able to choose a voice for their character since this
is not a commonly available feature in games, but it was stated that
this did not play a role in the designer’s ratings.
4.4.4 Model and Voice Integration Validation. To assess whether
the models and voices that we had developed would be perceived
as appropriate for the game, we recruited 120 participants (43%
female) on MTurk. All 120 participants played CodeBreakers us-
ing the Choice-None condition (i.e., randomly assigned model and
voice). Participants played the game for a minimum of 5 minutes,
but they were allowed to play as long as they liked beyond the 5-
minute mark. Random assignments were roughly even across mod-
els (24.2%/24.2%/32.5%/19.2%) and voices (24.2%/27.5%/26.7%/21.7%).
For the remainder of this section, ratings described for models fol-
low the left-to-right order of models shown in Figure 3. See Figure 5
for graphs summarizing the validation results.8
To assess whether models overall visually t the game, we asked,
“How appropriate were your avatar’s visual characteristics for the
game?” on a scale from 1:Inappropriate to 5:Appropriate. Scores
tended between neutral and appropriate for each model (M=4.24,
8
All validation questions are found in the graphs except for “How appropriate was the
avatar design overall?” for which summary statistics are provided in the text.
SD=0.83; M=3.86, SD=0.79; M=4.18, SD=0.76; M=4.04, SD=0.77).
To assess whether voices overall audially t the game, we asked
“How appropriate was your avatar’s voice for the game?” on a scale
from 1:Inappropriate to 5:Appropriate. Scores again tended between
neutral and appropriate for each