Conference PaperPDF Available

NavigaTone: Seamlessly Embedding Navigation Cues in Mobile Music Listening

Authors:

Abstract

As humans, we have the natural capability of localizing the origin of sounds. Spatial audio rendering leverages this skill by applying special filters to recorded audio to create the impression that a sound emanates from a certain position in the physical space. A main application for spatial audio on mobile devices is to provide non-visual navigation cues. Current systems require users to either listen to artificial beacon sounds, or the entire audio source (e.g., a song) is re-positioned in space, which impacts the listening experience. We present NavigaTone, a system that takes advantage of multi-track recordings and provides directional cues by moving a single track in the auditory space. While minimizing the impact of the navigation component on the listening experience, a user study showed that participants could localize sources as good as with stereo panning while the listening experience was rated to be closer to common music listening.
NavigaTone: Seamlessly Embedding Navigation Cues in
Mobile Music Listening
Florian Heller
Hasselt University - tUL - imec
3590 Diepenbeek, Belgium
florian.heller@uhasselt.be
Johannes Schöning
University of Bremen
Bremen, Germany
schoening@uni-bremen.de
ABSTRACT
As humans, we have the natural capability of localizing the
origin of sounds. Spatial audio rendering leverages this skill
by applying special filters to recorded audio to create the
impression that a sound emanates from a certain position in the
physical space. A main application for spatial audio on mobile
devices is to provide non-visual navigation cues. Current
systems require users to either listen to artificial beacon sounds,
or the entire audio source (e.g., a song) is re-positioned in
space, which impacts the listening experience. We present
NavigaTone, a system that takes advantage of multi-track
recordings and provides directional cues by moving a single
track in the auditory space. While minimizing the impact of
the navigation component on the listening experience, a user
study showed that participants could localize sources as good
as with stereo panning while the listening experience was rated
to be closer to common music listening.
ACM Classification Keywords
H.5.5. Information Interfaces and Presentation (e.g. HCI):
Sound and Music Computing; Systems
Author Keywords
Virtual Audio Spaces; Spatial Audio; Mobile Devices; Audio
Augmented Reality; Navigation.
INTRODUCTION
Portable music players are popular since they first appeared
over 30 years ago. As part of every smartphone they are ubiq-
uitous companions of our mobile lifestyles. While in the early
days, wearing headphones in public was rather unusual, it has
become a common sight and headphones are, more than ever,
a fashion accessory [19]. The rich built-in sensors of current
smartphones and the omnipresence of headphones allow us
to think of new audio-based applications. Using auditory bea-
cons as navigational aids in mobile audio augmented reality
systems has been shown to be a powerful and unobtrusive con-
cept for pedestrian navigation. A large body of related work
exists that focusses on how to use spatial audio for pedestrian
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
CHI 2018, April 21–26, 2018, Montreal, QC, Canada
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-5620-6/18/04. . . $15.00
DOI: https://doi.org/10.1145/3173574.3174211
Orientation
Cue
Space occupied
in the stereo
spectrum
Orientation
Cue
Space occupied
in the stereo
spectrum
Figure 1. NavigaTone leverages the possibilities of multi-track record-
ings for spatial audio navigation. Instead of moving around the entire
track in the stereo spectrum (left), NavigaTone only moves the voice of
the singer or another instrument around the users’ head.
navigation. These either represent the navigation target using
a dedicated beacon sound [10, 18, 25] which limits the use of
headphones to that single purpose, or by placing the source
of a music track at the target location [2, 21, 27] which im-
pacts the listening experience. Alternatively, orientation cues
could be provided using a bone conductance headphone [14,
26] or an acoustically transparent augmented reality audio
headset [22], both of which do not block the perception of
the surrounding soundscape. However, these do not solve the
issue of blending navigation cues into the music the user is
listening to.
In this note we present NavigaTone, that overcomes this prob-
lem by leveraging the potential of multi-track recordings. Navi-
gaTone integrates the needed navigational cues into the regular
stream of music in an unobtrusive way. Instead of moving the
entire track around in the stereo panorama, we only move a sin-
gle voice, instrument, or instrument group (cf. Figure 1). This
is possible with the recent appearance of commercially avail-
able multi-track recordings, e.g., in Native Instruments’ STEM
format [7]. This allows us to balance between the impact of
the navigation cue on the overall perception of the audio track
and the ability to localize the cue, with the goal to minimize
the effect on perception while still providing a good sense
of orientation. We compared the aesthetic appearance and
localizability of moving single sources of such a multitrack
recording in space against overlaying the song with additional
beacon sounds. Users clearly preferred the moving voice of
the singer over an overlaid beacon sound. To further analyze
the localization precision and perception of such an orienta-
tion cue, we performed a controlled user study to compare our
NavigaTone system against the baseline of moving the entire
track in the stereo spectrum. Our participants reported that
NavigaTone felt significantly more natural and intuitive com-
pared to the baseline, while being accurate in differentiating
the origin of a sound at 30
resolution, which is sufficient for a
pedestrian navigation scenario. In terms of other performance
variables, the more obtrusive baseline of moving the entire
track in the stereo spectrum performed slightly better, but was
perceived to negatively impact the listening experience.
RELATED WORK
AudioGPS [10] was the first system using auditory cues for
navigation by indicating the direction of the target. To dif-
ferentiate between sources in front or in the back of the user,
it used a harpsichord and a trombone timbre as navigational
beacons. Instead of using different beacon sounds, spatial
audio rendering applies special filters to an audio signal to
make it appear from a certain position in space. This can be
used to augment a certain target location with a sound which
the user localizes and tries to reach [2, 25]. However, as those
approaches require the user to wear headphones, listening to
a single beacon sound blocks this channel for other sources
of information (e.g., music, phone calls, audio books). To
overcome this limitation, several systems integrated the navi-
gation function into mobile music players. ONTRACK [12]
augments the waypoints of a navigation route with a song that
is panned from left to right, and that becomes louder the closer
you get to it. GpsTunes [21] inverses the distance cues and
dims the audio in close proximity of the target destination in
order to minimize the impact on the music listening experi-
ence. For sources in the back of the head, gpsTunes applied
a low-pass filter to give the music a more muffled character.
The approach to pan the entire track into a certain direction,
however, can result in the audio being played on one ear only
if the target is on the far left or right [1, 8, 12, 21, 31], which
affects how music is actually perceived [13].
The main advantage of this technology, compared to the
mainly visual (e.g., map-based or instruction-based navigation
systems such as [28]) or haptic [20] pedestrian navigation sys-
tems, is that it leverages our natural capability of localizing
sound sources in space and thereby reduces the load on the
visual (or haptic) senses [9] which are already used for the
primary task of, e.g., walking or riding a bike [32].
ORIENTATION QUEUES
To be able to move a certain instrument or the voice of a singer
around in the stereo spectrum, we need to separate it from the
rest. To separate the voice of a pop-song from the rest, the
algorithm presented in [11] produces very good results. The
separation into single instruments stems, however, remains
more complicated and can result in artifacts and noise. In
contrast to that, multi-track recordings provide these separate
stems for the different instruments, which means that you can
move, e.g., the hi-hat independently of the lead guitar. We take
advantage of such recordings to reduce the impact of the navi-
gation component on the music listening experience. Instead
of moving all sources around, or even worse, cut off certain in-
struments by using the balance control, we only move a single
source, e.g., the voice of the singer (cf. Figure 1). In order to
achieve a good localizability of the orientation cue, the choice
of the beacon sound is very important. Transient sounds, i.e.,
sounds with a short duration, high amplitude onset, are best to
localize by human listeners [4]. Furthermore, the larger the fre-
quency range of the sound, the more information it can carry
that can be used for localization. Most of the lab studies on
the localizability of sound, therefore, include bursts of white
noise as beacon sounds as they incorporate all of the aforemen-
tioned qualities. Such synthetic sounds, however, while being
technically optimal, are not necessarily a pleasing listening
experience. As a natural signal, human speech covers a large
spectrum and contains repeated transient elements, making
it a suitable cue without exhibiting the problem of repetition
as faced by synthetic beacons sounds [23]. Furthermore, our
auditory perception is optimized to localize and identify hu-
man speakers, even in complex auditory environments [3, 17].
Alternatively, the drums in a song are suitable as localization
cue because these are strong transient signals in an inherently
repetitive pattern.
ORIENTATION SURVEY
To find out whether there is an opportunity to integrate our
envisioned navigation cues into peoples’ everyday listening
habits, we conducted an online survey. The first section of
our online survey was targeted at understanding the listening
behavior of our potential users, and was completed by 21
participants. A majority listens to music at least sometimes
(3), very often (10), or always (3) while being on the move,
and only very few never (1) or rarely (4) do so. Nearly all use
headphones or earphones to listen to their music, and all but
one listen to the music with both ears. Music stored on the
device (13) and music from a streaming service like Spotify
(13) are the preferred sources of content. Audio books (7)
or speech podcasts (6) are popular as well, whereas only one
mentioned to listen to musical podcasts. In most cases (14),
the phone is in a pocket of their trousers. Only few keep the
phone in their hand (2), while the rest places it in some other
pocket or bag, depending on the situation.
Balancing Qualities
In the second part of the survey, we further evaluated the use of
different types of cue sounds, with the goal of finding the right
balance between an aesthetically pleasing presentation and a
good localizability. We used an excerpt of Carlos Gonzalez’s
’A Place For Us’, a vocal pop-song available as multitrack
recording
1
as base for this experiment. We used four differ-
ent orientation cues: two overlays and two integrated ones.
Overlay cues certainly affect the listening experience, never-
theless, we used them as their potential better localizability
could compensate for the degraded aesthetic perception. A
majority of the lab experiments on source localization have
been performed with noise bursts as they provide the largest
frequency range. Therefore, our first overlay consists of pink
noise bursts at 1.7
Hz
repetition rate (
1
4
bar at 102bpm) and a
duty cycle of 50%. Our second overlay consists of an 880Hz
pure sine wave at 0.85
Hz
repetition rate (
1
2
bar at 102bpm).
The overlays were beat-synchronized with the rest of the mu-
sic. As integrated cues, we used the lead voice and the snare
drums. All sources were mixed in the KLANG:app for iOS.
1cambridge-mt.com/ms-mtk.htm
The audio files and the mixing configuration are available
online2.
Study
We evaluated the perception of these directional cues both
in a static and a dynamic use, meaning the sources staying
at a fixed position or moving from one point to another, re-
spectively. We created samples with a fixed orientation cue at
the following angles in the frontal hemisphere: far left (-90
),
-45
, center (0
), 45
, far right (90
). The dynamic cues moved
by 90
from left to center, from center to left, from right to
center, from center to right, and from 45
right to 45
left.
Participants had to select the correct position or the correct
movement range. The static and dynamic conditions were
counterbalanced and the order of the samples randomized. In
total, n=13 participants (3 female, average age 30) completed
this part of the survey.
Results
As hypothesized, in terms of recognition rate, the dynamic
cues (M=.82, SD=0.14) outperformed the static ones (M=
.53, SD = .1) by far (t(24))=-5.66, p<.0001). Both in the static
and the dynamic cases, the voice cue achieved the highest
recognition rate (dynamic: M= .92, SD = .19, static: M= 0.66,
SD = .22), followed by the overlaid noise (dynamic: M= .92,
SD = .19, static: M= 0.57, SD = .23) and beep (dynamic: M
= .86, SD = .24, static: M= 0.55, SD = .23) while the snare
drum achieved significantly lower rates in the dynamic case
(M= .55, SD = .36,
p<0.05
, static: M= 0.35, SD = .23,
n.s). We also asked the participants to rate three aspects of the
orientation cue on a five-point Likert scale. When asked how
well they could localize the orientation cue, the voice (Mdn =
4, IQR = 2.5) and noise (Mdn = 4, IQR = 2) cues achieve the
best ratings in both static and dynamic cases while the snare
drum (Mdn = 2, IQR = 1), again, got significantly (
p<0.002
)
lower ratings. Participants also perceived the concentration
required to localize the source to be equally low for voice and
noise cues in the dynamic case (Mdn = 3, IQR = 2.5, whereas
beep (Mdn = 4, IQR = 1.5) and snare (Mdn = 5, IQR = 0)
demand high attention. The ratings for the static cues are a
little better (although not significantly), but the task also was
simpler (beep: Mdn = 3, IQR = 2, snare: Mdn = 4, IQR =
1, voice: Mdn = 2, IQR = 2, Noise: Mdn = 3, IQR = 1.5).
While in the static case, the listening experience for snare
(Mdn = 4, IQR = 1) and voice (Mdn = 4, IQR = 1) was rated
as good, they received lower ratings in the dynamic examples
(snare: Mdn = 3, IQR = 1.5 n.s., voice: Mdn = 3, IQR = 1.5
p=0.0137
). According to a repeated measures ANOVA, there
is a significant effect of the beacon sound on the perceived
listening experience. A post-hoc Tukey HSD test showed that
the two overlay sounds receive significantly lower ratings than
the two integrated ones (
p< .0014
). For the dynamic cues:
noise Mdn = 2, IQR = 2, beep Mdn = 2, IQR = 1.5 and for the
static ones: noise Mdn = 2, IQR = 1.5, beep Mdn = 3, IQR = 2.
At the end of the survey we asked which of the orientation cues
the participants preferred. A majority opted for the singer’s
voice (8), followed by the two overlay cues noise (4), and
2heller-web.net/navigatone
Dynamic Static
Question
Concentration Listening Experience Localizability
Rating
1
2
3
4
5
Concentration Listening Experience Localizability
Beep Noise Snare Voice
better
better
better
better
better
better
Figure 2. Median ratings of the different cue sounds. We asked how
well participants could localize the cue, how much they have to concen-
trate to localize the sound, and their perception of the overall listening
experience. Error bars indicate range.
beep (1). The snare drum was the least preferred cue of all,
although theoretically, it is a good choice of an orientation cue.
The ratings are probably low because it is much less present
in the mix than the other signals. We did not optimize the
mix for the orientation task, but chose one common for this
musical genre. The presence of the snare drum in the mix can,
however, be emphasized by increasing its volume relative to
the other stems.
LOCALIZATION PRECISION
While in the survey mentioned above, the goal was to de-
termine which kind of cue achieves an acceptable balance
between localization performance and aesthetic presentation,
the following experiment aims at determining the localization
performance more precisely.
Implementation
NavigaTone was implemented on an iPad Air 2 running iOS
with an attached motion-tracking headset. Spatial audio ren-
dering was performed in the KLANG:app, which uses a gen-
eralized Head-Related Transfer Function (HRTF) for spatial-
ization with a resolution of 1in horizontal and 5in vertical
direction. In a small experiment with 5 users, we determined a
minimum audible angle of around
6
in horizontal and
16
in
vertical direction, which is in line with our human capabilities
to locate sound sources [29]. We loaded the multitrack record-
ing in the software and placed the different sources around the
listener’s head. To track head movements, we used the Jabra
Intelligent Headset (intelligentheadset.com) which comes with
a motion tracker, that reports changes in head orientation at
a rate of around 40 Hz and has a specified latency of around
100 ms, which is noticeable [6] but well below the threshold
of 372 ms defined in [16]. The listener orientation and other
relevant playback parameters were sent to the KLANG:app
through OSC commands. While the audio data could also
have been transmitted to the headphones via Bluetooth along
with the sensor data, we used a wired connection between the
tablet and the headset to minimize latency. To allow simple
replication of this experiment, we used the demo track “Unsere
Stadt” that is part of the KLANG:app.
45°
7
8
9
10
11
12
13
16
15°
4
3
2
2m
1
Figure 3. We placed 16 cardboard tubes with 15spacing at 2m distance
from the participant. As in a real world scenario, you would turn around
to precisely localize a source in the back of your head, we only tested the
ability to detect whether a source is in the parietal hemisphere with three
sources.
STUDY
To evaluate the feasibility of our approach in terms of the user
experience as well as the performance, we compared Navi-
gaTone to the standard stereo-panning approach as used in
gpsTunes [21] or ONTRACK [12, 27]. This baseline is very
simple to implement, yet, with the interaural level difference
(ILD), covers the most important cue for lateralization. In
the NAVIGATO NE condition, the sources are arranged around
the listener’s head using an HRTF based rendering, and only
the lead vocal track is moved to communicate the direction
for navigation. To differentiate between sources that are in
the back or in front of the user, the existing stereo-panning
approaches applied a low-pass filter to muffle the sound of
sources in the occipital hemisphere of the listener. For reasons
of comparability, we used the same rendering as in the NAVI -
GATONE condition for the ST ER EO condition, and simulated
panning by placing all sources at the same position and moving
them in parallel. The hypothesis is, that people can distinguish
the source orientation with the same accuracy and that the
listening experience is more pleasant in the NAVIGATONE
condition.
We placed 16 numbered cardboard tubes with 15
spacing
at 2m distance of the listener (Figure 3). Participants were
standing at the center of the source circle looking at source
number 1. Before every trial we muted the lead vocal track
and in the ST ER EO condition, placed all sources at position
no. 1. We then moved the position of the lead vocal track
(NAVIGATONE) or all tracks (ST ER EO) (Figure 4) to one of
the 16 cardboard tubes, un-muted the lead-vocal track and
participants had to name the tube from which they perceived
the voice to come from as fast and accurate as possible. To
evaluate the risk of front-back confusions, we allowed the
participants to move their head and upper body, as they would
Figure 4. The setup of the user study. (left) In the NAVIGATONE condi-
tion just the vocal track was shifted to source no. 3, while the other tracks
were positioned in space to create an immersive user experience. (right)
in the ST ERE O condition all tracks were shifted to appear at source no. 3.
do in a real-world application, but not to move their feet. Once
a source is perceived in the back, users would turn around until
they have the active source in their frontal hemisphere, which
is also the reason why we only use three sources in the back of
the listener (Figure 3). As the IMU in the intelligent headset
tends to drift after fast and large head-turns, we checked the
calibration after every trial and realigned the virtual sources if
necessary. After both conditions, participants had to fill out
an adapted version of the presence questionnaire [30]. We
reduced the questionnaire to items applicable to our system
and additionally asked how similar the listening experience
was compared to regular music listening. All answers had to
be given on a five point Likert scale. Finally, participants were
asked for feedback in semi-structured interviews.
Results
In total, 16 participants (3 female, average age 27) completed
the experiment. The conditions where counterbalanced and
the sequence of sources was randomized using Latin squares.
None of the participants reported having a hearing disorder,
problems with spatial hearing, and three reported having pre-
vious experience with audio augmented reality systems.
According to our users, both conditions offered a pleasant
music listening experience. Nevertheless, NavigaTone out-
performed the baseline in some dimensions. According to a
Wilcoxon singed rank test the NavigaTone condition was rated
significantly more intuitive and introducing less mental work-
load. This was also backed up by participants’ comments after
the test, saying that they found the NAVIGATO NE condition to
be more natural. One user said “In the first condition (stereo),
it was easier, because the sound is just there (pointing at one
of the cardboard tubes), but it is also quite narrow. I preferred
the second condition (NavigaTone) because the sound is all
around you.” Interestingly, the question “How natural did your
interactions with the environment seem?” received the same
rating for both conditions (Mdn=2, IQR=1.75). However, the
rating for “How similar was the music listening experience
compared to regular music listening?” was slightly better
in the NAVIGATON E condition (Mdn=1, IQR=3) than in the
ST EREO condition (Mdn=2, IQR=3), but the difference was
not statistically significant.
After they completed the experiment, we asked participants for
feedback on the listening experience. Nearly all participants
mentioned that they found it easier to localize the sources in the
ST EREO condition, but that they found the listening experience
to be more enjoyable in the NAVIGATON E condition. The
ST EREO condition has the advantage of providing a very strong
cue, whereas NavigaTone aims to provide navigation cues, that
are not impacting the music listening experience and are very
unobtrusive.
Those findings were also reflected in the variables that com-
pared the performance of both approaches. The average task
completion time was 13.3s (SD=7.6) in the NAVIGATON E
condition and 10.97s (SD=4.9) in the ST ER EO condition. A
repeated measures ANOVA on the log-transformed task com-
pletion times with user as random factor showed that users
were significantly faster in the ST ER EO condition than in the
NAVIGATONE condition (F(1,507) = 11.8, p = 0.0006). No
significant effect of source position on the task completion
time could be found. Again, as the baseline condition provided
a very prominent cue, this result was not surprising.
More interestingly, the recognition rate was slightly better in
the ST EREO condition (M= 0.49, SD = 0.5) than in the NAV-
IGATONE condition (M=0.41, SD = 0.49), but overall rather
low and with a large spread. We calculated the offset between
the number of the source actually playing and the given an-
swer and found an average error of 0.83 (SD=0.96) in the
NAVIGATONE condition and 0.73 (SD=1.13) for the STEREO
condition. This shows that the answers were mostly only off
by one source or 15
respectively, which is in the range of
human lateralization error [15]. Furthermore, localization per-
formance decreases in the presence of other, competing sound
sources [5], which is not the case in the STER EO condition.
In a pedestrian navigation scenario, it is rarely necessary to
differentiate between two paths at such angular resolution,
making both implementations well suitable in practice. If we
count the off-by-one answers as correct, then we achieve a
recognition rate of 86% (SD=35) for the NAVIGATON E and
90% (SD=30) for the ST ER EO condition. Interestingly, we
observed two cases of front-back confusion in the ST EREO
condition, but none in the NAVIG ATONE condition.
DISCUSSION
Overall, our results of the controlled experiment confirm that it
is feasible with NavigaTone to provide navigational cues while
listening to music, by shifting just one single track. While
the listening experience was rated better, the performance was
comparable to the baseline.
Coming from a lab experiment, our results do not account
for situations with higher cognitive load as they would be
encountered in a real-world navigation scenario. Participants
could fully concentrate on determining the origin of the sound,
without having to ensure their personal safety by paying at-
tention to traffic lights, pedestrians, or other obstacles. This
more complex setting might influence the perception of our
cues [24], which is why we plan to run further studies with the
presented system as pedestrian navigation system under more
realistic conditions.
Again, as the baseline provides a very strong cue, it is not
surprising, that NavigaTone did not outperform it. We believe
that, in the future, we can further tweak the NavigaTone ap-
proach to even outperform the baseline, e.g., by using two
tracks that span a navigation vector (one source moves in front
of the user, the other one moves in the back) or by finding
the sweet spot between both approaches. As participants are
familiar with listening to stereo music, the perception of stand-
ing in the center of a band and being able to move within
(cf. Figure 4) is actually quite different, which could be the
reason for the small differences in the ratings.
While multi-track recordings offer great potential, it is still
uncommon to release all separate tracks of a song to the public
because this would reveal the very core of a music produc-
tion. As a compromise, Native Instruments’ STEMS format
includes four distinct tracks for specific parts of a recording [7].
Initially designed to give DJs more creative freedom in mixing
two or more songs together, it can also serve as a potential
material for NavigaTone. As the file format specifications
indicate into which tracks specific instrument or sound groups
should be mixed [7], we can pick one with transient sounds.
In both our survey and experiment, we used a vocal pop song
as a starting point for our investigation. Other musical genres
might have different prerequisites. While the noise beacon
somehow fits into the pop-song because of its similarity to
a snare drum or hi-hat, it blends less nicely into a piece of
classical music. In the future, we intend to further refine the
choice of beacon sounds to other musical genres, based on
available stems in the recording and mixing qualities with the
underlying track.
Participants of the survey also mentioned that they would pre-
fer turn-by-turn style navigation. This could be implemented
using a dynamic volume for the navigation cue, dimming it
in between waypoints and emphasizing it near changes in di-
rection. For the overlay cues, this reduces their impact on the
listening experience, while it might also improve the results of
the snare drum cue by making it more present and thus easier
do detect if a change in direction is imminent.
We observed that the Invensense MPU-9250 sensor in the
Intelligent Headset tends to drift after fast and large head turns.
While the fusion algorithms compensate after some time, this
can still lead to errors of 45
and more for a brief time. In our
experiment, we took great care in controlling the drift and the
offset, nevertheless, in a real-world environment the correct
alignment of virtual audio source and physical target cannot
be guaranteed. Through better sensor fusion algorithms, other
IMUs achieve better results in compensating the drift of the
gyroscope and therefore show a lower tendency to drift in
the overall heading information. However, in practice this
behavior might not be of such importance as the paths to
choose from are usually well separated, e.g., in city-scale
navigation.
CONCLUSION & FUTURE WORK
In this note we presented NavigaTone, a new approach to in-
tegrate navigation cues into everyday mobile music listening.
Instead of blocking the auditory channel for the single purpose
of presenting an auditory beacon at the target location, we take
advantage of multitrack recordings to reduce the impact of the
navigation component on the listening experience. In combi-
nation with spatial audio rendering, we are able to indicate the
direction of the navigation target by moving, e.g., the voice
of the singer around the user’s head. In a lab study with 16
users, the results of this new approach were on par with the
much simpler stereo-panning approach, but found our spatial
display to be more natural and less cognitively demanding.
Although increased cognitive load under realistic circum-
stances might influence the perception, we believe that our ap-
proach can provide navigational cues in very different scenar-
ios from pedestrian navigation to navigation in virtual worlds.
As we were mostly interested in the ability to localize sources,
we performed the lab experiment using a short loop of a vocal
track that comes with the Klang:app we used. To be able to
work reliably with a multitude of songs, NavigaTone needs
to ensure that the orientation cue is audible at the waypoints,
i.e., the voice of the singer should be present at an intersection
where the user needs to perform a left turn. If we look at
the capabilities of modern DJ software that allow us to create
remixes on the fly, we can think of incorporating the naviga-
tion function even deeper into the playback mechanism. An
intelligent algorithm could generate a remix of the original
track adapted to the navigation task, with the samples used for
localization shifted slightly from their original timing to make
sure they are present when needed.
ACKNOWLEDGMENTS
We would like to thank the participants of our study for their
time and the people at KLANG:technologies for their feedback
and support. This work is part of the SeRGIo project and the
Volkswagen Foundation through a Lichtenberg Professorship.
SeRGIo is an icon project realized in collaboration with imec,
with project support from VLAIO (Flanders Innovation &
Entrepreneurship).
REFERENCES
1.
Robert Albrecht, Riitta Väänänen, and Tapio Lokki. 2016.
Guided by Music: Pedestrian and Cyclist Navigation with
Route and Beacon Guidance. Pers. and Ubiqu. comp. 20,
1 (2016). DOI:
http://dx.doi.org/10.1007/s00779-016- 0906-z
2. Anupriya Ankolekar, Thomas Sandholm, and Louis Yu.
2013. Play it by ear: a case for serendipitous discovery of
places with musicons. In CHI ’13.DOI:
http://dx.doi.org/10.1145/2470654.2481411
3. Barry Arons. 1992. A Review of The Cocktail Party
Effect. Journal of the American Voice I/O Society 12
(1992), 35–50. DOI:http://dx.doi.org/10.1.1.30.7556
4. Jens Blauert. 1996. Spatial Hearing: Psychophysics of
Human Sound Localization (2 ed.). MIT Press.
5. Jonas Braasch and Klaus Hartung. 2002. Localization in
the Presence of a Distracter and Reverberation in the
Frontal Horizontal Plane. I. Psychoacoustical Data. Acta
Acustica united with Acustica 88, 6 (2002), 942–955.
http://www.ingentaconnect.com/content/dav/aaua/2002/
00000088/00000006/art00013
6. Douglas S Brungart, Brian D Simpson, and Alexander J
Kordik. 2005. The detectability of headtracker latency in
virtual audio displays. In ICAD ’05.
http://hdl.handle.net/1853/50185
7. Chad Carrier and Stewart Walker. 2015. STEM File
Specification. Technical Report.
8. Richard Etter and Marcus Specht. 2005. Melodious
walkabout: Implicit navigation with contextualized
personal audio contents. Pervasive ’05 Adjunct
Proceedings (2005).
9. Eve Hoggan, Andrew Crossan, Stephen A Brewster, and
Topi Kaaresoja. 2009. Audio or Tactile Feedback: Which
Modality when?. In CHI ’09.DOI:
http://dx.doi.org/10.1145/1518701.1519045
10. Simon Holland, David R Morse, and Henrik Gedenryd.
2002. AudioGPS: Spatial Audio Navigation with a
Minimal Attention Interface. Pers. and Ubiqu. comp. 6, 4
(2002). DOI:http://dx.doi.org/10.1007/s007790200025
11. Andreas Jansson, Eric J Humphrey, Nicola Montecchio,
Rachel Bittner, Aparna Kumar, and Tillman Weyde. 2017.
Singing voice separation with deep U-Net convolutional
networks. In ISMIR ’17. 323–332.
12. Matt Jones, Steve Jones, Gareth Bradley, Nigel Warren,
David Bainbridge, and Geoff Holmes. 2008. ONTRACK:
Dynamically Adapting Music Playback to Support
Navigation. Pers. and Ubiqu. comp. 12, 7 (2008). DOI:
http://dx.doi.org/10.1007/s00779-007- 0155-2
13. Doreen Kimura. 1964. Left-right differences in the
perception of melodies. Quarterly Journal of
Experimental Psychology 16, 4 (Dec. 1964), 355–358.
DOI:http://dx.doi.org/10.1080/17470216408416391
14.
Robert W Lindeman, Haruo Noma, and Paulo Goncalves
de Barros. 2007. Hear-Through and Mic-Through
Augmented Reality: Using Bone Conduction to Display
Spatialized Audio. In ISMAR ’07.DOI:
http://dx.doi.org/10.1109/ISMAR.2007.4538843
15. James C Makous and John C Middlebrooks. 1990.
Two-dimensional sound localization by human listeners.
J. Acoust. Soc. Am. 87, 5 (1990). DOI:
http://dx.doi.org/10.1121/1.399186
16.
Nicholas Mariette. 2009. Navigation Performance Effects
of Render Method and Head-Turn Latency in Mobile
Audio Augmented Reality. In ICAD ’09.DOI:
http://dx.doi.org/10.1007/978-3- 642-12439- 6_13
17. T May, S van de Par, and A Kohlrausch. 2013. Binaural
Localization and Detection of Speakers in Complex
Acoustic Scenes. In The Technology of Binaural
Listening.DOI:
http://dx.doi.org/10.1007/978-3- 642-37762- 4_15
18.
David McGookin and Pablo Priego. 2009. Audio Bubbles:
Employing Non-speech Audio to Support Tourist
Wayfinding. In Haptic and Audio Interaction Design.
DOI:http://dx.doi.org/10.1007/978-3- 642-04076- 4_5
19. Felix Richter. 2014. Infographic: U.S. Teens Love Beats
Headphones. (May 2014). https://www.statista.com/
chart/2227/preferred-headphone- brands-among- us-teens/
20.
Enrico Rukzio, Michael Müller, and Robert Hardy. 2009.
Design, Implementation and Evaluation of a Novel Public
Display for Pedestrian Navigation: The Rotating
Compass. In CHI ’09.DOI:
http://dx.doi.org/10.1145/1518701.1518722
21. Steven Strachan, Parisa Eslambolchilar, Roderick
Murray-Smith, Stephen Hughes, and Sile O’Modhrain.
2005. GpsTunes: Controlling Navigation via Audio
Feedback. In MobileHCI ’05.DOI:
http://dx.doi.org/10.1145/1085777.1085831
22. Miikka Tikander, Aki Harma, and Matti Karjalainen.
2003. Binaural positioning system for wearable
augmented reality audio. In IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics.
DOI:http://dx.doi.org/10.1109/ASPAA.2003.1285854
23.
Tuyen V Tran, Tomasz Letowski, and Kim S Abouchacra.
2000. Evaluation of acoustic beacon characteristics for
navigation tasks. Ergonomics 43, 6 (2000). DOI:
http://dx.doi.org/10.1080/001401300404760
24.
Yolanda Vazquez-Alvarez and Stephen A Brewster. 2011.
Eyes-free Multitasking: The Effect of Cognitive Load on
Mobile Spatial Audio Interfaces. In CHI ’11.DOI:
http://dx.doi.org/10.1145/1978942.1979258
25. Yolanda Vazquez-Alvarez, Ian Oakley, and Stephen A
Brewster. 2012. Auditory display design for exploration
in mobile audio-augmented reality. Pers. and Ubiqu.
comp. 16, 8 (2012). DOI:
http://dx.doi.org/10.1007/s00779-011- 0459-0
26. Bruce N Walker and Jeffrey Lindsay. 2005. Navigation
performance in a virtual environment with bonephones.
In ICAD ’05.http://hdl.handle.net/1853/50173
27. Nigel Warren, Matt Jones, Steve Jones, and David
Bainbridge. 2005. Navigation via Continuously Adapted
Music. In CHI EA ’05.DOI:
http://dx.doi.org/10.1145/1056808.1057038
28.
Dirk Wenig, Johannes Schöning, Brent Hecht, and Rainer
Malaka. 2015. StripeMaps: Improving Map-based
Pedestrian Navigation for Smartwatches. In
MobileHCI ’15.DOI:
http://dx.doi.org/10.1145/2785830.2785862
29. Elizabeth M Wenzel, Marianne Arruda, Doris J Kistler,
and Frederic L Wightman. 1993. Localization using
nonindividualized head-related transfer functions. J.
Acoust. Soc. Am. 94, 1 (1993). DOI:
http://dx.doi.org/10.1121/1.407089
30. Bob G Witmer and Michael J Singer. 1998. Measuring
Presence in Virtual Environments: A Presence
Questionnaire. Presence: Teleoper. Virtual Environ. 7, 3
(1998).
DOI:http://dx.doi.org/10.1162/105474698565686
31.
Shingo Yamano, Takamitsu Hamajo, Shunsuke Takahashi,
and Keita Higuchi. 2012. EyeSound: Single-modal
Mobile Navigation Using Directionally Annotated Music.
In Augmented Human ’12. ACM, New York, NY, USA.
DOI:http://dx.doi.org/10.1145/2160125.2160147
32. Matthijs Zwinderman, Tanya Zavialova, Daniel Tetteroo,
and Paul Lehouck. 2011. Oh music, where art thou?. In
MobileHCI ’11 EA.DOI:
http://dx.doi.org/10.1145/2037373.2037456
... Alongside spatial perception research, recent studies in audio perception have investigated methods to manipulate sound for specific perceptual experiences. One such study [38] leveraged humans' innate ability to localize sounds, embedding non-visual navigation cues into the music a user listens to. By moving a single voice, instrument, or instrument group within the stereo panorama, the system subtly alters the audio track, providing useful orientation cues with minimal impact on the listener's perception. ...
... For instance, Zhao et al. [123] carried out a study comparing the effectiveness of audio and visual feedback in helping those with low vision in wayfinding tasks. Other developments involve generating visualizations to assist cyclists in safely traversing uncontrolled intersections [65] and incorporating navigation cues into mobile music listening experiences [38]. ...
... Although the majority of research on wearable AR applications in the past five years has primarily focused on specific domains, such as the medical and industrial sectors, our analysis has observed an increase in consumer wearable AR applications. The expansion of wearable AR applications to broader industries and use cases (e.g., navigation [38,123], language learning [42], and office productivity [27,59,67]) demonstrates the potential of this technology to enhance various aspects of human life. It is important to note that we consider office work as a broad use case because it spans a wide range of industries, professions, and tasks. ...
Article
Wearable Augmented Reality (AR) has attracted considerable attention in recent years, as evidenced by the growing number of research publications and industry investments. With swift advancements and a multitude of interdisciplinary research areas within wearable AR, a comprehensive review is crucial for integrating the current state of the field. In this paper, we present a review of 389 research papers on wearable AR, published between 2018 and 2022 in three major venues: ISMAR, TVCG, and CHI. Drawing inspiration from previous works by Zhou et al. and Kim et al., which summarized AR research at ISMAR over the past two decades (1998–2017), we categorize the papers into different topics and identify prevailing trends. One notable finding is that wearable AR research is increasingly geared towards enabling broader consumer adoption. From our analysis, we highlight key observations related to potential future research areas essential for capitalizing on this trend and achieving widespread adoption. These include addressing challenges in Display, Tracking, Interaction, and Applications, and exploring emerging frontiers in Ethics, Accessibility, Avatar and Embodiment, and Intelligent Virtual Agents.
... With this background, researchers have Y. Yamazaki proposed navigation methods that indicate the direction to be travelled by modulating music, so as not to disturb music listening while navigating [2], [3], [4], [5]. These studies have shown that localization using music modulation can provide the same level of navigation capability as voice guidance. ...
... In particular, modulating the stereo balance of the music inevitably results in a larger volume difference between the left and right sides when the target is right beside you, affecting perception of the music [17]. To facilitate navigation using a more natural music-listening experience, Heller et al. [4], [5] proposed the method NavigaTone, which only modulated specific music tracks, such as vocals and drums. They initially evaluated the NavigaTone method by asking participants to identify sound sources randomly placed at resolutions of 15°in front of them and 45°behind them. ...
... 4) Result: All recorded data and the mean of each participant's mean error (deg) are shown in Fig 5. For each participant, the percentage of outcomes with an error margin below 30°was calculated (only *1, a manipulation error, was excluded) in order to compare the results with those presented in the work of Heller et al. [4]. The mean of the percentages for all participants was 89%; for the track A group it was 85% and for the track B group 92%. ...
Article
Full-text available
We propose a method that stimulates musical vibration (generated from and synchronized with musical signals), modulated by the direction and distance to the target, on both sides of a user's neck with Hapbeat, a necklace-type haptic device. We conducted three experiments to confirm that the proposed method can achieve both haptic navigation and enhance the music-listening experience. Experiment 1 consisted of conducting a questionnaire survey to examine the effect of stimulating musical vibrations. Experiment 2 evaluated the accuracy (deg) of users' ability to adjust their direction toward a target using the proposed method. Experiment 3 examined the ability of four different navigation methods by performing navigation tasks in a virtual environment. The results of the experiments showed that stimulating musical vibration enhanced the music-listening experience, and that the proposed method is able to provide sufficient information to guide the users: accuracy in identifying directions was about 20 ^{\circ } , participants reached the target in all navigation tasks, and in about 80% of all trials participants reached the target using the shortest route. Furthermore, the proposed method succeeded in conveying distance information, and Hapbeat can be combined with conventional navigation methods without interfering with music listening.
... However, such voice guidance can disturb music listening. With this background, researchers have proposed navigation methods that indicate the direction to be travelled by modulating music, so as not to disturb music listening while navigating [2], [3], [4], [5], [6]. These studies have shown that localization using music modulation can provide the same level of navigation capability as voice guidance. ...
... In particular, modulating the stereo balance of the music inevitably results in a larger volume difference between the left and right sides when the target is right beside you, affecting perception of the music [18]. To achieve navigation with a more natural music listening experience, Heller et al. [5], [6] proposed methods that modulated only specific music tracks, such as vocals and drums. Their method demonstrated a navigation performance comparable to modulating the stereo balance of all the music, and users rated the listening experience with their method similar to general music listening. ...
... • NavigaTone (NT): the condition that played music whose Vox track was localized by the direction to the target. The method is based on previous studies [5], [6] using the Resonance Audio framework [39] to perform spatial audio rendering. • NT&Hap: the condition that stimulated the unmodulated music vibration to the participant in addition to the NT. ...
Preprint
We propose a method that stimulates music vibration (generated from and synchronized with musical signals), modulated by the direction and distance to the target, on both sides of a user's neck with Hapbeat, a necklace-type haptic device. We conducted three experiments to confirm that the proposed method can achieve both haptic navigation and enhance the music listening experience. Experiment 1 consisted of conducting a questionnaire survey to examine the effect of stimulating music vibrations. Experiment 2 evaluated the accuracy (deg) of users' ability to adjust their direction toward a target using the proposed method. Experiment 3 examined the ability of four different navigation methods by performing navigation tasks in a virtual environment. The results of the experiments showed that stimulating music vibration enhanced the music listening experience, and that the proposed method is able to provide sufficient information to guide the users: accuracy in identifying directions was about 20\textdegree, participants reached the target in all navigation tasks, and in about 80\% of all trials participants reached the target using the shortest route. Furthermore, the proposed method succeeded in conveying distance information, and Hapbeat can be combined with conventional navigation methods without interfering with music listening.
... Liarokapis [82] Visual tracking Keyboard/mouse/touch screen HMD ··· OpenAL using generic HRTFs Stahl [62] GPS-inertial tracking Slider on GUI Mobile device Outdoor ··· Wakkary and Hatala [70] RFID-visual tracking 3D tangible interface Audio only ··· ··· Wilson et al. [146] GPS-inertial tracking 2D scrolling interface Audio only Outdoor ··· Zimmermann and Lorenz [58] RFID tracking Implicit Audio only Artifical reverberation ··· Heller et al. [72] UWB-inertial tracking Implicit Audio only ··· OpenAL using generic HRTFs Kern et al. [80] ··· Implicit PC display ··· ··· Blum et al. [91] GPS-inertial tracking 3D tangible interface Audio only Outdoor OpenAL using generic HRTFs Katz et al. [65] Visual-inertial tracking Implicit Audio only Outdoor ··· (continued on next page) [66] Visual-inertial tracking Implicit Audio only Pre-modeled room Generic HRTFs Vazquez-Alvarez et al. [92] GPS-inertial tracking 3D tangible interface Audio only Outdoor JAVA JSR-234 using generic HRTFs Blum et al. [51] Inertial tracking 3D tangible interface Audio only ··· PureData using generic HRTFs Langlotz et al. [71] GPS-visual tracking Touch screen Mobile device Outdoor Stereo sound panning de Borba Campos et al. [123] ··· ··· Audio only ··· Stereo sound panning Heller et al. [78] Retroreflective tracking Implicit Audio only Artificial reverberation OpenAL using generic HRTFs Blessenohl et al. [26] Visual tracking Implicit Audio only ··· Generic HRTFs Ruminski [31] Visual tracking ··· Mobile device ··· ··· Chatzidimitris et al. [59] GPS tracking Touch screen Mobile device Outdoor OpenAL using generic HRTFs Heller et al. [52] Inertial tracking Implicit Audio only ··· KLANG using generic HRTFs Russell et al. [73] UWB-inertial tracking Implicit Audio only Outdoor 3DCeption using generic HRTFs Heller and Schöning [63] GPS-inertial tracking Implicit Audio only ··· KLANG using generic HRTFs Kim et al. [86] ··· Touch screen HMD ··· ··· Lim et al. [85] ··· Touch screen Mobile device Outdoor ··· Schoop et al. [27] Visual tracking Implicit Audio only Outdoor Stereo sound panning Sikora et al. [64] GPS-inertial tracking [147] ··· Implicit Audio only ··· Generic HRTFs Sagayam et al. [87] Visual tracking Touch screen Mobile device Pre-modeled room Generic HRTFs Yang et al. [101] Visual tracking Implicit HMD ··· Generic HRTFs Chong and Alimardanov [169] ··· Implicit Audio only Outdoor Generic HRTFs Comunita et al. [67] Visual-inertial tracking Implicit Mobile device Pre-modeled room Generic HRTFs Guarese et al. [54] Inertial tracking Implicit HMD ··· Generic HRTFs Kaul et al. [28] Visual tracking Implicit Audio only ··· Generic HRTFs *AAR = audio augmented reality; HMD = head-mounted display; HRTF = head-related transfer function; IR = impulse response; RFID = radio frequency identification; UWB = ultra wideband. The fact that visual tracking is most popular could be largely due to the development of computer vision techniques. ...
... To achieve robust tracking in various environments, researchers have also explored hybrid techniques that fuse several kinds of sensors. Table 2 shows that 34% of the AAR systems employed hybrid pose tracking approaches, with the most popular combination being GPS and inertial sensors (53%; e.g., [61][62][63][64]). These implementations were typically designed for large spaces or outdoor environments, where GPS sensors can be used for localizing the user's position and the inertial sensors can be used to determine the user's head orientation. ...
... Of the reviewed AAR systems, 54% implemented binaural spatialization with generic HRTFs. Some of these systems used open-source or freely available spatial audio engines, such as OpenAL [29,30,82,72,91,78,59] and KLANG [52,63]. However, the auralization details of the spatial audio engines are not available, and some of these systems did not specify the auralization principles or the audio engines they used for binaural spatialization. ...
... A point on a 2-D plane can be represented as a pair of positional coordinates (x, y) along two orthogonal axes in the Cartesian system or in the form of an angle (bearing) and a distance from the origin (range) in the polar system ( , r). Existing applications have employed either the Cartesian [7,12,41] or the polar [1,5,17] system, but to the best of our knowledge not compared their guidance efficacy (in terms of completion time, trajectory length, cognitive load, etc.) side-by-side when controlling for other parameters. The goal of our present study was to experimentally assess AG guidance efficacy in a computer-based 2-D navigation task, specifically comparing (i) the concurrent and sequential sonification of dimensional coordinates, and (ii) the sonification of Cartesian and polar coordinates. ...
... GPSbased navigation) tend to be based on the polar coordinate system as it is readily suited to how humans locomote (by turning to face the desired direction and moving forward until the destination or next turn). Users usually have to 'follow the sound' representing the target location (such as beacon sounds [37,39], music [17,34], or noise [9]), whose azimuth position is encoded as the perceived location of the sound in the stereo field [9,37,39]. Complex routes are commonly broken down into simpler segments separated by waypoints where a direction change usually occurs [37,39,40]. ...
... Drivers can drive to their destination according to the directions indicated by the music. In 2018, Heller et al. [73] proposed NavigaTone, a system that utilizes multichannel recording and provides directional navigation by moving individual tracks in auditory space. The driver can position the sound source as if using stereo panning, while the listening experience is closer to that of regular music listening. ...
Preprint
Full-text available
Since 2021, the term "Metaverse" has been the most popular one, garnering a lot of interest. Because of its contained environment and built-in computing and networking capabilities, a modern car makes an intriguing location to host its own little metaverse. Additionally, the travellers don't have much to do to pass the time while traveling, making them ideal customers for immersive services. Vetaverse (Vehicular-Metaverse), which we define as the future continuum between vehicular industries and Metaverse, is envisioned as a blended immersive realm that scales up to cities and countries, as digital twins of the intelligent Transportation Systems, referred to as "TS-Metaverse", as well as customized XR services inside each Individual Vehicle, referred to as "IV-Metaverse". The two subcategories serve fundamentally different purposes, namely long-term interconnection, maintenance, monitoring, and management on scale for large transportation systems (TS), and personalized, private, and immersive infotainment services (IV). By outlining the framework of Vetaverse and examining important enabler technologies, we reveal this impending trend. Additionally, we examine unresolved issues and potential routes for future study while highlighting some intriguing Vetaverse services.
... Furthermore, spatial audio can be used for guidance. For example, the direction from which music is played [1,41], or individual instruments [44] can be used for guidance. To determine the orientation of the user in relation to the played spatial sounds, sensors in the headphones or the smartphone can be used as a virtual directional microphone [42,43]. ...
Article
Full-text available
Urban environments are often characterized by loud and annoying sounds. Noise-cancelling headphones can suppress negative influences and superimpose the acoustic environment with audio-augmented realities (AAR). So far, AAR exhibited limited interactivity, e. g., being influenced by the location of the listener. In this paper we explore the superimposition of synchronized, augmented footstep sounds in urban AAR environments with noise-cancelling headphones. In an online survey, participants rated different soundscapes and sound augmentations. This served as a basis for selecting and designing soundscapes and augmentations for a subsequent in-situ field study in an urban environment with 16 participants. We found that the synchronous footstep feedback of our application EnvironZen contributes to creating a relaxing and immersive soundscape. Furthermore, we found that slightly delaying footstep feedback can be used to slow down walking and that particular footstep sounds can serve as intuitive navigation cues.
Book
Full-text available
The field of spatial hearing has exploded in the decade or so since Jens Blauert's classic work on acoustics was first published in English. This revised edition adds a new chapter that describes developments in such areas as auditory virtual reality (an important field of application that is based mainly on the physics of spatial hearing), binaural technology (modeling speech enhancement by binaural hearing), and spatial sound-field mapping. The chapter also includes recent research on the precedence effect that provides clear experimental evidence that cognition plays a significant role in spatial hearing. The remaining four chapters in this comprehensive reference cover auditory research procedures and psychometric methods, spatial hearing with one sound source, spatial hearing with multiple sound sources and in enclosed spaces, and progress and trends from 1972 (the first German edition) to 1983 (the first English edition)—work that includes research on the physics of the external ear, and the application of signal processing theory to modeling the spatial hearing process. There is an extensive bibliography of more than 900 items.
Conference Paper
Full-text available
Map applications for smartwatches present new challenges in cartography, a domain in which large display sizes have significant advantages. In this paper, we introduce StripeMaps, a system that adapts the mobile web design technique of linearization for displaying maps on the small screens of smartwatches. Just as web designers simplify multiple column desktop websites into a single column for easier navigation on mobile devices, StripeMaps transforms any two-dimensional route map into a one-dimensional "stripe". Through a user study, we show that this simplification allows StripeMaps to outperform both traditional mobile map interfaces and turn-by-turn directions for pedestrian navigation using smartwatches. In addition to introducing StripeMaps, this paper also has a secondary contribution. It contains the first empirical comparison of different approaches for pedestrian smartwatch navigation and illuminates their pros and cons.
Article
Full-text available
Music listening and navigation are both common tasks for mobile device users. In this study, we integrated music listening with a navigation service, allowing users to follow the perceived direction of the music to reach their destination. This navigation interface provided users with two different guidance methods: route guidance and beacon guidance. The user experience of the navigation service was evaluated with pedestrians in a city center and with cyclists in a suburban area. The results show that spatialized music can be used to guide pedestrians and cyclists toward a destination without any prior training, offering a pleasant navigation experience. Both route and beacon guidance were deemed good alternatives, but the preference between them varied from person to person and depended on the situation. Beacon guidance was generally considered to be suitable for familiar surroundings, while route guidance was seen as a better alternative for areas that are unfamiliar or more difficult to navigate.
Article
Full-text available
A critical engineering parameter in the design of interactive vir-tual audio displays is the maximum amount of latency that can be tolerated between the movement of the listener's head and the corresponding change in the spatial audio signal presented to the listener's ears. In this study, subjects using a virtual audio dis-play were asked to detect the difference between a control stimu-lus that had the lowest possible latency value for the display sys-tem (11.7 ms) and a test stimulus that had an artificially increased headtracker latency ranging from 36 to 203 ms. In a standard lis-tening configuration with only a single virtual sound source, the results show that typical listeners are unable to reliably detect the presence of headtracker latencies smaller than 80 ms, and that even the best listeners are unable to detect changes smaller than 60 ms. However, the addition of low-latency reference tone at the same location of the target signal decreases the minimum threshold for latency detection by about 25 ms. This result suggests that aug-mented reality systems may require headtracker latencies smaller than 30 ms to ensure the delays are undetectable to all users in all listening environments.
Chapter
The robust localization of speech sources is required for a wide range of applications, among them hearing aids and teleconferencing systems. This chapter focuses on binaural approaches to estimate the spatial position of multiple competing speakers in adverse acoustic scenarios by only exploiting the signals reaching both ears. A set of experiments is conducted to systematically evaluate the impact of reverberation and interfering noise on speaker-localization performance. In particular, the spatial distribution of the interfering noise has a considerable effect on speaker-localization performance, being most detrimental if the noise field contains strong directional components. In these conditions, interfering noise might be erroneously classified as a speaker position. This observation highlights the necessity to combine the localization stage with a decision about the underlying source type in order to enable a robust localization of speakers in noisy environments.
Article
A critical parameter for the design of interactive virtual audio displays is the maximum acceptable amount of delay between the movement of the listener's head and the corresponding change in the spatialized signal presented to the listener's ears. Two studies that used a low-latency virtual audio display to evaluate the effects of headtracker latency on auditory localization are presented. The first study examined the effects of headtracker delay on the localization on broad-band sounds. The results show that latency values in excess of 73 ms result in increased localization errors for brief sounds and increased localization response times for continuous sound sources. The second study measured how well listeners could detect the presence of headtracker latency in a virtual sound. The results show that the best listeners can detect latency values of 60-70 ms for isolated sounds, and that their detection thresholds are 25 ms lower for sounds presented in conjunction with a low-latency reference tone. These results suggest that headtracker latency values lower than 60 ms are likely to be adequate for most virtual audio applications, and that delays of less than 30 ms are difficult to detect even in very demanding virtual auditory environments.
Conference Paper
Current location-based services (LBS) typically allow users to locate points of interest (POI) in their vicinity but can detract from the user's emotional experience of exploring a new location. In this paper, we examine how cues in the form of popular music (musicons) can emotionally engage users and enhance their experience of discovering nearby POIs serendipitously in unfamiliar places. The primary contribution of this paper is a field study, in which we evaluate the performance and emotional engagement of different types of audio-based cues for directing users' attention to specific POIs. Musicons and mixed-modality cues performed close to visual and speech cues, and significantly better than auditory icons, for POI identification while creating a much more pleasant and engaging user experience. We conclude that cues for POI discovery need not always be as explicit as the baseline visual cues. Indeed, the most challenging cues, auditory icons, led to a heightened sense of autonomy.
Article
Normal subjects were given two auditory tests, one consisting of spoken digits presented dichotically, the other of melodies presented dichotically. On the Digits test, the score for the right ear was higher than for the left (as previously established), and on the Melodies test the score for the left ear was higher than for the right. These findings were related to the different roles of the right and left hemispheres of the brain in verbal and nonverbal perception.
Article
In this paper, we propose a mobile navigation system that uses only auditory information, i.e., music, to guide the user. The sophistication of mobile devices has introduced the use of contextual information in mobile navigation, such as the location and the direction of motion of a pedestrian. Typically in such systems, a map on the screen of the mobile device is required to show the current position and the destination. However, this restricts the movements of the pedestrian, because users must hold the device to observe the screen. We have, therefore, implemented a mobile navigation system that guides the pedestrian in a non-restricting manner by adding direction information to music. By measuring the resolution of the direction that the user can perceive, the phase of the musical sound is changed to guide the pedestrian. Using this system, we have verified the effectiveness of the proposed mobile navigation system.
Article
The ability to localize a sound in the free field emitted from one direction (noise burst, 200-ms duration, 20-ms cos2-ramps) in the presence of another sound emitted from a different direction (noise burst, 500-ms duration, 20-ms cos2-ramps) is measured in anechoic and reverberant virtual environments using individual head-related transfer functions (HRTFs). The target is presented from 13 directions in steps of 15° in the frontal-horizontal plane at different power ratios of target and distracter (T/D-ratio), measured before they are filtered with the HRTFs. The distracter is placed at 0°, 30° or 90° azimuth. When the T/D-ratio is set at 0 dB, the perceived directions of the target were significantly shifted away from the actual location of the target in the opposite direction of the distracter location for all distracter directions. This phenomenon is found for all stimuli presented in an anechoic and in a reverberant environment. With decreasing T/D-ratio, the listeners give similar responses for adjacent angles. At the lowest test condition (which was set individually to −12 dB or −15 dB), the listeners' answers can be grouped into the following general target positions: left, front and right. In the reverberant condition, this effect is observed at T/D-ratios of −7 dB or −10 dB. Measurements of masked detection thresholds in a further experiment show that the listeners cannot detect the target when it is presented at the direction of the distracter (0° azimuth) at T/D-ratios below −5 dB. A 2A-4IFC discrimination experiment in the anechoic condition reveals that the listeners are unable to discriminate between target angles up to 45° apart for low T/D-ratios (≤ −12 dB).