Conference PaperPDF Available

JSAmbisonics: A Web Audio library for interactive spatial sound processing on the web

Authors:

Abstract

This paper introduces the JSAmbisonics library, a set of JavaScript modules based on the Web Audio API for spatial sound processing. Deployed via Node.js, the library consists of a compact set of tools for reproduction and manipulation of first-or higher-order recorded or simulated Ambisonic sound fields. After a brief introduction to the fundamentals of Ambisonic processing, the main components (encoding, rotation, beamforming, and binaural decoding) of the JSAmbisonics library are detailed. Each component , or " node " , can be used on its own or combined with others to support various application scenarios, discussed in Section 4. An additional library developed to support spherical harmonic transform operations is introduced in Section 3.2. Careful consideration has been given to the overall computational efficiency of the JSAmbisonics library, particularly regarding spatial-encoding and decoding schemes, optimized for real-time production and delivery of immersive web contents.
JSAmbisonics: A Web Audio library for interactive spatial sound
processing on the web
ARCHONTIS POLITIS
Dept. of Signal Processing and Acoustics, Aalto University, Espoo, Finland
e-mail: archontis.politis@aalto.fi
DAVID POIRIER-QUINOT
IRCAM, Paris, France
e-mail: david.poirier-quinot@ircam.fr
September 23rd 2016.
Abstract
This paper introduces the JSAmbisonics library, a set of JavaScript modules based on the Web Audio
API for spatial sound processing. Deployed via Node.js, the library consists of a compact set of tools for
reproduction and manipulation of first- or higher-order recorded or simulated Ambisonic sound fields.
After a brief introduction to the fundamentals of Ambisonic processing, the main components (encoding,
rotation, beamforming, and binaural decoding) of the JSAmbisonics library are detailed. Each compo-
nent, or “node”, can be used on its own or combined with others to support various application scenarios,
discussed in Section 4. An additional library developed to support spherical harmonic transform oper-
ations is introduced in Section 3.2. Careful consideration has been given to the overall computational
efficiency of the JSAmbisonics library, particularly regarding spatial-encoding and decoding schemes,
optimized for real-time production and delivery of immersive web contents.
1 Introduction
The emerging technological trends on delivery of au-
diovisual content are currently targeting increased
immersion. After the increase in bandwidth and com-
putational power that made delivery of high quality
audio and video content on devices such as smart-
phones possible, making this content immersive is
considered a requirement in order to provide a leap in
user experience compared to the traditional modes of
enjoying audiovisual content. Virtual and augmented
reality technology has also resurfaced, targeting mo-
bile platforms and seemingly closer to large scale de-
ployment. Spatial sound is a fundamental component
of these immersive technologies.
Effective spatial sound tools for creation of immer-
sive content are well known from an audio engineering
point of view; panning tools for loudspeakers, binau-
ral filters for headphones, reverberation and decorre-
lation for a sense of space. One approach to spatial
scene description and generation is to define all in-
dividual sound sources and environment along with
their spatialization parameters, an approach termed
object-based spatial audio. An alternative is to con-
sider a scene-based description, in which the audio
signals describe a full sound scene. Such a represen-
tation has certain advantages over the object-based
approach, as long as the format is adequate to re-
produce the sound scene with high perceptual qual-
ity and there is no intention of re-mixing the scene
components at the client side. Such advantages are
lower transmission requirements, compared to the
high number of object channels, efficient implemen-
tation of scene effects, such as rotations, and direct
mixing with recorded sound scenes.
Ambisonics [1,2,3,4] is such a method with the
main advantage that it offers a canonical and hier-
archical representation of the spatial sound scene,
Ambisonic processing on the web Politis
and it is computationally efficient. Ambisonics treats
synthetic and captured sound scenes in a common
framework, which makes them especially suitable for
spherical audio recording in conjunction with spheri-
cal video. Furthermore, it provides a suitable method
for rendering to headphones with a combination of
ambisonic theory and binaural filters, and suitable
tools for rotations and manipulations of the scene.
This paper presents an Ambisonics audio library
that utilizes the Web Audio API (WAA) [5] for inter-
active spatial sound processing on the web [6]. That
makes the library useful for spatial sound creation on
any modern browser that supports WAA. The library
is written in JavaScript (JS) and is easy to use and
incorporate in a web application. Special effort has
been given in making the library comprehensive and
extensible. The library supports Higher-order Am-
bisonics (HOA) of arbitrary order and implements
most fundamental ambisonic processing blocks for
generating and reproducing a sound scene. These
operations, their implementation and potential ap-
plications are presented below.
2 Ambisonics background
2.1 Sound scene description in Am-
bisonics
Assuming that all sound sources are on the far-field,
a general sound scene can be described as a continu-
ous distribution of plane waves with spatio-temporal
amplitude a(t, γ) for a plane wave incident from di-
rection γ= [cos φcos θ, sin φcos θ, sin θ]T, with (φ, θ)
being the azimuth and elevation angle respectively.
By taking the spherical harmonic transform (SHT) of
the amplitude density, we arrive at the ambisonic de-
scription of the sound scene, encoded into the SH co-
efficients of the amplitude density a, or equivalently,
the ambisonic signals
a(t) = SHT {a(t, γ)}=Zγ
a(γ)y(γ) dγ,(1)
where Rγdγdenotes integration over the surface of
the unit sphere, and dγ= cos θdθdφis the differen-
tial surface element. The basis vector y(γ) contains
all SHs up to a specified maximum order N. For a
SHT of order N, there are M= (N+ 1)2SHs and
ambisonic signals. Following established HOA con-
ventions, real SHs are used and defined as
Ynm(θ, φ) = s(2n+ 1)(n− |m|)!
(n+|m|)!Pn|m|(sin θ)ym(φ),
(2)
with
ym(φ) =
2 sin |m|φ m < 0,
1m= 0,
2 cos m > 0,
(3)
and Pnm the associated Legendre functions of degree
n. The SHs are orthonormal with
Zγ
y(γ)yT(γ) dγ= 4πI(4)
where Iis the M×Midentity matrix. Using this
power normalization, the 0th order ambisonic signal
a00 is equivalent to an omnidirectional signal at the
origin.
The most commonly used ordering of SHs in most
scientific fields and, consequently, the ambisonic sig-
nals, is
[y(γ)]q=Ynm(γ),with q= 1,2, ..., (N+ 1)2
and q=n2+n+m+ 1.(5)
From the index qthe mode number (n, m) can be re-
covered as n=bq1cand m=qn2n1.
In HOA literature, the ordering of Eq. 5is known as
ACN ambisonic channel ordering, and the normal-
ization of Eq. 2&4as N3D normalization.
2.2 Ambisonic encoding
Encoding of a plane wave source carrying a signal
s(t), incident from γ0, to ambisonic signals is given
by
a(t) = s(t)y(γ0),(6)
so that multiple signals for Ksources can be encoded
as
a(t) =
K
X
k=1
sk(t)y(γk).(7)
2.3 Ambisonic rotation
Rotation of the sound scene can be conveniently per-
formed in the SHD by applying a SH rotation matrix
to the ambisonic signals. More specifically, for a ro-
tation of the coordinate system given by the three
Euler angles α, β, γ , the signals of the rotated scene
are given by
arot
n(t) = Mrot
n(α, β, γ )an(t),with n= 1,2, ..., N
(8)
where an= [an(n), ..., ann]Tdenotes the ambisonic
signals of order n, and Mrot
nis an (n+ 1)2×(n+ 1)2
rotation matrix for the certain order. Semi-closed
form solutions for the rotation matrices exist only for
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 2
Ambisonic processing on the web Politis
complex SHs, and they are too computationally de-
manding to compute for real-time applications. How-
ever, fast recursive algorithms exist for rotation of
real SHs, that are efficient and suitable for ambisonic
processing. [7,8].
2.4 Ambisonic reflection
Reflection, or mirroring, of the sound scene along the
principal planes of yz (front-back), xz (left-right), or
xy (up-down) becomes a trivial operation in the SHD
due to symmetry properties of the SHs. As SHs are
either symmetric or antisymmetric with respect to
these planes, the ambisonic signals either remain the
same under reflection (for symmetric SHs) or are in-
verted (for antisymmetric SHs). Hence, reflection re-
duces to inverting the polarity of specific sets of am-
bisonic signals, depending on the reflection plane:
(m < 0meven) (m0modd) : yz (9)
m < 0 : xz (10)
(n+m) odd : xy.(11)
2.5 Ambisonic beamforming
Beamforming in the SHD reduces to a weight-and-
sum operation of the SH signals. In ambisonic litera-
ture SH beamforming has been traditionally termed
avirtual microphone. In case the directional pattern
of the virtual microphone is axisymmetric, which is
usually the case of interest, the virtual microphone
signal xvm(t) is given by
xvm(t, γ0) = wT(γ0)a(t) (12)
where γ0is the orientation of the virtual micro-
phone, and w(γ0) the (N+1)2vector of beamforming
weights. The weight vector follows the ordering of the
SHs, and can be expressed as a pattern-dependent
part and a rotation-dependent part as
[w(γ0)]q=wnm =cnYnm(γ0).(13)
The (N+ 1) coefficients cnare derived according to
desired properties of the virtual microphone; some
patterns of interest are presented below.
2.6 Ambisonic decoding
2.6.1 Loudspeaker decoding
The ambisonic signals can be distributed to a play-
back setup through a decoding mixing matrix, a pro-
cess termed ambisonic decoding. Commonly, this
decoding matrix is frequency-independent, especially
in HOA. Its design can be performed according to
physical or psychoacoustical criteria. The signals
xls = [x1, ..., xL] for Lloudspeakers are then obtained
by
xls(t) = Dls a(t) (14)
where Dls is the L×(N+ 1)2decoding matrix.
Some straightforward designs for the decoding
matrix are the following:
Sampling :Dls =1
LYT
L(15)
Mode matching :Dls = (YT
LYL+β2I)1YT
L(16)
ALLRAD :Dls =1
Ntd
GtdYT
td (17)
where YL= [y(γ1), ..., y(γL)] is the (N+ 1)2×L
matrix of SHs at the loudspeaker directions. In the
mode-matching approach, the least-squares solution
is usually constrained with a regularization value β.
In the ALLRAD method [4], Ytd = [y(γ1), ..., y(γT)]
is the matrix of SHs at the Ntd directions of a uniform
spherical t-design [9], of t2N+ 1, while the Gtd is
an L×Ntd matrix of vector-amplitude panning gains
(VBAP) [10], with the t-design directions considered
as virtual sources.
2.6.2 Binaural decoding
Ambisonics are suitable for headphone reproduc-
tion, by integrating head-related transfer functions
(HRTFs). As HRTFs are frequency-dependent, so
are the decoding matrices in this case. More specifi-
cally, the binaural signals xbin = [xL, xR]Tare given
by
xbin(f) = Dbin (f)a(f) (18)
with Dbin being the 2 ×(N+ 1)2decoding matrix.
In the time domain, Eq. 18 translates to a sum of
convolutions as
xbin(t) =
(N+1)2
X
q=1
dL
q(t)aq(t)
(N+1)2
X
q=1
dR
q(t)aq(t)
(19)
where () denotes convolution and dL
q(t) =
IFT {[Dbin]1,q (f)}is the filter derived from the in-
verse Fourier transform of the q-th entry of the decod-
ing matrix for the left ear, and similarly for the right.
Hence, in the general case 2 ×(N+ 1)2convolutions
are required for binaural decoding.
There are two ways to derive the decoding matrix
coefficients, or equivalently the filters. The direct ap-
proach takes advantage of the Parseval’s theorem for
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 3
Ambisonic processing on the web Politis
the SHT, which for a sound distribution a(f, γ) and
e.g. the left HRTF hL(f, γ) states that
xL(f) = Zγ
a(f, γ)hL(f , γ) dγ
=SHT {a(f, γ)} · SHT {hL(f , γ)}
=hT
L(f)a(f).(20)
where hLare the coefficients of the SHT applied on
the HRTF. The above Eq. 20 states that the bin-
aural signals are the result of the inner product be-
tween the ambisonic coefficients and the SH coeffi-
cients of the HRTFs. Hence the decoding matrix in
this case is Dbin(f) = [hL(f),hR(f)]T. Expansion
of HRTFs into SH coefficients has been researched
extensively, mainly in the context of HRTF interpo-
lation [11,12,13].
The second way, and the one seen more often
in literature [14,15], is the virtual loudspeaker ap-
proach, in which plane wave signals are decoded with
a decoding matrix of preference Dvls, covering the
sphere adequately, and then consequently convolved
with the HRTFs for the decoding directions. The
number Kof decoding directions is selected to be
high enough for the order of the available ambisonic
signals, with K > (N+ 1)2. Formulated in the fre-
quency domain, the virtual loudspeaker approach be-
comes
xbin(f) = HLR (f)Dvlsa(f) = Dbin (f)a(f),(21)
where
HLR =
hL(f, γ1)hR(f , γ1)
... ...
hL(f, γK)hR(f , γK)
T
(22)
is the matrix of HRTFs for the decoding directions.
Note that the final ambisonic decoding matrix Dbin =
HLRDvls is again of size 2 ×(N+ 1)2, no matter the
number of decoding directions K.
If it is assumed that the left and right HRTFs
are antisymmetric with respect to the median plane
(termed here as xz-antisymmetry), e.g. when non-
personalized HRTFs are applied, then what the right
ear would capture is similar to the left ear signal if the
sound scene distribution was mirrored with respect
to the median plane. Such mirroring corresponds to
Eq. 10. In practice, that means that only the (N+1)2
left-ear HRTF filters need to be applied to derive both
ear signals. Any of the two methods presented above
can be used for computing the filters. Assuming two
intermediate signals M(t) and S(t) with
M(t) = X
q|m0
dL
q(t)aq(t)
S(t) = X
q|m<0
dL
q(t)aq(t) (23)
the binaural signals can be derived simply by
xbin(t) = M(t) + S(t)
M(t)S(t).(24)
This formulation is of practical importance for real-
time applications since it reduces the required num-
ber of convolutions by half. This fact has been noted
in literature with the virtual loudspeaker approach,
assuming antisymmetric arrangements [15]. It can
also be seen, however, from a purely ambisonic per-
spective as shown above.
3 Implementation
3.1 Web Audio API
WAA contains all signal processing elements that per-
mit the realization of ambisonic processing. More
specifically, since they are all either frequency in-
dependent or frequency-dependent linear processes,
they can be realized with gain factors, convolutions
and summations on the ambisonic signals. In WAA
fundamental signal processing blocks are called Au-
dio Nodes. Three such audio nodes are used in the
implementation of all ambisonic processing blocks.
The first is the Gain Node, a simple signal multiplier
with user-controlled gain at runtime. The second is
a convolution block, the Convolver Node, which per-
forms linear convolution with user-specified FIR fil-
ters. This block is utilized for the convolutions in
the binaural decoding stage. Finally, the (N+ 1)2
channels for a specified order are grouped into single
streams when sent from an ambisonic block to an-
other, by using the Channel Merger Node, and split
again into the constituent channels using Channel
Splitter Node when received from an ambisonic block,
to be processed.
Vector and matrix operations on the ambisonic
signals are realized with groups of gain nodes and by
summing appropriately the resulting channels. An
alternative to this can be the Audio Worker Node, in
which JS code is applied directly on the audio buffers.
However, the built-in gain nodes handle the fast up-
dating of values during runtime without artifacts, and
the benefit of an audio worker implementation is ex-
pected to be small if any. An implementation and
comparison of such an approach is planned as future
work.
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 4
Ambisonic processing on the web Politis
3.2 JS Spherical Harmonic Transform
library
Since there is no existing JS library for the spherical
harmonic operations involved in ambisonic process-
ing, a custom made one was created for this project
[16]. The library performs the following basic opera-
tions:
Computation of all associated Legendre func-
tions up to a maximum degree N, for sets of
points, using fast recursive formulas [17].
Computation of all real SHs up to a specified
order N, for sets of directions.
Computation of the forward SHT, using either
a direct weighted-sum approach of the data
points, or by a least-squares approach. The
transform returns a vector of SH coefficients.
Computation of the inverse SHT at an arbitrary
direction, using the SH coefficients from the for-
ward transform.
Computation of rotation matrices in the SHD,
using the fast recursive solution of [7] for real
SHs .
Applications of JSHT are not limited only to Web
audio and ambisonics. Graphics and scientific appli-
cations that benefit from a spherical spectral repre-
sentation can use it for demonstrative purposes de-
ployed on the Web. Spherical interpolation of direc-
tional data is such an example.
3.3 JS Ambisonics library
The WAA Ambisonics library implements a set of
audio processing blocks that realize most of the fun-
damental operations presented in Sec. 2. SH com-
putations are performed internally using the JS SHT
library described above. All ambisonic processing fol-
lows the ACN/N3D convention. However, a number
of blocks are provided for converting other channel
and normalization conventions to this specification.
All ambisonic blocks expose an in and out node, that
can be used for WAA style of connecting audio blocks.
Furthermore, they expose some properties and meth-
ods that can be updated during real-time, for interac-
tive operation. For a detailed documentation of the
object properties the reader is referred to [6].
3.3.1 Encoding, Rotation & Mirroring
The monoEncoder object takes a monophonic sound
stream and encodes it in an ambisonic stream of a
user-specified order, and at a user specified direction,
using Eq. 6. The source direction can be updated
interactively at runtime.
The sceneRotator object takes an ambisonic
stream of a certain order and returns the stream
of the same order for a rotated sound scene. The
scene rotation is given in yaw-pitch-roll convention.
To avoid redundant computations, the ambisonic sig-
nals of each order nare multiplied only with the
rotation matrix Mrot
nof that order, as shown in
Eq. 8. The sceneMirror object implements mirror-
ing through the polarity inversions of Eq. 911. Both
rotation and mirroring can be updated interactively.
3.3.2 Virtual Microphones
The virtualMic object implements an ambisonic
beam former of a user-specified type and orientation.
The block implements Eq. 12, with the following op-
tions controlling the type of a virtual microphone of
order Nthrough the coefficients cnof Eq. 13:
cardioid : cn=N!N!
(N+n+ 1)!(Nn)! (25)
hypercardioid : cn=1
(N+ 1)2(26)
max rE : cn=Pn(cos κN)
PN
n=0(2n+ 1)Pn(cos κN)
(27)
with κN= cos (2.407/(N+ 1.51)) as given in [4].
Higher-order cardioids are defined as a normal car-
dioid raised to the power of N. Higher-order hyper-
cardioids maximize the directivity factor for a given
order; in spherical beamforming literature also known
as regular or plane-wave decomposition beamformers.
The max-rE pattern originates from ambisonic lit-
erature and maximizes the acoustic intensity vector
in an isotropic diffuse field. Apart from the above,
higher-order supercardioids are also implemented up
to 4th order with the coefficients converted appropri-
ately from [18]. Supercardioids maximize the front-
to-back power ratio for a given order.
3.3.3 Conversion between formats
All operations are internally performed using the
ACN/N3D specification. However, the vast major-
ity of recorded ambisonic material is first order, and
it follows the traditional B-format specification of
WXYZ channel ordering. Conversion from this spec-
ification to ACN/N3D can be expressed by the con-
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 5
Ambisonic processing on the web Politis
version matrix
xACN/N3D =
2 0 0 0
0 0 3 0
0 0 0 3
03 0 0
xWXYZ (28)
Regarding HOA, the first existing specification is
the Furse-Malham (FuMa) one [19], defined up to
third order. Conversion from WXYZ or FuMa
to ACN/N3D can be performed with the convert-
ers.bf2acn and converters.fuma2acn objects respec-
tively. Note that the first-order specification of FuMa
is the same as the traditional WXYZ one.
Recent HOA research and technology uses the
ACN ordering scheme as the standard. However,
in terms of SH normalization there are two popu-
lar schemes, the orthonormal N3D, which is used
throughout this library, and the Schmidt semi-
normalized one, known as SN3D in ambisonic litera-
ture. Conversion between the two is trivial and given
by
xnm|SN3D =xnm|N3D/2n+ 1 (29)
xnm|N3D =2n+ 1xnm|SN3D.(30)
Conversion between the two specifications can be per-
formed with the blocks converters.n3d2sn3d and con-
verters.sn3d2n3d.
3.3.4 Acoustic Visualization
It is possible to extract information from the am-
bisonic signals about the directional distribution of
sound in the scene. One such approach is based on
the acoustic active intensity, expressing the net flow
of energy through the notional center of the sound
scene, and the diffuseness, expressing the portion of
energy that is not propagating due to either modal
or diffuse behavior. These parameters require only
the first-order ambisonic signals, which correspond to
acoustic pressure and velocity, see for example [20].
Examples of how diffuseness and intensity may be
used for visualizations sound sources in the scene can
be found in the code examples [6]. Their broadband
version can be extracted using the intensityAnalyzer
block, computed at each processing block of WAA.
More refined visualizations can be obtained if the
intensity and diffuseness is computed in frequency
bands, e.g. using the biquad filter structures of WAA.
3.4 Decoding filter generation and
SOFA integration
Binaural decoding is implemented in the binDecoder
block, paired with both hoaLoader and hrirLoader
blocks that handle user-defined binaural decoding fil-
ters loading.
Using the hoaLoader, users can choose both HRIR
set and decoding approach. An additional Matlab
script based on the Higher-Order-Ambisonics library
[21] is available for offline generation of HOA decod-
ing filters. Some decoding filters are already included
in the repository, based on LISTEN HRTF sets [22],
derived using the ALLRAD method of Eq. 17. Both
decoding approaches mentioned in Sec. 2.6.2 were
tested for derivation of decoding filters. The virtual
loudspeaker approach was found superior in terms of
preserving timbre than the direct approach of Eq. 20,
which suffered from severe high-frequency loss at
lower orders. Note that an approximate timbre cor-
rection can be applied to counteract this effect, as
proposed in [23].
The hrirLoader block on the other hand allows for
on-the-fly HRIR filters loading, internally converted
to HOA decoding filters to be used by the binDe-
coder block. The hrirLoader implementation is based
on the HrftSet class of the binauralFIR library [24],
featuring server-based HRIR loading, granting access
to an extensive choice of HRTF sets without clut-
tering the library itself. At the time of writing, the
hrirLoader relies on local JSON embedded HRTF set
loading, awaiting for the IRCAM OpenDAP SOFA
server [25] publication.
4 Applications
The library is relevant to any web application that
delivers or involves immersive content. Some exam-
ples of special interest are highlighted below:
Reproduction of spherical audio and video for
telepresence. In this scenario an ambisonic au-
dio stream is delivered to the client along with a
spherical video. The audio part is rendered bin-
aurally at the target platform including head-
rotation information, giving a convincing sense
of presence.
Reproduction of audio-only or audiovisual com-
positions, with the sound part encoded into
a few ambisonic channels using the provided
tools, and broadcasted to multiple clients with
binaural rendering done independently on each
one of them.
Web VR/AR applications in which the audio
components are updated in real-time and en-
coded into ambisonic streams, avoiding costly
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 6
Ambisonic processing on the web Politis
binaural rendering of multiple sources and re-
verberation, while still peforming rotation of
the sound scene.
Web video games with immersive spatial sound.
Interactive visualization driven by spatial prop-
erties of the sound scenes, for extracting acous-
tic information or for artistic uses.
Some basic examples highlighting some of these
applications are included in the code repository [6].
References
[1] M. A. Gerzon, “Periphony: With-height sound
reproduction,” Journal of the Audio Engineering
Society, vol. 21, no. 1, pp. 2–10, 1973.
[2] S. Moreau, S. Bertet, and J. Daniel, “3D sound
field recording with higher order ambisonics –
objective measurements and validation of spher-
ical microphone,” in 120th Convention of the
AES, (Paris, France), 2006.
[3] M. A. Poletti, “Three-dimensional surround
sound systems based on spherical harmon-
ics,” Journal of the Audio Engineering Society,
vol. 53, no. 11, pp. 1004–1025, 2005.
[4] F. Zotter and M. Frank, “All-round ambisonic
panning and decoding,” Journal of the Audio
Engineering Society, vol. 60, no. 10, pp. 807–820,
2012.
[5] W3C, “Web Audio API,” 12 2015. https:
//www.w3.org/TR/webaudio/.
[6] A. Politis and D. Poirier-Quinot, “JSAmbison-
ics: A Web Audio library for interactive spa-
tial sound processing on the web.” https://
github.com/polarch/JSAmbisonics.
[7] J. Ivanic and K. Ruedenberg, “Rotation matri-
ces for real spherical harmonics. direct determi-
nation by recursion,” The Journal of Physical
Chemistry, vol. 100, no. 15, pp. 6342–6347, 1996.
[8] M. A. Blanco, M. Fl´orez, and M. Bermejo,
“Evaluation of the rotation matrices in the basis
of real spherical harmonics,” Journal of Molec-
ular Structure: THEOCHEM, vol. 419, no. 1,
pp. 19–27, 1997.
[9] R. H. Hardin and N. J. Sloane, “McLaren’s im-
proved snub cube and other new spherical de-
signs in three dimensions,” Discrete & Compu-
tational Geometry, vol. 15, no. 4, pp. 429–441,
1996.
[10] V. Pulkki, “Virtual sound source positioning us-
ing vector base amplitude panning,” Journal of
the Audio Engineering Society, vol. 45, no. 6,
pp. 456–466, 1997.
[11] M. J. Evans, J. A. S. Angus, and A. I. Tew, “An-
alyzing head-related transfer function measure-
ments using surface spherical harmonics,” The
Journal of the Acoustical Society of America,
vol. 104, no. 4, pp. 2400–2411, 1998.
[12] D. N. Zotkin, R. Duraiswami, and N. A.
Gumerov, “Regularized HRTF fitting using
spherical harmonics,” in IEEE Workshop on
Applications of Signal Processing to Audio and
Acoustics (WASPAA), (New Paltz, NY, USA),
2009.
[13] G. D. Romigh, D. S. Brungart, R. M. Stern,
and B. D. Simpson, “Efficient real spherical
harmonic representation of head-related trans-
fer functions,” IEEE Journal of Selected Topics
in Signal Processing, vol. 9, no. 5, pp. 921–930,
2015.
[14] M. Noisternig, T. Musil, A. Sontacchi, and
R. H¨oldrich, “3D binaural sound reproduc-
tion using a virtual ambisonic approach,” in
IEEE Int. Symposium on Virtual Environments,
Human-Computer Interfaces and Measurement
Systems (VECIMS), (Lugano, Switzerland),
2003.
[15] B. Wiggins, I. Paterson-Stephens, and P. Schille-
beeckx, “The analysis of multi-channel sound re-
production algorithms using HRTF data.,” in In
19th Int. Conf. of the AES, 2001.
[16] A. Politis, “A JavaScript library
for the Spherical Harmonic Trans-
form.” https://github.com/polarch/
Spherical-Harmonic-Transform-JS.
[17] E. W. Weisstein, “Associated legendre poly-
nomial.” http://mathworld.wolfram.com/
AssociatedLegendrePolynomial.html.
[18] G. W. Elko, “Differential microphone arrays,”
in Audio signal processing for next-generation
multimedia communication systems, pp. 11–65,
Springer, 2004.
[19] Blue Ripple Sound, “HOA Technical Notes – B-
format.” http://www.blueripplesound.com/
b-format.
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 7
Ambisonic processing on the web Politis
[20] A. Politis, T. Pihlajam¨aki, and V. Pulkki, “Para-
metric spatial audio effects,” in Int. Conf. on
Digital Audio Effects (DAFx), (York, UK), 2012.
[21] A. Politis, “Higher Order Ambisonics li-
brary,” 2015. https://github.com/polarch/
Higher-Order-Ambisonics.
[22] O. Warusfel, “Listen HRTF database,” online,
IRCAM and AK, Available: http://recherche.
ircam. fr/equipes/salles/listen/index. html,
2003.
[23] J. Sheaffer, S. Villeval, and B. Rafaely, “Render-
ing binaural room impulse responses from spher-
ical microphone array recordings using timbre
correction,” in EAA Joint Symposium on Au-
ralization and Ambisonics, (Berlin, Germany),
2014.
[24] T. Carpentier, “Binaural synthesis with the
Web Audio API,” in 1st Web Audio Conference
(WAC), 2015.
[25] IRCAM, “IRCAM OpenDAP Server. (to be pub-
lished soon)..”
Interactive Audio Systems Symposium, September 23rd 2016, University of York, United Kingdom. 8
... In the case of FOA only, and contrary to higher-order Ambisonics, the rotation can be simply performed with a standard rotation matrix. Details on constructing such matrices can be found in [31]. Following e.g., the yaw-pitch-roll convention corresponding to angles (α, β, γ) such a rotation Q is applied to the ambisonic signals as ...
... Stereo signals corresponding to coincident stereo recording [37] can be extracted in a straightforward way from FOA signals, by generating two beamformers emulating two coincident directional microphones pointing left and right. Note that ambisonic-to-binaural decoding can be also conducted in the same manner, with frequency-dependent beamformers approximating generic or individualized head-related transfer functions [31]. In this work we use the same process as in [17], where two broadband hyper-cardioid beamformers are steered towards 90 • to the left and right respectively, falling somewhere between stereophonic recording and binaural decoding. ...
... Since the proposed approaches depend on the spatial encoding of Ambisonics, and since such encoding is integrated into the audio signals themselves and not, e.g., additional metadata, it is crucial to know exactly which channel ordering and channel normalization convention is used, in order to perform beamforming and rotation operations without errors. In this work we used the widely accepted ambisonic convention of ACN channel ordering, which corresponds to (WYZX) order for FOA, and SN3D channel normalization scheme, together known as ambiX format [31]. However since Ambisonics are not defined yet in audio file container formats, confusion with other surround formats and channel swapping by audio file writers and loaders is a common occurrence. ...
Article
Full-text available
Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360 $^\circ$ video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10% improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.
... In the case of FOA only, and contrary to higher-order Ambisonics, the rotation can be simply performed with a standard rotation matrix. Details on constructing such matrices can be found in [31]. Following e.g., the yawpitch-roll convention corresponding to angles (α, β, γ) such a rotation Q is applied to the ambisonic signals as ...
... Stereo signals corresponding to coincident stereo recording [37] can be extracted in a straightforward way from FOA signals, by generating two beamformers emulating two coincident directional microphones pointing left and right. Note that ambisonic-to-binaural decoding can be also conducted in the same manner, with frequency-dependent beamformers approximating generic or individualized head-related transfer functions [31]. In this work we use the same process as in [17], where two broadband hyper-cardioid beamformers are steered towards 90 • to the left and right respectively, falling somewhere between stereophonic recording and binaural decoding. ...
... Since the proposed approaches depend on the spatial encoding of Ambisonics, and since such encoding is integrated into the audio signals themselves and not, e.g., additional metadata, it is crucial to know exactly which channel ordering and channel normalization convention is used, in order to perform beamforming and rotation operations without errors. In this work we used the widely accepted ambisonic convention of ACN channel ordering, which corresponds to (WYZX) order for FOA, and SN3D channel normalization scheme, together known as ambiX format [31]. However since Ambisonics are not defined yet in audio file container formats, confusion with other surround formats and channel swapping by audio file writers and loaders is a common occurrence. ...
Preprint
Full-text available
Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360$^\text{o}$ video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10 $\%$ improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.
... Therefore, we are observing major efforts on multiple aspects of web-based spatial audio, with examples in fruition [14,15], rendering [16][17][18] and personalisation [19,20]. ...
... A relevant application of Omnitone in the field of soundscapes is the Storyspheres web app by the Google News Lab [26] which allows to create a soundscape for a 360 • image placing sound sources in the scene and render the spatialised sound. In [16], Politis and Porier-Quinot present JSAmbisonics as a library that uses the WAA for interactive spatial sound processing on the web. This work is relevant for it supports Ambisonics of any order and allows to load custom HRTFs for Ambisonics to binaural decoding. ...
Article
Full-text available
PlugSonic is a series of web- and mobile-based applications designed to edit samples and apply audio effects (PlugSonic Sample) and create and experience dynamic and navigable soundscapes and sonic narratives (PlugSonic Soundscape). The audio processing within PlugSonic is based on the Web Audio API while the binaural rendering uses the 3D Tune-In Toolkit. Exploration of soundscapes in a physical space is made possible by adopting Apple’s ARKit. The present paper describes the implementation details, the signal processing chain and the necessary steps to curate and experience a soundscape. We also include some metrics and performance details. The main goal of PlugSonic is to give users a complete set of tools, without the need for specific devices, external software and/or hardware specialised knowledge, or custom development, with the idea that spatial audio has the potential to become a readily accessible and easy to understand technology, for anyone to adopt, whether for creative or research purposes.
... Instead, the reverberations of the space take prominence [27]. With added HOA RIR spatiality achieved via JSAmbisonics [28], the experience of space inside Acoustic Atlas feels realistic. ...
... It is deployed via Node.js. It is computationally efficient enough to deliver real-time HOA manipulation in the browser [28]. In the case of Acoustic Atlas, FOA B-Format is 'upmixed' to 3OA and then converted to the ACN/SN3D format. ...
Conference Paper
The Acoustic Atlas project, funded by the European Commission, explores a new model for experiencing scientific acoustic heritage data in the form of room impulse responses (RIRs), as real-time auralisations. The browser-based Web Audio application functions on any smart mobile device or computer running a web browser. It utilises the built-in microphone and headphone output of the device to transport the user to a selected heritage site via headphones and a live microphone feed. Users are able to emit sounds to hear the reflections from a first-person point of hearing. Researchers in the field of heritage acoustics have advocated for the importance of acoustical studies and their historical, artistic and spiritual value. However, heritage sites still prioritise dissemination through visual means, such as maps, 3D models, photographs and videos. Where acoustic research is available, it is mostly published in text form through the analysis of objective acoustical parameters. When attempting to access auralisations, the search becomes difficult and fragmented. Acoustic Atlas provides real-time auralisations to wide audiences, beyond those already working in the field of heritage acoustics and, as a result, contributes to the awareness of the importance of listening and the preservation of sonic heritage. The present paper explores the web architecture of the project that utilises the Web Audio API and Tone.js which currently enables auralisations at acceptable latency. It also looks into achieving synchronisation between B-format RIRs and 360 head positioning in the browser, in real-time.
... The stream is passed through an HTML audio element and included in the Web Audio API as a MediaElementAudioSourceNode, cf. fig. 3. It passes further through processing nodes in the We-bAudio API, AllRAD decoder, encoder, rotator, BRIR (binaural room impulse response) convolution and binaural decoder, which are based on JSAmbisonics [12]. Finally, the two-channel audio stream is forwarded to the listener's hardware via the AudioDestinationNode. ...
Conference Paper
Full-text available
3D audio technology is well known from cinema, computer games, and virtual reality and it is gaining more and more importance. Usually, consumers do not have a suitable multichannel loudspeaker system at home, whether for scene-based, channel-based, or object-based productions. Thus, a viable solution for the living room is to decode the 3D audio material for binaural playback via headphones. However, the 3D audio scene can only be fully experienced if head movements are also taken into account which requires a tracking sensor to modify the sound field accordingly. Although there are technically simple and inexpensive hardware solutions available, they are not part of the standard home equipment. An alternative for head tracking is to use the webcam of the user's device. The Cat3DA (camera-tracked 3D audio) player relies on conventional equipment that most users already have at home: pc or laptop equipped with a webcam and a pair of standard headphones for binaural playback of a virtual loudspeaker setup. The player also incorporates binaural impulse responses of different listening rooms to support externalization of the sound scene.
... It is deployed via Node.js, and each 'node' can be used independently or in combination. It has been designed for computational efficiency "particularly regarding spatialencoding and decoding schemes, optimized for real-time production and delivery of immersive web content" -FOA B-format signals in some cases (see [28]) ...
Article
This paper examines the current eco-system of tools for implementing dynamic 3D audio through the browser, from the perspective of spatial sound practitioners. It presents a survey of some existing tools to assess usefulness, and ease of use. This takes the forms of case studies, interviews with other practitioners, and initial testing comparisons between the authors. The survey classifies and summarizes their relative advantages, disadvantages and potential use cases. It charts the specialist knowledge needed to employ them or enable others to.The recent and necessary move to online exhibition of works, has seen many creative practitioners grapple with a disparate eco-system of software. Such technologies are diverse in their both their motivations and applications. From formats which overcome the limits of WebGL’s lack of support for Ambisonics, to the creative deployment of Web Audio API (WAA), to third-party tools based on WAA, the field can seem prohibitively daunting for practitioners. The current range of possible acoustic results may be too unclear to justify the learning curve.Through this evaluation of the current available tools, we hope to demystify and make accessible these novel technologies to composers, musicians, artists and other learners, who might otherwise be dissuaded from engaging with this rich territory. This paper is based on a special session at Soundstack 2021.
... It is deployed via Node.js, and each 'node' can be used independently or in combination. It has been designed for computational efficiency "particularly regarding spatialencoding and decoding schemes, optimized for real-time production and delivery of immersive web content" -FOA B-format signals in some cases (see [28]) ...
Conference Paper
This paper examines the current eco-system of tools for implementing dynamic 3D audio through the browser, from the perspective of spatial sound practitioners. It presents a survey of some existing tools to assess usefulness, and ease of use. This takes the forms of case studies, interviews with other practitioners, and initial testing comparisons between the authors. The survey classifies and summarizes their relative advantages, disadvantages and potential use cases. It charts the specialist knowledge needed to employ them or enable others to. The recent and necessary move to online exhibition of works, has seen many creative practitioners grapple with a disparate eco-system of software. Such technologies are diverse in their both their motivations and applications. From formats which overcome the limits of WebGL’s lack of support for Ambisonics, to the creative deployment of Web Audio API (WAA), to third- party tools based on WAA, the field can seem prohibitively daunting for practitioners. The current range of possible acoustic results may be too unclear to justify the learning curve. Through this evaluation of the current available tools, we hope to demystify and make accessible these novel technologies to composers, musicians, artists and other learners, who might otherwise be dissuaded from engaging with this rich territory. This paper is based on a special session at Soundstack 2021.
... Note that in [13,14], the technique was intended for use in teleconferencing applications, enabling the zooming-in function on the video to be accompanied by the respective acoustical zooming. Due to the recent rise in popularity of over-the-web streaming of Ambisonic sound scenes [29,30], user controllable acoustical zooming methods are becoming more widespread. ...
Conference Paper
Full-text available
Decomposing a sound-field into its individual components and respective parameters can represent a convenient first-step towards offering the user an intuitive means of controlling spatial audio effects and sound-field modification tools. The majority of such tools available today, however, are instead limited to linear combinations of signals or employ a basic single-source parametric model. Therefore, the purpose of this paper is to present a parametric framework, which seeks to overcome these limitations by first dividing the sound-field into its multi-source and ambient components based on estimated spatial parameters. It is then demonstrated that by manipulating the spatial parameters prior to reproducing the scene, a number of sound-field modification and spatial audio effects may be realised; including: directional warping, listener translation, sound source tracking, spatial editing workflows and spatial side-chaining. Many of the effects described have also been implemented as real-time audio plug-ins, in order to demonstrate how a user may interact with such tools in practice.
Conference Paper
Full-text available
Conference Paper
Full-text available
The technique of rendering binaural room impulse responses from spatial data captured by spherical microphone arrays has been recently proposed and investigated perceptually. The finite spatial resolution enforced by the microphone configuration restricts the available frequency bandwidth and, accordingly, modifies the perceived timbre of the played-back material. This paper presents a feasibility study investigating the use of filters to correct such spectral artifacts. Listening tests are employed to gain a better understanding of how equalization affects externalization, source focus and timbre. Preliminary results suggest that timbre correction filters improve both timbral and spatial perception.
Conference Paper
Full-text available
Described in this paper is a method for the analysis and comparison of multi-speaker surround sound algorithms using HRTF data. Using Matlab and Simulink [1] a number of surround sound systems were modeled, both over multiple speakers (for listening tests) and using the MIT Media Labs HRTF set (for analysis)[2]. The systems under test were 1st Order Ambisonics over eight and five speakers, 2nd Order Ambisonics over eight speakers and Amplitude panned 5.0 over five speakers. The listening test results were then compared to the HRTF analysis with favourable results.
Conference Paper
Full-text available
Parametric spatial audio coding methods aim to represent efficiently spatial information of recordings with psychoacoustically relevant parameters. In this study, it is presented how these parameters can be manipulated in various ways to achieve a series of spatial audio effects that modify the spatial distribution of a captured or synthe- sised sound scene, or alter the relation of its diffuse and directional content. Furthermore, it is discussed how the same representation can be used for spatial synthesis of complex sound sources and scenes. Finally, it is argued that the parametric description provides an efficient and natural way for designing spatial effects.
Article
Several methods have recently been proposed for modeling spatially continuous head-related transfer functions (HRTFs) using techniques based on finite-order spherical harmonic expansion. These techniques inherently impart some amount of spatial smoothing to the measured HRTFs. However, the effect this spatial smoothing has on the localization accuracy has not been analyzed. Consequently, the relationship between the order of a spherical harmonic representation for HRTFs and the maximum localization ability that can be achieved with that representation remains unknown. The present study investigates the effect that spatial smoothing has on virtual sound source localization by systematically reducing the order of a spherical-harmonic-based HRTF representation. Results of virtual localization tests indicate that accurate localization performance is retained with spherical harmonic representations as low as fourth-order, and several important physical HRTF cues are shown to be present even in a first-order representation. These results suggest that listeners do not rely on the fine details in an HRTF's spatial structure and imply that some of the theoretically-derived bounds for HRTF sampling may be exceeding perceptual requirements.
Article
The theory of recording and reproduction of three-dimensional sound fields based on spherical harmonics is reviewed and extended. Free-field, sphere, and general recording arrays are reviewed, and the mode-matching and simple source approaches to sound reproduction in anechoic environments are discussed. Both methods avoid the need for both monopole and dipole loudspeakers - as required by the Kirchhoff-Helmholtz integral. An error analysis is presented and simulation examples are given. It is also shown that the theory can be extended to sound reproduction in reverberant environments.
Article
All-Round Ambisonic Panning (AllRAP) is an algorithm for arbitrary loudspeaker arrangements, aiming at the creation of phantom sources of stable loudness and adjustable width. The equivalent All-Round Ambisonic Decoding (AllRAD) fits into the Ambisonic format concept. Conventional Ambisonic decoding is only simple with optimal loudspeaker arrangements for which it achieves direction-independent energy and energy spread, the estimated phantom source loudness and width. AllRAP/AllRAD is still simple but more versatile and utilizes the combination of a virtual optimal loudspeaker arrangement with Vector-Base Amplitude Panning. Open access: http://www.aes.org/e-lib/download.cfm/16554.pdf?ID=16554
Article
Rotation matrices (or Wigner D functions) are the matrix representations of the rotation operators in the basis of spherical harmonics. They are the key entities in the generation of symmetry-adapted functions by means of projection operators. Although their expression in terms of ordinary (complex) spherical harmonics and Euler rotation angles is well known, an alternative representation using real spherical harmonics is desirable. The aim of this contribution is to obtain a general algorithm to compute the representation matrix of any point-group symmetry operation in the basis of the real spherical harmonics, paying attention to the use of recurrence relationships that allow the treatment of functions with high angular momenta.
Article
A continuous, functional representation of a large set of head-related transfer function measurements (HRTFs) is developed. The HRTFs are represented as a weighted sum of surface spherical harmonics (SSHs) up to degree 17. A Gaussian quadrature method is used to pick out a set of experimentally efficient measurement directions. Anechoic impulse responses are measured for these directions between a source loudspeaker and the entrance to the ear canal of a head-and-torso simulator (HATS). Three separate SSH analyses are carried out: The first forms a SSH representation from the time responses, with the variable onset delay caused by interaural differences intact, by applying the analysis to each time sample in turn. The second SSH model is formed in exactly the same way, except using impulse responses in which the variable onset delays have been equalized. The final SSH analysis is carried out in the frequency domain by applying the technique on a frequency bin by frequency bin basis to the magnitude and unwrapped phase responses of the HRTFs. The accuracy and interpolation performance of each of the computed SSH models is investigated, and the usefulness of the SSH technique in analyzing directional hearing and, particularly, in spatializing sounds is discussed.
Article
A recurrence procedure is derived for constructing the rotation matrices between real spherical harmonics directly in terms of the elements of the original 3×3 rotation matrix without the intermediary of any parameters. The procedure furnishes a simple, efficient, and general method for the formal as well as numerical evaluation of these representation matrices.