Content uploaded by Tim Lübeck

Author content

All content in this area was uploaded by Tim Lübeck on Oct 30, 2020

Content may be subject to copyright.

Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20), Vienna, Austria, September 8–12, 2020

PERCEPTUAL EVALUATION OF MITIGATION APPROACHES OF IMPAIRMENTS DUE

TO SPATIAL UNDERSAMPLING IN BINAURAL RENDERING OF SPHERICAL

MICROPHONE ARRAY DATA: DRY ACOUSTIC ENVIRONMENTS

Tim Lübeck, Johannes M. Arend, Christoph Pörschmann ∗

Institute of Communications Engineering

TH Köln - University of Applied Sciences,

50678 Cologne, Germany

tim.luebeck@th-koeln.de

Hannes Helmholz, Jens Ahrens †

Division of Applied Acoustics

Chalmers University of Technology

412 96 Gothenburg, Sweden

hannes.helmholz@chalmers.se

ABSTRACT

Employing a ﬁnite number of discrete microphones, instead of a

continuous distribution according to theory, reduces the physical

accuracy of sound ﬁeld representations captured by a spherical mi-

crophone array. For a binaural reproduction of the sound ﬁeld, a

number of approaches have been proposed in the literature to miti-

gate the perceptual impairment when the captured sound ﬁelds are

reproduced binaurally. We recently presented a perceptual evalua-

tion of a representative set of approaches in conjunction with rever-

berant acoustic environments. This paper presents a similar study

but with acoustically dry environments with reverberation times

of less than 0.25 s. We examined the Magnitude Least-Squares

algorithm, the Bandwidth Extraction Algorithm for Microphone

Arrays, Spherical Head Filters, spherical harmonics Tapering, and

Spatial Subsampling, all up to a spherical harmonics order of 7.

Although dry environments violate some of the assumptions un-

derlying some of the approaches, we can conﬁrm the results of

our previous study: Most approaches achieve an improvement

whereby the magnitude of the improvement is comparable across

approaches and acoustic environments.

1. INTRODUCTION

Spherical microphone arrays (SMAs) allow for capturing sound

ﬁelds including spatial information. The captured sound ﬁelds

can be rendered binaurally if the head-related transfer func-

tions (HRTFs) are available on a sufﬁciently dense grid. Mathe-

matically, this is performed by means of spherical harmonics (SH)

expansion of the sound ﬁeld and the HRTFs [1, 2]. Conceptually, it

is equivalent to bringing the listener’s head virtually into the sound

ﬁeld captured with the array. Rotation of the HRTFs relative to the

sound ﬁeld according to the instantaneous head orientation of the

listener allows for dynamic presentation.

The physical accuracy that can be achieved with SMAs is lim-

ited, mainly due to the employment of a ﬁnite number of micro-

phones as opposed to the continuous distribution that the theory as-

sumes. This leads to spatial undersampling of the captured sound

ﬁeld, which 1) induces spatial aliasing and 2) limits the maximum

∗This work was partly supported by ERDF (European Regional Devel-

opment Fund).

†This work was partly supported by Facebook Reality Labs.

Copyright: © 2020 Tim Lübeck et al. This is an open-access article distributed under

the terms of the Creative Commons Attribution 3.0 Unported License, which permits

unrestricted use, distribution, and reproduction in any medium, provided the original

author and source are credited.

obtainable SH order representation. The order of the SH presen-

tation directly corresponds to the spatial resolution of the captured

sound ﬁeld. Both phenomenons can lead to audible artifacts. An-

other practical impairment is caused by self-noise of the micro-

phones in the array. Studying this aspect is beyond the scope of

the present paper. We refer the reader to [3, 4].

In recent years, several approaches to mitigate such impair-

ments in binaural rendering of undersampled SMA data have been

proposed. We recently conducted a listening experiment to study

the perceptual effects of the mitigation approaches [5]. The study

employed the acoustic data of two rooms with a reverberation time

of more than 1 s. In this contribution we present the results for a

similar study, whereby the employed acoustic environments ex-

hibit shorter reverberation times of less than 0.25 s.

2. SPATIAL UNDERSAMPLING

To outline the phenomenon of spatial undersampling, we brieﬂy

summarize the fundamental concept of binaural rendering of SMA

data. For a more detailed explanation please refer to [2, 6]. The

sound pressure S(r, ϕ, θ, ω)captured by the microphones on the

array surface Ωis represented in the SH domain using the spherical

Fourier transform (SFT)

Snm(r, ω) = ZΩ

S(r, ϕ, θ, ω)Ym

n(θ, ϕ)∗dAΩ,(1)

whereby rdenotes the array radius, ϕand θthe azimuth and co-

latitude of a point on the array surface, and ω= 2πf the angular

frequency. Ym

n(θ, ϕ)denotes the orthogonal SH basis functions

for certain orders nand modes mand (·)∗the complex conjugate.

Based on knowledge of the sound ﬁeld SH coefﬁcients Snm,

the sound ﬁeld on the array surface can be decomposed into a con-

tinuum of plane waves impinging from all possible directions

D(ϕ, θ, ω) =

∞

X

n=0

n

X

m=−n

dnSnm(r, ω)Ym

n(ϕ, θ),(2)

with a set of radial ﬁlters dn. Note that S(r, ϕ, θ, ω)and

D(ϕ, θ, ω)do not necessarily represent the same sound ﬁelds. A

SMA can incorporate a scattering body whose effect is contained

in S(r, ϕ, θ, ω)but not in D(ϕ, θ, ω)where it is removed by the

radial ﬁlters.

A HRTF H(ϕ, θ, ω)can be interpreted as the spatio-temporal

transfer function of a plane wave to the listeners’ ears. The binau-

ral signals B(ω)for the left or right ear due to the plane wave com-

ponents D(ϕ, θ, ω)impinging on the listener’s head can therefore

DAFx.1

Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20), Vienna, Austria, September 8–12, 2020

be computed by weighting all HRTFs H(ω)with the plane wave

coefﬁcients of D(ϕd, θd, ω)and integrating over all propagation

directions

B(ω) = 1

4πZΩ

H(ϕ, θ, ω)D(ϕ, θ , ω) dAΩ.(3)

Transforming the HRTFs into the SH domain as well and exploit-

ing the orthogonality property of the SH basis functions allows to

resolve the integral and compute the binaural signals for either ear

as [1]

B(ω) =

∞

X

n=0

n

X

m=−n

dnSnm(ω , r)Hnm(ω).(4)

The exact formulation of Eq. (4) depends on the particular deﬁni-

tion of the employed SH basis functions [7, p. 7].

So far, we have assumed a continuously and ideally sampled

sound pressure distribution on the array surface. In this case, the

computation of the ear signals is perfect i.e., B(ω)in (4) are the

signals that arise if the listener with HRTFs H(ϕ, θ, ω)is exposed

to the sound ﬁeld that the microphone array captures. Real-world

SMAs employ only a ﬁnite number of discrete microphones. As

a result, spatial aliasing and truncation of the SH order noccur,

which makes the ear signals that are computed by the processing

pipeline differ from the true ones. This can signiﬁcantly affect the

perceptual quality of binaural reproduction, as shown by numerous

research [2, 8, 9, 10]. These impairments due to spatial undersam-

pling are brieﬂy discussed in the following.

2.1. Spatial Aliasing

Similar to time-frequency sampling, where frequency components

above the Nyquist-frequency are aliased to lower frequency re-

gions, sampling the space with a limited number of sensors intro-

duces spatial aliasing. Note that this applies for both, sampling of

the sound ﬁeld S(·)as well as for the sampling of the HRTFs H(·).

In case aliasing occurs, higher spatial modes cannot be reliably re-

solved and leak into lower modes. Generally, higher modes are

required for resolving high frequency components with smaller

wavelengths. Spatial aliasing therefore limits the upper bound of

the time-frequency bandwidth that can be deduced reliably from

the array signals. While theoretically being apparent at all tem-

poral frequencies f, spatial aliasing artifacts are considerable only

above the temporal-frequency [6]

fA=Nsg c

2πr .(5)

Thereby, cdenotes the speed of sound and Nsg the maximum re-

solvable SH order nof the sampling scheme. The leakage of

higher spatial modes into lower spatial modes results in an increase

of the magnitudes at temporal-frequencies above fA. Although

spatial aliasing primarily impairs spatial properties, it therefore

also affects the time-frequency spectrum of the binaural signals.

2.2. Spherical Harmonic Truncation

Orthogonality of the SH basis functions Ym

n(·)is given only up

to the order n=Nsg (Eq. (5)) due to the discrete sampling of the

SMA surface. Spatial modes for n>Nsg are spatially distorted

and are ordinarily not computed. This order truncation results in

a loss of spatial information. The sampling of the SMA is usually

sparser than that of the HRTFs so that the SMA is the limiting

factor.

Also the spatial order truncation affects the time-frequency

representation by discarding components with mostly high fre-

quency content. In addition, hard truncation of the SH coefﬁcients

at a certain order nresults in side-lobes in the plane wave spectrum

in Eq. (2) [11], which can further impair the binaural signals.

3. MITIGATION APPROACHES

In the last years, a number of different approaches to improve bin-

aural rendering of SMA captures have been presented in the liter-

ature. In the following, a selection of approaches is summarized.

These are the approaches that we evaluated in the experiment pre-

sented in Sec. 6.

3.1. Pre-Processing of Head-Related Transfer Functions

Since in practice, the SH order truncation of high-resolution

HRTFs cannot be avoided, a promising approach to mitigate the

truncation artifacts is to pre-process the HRTFs in such a way that

the major energy is shifted to lower orders without notably de-

creasing the perceptual quality. Several approaches to achieve this

have been introduced. A summary of a selection of pre-processing

techniques is presented in [12]. In this paper, we investigate two

concepts.

3.1.1. Spatial Subsampling

For the spatial subsampling method [2] (SubS), the HRTFs are

transformed into the SH domain up to the highest SH order Nsg

that the sampling grid supports. Based on this representation, the

HRTFs are spatially resampled with a reduced maximum SH or-

der N′

sg to the grid on which the sound ﬁeld is sampled, which is

usually more coarse.

This process modiﬁes the spatial aliasing in the signals in

a favorable way [2]. Fig. 1 depicts the energy distribution of

dummy head HRTFs [13] with respect to SH order (y-axis) and

frequency (x-axis). The left-hand diagram illustrates the untreated

HRTFs with a signiﬁcant portion of energy at high SH orders. The

middle diagram shows the same HRTF set being subsampled to a

5th-order Lebedev grid. Evidently, the information can be reliably

obtained only up to the 5th order.

Figure 1: Energy distribution in dB with respect to order and fre-

quency of the HRTFs of a Neumann KU100 dummy head. Un-

treated (left), subsampled (center), MagLS pre-processed (right).

DAFx.2

Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20), Vienna, Austria, September 8–12, 2020

3.1.2. Magnitude Least-Squares

Another HRTF pre-processing approach is the Magnitude Least-

Squares (MagLS) [14] algorithm, which is an improvement of the

Time Alignment (TA) proposed by the same authors. Both ap-

proaches are based on the duplex theory [15]. At high frequencies,

the interaural level differences (ILDs) become perceptually more

relevant than the interaural time differences (ITDs). However, at

high frequencies, the less relevant phase information constitutes a

major part of the energy. Thus, removing the linear phase at high

frequencies decreases the energy in high modes, without losing

relevant perceptual information. MagLS aims to ﬁnd an optimum

phase by solving a least-squares problem that minimizes the differ-

ences in magnitude to a reference HRTF set, resulting in minimal

phase in favor of optimal ILDs. Fig. 1 (right) illustrates the en-

ergy distribution of MagLS pre-processed HRTFs for SH order 5.

The major part of the energy is shifted to SH coefﬁcients of orders

below 5.

The major difference between both HRTF pre-processing ap-

proaches is that subsampling results in a HRTF set deﬁned for a

reduced number of directions and thus allowing only for a limited

SH representation. In contrast, MagLS does not change the HRTF

sampling grid and thus, theoretically, allows expansion up to the

original SH order.

3.2. Bandwidth Extension Algorithm for Microphone Arrays

Besides pre-processing of the HRTFs, there are algorithms that are

applied to the sound ﬁeld SH coefﬁcients. The Bandwidth Exten-

sion Algorithm for Microphone Arrays (BEMA) [16, 2] synthe-

sizes the SH coefﬁcients at f≥fAby extracting spatial and spec-

tral information from components f < fA. The time-frequency

spectral information is obtained by an additional omnidirectional

microphone in the center of the microphone array (which is evi-

dently not feasible in practice if a scattering object is employed).

The BEMA coefﬁcients can then be estimated as the combination

of spatial and spectral information.

Fig. 2 depicts the magnitudes of plane wave components cal-

culated for a broadband plane wave impinging from ϕ= 180°,

θ= 90° on a 50 sampling point Lebedev grid SMA with respect

to azimuth angle (x-axis) and frequency (y-axis). The top diagram

is based on untreated SH coefﬁcients, the bottom diagram illus-

trates the effect of BEMA. For the example of a single plane wave,

the sound ﬁeld is perfectly reconstructed over the entire audible

bandwidth.

3.3. Spherical Harmonic Tapering

SH order truncation induces side-lobes in the plane wave spec-

trum, which can be reduced by tapering high orders n[11]. In

other words, an order-dependent scaling factor is applied to all SH

modes and coefﬁcients of that order. Different windows have been

discussed, and a cosine-shaped fade-out was found to be the opti-

mal choice. Additionally, the authors recommend to equalize the

binaural signals with the so-called Spherical Head Filter, as dis-

cussed in the subsequent section. The combination of SH tapering

and spherical head ﬁlters is referred to as Tap+SHF in the remain-

der.

Figure 2: Plane wave magnitudes of a plane wave impact from

ϕ= 180°, θ= 90° on a 50 sampling point Lebedev grid SMA

with a radius of 8.75 cm. The top diagram depicts the untreated

magnitudes, the bottom diagram the plane wave calculated after

BEMA processing.

Figure 3: Spherical Head Filter (SHF) for orders N= (3,5,7).

3.4. Spectral Equalization

The modiﬁcation of the time-frequency response due to spatial un-

dersampling is a perceptually distinctive impairment, as shown e.g.

in [10]. Therefore, a third category of mitigation approaches is

global equalization of the binaural signals. Different approaches

have been introduced in the literature to design such equalization

ﬁlters. The Spherical Head Filter (SHF) [8] compensates for the

low-pass behavior of SH order truncation. The authors disregard

spatial aliasing effects and proposed a ﬁlter based on the plane

wave density function of a diffuse sound ﬁeld. The resulting ﬁlters

for different SH orders are depicted in Fig. 3. A similar approach

to equalize this low-pass effect has been discussed in [17]. In the

following we investigate the SHFs.

4. EMPLOYED DATA

The stimuli in our study were created from measured array room

impulse responses using the sound_field_analysis-py

Python toolbox [18] and the impulse response data set from [19].

This data set contains both binaural room impulse responses

(BRIRs) measured with a Neumann KU100 dummy head as well

as array room impulse responses (ARIRs) captured on various

Lebedev grids under identical conditions. This allows for a direct

DAFx.3

Figure 4: The Control Room 1 (left) and 7 (right) (CR1, CR7) with reverberation times of less than 0.25 s (measured at 500 Hz and 1 kHz)

at the WDR Broadcast studios, that were auralized in the listening experiment.

comparison of binaural auralization of SMA data to the ground

truth dummy head data. The ARIR measurements were performed

with the VariSphear device [20], which is a fully automated robotic

measurement system that sequentially captures directional impulse

responses on a spherical grid for emulating a SMA. To obtain im-

pulse responses of a rigid sphere array, the Earthworks M30 micro-

phone was ﬂush-mounted in a wooden spherical scattering body

(see [19, Fig. 12]). All measurements were performed in four dif-

ferent rooms at the WDR broadcast studios in Cologne, Germany.

In this study we employ the measurement data of the rooms Con-

trol Room 1 (CR1) and Control Room 7 (CR7) (Fig. 4), which

both have short reverberation times of less than 0.25 s. Recall that

we conducted a similar study with the rooms Small Broadcast Stu-

dio (SBS) and Large Broadcast Studio (LBS) with approximate

reverberation times of 1 s and 1.8 s in [5].

The Neumann KU100 HRIR set, measured on a 2702 sam-

pling point Lebedev grid [13], is used to synthesize binaural sig-

nals B(ω)for a pure horizontal grid of head orientations with 1°

resolution based on ARIRs according to Eq. (4). We denote this

data "ARIR renderings" in the following. Likewise, the BRIRs of

the dummy head are available for the same head orientations so

that a direct comparison of both auralizations is possible.

In order to restrict the gain of the radial ﬁlters dn(ω)in (4), we

employ a soft-limiting approach [2, pp. 90-118]. Fig. 5 illustrates

the inﬂuence of the soft-limiting for the left-ear binaural room

transfer functions (BRTFs) resulting from a broadband plane wave

impinging from (ϕ= 0°, θ = 90°)on a simulated 2702 sampling

point Lebedev SMA. The BRTFs were calculated up to the 35th-

order using the different radial ﬁlter limits 0, 10, 20, and 40 dB. It

can be seen that a limit of 0 dB leads to a signiﬁcant attenuation

of the high frequency components, but provides an advantageous

signal-to-noise ratio in the resulting ear signals nevertheless [2, 4].

Although this is not required for the ideal rendering conditions in

this study, we chose 0 dB soft-limiting for this contribution in or-

der to produce comparable results to previous studies [2, 10].

All mitigation algorithms were implemented with

sound_field_analysis-py [18]. Solely the MagLS

HRIRs were pre-processed with MATLAB code provided by the

authors of [14]. Every ARIR parameter set was processed with

each of the mitigation algorithms MagLS, Tapering+SHF, SHF,

and SubS (Spatial Subsampling), as well as an untreated (Raw)

ARIR rendering was produced.

Previous studies showed that SH representations of an order of

less than 8 exhibit audible undersampling artifacts, i.e., a clear per-

ceptual difference to the reference dummy head data [10]. Since

Figure 5: Left ear magnitude responses of the frontal KU100

HRTF, and ARIR binaural renderings up to order 35 involving

radial ﬁlters with different soft-limits. The ARIR renderings are

based on a simulated broadband plane wave impinging virtual

2702 Lebedev SMAs from (ϕ= 0°, θ = 90°). The deviation

to the magnitudes of the HRTF illustrates the inﬂuence of the soft

limit. All magnitude responses are 1/3-octave-smoothed.

this work investigates the effectiveness of mitigation approaches

for undersampled sound ﬁelds, we chose to focus on SH orders

below 8 for the subsequent instrumental and perceptual evalua-

tion. Signiﬁcant beneﬁcial effects of the mitigation approaches for

higher orders are not expected.

5. INSTRUMENTAL EVALUATION

In this section, we compare the mitigation approaches based on

3rd SH order array data of CR7, which has a reverberation time of

about 0.25 s. We used ARIRs from a 50-point Lebedev grid. We

calculated the BRIRs for 360 azimuth directions in the horizontal

plane in steps of 1° and compare them to the measured ground

truth dummy head BRIRs for the same head orientations.

Absolute spectral differences between dummy head and array

BRIRs in dB are illustrated in Fig. 6. The top diagram depicts

the deviations averaged over all 360 directions with respect to fre-

quency (x-axis). The bottom diagram shows the differences aver-

aged over 40 directions contralateral to the source position. It is

evident that the spectral differences tend to be larger on this con-

tralateral side.

The untreated (Raw) rendering indicated by the dashed line is

clearly affected by undersampling artifacts above fA. Around the

DAFx.4

5

10

15

20

Spectral difference in dB

1k 10k 15k

5

10

15

20

25

25

Contralateral Overall

Frequency in Hz

a

f

Raw

BEMA

MagLS

SubS

SHF

Taper

fA

Figure 6: Absolute spectral differences of dummy head and SMA

binaural signals in dB. Top: averaged over 360 horizontal direc-

tions. Bottom: averaged over 40 directions around the contralat-

eral side.

contralateral side, these differences increase rapidly. Both HRTF

pre-processing algorithms (SubS (gray) and MagLS (green)) sig-

niﬁcantly decrease the difference to the reference whereby MagLS

tends to produce the lowest deviations.

Although BEMA (blue) was shown to be effective for very

simple sound ﬁelds like a single plane wave, it produces signif-

icantly larger deviations from the reference than Raw. As noted

by the authors of BEMA [2], even for a simple sound ﬁeld com-

posed of three plane waves from different directions and arbitrary

phase, BEMA introduces audible comb ﬁltering artifacts. Addi-

tionally, the averaging of the SH coefﬁcients from lower modes to

extract the spatial information for higher modes, leads to a perceiv-

able low-pass effect, which produces the large differences towards

higher frequencies.

The SHFs and Tapering perform comparably. Both methods

employ global ﬁltering to the binaural signals. The differences at

the contralateral side are larger than for frontal directions.

6. PERCEPTUAL EVALUATION

Some of the approaches considered here have already been percep-

tually evaluated in listening experiments. Subsampling showed to

signiﬁcantly improve the perceptual quality [2], although it pro-

vokes stronger spatial aliasing. Time Alignment, Subsampling

and SHFs were compared in [9]. The results showed that mostly

Time Alignment, which is a predecessor of MagLS, yields bet-

ter results than Subsampling. The SHFs were rated worst of the

three tested methods, matching the instrumental results depicted in

Fig. 6. This may be due to the fact that global equalization shifts

the error in binaural time-frequency spectra to lateral directions.

The perceptual evaluation of BEMA showed improvements when

auralizing simulated sound ﬁelds with a limited number of sound

sources [2]. However, for measured diffuse sound ﬁelds, BEMA

introduces signiﬁcant artifacts and thus is no promising algorithm

for real-world applications. To our knowledge, Tapering has not

been evaluated perceptually in a formal manner.

6.1. Methods

6.1.1. Stimuli

The stimuli were calculated as described in Sec. 4 for the SH or-

ders 3, 5 and 7 for 360 directions along the horizontal plane with

steps of 1° for the room CR7 and CR1. The 3rd and 5th-order

renderings are based on impulse response measurements on the

50 sampling point Lebedev grid while for order 7 the 86 sampling

point Lebedev grid was used. Previous studies showed strong per-

ceptual differences between ARIR and dummy head auralizations

in particular for lateral sound sources [9, 10]. Therefore, each

ARIR rendering was generated for a virtual source in the front

(ϕ= 0°, θ = 90°)and at the side (ϕ= 90°, θ = 90°). To

support transparency, static stimuli for both tested sound source

positions are publicly available 1. Anechoic drum recordings were

used as the test signal in particular because drums have a wide

spectrum and strong transients making them a critical test signal.

Previous studies showed that certain aspects are only induced with

critical signals [2, 10].

6.1.2. Setup

The experiment was conducted in a quiet acoustically damped au-

dio laboratory at Chalmers University of Technology. The Sound-

Scape Renderer (SSR) [21] in binaural room synthesis (BRS)

mode was used for dynamic auralization. It convolves arbitrary in-

put test signals with a pair of BRIRs corresponding to the instanta-

neous head orientation of the listener, which was tracked along the

azimuth with a Polhemus Patriot tracker. The binaural renderings

were presented to the participants using AKG K702 headphones

with a Lake People G109 headphone ampliﬁer at a playback level

of about 66 dBA. The output signals of the SSR were routed to an

Antelope Audio Orion 32 DA converter at 48 kHz sampling fre-

quency and a buffer length of 512 samples. Equalization according

to [19] was applied to the headphones and the dummy head. The

entire rendering and performance of the listening experiment were

done on an iMac Pro 1.1.

6.1.3. Paradigm and Procedure

The test design was based on the Multiple Stimulus with Hidden

Reference and Anchor (MUSHRA) methodology proposed by the

International Telecommunication Union (ITU) [22]. The partici-

pants were asked to compare the ARIR renderings to the dummy

head reference in terms of overall perceived difference. The an-

chor consists of diotic non-head-tracked BRIRs, low-pass ﬁltered

at a cutoff at 3 kHz. Each trial, i.e., a MUSHRA page, comprised

8 stimuli to be rated by the subjects (BEMA, MagLS, SHF, Taper-

ing+SHF, SubS, Raw, hidden reference (Ref), Anchor). The exper-

iment was composed of 12 trials: 3 SH orders (3,5,7) ×2 nom-

inal source positions (0°,90°)×2 rooms (CR1, CR7).

The subjects were provided a graphical user interface (GUI)

with continuous sliders ranging from ’No difference’, ’Small dif-

ference’, ’Moderate difference’, ’Signiﬁcant difference’ to ’Huge

difference’ as depicted in Fig. 7.

14 participants in the age between 21 and 50 years took part

in the experiment. Most of them were MSc students or staff at the

Division of Applied Acoustics of Chalmers University of Technol-

ogy. The subjects were sitting in front of a computer screen with a

keyboard and a mouse. The drum signal was playing continuously,

1http://doi.org/10.5281/zenodo.3931629

DAFx.5

Figure 7: Employed graphical user interface of the listening exper-

iment.

and it was possible to listen to each stimulus as often and long as

desired. The participants were allowed and strongly encouraged to

move their heads during the presentation of the stimuli. At the be-

ginning of each experiment, the subjects rated four training stimuli

that covered the entire range of perceptual differences of the pre-

sented stimuli in the main part of the experiment. These training

stimuli consisted of a BEMA and MagLS rendering of CR1 data

at order 3 for the lateral sound source position as well as the cor-

responding anchor and reference. The experiment took on average

about 30 minutes per participant.

6.2. Results

As recommended by the ITU [22], we post-screened all reference

and anchor ratings. Two participants rated the anchor higher than

30 (44, 36). We found no further inconsistencies so that we chose

not to exclude these participants.

In the listening experiment, we solely presented one order and

one direction per trial. We want to therefore highlight that the

direct comparison of the ratings for different orders and different

source positions as well as subsequent interpretation has to be per-

formed with reservation. All stimuli were presented in randomized

order and the corresponding references and anchors were always

the same for each condition so that some amount of consistency

in the subject’s responses may be assumed. We therefore present

a statistical analysis in the following that includes comparisons

between orders and positions as it is commonly performed with

MUSHRA data.

Fig. 8 presents the interindividual ratings in form of boxplots.

The plots are divided for each room and sound source position

and present the ratings with respect to the algorithm (x-axis) and

order as indicated by the color. Two major observations can be

made: 1) Considering the ratings of the Raw conditions shows

that mostly higher-order renderings were perceived closer to the

reference than lower-order renderings. 2) The algorithms MagLS,

Tapering+SHF, SubS, and SHF all improve ARIR renderings com-

pared to untreated renderings. This improvements seems to be-

come weaker with increasing order.

For statistical analysis of the results, a repeated measures

ANOVA was performed. We applied a Lilliefors test for normality

to test the assumptions for the ANOVA. It failed to reject the null

hypothesis in 4 of 72 conditions at a signiﬁcance level of p= 0.05.

However, parametric tests such as the ANOVA are generally robust

to violations of normality assumption [25]. For further analysis

Greenhouse-Geisser corrected p-values are considered, with the

associated ϵ-values for correction of the degrees of freedom of the

F-distribution being reported.

A four-way repeated measures ANOVA with the within-

subject factors algorithm (BEMA, MagLS, Tapering+SHF, SHF,

SubS, and Raw), order (3,5,7), room (CR1, CR7), and nominal

source position (0°,90°)was performed. The associated mean

values with respect to algorithm (x-axis), and SH order (color) are

depicted in Fig. 9. Each value was calculated as the mean value of

the ratings of all participants for both directions and both rooms.

The 95 % within-subject conﬁdence intervals were determined as

proposed by [23, 24] based on the main effect of algorithm. Sim-

ilar to the boxplots, the mean values indicate that all algorithms

except BEMA yield considerable improvements.

The ANOVA revealed the signiﬁcant main effects algorithm

(F(5,65) = 143.64,p < .001,η2

p=.917,ϵ=.457), and or-

der (F(2,26) = 37.382,p < .001,η2

p=.742,ϵ=.773). These

signiﬁcant effects match the observations made so far. Mostly,

higher-order renderings yielded smaller perceptual differences

than lower-order ones. Further, the algorithm signiﬁcantly in-

ﬂuences the perceptual character of ARIR renderings. The

ANOVA revealed the signiﬁcant interaction of algorithm×order

(F(10,130) = 4.756,p < .001,η2

p=.268,ϵ=.556). Thus,

the algorithms seem to perform differently with respect to

the rendering order. The signiﬁcant effect of the interac-

tion of algorithm×source position (F(5,65) = 7.176,p < .001,

η2

p=.356,ϵ=.774) shows that the performance of the algorithm

also depends on the sound source position.

The ANOVA also revealed two signiﬁcant interactions in-

volving the factor room: The interaction of algorithm×room

(F(5,65) = 2.864,p < .040,η2

p=.181,ϵ=.695), as well as

order×room (F(2,26) = 4.736,p < .024,η2

p=.267,ϵ=.853)

were found to be signiﬁcant. The results of the listening experi-

ment and the ANOVA values, are available as well 1.

7. DISCUSSION AND CONCLUSIONS

We presented a perceptual evaluation of approaches for mitigating

the perceptual impairment due to spatial aliasing and order trun-

cation in binaural rendering of spherical microphone array data.

The present results employing dry acoustic environments together

with previous results on reverberant environments [5] suggest the

following:

• Bandwidth Extension Algorithm for Microphone Arrays

(BEMA) is the only method that causes larger perceptual

differences to the ground truth signal than without mitiga-

tion.

• Depending on the condition, all other mitigation ap-

proaches produce either no improvement or an improve-

ment that is comparable in magnitude.

• Mitigation is more effective at lower orders and is hardly

detectable at order 7.

• We did not ﬁnd a dependency on the room although some

mitigation approaches are based on a diffuse ﬁeld assump-

tion, which fulﬁlled better in more reverberant rooms.

• In both experiments Tapering+SHF was sometimes rated

closer to the reference when rendered at order 5, instead of

order 7. This might be caused by the cosine-shaped win-

dowing of the Tapering algorithm, which modiﬁes higher

rendering orders more than lower ones.

DAFx.6

b

Anchor BEMA Raw SHF SubS Tap.+SHF MagLS Ref

Huge

Significant

Moderate

Small

No CR1 0°

Anchor BEMA Raw SHF SubS Tap.+SHF MagLS Ref

Huge

Significant

Moderate

Small

No CR1 90°

Anchor BEMA Raw SHF SubS Tap.+SHF MagLS Ref

Huge

Significant

Moderate

Small

No CR7 0°

Anchor BEMA Raw SHF SubS Tap.+SHF MagLS Ref

Huge

Significant

Moderate

Small

No CR7 90°

Figure 8: Interindividual variation in the ratings of perceptual difference between the stimulus and the dummy head reference with respect

to the algorithm (x-axis), and SH order (color) for each room and virtual source position separately. Each box indicates the 25th and 75th

percentiles, the median value (black line), the outliers (grey circles) and the minimum / maximum ratings not identiﬁed as outliers (black

whiskers).

BEMA Raw SHF SubS T+SHF MagLS

Huge

Significant

Moderate

Small

No

N = 3

N = 5

N = 7

Figure 9: Mean values of the ratings pooled over both rooms with

respect to the algorithm. The 95 % within-subject conﬁdence inter-

vals were calculated according to [23, 24]. The ratings for different

SH orders are displayed separately as indicated by the color.

8. ACKNOWLEDGMENTS

We thank Christian Schörkhuber, Markus Zaunschirm, and Franz

Zotter of IEM at the University of Music and Performing Arts in

Graz for providing us with their code of MagLS, and all partici-

pants of the listening experiment for their support.

9. REFERENCES

[1] Boaz Rafaely and Amir Avni, “Interaural cross correlation

in a sound ﬁeld represented by spherical harmonics,” The

Journal of the Acoustical Society of America, vol. 127, no. 2,

pp. 823–828, 2010.

[2] Benjamin Bernschütz, Microphone Arrays and Sound Field

Decomposition for Dynamic Binaural Recording, Ph.D. the-

sis, Technische Universität Berlin, 2016.

[3] Hannes Helmholz, Jens Ahrens, David Lou Alon, Sebastià

V. Amengual Garí, and Ravish Mehra, “Evaluation of Sensor

Self-Noise In Binaural Rendering of Spherical Microphone

Array Signals,” in Proc. of the IEEE Intern. Conf. on Acous-

tics, Speech and Signal Processing (ICASSP), Barcelona,

Spain, May 2020, pp. 161–165, IEEE.

[4] Hannes Helmholz, David Lou Alon, Sebastià V. Amengual

Garí, and Jens Ahrens, “Instrumental Evaluation of Sensor

Self-Noise in Binaural Rendering of Spherical Microphone

Array Signals,” in Forum Acusticum, Lyon, France, 2020,

pp. 1–8, EAA.

[5] Tim Lübeck, Hannes Helmholz, Johannes M. Arend,

Christoph Pörschmann, and Jens Ahrens, “Perceptual Eval-

uation of Mitigation Approaches of Impairments due to Spa-

tial Undersampling in Binaural Rendering of Spherical Mi-

crophone Array Data,” Journal of the Audio Engineering

Society, pp. 1–12, 2020.

DAFx.7

[6] Boaz Rafaely, Springer Topics in Signal Processing Springer

Topics in Signal Processing, Springer, 2015.

[7] Carl Andersson, “Headphone Auralization of Acoustic

Spaces Recorded with Spherical Microphone Arrays,” M.S.

thesis, Chalmers University of Technology, 2017.

[8] Zamir Ben-Hur, Fabian Brinkmann, Jonathan Sheaffer, Ste-

fan Weinzierl, and Boaz Rafaely, “Spectral equalization in

binaural signals represented by order-truncated spherical har-

monics,” The Journal of the Acoustical Society of America,

vol. 141, no. 6, pp. 4087–4096, 2017.

[9] Markus Zaunschirm, Christian Schörkhuber, and Robert

Höldrich, “Binaural rendering of Ambisonic signals by head-

related impulse response time alignment and a diffuseness

constraint,” The Journal of the Acoustical Society of Amer-

ica, vol. 143, no. 6, pp. 3616–3627, 2018.

[10] Jens Ahrens and Carl Andersson, “Perceptual evaluation of

headphone auralization of rooms captured with spherical mi-

crophone arrays with respect to spaciousness and timbre,”

The Journal of the Acoustical Society of America, vol. 145,

no. April, pp. 2783–2794, 2019.

[11] Christoph Hold, Hannes Gamper, Ville Pulkki, Nikunj

Raghuvanshi, and Ivan J. Tashev, “Improving Binaural Am-

bisonics Decoding by Spherical Harmonics Domain Taper-

ing and Coloration Compensation,” in Proceedings of Inter-

national Conference on Acoustics, Speech and Signal Pro-

cessing, 2019, pp. 261–265.

[12] Fabian Brinkmann and Stefan Weinzierl, “Comparison of

head-related transfer functions pre-processing techniques for

spherical harmonics decomposition,” in Proceedings of the

AES Conference on Audio for Virtual and Augmented Reality,

Redmond, USA, 2018, pp. 1–10.

[13] Benjamin Bernschütz, “A Spherical Far Field HRIR/HRTF

Compilation of the Neumann KU 100,” in Proceedings of

the 39th DAGA, Meran, Italy, 2013, pp. 592–595.

[14] Christian Schörkhuber, Markus Zaunschirm, and Robert

Holdrich, “Binaural rendering of Ambisonic signals via

magnitude least squares,” in Proceedings of 44th DAGA, Mu-

nich, Germany, 2018, pp. 339–342.

[15] Lord Rayleigh, “XII. On our perception of sound direction,”

Philosophical Magazine Series 6, vol. 13, no. 74, pp. 214–

232, 1907.

[16] Benjamin Bernschütz, “Bandwidth Extension for Micro-

phone Arrays,” in Proceedings of the 133th AES Convention,

San Francisco, USA, 2012, pp. 1–10.

[17] Thomas McKenzie, Damian T. Murphy, and Gavin Kear-

ney, “Diffuse-Field Equalisation of binaural ambisonic ren-

dering,” Applied Sciences, vol. 8, no. 10, 2018.

[18] Christoph Hohnerlein and Jens Ahrens, “Spherical Mi-

crophone Array Processing in Python with the sound ﬁeld

analysis-py Toolbox,” in Proceedings of the 43rd DAGA,

Kiel, Germany, 2017, pp. 1033–1036.

[19] Philipp Stade, Benjamin Bernschütz, and Maximilian Rühl,

“A Spatial Audio Impulse Response Compilation Captured

at the WDR Broadcast Studios,” in Proceedings of the 27th

Tonmeistertagung - VDT International Convention, Cologne,

Germany, 2012, pp. 551–567.

[20] Benjamin Bernschütz, Christoph Pörschmann, Sascha Spors,

and Stefan Weinzierl, “Entwurf und Aufbau eines variablen

sphärischen Mikrofonarrays für Forschungsanwendungen in

Raumakustik und Virtual Audio,” in Proceedings of 36th

DAGA, Berlin, Germany, 2010, pp. 717–718.

[21] Matthias Geier, Jens Ahrens, and Sascha Spors, “The

soundscape renderer: A uniﬁed spatial audio reproduction

framework for arbitrary rendering methods,” in Proceed-

ings of the 124th AES Convention, Amsterdam, Nether-

lands, 2008, pp. 179–184, Code publicly available at

"http://spatialaudio.net/ssr/".

[22] ITU-R BS.1534-3, “Method for the subjective assessment of

intermediate quality level of audio systems,” 2015.

[23] Geoffrey R Loftus, “Using conﬁdence intervals in within-

subject designs,” Psychonomic Bulletin & Review, vol. 1,

no. 4, pp. 1–15, 1994.

[24] Jerzy Jarmasz and Justin G. Hollands, “Conﬁdence Inter-

vals in Repeated-Measures Designs: The Number of Obser-

vations Principle,” Canadian Journal of Experimental Psy-

chology, vol. 63, no. 2, pp. 124–138, 2009.

[25] Jürgen Bortz and Christof Schuster, Statistik für Human- und

Sozialwissenschaftler, Springer-Verlag, Gießen, Germany, 7

edition, 2010.

DAFx.8