Content uploaded by Leo McCormack

Author content

All content in this area was uploaded by Leo McCormack on Oct 29, 2022

Content may be subject to copyright.

Perceptually informed interpolation and rendering of spatial room

impulse responses for room transitions

Thomas McKENZIE1; Nils MEYER-KAHLEN1; Rapolas DAUGINTIS2; Leo McCORMACK1;

Sebastian J. SCHLECHT1,3; Ville PULKKI1

1Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland

2Dyson School of Design Engineering, Imperial College London, London, UK

3Media Lab, Department of Art and Media, Aalto University, Espoo, Finland

ABSTRACT

The acoustics of coupled rooms is often more complex than single rooms due to the increase in features

such as double-slope decays, direct sound occlusion and anisotropic reverberation. For directional capture,

analysis and reproduction of room acoustics, spatial room impulse responses (SRIRs) can be utilised, but

measuring SRIRs at multiple positions is time consuming, and thus it may be desirable to interpolate between

a sparse set of measurements. This paper presents a perceptually informed interpolation method for higher-

order Ambisonic SRIRs that is robust for coupled rooms. It uses minimum-phase magnitude interpolation of

the direct sound which is steered to the relative direction of arrival, sector-based early reﬂection interpolation

in the frequency domain, and relative RMS matching for late reverberation. A method for rendering up to six

degrees-of-freedom datasets of SRIRs is then presented using a time-varying partitioned convolution audio

plugin, which is open-source and has made available for download. Finally, a listening test is conducted

to assess the perceptual quality of interpolating between coupled room SRIR measurements with varying

inter-measurement distance. The results suggest that for the tested scenario, using the presented interpolation

method, a 50 cm inter-measurement distance is perceptually sufﬁcient.

Keywords: Spatial room impulse response, six degrees-of-freedom, SRIR interpolation

1. INTRODUCTION

The human auditory system retrieves important spatial cues from the acoustics of a room. Several

characteristics of reverberation are dependent on the source and receiver positions in a room, such as

direct-to-reverberant ratio, early reﬂections and modal coupling, while reverberation time remains largely

constant. Coupled room acoustics is more complex, with the emergence of double-slope decays in the

room response, edge diffraction and portalling effects (1, 2), all of which vary with inter-room listener

and source positions and coupling aperture size. Portalling is used in this paper to refer to scattering and

diffraction around the coupling aperture, which gives the perception of the sound source location at the

coupling aperture (3). When a listener moves in a simple shoebox room, acoustical changes tend to be

smooth and gradual. In the transition between coupled rooms, however, rapid changes in acoustics can

occur with small positional changes (4).

Recent literature has investigated how room acoustics measurements, known as room impulse re-

sponses (RIRs), can be used to evaluate the acoustical changes with different receiver positions inside a

single room, both for virtual reality (5, 6) and dereverberation applications (7). For 6DoF rendering of

sound scenes with Ambisonic spatial room impulse responses (SRIRs) at multiple positions in space, a

convolution plugin that can switch between SRIRs in real time is needed, followed by an auralisation

method such as binaural rendering (8, 9, 10).

Measurement or simulation of SRIRs at multiple positions at a high measurement resolution can be

time consuming and computationally expensive, and therefore it can be desirable to interpolate between

1thomas.mckenzie@aalto.ﬁ

ABS-0439

a sparse set of measurements. The perceptual requirements for inter-measurement distance vary with

auditory stimuli (11, 12), whereby sounds with limited frequency bandwidth can forgive larger distances

between measurements (13), and the greater diffuseness of late reverberation allows for different mea-

surement distances for different parts of the impulse response (5). For different receiver positions inside

coupled rooms, however, the requirements may vary due to the increased acoustical complexity.

Interpolation of mono or binaural RIRs has been approached in many ways in the past: 1) Dynamic

time warping (14), where the time axes of the nearest RIRs are stretched until they align; 2) Modal

interpolation using a general solution to the Helmholtz equation (15), which is effective for non-uniform

spatial distributions of RIRs at low frequencies; and 3) A combination of plane wave decomposition

and time-domain equivalent source methods (11). Moving into SRIRs, a ﬁrst-order interpolation method

is presented in (16), which separates input SRIRs into specular parts, which are the direct sound and

early reﬂections, and the diffuse parts. These are interpolated separately, where the specular parts are

interpolated individually using direction of arrival estimations. In (12), a similar method is presented for

early reﬂection interpolation between the nearest three receivers, with simpler interpolation of residual

signals.

As the transition between coupled rooms is highly complex, it requires great care in reproduction.

Interpolation between two coupled room RIRs is likely to be a more demanding task than for two RIRs

inside the same room. This paper presents a perceptually informed SRIR interpolation method for higher-

order Ambisonic SRIRs, which utilises minimum-phase magnitude interpolation of the direct sound

steered to the estimated direction of arrival, sector-based early reﬂection interpolation in the frequency

domain, and relative RMS matching late reverberation interpolation. A method for rendering up to six

degrees-of-freedom datasets of SRIRs is then presented using a time-varying partitioned convolution

audio plugin. Finally, a listening test is conducted in virtual reality to assess the perceptual quality of

interpolating between coupled room SRIRs with varying inter-measurement distance using a previously

measured dataset of the transition between coupled rooms.

The paper is laid out as follows: Section 2 details the interpolation method and Section 3 describes the

convolution plugin and dynamic binaural rendering. Section 4 then presents the methodology and results

of a listening test conducted in virtual reality to evaluate different inter-measurement resolutions of a

dataset of coupled room SRIRs. Finally, Section 5 presents concluding remarks and proposes further

work, and MATLAB code for the interpolation method and an open-source virtual studio technology

(VST) plugin for the 6DoF convolution are made available for download, with the links provided at the

end of the paper.

2. PERCEPTUALLY INFORMED SRIR INTERPOLATION

This section describes the method of perceptually informed interpolation between SRIRs. The method

is designed for 3D sets of SRIRs. The maximum spherical harmonic (SH) order is denoted in this paper as

N, with the order of an individual SH component denoted by nand the degree denoted by m. A MATLAB

implementation of the interpolation method is available for download (see Section 6 for download link).

In this paper, the Ambisonic Channel Numbering (ACN) and semi-normalised (SN3D) conventions are

employed.

The SRIR measurements were made at Jpoints at coordinates PJ⊂R3. At each of those points a

directional room impulse response was measured, which was encoded to the SH domain and is denotes

as hj(t)∈R(N+1)2. These responses are then interpolated to a dense set of I>Jpoints at positions

ˆ

PI⊂R3. The distance between a point from the set of measurement points pj∈ PJand a point from the

set of interpolation points ˆ

pi∈ˆ

PIis denoted as

vi,j=|ˆ

pi−pj|2,(1)

where |ˆ

pi−pj|2denotes the Euclidean distance. With the deﬁnition of the distance, it is possible to ﬁnd

the subset of J′=2Dmeasurement points, which contains the measurements closest to any interpolation

point ˆ

pi, where Dis the dimensionality in which the measurement points are arranged. Therefore, a

1D set of SRIRs in a line will have two nearest measurements; a 2D set of SRIRs in a grid will have

four nearest measurements, and a 3D set will have eight nearest measurements. This gives a subset of

nearest points P(i)

J′⊂ PJfor each interpolation point. As all steps described next are carried out for each

interpolation point, the index iis omitted for readability.

2.1 Direct Sound

In this study, the direct sound is taken as the ﬁrst 4.17ms of the input SRIRs (200 samples at 48 kHz),

though this value is adjustable. The method assumes RIR onsets are time-aligned. The direction of arrival

(DoA) of the direct sound in each input SRIR is ﬁrst estimated using the time-averaged pseudointensity

vector, i∈R3, which is derived from the ﬁrst-order SH components as

i=

200

∑

t=1

[h1(t)h4(t),h1(t)h2(t),h1(t)h3(t)]T,(2)

where superscript Tdenotes transposition.

For each interpolation point, the direct sound direction ˆ

θ∈ S2needs to be determined. For a non-

occluded source, a geometrically correct method would be to estimate the sound source location based

on the direct sound DoAs observed at the measurements. This can be done by ﬁnding the point that is

closest to all lines along the DoAs, ideally their intersection point. Then, the direct sound direction could

be computed at the interpolated point. In coupled rooms, where the sound source can potentially be

occluded, see for example loudspeakers 2 and 3 in Fig. 2a, this procedure may cause problems. Between

two measurements, the location of the ﬁrst sound energy will change and such geometrical solutions may

give arbitrary results. Therefore, a simpler approximate algorithm was used in this study to estimate the

sound source location. The direct sound direction at each interpolation point was set to

ˆ

θ=∑

j′

θj′gj′,(3)

where gj′are distance weights obtained from the inverse distances between the interpolated positions

and the nearest measurement positions

gj′=v−1

j′

∑J′

j′=1v−1

j′

.(4)

When the direct sound coincides with the measurement position, the direction is correct. Also, when

the sound source is sufﬁciently far away, or the spacing of measurement points is small, the error intro-

duced by this simpliﬁcation is small. In case of occlusion at some position, a smooth interpolation curve

emerges between the direction of a visible direct sound, and the ﬁrst energy arriving from an occluded

sound source.

For the minimum phase direct sound interpolation, the omnidirectional channels of the nearest input

SRIRs are ﬁrst converted into the frequency domain. The spectra are then 1/3 octave smoothed, mag-

nitude weighted based on the gains gj′, and made minimum phase. The spectra are then summed, and

encoded into SH at the target angle ˆ

θ. The interpolated direct sound is then amplitude normalised based

on the gain weighted RMS of the nearest measurements. This procedure ensures that the effect of the

sound source directivity is accounted for at the interpolated position.

2.2 Early Reﬂections

For the early reﬂection interpolation, ﬁrstly the transition time tEL, which is the cutoff between early

reﬂections and late reverberation, is calculated separately for each input SRIR based on the energy decay

curve passing a set threshold value (17). The omnidirectional channel of each SRIR is ﬁrst bandpass

ﬁltered at 1 kHz, then normalised to a maximum amplitude of 1, and Schroeder integration is used to

obtain the energy decay curve (EDC):

D(t) = Z∞

t

h2(τ)dτ.(5)

In this study, values of tEL are calculated as tEL =D(t)/10, rounded to the nearest 1000 samples,

which generally fall between 80ms and 250ms for the room transition dataset (4). This is on the higher

end of typical early reﬂection cutoff times reported in the literature (17, 18, 19).

The early reﬂections are interpolated and equalised at different directions on the sphere by using

beamforming and reconstruction (20). For this, the measured SH domain responses are analysed with

a set of max-rEbeams directed to a dense set of Lquasi-uniformly arranged directions in a so called

t-design Θ

Θ

Θt(21),

hj(t) = 4π

LYNdiagN{wn}hj(t),(6)

where YN∈RL×(N+1)2is a matrix of real spherical harmonics evaluated at directions Θ

Θ

Θt, and diagN{wn}

is a diagonal matrix of beamforming weights, with one unique weight for all SH components belonging

to each order. The t-design with the least number of points that fulﬁlls T≥2N+1 is selected. For the

fourth-order SRIRs used in this paper for example, the t-design has 48 points. The beam signals are

weighted with the distance weights and summed together

h(t) = ∑

j′

gj′hj′(t).(7)

Next, the summed signals are equalised to match the weighted sum of the magnitude spectra in each

direction. Equalisation is needed to rectify any comb ﬁltering artefacts that may arise from the summing

of correlated signals, and is most apparent when the SRIRs to be interpolated are a greater distance apart.

Every beamformed signal is equalised separately, such that colouration is removed in each direction. The

equalisation is performed in the frequency domain in equivalent rectangular bandwidth (ERB) frequency

bands (22):

BWERB =24.7(4.37 ×10−3fc+1),(8)

where fcis the centre frequency. In this study, 48 frequency bands are employed with the lowest fre-

quency at 10 Hz, which approximates to 1/3rd octave bands. For each ERB band, the target RMS is

a sum of the RMS of each amplitude-weighted nearest SRIR beam divided by the current RMS of the

interpolated beam. An equalisation curve is then calculated by linear interpolation of each ERB band

target RMS, between 20 Hz and 20 kHz. After the directionally equalised responses for every directional

response in h(EQ)(t)are obtained, they are brought back in the SH domain using

h(t) = diagN1

wnYT

Nh(EQ)(t).(9)

2.3 Late Reverberation

The late reverberation interpolation follows much of the same method as used for the early reﬂec-

tions, but without the beamforming. The ﬁnal interpolated SRIRs are a sum of the interpolated direct

sound, early reﬂections and late reverberation, with cosine shaped amplitude windows used to fade be-

tween sections: 20 samples for direct sound to early reﬂections, and 10 ms for early reﬂections to late

reverberation (both values are conﬁgurable). The interpolated set of SRIRs can then be saved as a spa-

tially oriented format for acoustics (SOFA) ﬁle (23), in the same format as the input set, which makes it

directly compatible with the convolution plugin to be described in the following section.

3. RENDERING

To auralise the SRIRs, a virtual studio technology (VST) plugin was developed, which allows for a

monophonic input signal to be convolved with a speciﬁed SRIR from a set of input SRIRs. The plugin

uses fast partitioned time-varying convolution in the frequency domain (24) with the overlap-add method

to allow for real-time switching between input SRIRs, with minimal perceptual switching artefacts. It is

based on a MATLAB prototype presented in a previous study (8), for which the reader is directed to for

a more detailed description. The plugin is freely available as part of the SPARTA plugin suite (25) (see

Section 6 for a download link), and the plugin graphical user interface (GUI) is presented in Figure 1.

To summarise the method, input SRIR ﬁlters are divided into blocks based on the digital audio work-

station (DAW) block size and placed into a ﬁlter matrix. Each block is then zero padded with the same

number of samples as the block size, and converted to the frequency domain using the discrete Fourier

transform (DFT). Only the ﬁrst half of the result is saved, which reduces computational load. The input

monophonic signal to be convolved with the SRIRs is also converted into the frequency domain, and a

Figure 1 – The graphical user interface of the 6DoFconv VST plugin.

signal matrix is constructed of the input signal blocks, whereby for each input block time period, the

input signal matrix is shifted by one block, dropping the oldest input block and placing the current signal

block to the front.

Each block of the signal matrix is then multiplied by each block of the ﬁlter matrix corresponding

to the chosen ﬁlter selection, and the results are then summed. The block is then duplicated, ﬂipped

and the complex conjugate is taken to rebuild the second half of the frequency domain signal, before

being converted into the time domain using the inverse DFT. Convolution artefacts caused by the signal

discontinuities when switching between the SRIRs are mitigated by cross-fading the convolved signals

across multiple convolution blocks: the block convolved with the currently selected SRIR is saved for

the next time period while the current output block is constructed from a linear cross-fade between the

second half of the convolution block before the last and the ﬁrst half of the last convolution block.

The GUI of the VST plugin allows for a user to select a path to a SOFA ﬁle with SRIRs and their

associated listener positions encoded. It then loads the SRIRs and displays their positions in the Coor-

dinate View on the right (the dimensions of the view window are determined by the coordinate range of

the source and listener positions). The view can be chosen to be from the top or from the side using a

drop-down menu on the top right corner. The listener position can be changed by dragging the orange

dot in the coordinate view or moving the target position sliders on the left. When the position is changed,

the plugin ﬁnds the nearest neighbouring SRIR based on the smallest Euclidean distance. The listener

position can also be controlled via open sound control (OSC) messages from an external device, such as

a head tracker. Additionally, the plugin includes an Ambisonic sound-ﬁeld rotator for Ambisonic SRIRs.

4. EVALUATION

This section details the evaluation of the interpolation algorithm, which was carried out both numer-

ically and perceptually. The set of measurements used in the evaluation was the storage to stairwell

measurements from the Room Transition dataset of SRIRs at N=4 (4), available under a Creative Com-

mons license1. The room transition investigated in this paper is from a dry storage space to a more

reverberant stairwell, with measured background noise levels of 32.8 dBA and 35.2 dBA and RT60s of

0.29 s and 0.73 s, respectively (8).

Figure 2a presents the room geometry and loudspeaker positions of the measurements, with four

1http://doi.org/10.5281/zenodo.4095493

11.7 m

Stairwell

Height = 14.4 m

Storage

Height =

3.6 m

6.3 m

5 m

3.3 m

0 1 m

1

2

3

4

(a) Room geometry and loudspeaker locations

0 0.1 0.2 0.3 0.4 0.5 0.6

Time (s)

-70

-60

-50

-40

-30

-20

-10

0

Energy (dB)

EDC

Slope 1

Slope 2

(b) EDC (energy decay curve) for loudspeaker 2 at

150 cm inside storage space

Figure 2 – Room geometry and loudspeaker locations of the coupled room transition, and an energy

decay curve illustrating the double-slope decay of the room transition. Measurements denoted by

dashed arrow; loudspeaker numbers 1 and 4 retain a continuous line-of-sight between the loudspeakers

and microphone for all measurement positions, 2 and 3 feature occlusion at some measurement

positions (4).

loudspeakers: two in each room; for which one retains a continuous line-of-sight (CLOS) between the

source and receiver for all receiver positions, and two without CLOS. Figure 2b shows the EDC, calcu-

lated using equation 5, for loudspeaker 2 at receiver position 100 cm, which is 150 cm inside the storage

space. The EDC illustrates the double-slope nature of the energy decay, caused by the combination of

the reverberation times of the coupled rooms, whereby the amplitude of each room’s single-slope decay

is the only feature that is considered to change with receiver position (26).

To assess the interpolation, test sets of SRIRs were calculated from the original dataset of measured

SRIRs, which has a 5 cm inter-measurement distance (IMD). This was done by interpolating (at 5 cm

intervals) sparse versions of the original dataset with new IMDs of 10 cm, 20 cm, 50 cm, 100 cm, 200 cm

and 500 cm (where the 500 cm case is just two SRIRs - one at either end). This was repeated for the

measurements at the four loudspeaker positions illustrated in Figure 2a.

4.1 Numerical Evaluation

To numerically evaluate the interpolation method, the DoA of the room transition SRIRs was esti-

mated ﬁrst for the original dataset (with an IMD of 5 cm), and then for the test sets of SRIRs calculated

from interpolation of the original dataset with a reduced IMD. DoA was estimated above 3 kHz, due

to the order dependent ﬁltering necessary for higher-order spherical microphone arrays (27), using a

fourth-order SH steered plane-wave decomposition beamformer, that calculates the power at each cho-

sen location on the sphere (28). DoA was estimated in ﬁve degree resolution for seven arrivals, referring

to the direct sound and loudest early reﬂections.

The error in DoA was then calculated as the difference in azimuth angle between the DoAs calculated

from the reference dataset and the interpolated datasets. A single azimuth error value Eθfor each inter-

polated dataset was then calculated as the mean of the absolute difference in estimated azimuth. Table 1

-250

-200

-150

-100

-50

0

50

100

150

200

250

<-- Storage Measurement position (cm) Stairwell -->

-135

-90

-45

0

45

90

135

180

Azimuth (°)

-10

-8

-6

-4

-2

0

Normalised power (dB)

(a) Original 5 cm interval

-250

-200

-150

-100

-50

0

50

100

150

200

250

<-- Storage Measurement position (cm) Stairwell -->

-135

-90

-45

0

45

90

135

180

Azimuth (°)

-10

-8

-6

-4

-2

0

Normalised power (dB)

(b) Interpolated from 50 cm interval

Figure 3 – Estimated direction of arrival of direct sound and early reﬂections for LS 2 (in storage, no

continuous line-of-sight between the source and receiver, see Fig. 2a). Azimuth values are presented

from −170◦to 190◦for aided visibility around ±180◦, and colour intensity is normalised separately to

each measurement’s maximum power value.

presents the results. In general, Eθincreases with higher IMD, which is expected. Some interesting

results emerge when considering the differences between LS 1 and 2, in the storage space, and LS 3

and 4, in the stairwell. LS 3 and LS 4 have considerably lower Eθfor the low IMD sets, which may be

explained by the higher reverberation time of the stairwell, leading to higher energy throughout the room

transition. The Eθsigniﬁcantly jumps at IMD = 500, suggesting the interpolation method is unable to

accurately reconstruct the room transition acoustics at this distance.

Table 1 – Mean estimated DoA error Eθin degrees between the reference dataset and the test SRIR

datasets. IMD refers to inter-measurement distance of the test SRIR datasets, and LS X refers to the

loudspeaker positions as illustrated in Figure 2a.

IMD (cm) 10 20 50 100 200 500

LS 1 11.0 14.0 16.2 21.0 18.7 24.5

LS 2 13.6 16.9 18.3 19.4 21.1 35.7

LS 3 4.88 7.63 10.5 14.6 12.0 22.0

LS 4 3.40 5.48 6.88 7.35 19.6 18.6

To better illustrate the DoA of the interpolated SRIR sets, the horizontal DoA for all source locations

and measurement positions of LS 2 (in storage, no CLOS between the source and receiver) is presented

in Figure 3, for the original SRIRs and for the interpolated SRIRs from 50 cm IMD. In the plot, a positive

increase in azimuth denotes anticlockwise movement, and colour intensity is normalised separately for

each measurement to the maximum power detected in that measurement, in order to illustrate the relative

intensity of the dominant source direction to the other reﬂections. The overall trends are largely retained

with the interpolation, though some details around the coupling aperture are somewhat less accurately

captured.

4.2 Subjective Evaluation

To perceptually evaluate the quality of the SRIR interpolation, a listening test was conducted in virtual

reality. The test paradigm was MUSHRA-like, with a hidden reference but no anchor. Participants were

presented with seven conditions for which they could select one condition at a time, and were asked to

walk the transition and rate the sound quality in terms of overall perceived similarity to the reference,

with instructions to listen for all of localisation accuracy, colouration and reverberation. The reference

condition was the original dataset of SRIRs, and the test conditions were the interpolated SRIR sets at

different IMDs.

Two test stimuli were used: a dry recording of a drumkit, chosen for its transients, sharp attacks

and wide range of frequency content, and an anechoic violin recording, chosen for its smooth and pe-

riodic waveform2. The 6DoFconv plugin was used to convolve the test stimuli with the set of SRIRs,

whereby the SRIR was switched depending on the participant’s position. The convolved signals were

then rendered binaurally using the parametric higher-order DirAC binaural decoder (29). Mysphere 3.2

headphones were used for playback, which have been shown to offer high levels of passive transparency

(30) which makes them suitable for experiments with both real and virtual sources (8, 9). Audio process-

ing and programming of the listening test was conducted in Cycling 74 Max.

To display the room transition in virtual reality, three-dimensional models of the two rooms were

captured using LiDAR technology from an Apple iPad Pro, with certain features enhanced in post pro-

cessing, such as the doors and windows, using high resolution two-dimensional textures and sharper

edges. Unity was used to render the visuals, which were displayed on an Oculus Rift S. The loudspeaker

model was movable in the environment, such that whichever loudspeaker was currently playing was dis-

played (as determined in Max, and sent to Unity via OSC). User position and orientation data, for SRIR

selection and sound ﬁeld rotation in the 6DoFconv plugin, was sent from Unity to Max via OSC.

The listening test instructions and MUSHRA-like user interface were shown in the Unity virtual

environment: the position of these was controlled by the Oculus left hand controller, and interactions

made using the trigger on the Oculus right hand controller. To ensure participants stayed within the

bounds of the SRIR measurements, a guiding line was placed at 1.2 m above the ground in the Unity

scene, from 2.5 m inside the storage space to 2.5 m inside the stairwell, corresponding to the positions

of the measurements. In the case that the participant strayed more than 25 cm from the guiding line in

the X or Z axis, the screen ﬂashed red and the audio cut out.

The listening test consisted of a total of eight trials: the four loudspeaker positions presented once

with the drumkit and once with the violin. No repeats were conducted. Trial and condition ordering was

randomised and double anonymous. The tests were conducted on 13 participants aged between 24 and

31 (11 male, 2 female) with self reported normal hearing and prior critical listening experience (such as

education or employment in audio or music engineering).

4.2.1 Results and Discussion

The results of the listening test are presented as violin plots in Figure 4. Violin plots display both the

density trace and box plot, which better illustrates the structure of the data over traditional box plots (31).

The violin widths represent the density of data, median values are presented as a white point, interquartile

ranges are marked using a thick grey line, the ranges between the lower and upper adjacent values are

marked using a thin grey line, and individual results are displayed as coloured points.

The results generally show that, with the presented SRIR interpolation method, IMDs up to 50 cm

produced perceptually comparable results to the reference at 5 cm IMD. Even for the 100 cm IMD,

median values were above 80 for 7 out of 8 tested conditions. At 200 cm and 500 cm IMD, scores

were signiﬁcantly lower, especially for LS 2 and LS 3, where there was no CLOS between the source

and receiver, and the largest angular errors in the direct sound direction occur due the choice of direct

sound location estimation. This is in ﬁtting with the results shown in (8), which showed that a linear

interpolation between the ﬁrst and last measurements was rated as higher in naturalness for the two

sound sources with CLOS (LS 1 and LS 4) than those without (LS 2 and LS 3).

To test the statistical signiﬁcance of the results, the data was ﬁrst tested for normality using the

Shapiro-Wilk test, which showed not all data to be normally distributed, even when excluding the refer-

ence condition. Therefore, statistical analysis was conducted using non-parametric methods. Friedman’s

tests showed that the conditions were statistically signiﬁcantly different (p<0.001) for all stimuli and

loudspeaker pairs except LS 1 with the violin stimulus: χ2(6) = 8.32,p=0.21; in this conﬁguration

both 200 cm and 500 cm IMDs performed relatively well, with median values of 74 and 69, respectively.

To look in more detail at the statistical signiﬁcance of the difference between results, post-hoc pair-

wise Wilcoxon signed-rank tests with the Bonferroni-Holm correction were conducted: the results are

presented in Figure 5. These conﬁrm that the main differences in results are caused by the 200 cm and

500 cm IMDs in most cases. They suggest an IMD of 100 cm is sufﬁcient for most cases, apart from

LS 2 with the drumkit stimulus.

The different stimuli, a drumkit and a violin, on the whole produced relatively similar results, though

at IMD ≥100 cm, the median rating of the drumkit was lower for 11 out of 12 cases. This suggests that

the drumkit stimulus showed the artefacts of interpolation better, and could suggest that the choice of

2Downloaded from https://www.openair.hosted.york.ac.uk/

5 (Ref) 10 20 50 100 200 500

Inter-measurement distance (cm)

0

20

40

60

80

100

Score

Drumkit

Violin

(a) LS 1 (in storage, with CLOS)

5 (Ref) 10 20 50 100 200 500

Inter-measurement distance (cm)

0

20

40

60

80

100

Score

Drumkit

Violin

(b) LS 2 (in storage, no CLOS)

5 (Ref) 10 20 50 100 200 500

Inter-measurement distance (cm)

0

20

40

60

80

100

Score

Drumkit

Violin

(c) LS 3 (in stairwell, no CLOS)

5 (Ref) 10 20 50 100 200 500

Inter-measurement distance (cm)

0

20

40

60

80

100

Score

Drumkit

Violin

(d) LS 4 (in stairwell, with CLOS)

Figure 4 – Violin plots of the MUSHRA-like listening test results. CLOS refers to a continuous

line-of-sight between the loudspeaker and listener for all listener positions (refer to Fig. 2a for

loudspeaker positions and room geometries). Median values are a white point, interquartile range a

thick grey line, the range between lower and upper adjacent values a thin grey line, and individual

results are coloured points.

Drumkit, LS 1

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Drumkit, LS 2

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Drumkit, LS 3

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Drumkit, LS 4

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Violin, LS 1

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Violin, LS 2

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Violin, LS 3

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Violin, LS 4

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

5 (Ref)

10

20

50

100

200

500

Inter-measurement distance (cm)

0

0.2

0.4

0.6

0.8

1

p value

Figure 5 – Wilcoxon signed-rank (with Bonferroni-Holm correction) matrices of the listening test

results between different conditions.

IMD in measuring could be inﬂuenced by the stimuli of the application.

5. CONCLUSIONS

This paper has presented an interpolation method for higher-order Ambisonic spatial room impulse

responses (SRIRs), suitable for up to six degrees-of-freedom datasets and robust in interpolating mea-

surements in the transition between coupled rooms. A time-varying partitioned convolution method then

allows for real-time switching of SRIRs.

The system has been evaluated numerically, using direction-of-arrival (DoA) analysis, which shows

that the interpolation seems relatively accurate even at an inter-measurement distance (IMD) of 10 times

the original. A dynamic listening test has then been conducted in virtual reality, using visuals of three-

dimensional models from room scans using LIDAR technology and parametric binaural decoding, where

participants were able to walk through the transition in real time. The results showed that, using the

presented interpolation method, IMDs up to 50 cm or in some cases 100 cm were rated as highly similar

to the reference (IMD of 5 cm).

The evaluation showed that, even for a demanding acoustic scenario such as a room transition, the

presented SRIR interpolation method is able to reduce the necessary inter-measurement distance, which

allows for time and cost saving in measurements.

Further work will compare the presented SRIR interpolation method to other available methods, and

quantify the improvements over a basic linear interpolation method. Additionally, the method should be

used to interpolate between measurements in a single room, and the results compared to the evaluation

in this study, to assess the feasibility of interpolating between measurements at a higher IMD when the

acoustical changes are smaller.

6. DOWNLOAD

The presented interpolation method is available for download as MATLAB code3, along with demon-

stration and analysis scripts, and the 6DoFconv VST plugin is now freely available as part of the SPARTA

suite4.

ACKNOWLEDGEMENTS

This research was supported by the Human Optimised XR (HumOR) Project, funded by Business

Finland, and the EU’s Horizon 2020 research and innovation programme under the Marie Skłodowska-

Curie grant agreement No. 812719.

REFERENCES

1. Xiang N, Jing Y, Bockman AC. Investigation of acoustically coupled enclosures using a diffusion-

equation model. J Acoust Soc Am. 2009;126(3):1187-98.

2. Billon A, Valeau V, Sakout A, Picaut J. On the use of a diffusion model for acoustically coupled

rooms. J Acoust Soc Am. 2006;120(4):2043-54.

3. Raghuvanshi N, Snyder J. Parametric directional coding for precomputed sound propagation. ACM

Trans on Graphics. 2018;37(4):1-14.

4. McKenzie T, Schlecht SJ, Pulkki V. Acoustic analysis and dataset of transitions between coupled

rooms. In: IEEE Int. Conf. on Acoust., Speech and Sig. Proc. Online; 2021. p. 481-5.

5. Neidhardt A, Tommy AI, Pereppadan AD. Plausibility of an interactive approaching motion towards

a virtual sound source based on simpliﬁed BRIR sets. In: AES 144th Conv. Milan; 2018. p. 1-11.

6. Stein E, Goodwin MM. Ambisonics depth extensions for six degrees of freedom. In: AES Int. Conf.

on Headphone Technology. vol. 2019. San Francisco; 2019. p. 1-10.

7. Jeub M, Schäfer M, Vary P. A binaural room impulse response database for the evaluation of dere-

verberation algorithms. In: IEEE Int. Conf. on Digital Sig. Proc. Santorini; 2009. p. 1-5.

8. McKenzie T, Schlecht SJ, Pulkki V. Auralisation of the transition between coupled rooms. In:

Immersive and 3D Audio: From Architecture to Automotive (I3DA). Online: IEEE; 2021. p. 1-9.

3https://github.com/thomas-mckenzie/srir_interpolation

4https://leomccormack.github.io/sparta-site/docs/plugins/sparta-suite/#6dofconv

9. Meyer-Kahlen N, Amengual Garí S, McKenzie T, Schlecht SJ, Lokki T. Transfer-plausibility of

binaural rendering with different real-world references. In: Jahrestagung für Akustik - DAGA 2022.

Stuttgart; 2022. p. 1-4.

10. McCormack L, Politis A, McKenzie T, Hold C, Pulkki V. Object-based six-degrees-of-freedom

rendering of sound scenes captured with multiple Ambisonic receivers. Journal of the Audio Engi-

neering Society. 2022;70(5):355-72.

11. Antonello N, De Sena E, Moonen M, Naylor PA, Van Waterschoot T. Room impulse response

interpolation using a sparse spatio-temporal representation of the sound ﬁeld. IEEE/ACM Trans on

Audio, Speech and Lang Proc. 2017;25(10):1929-41.

12. Müller K, Zotter F. Auralization based on multi-perspective Ambisonic room impulse responses.

Acta Acustica. 2020;6(25):1-18.

13. Neidhardt A, Reif B. Minimum BRIR grid resolution for interactive position changes in dynamic

binaural synthesis. In: AES 148th Conv. Online; 2020. p. 1-10.

14. Masterson C, Kearney G, Boland F. Acoustic impulse response interpolation for multichannel sys-

tems using Dynamic Time Warping. In: AES 35th International Conference. London; 2009. p. 1-10.

15. Das O, Calamia P, Gari SVA. Room impulse response interpolation from a sparse set of measure-

ments using a modal architecture. In: IEEE International Conference on Acoustics, Speech and

Signal Processing. Online; 2021. p. 960-4.

16. Zhao J, Zheng X, Ritz C, Jang D. Interpolating the directional room impulse response for dynamic

spatial audio reproduction. Applied Sciences. 2022;12(4).

17. Hidaka T, Yamada Y, Nakagawa T. A new deﬁnition of boundary point between early reﬂections

and late reverberation in room impulse responses. Journal of the Acoustical Society of America.

2007;122(326).

18. Meesawat K, Hammershøi D. An investigation on the transition from early reﬂections to a reverber-

ation tail in a BRIR. In: International Conference on Auditory Display. Kyoto; 2002. p. 5-9.

19. Campos A, Sakamoto S, Salvador CD. Directional early-to-late energy ratios to quantify clarity: A

case study in a large auditorium. In: Immersive and 3D Audio. Online; 2021. .

20. Hold C, McKenzie T, Gotz G, Schlecht SJ, Pulkki V. Resynthesis of spatial room impulse response

tails with anisotropic multi-slope decays. Journal of the Audio Engineering Society. 2022;70(6):526-

38.

21. Hardin RH, Sloane NJA. New spherical designs in three and four dimensions. In: IEEE Int. Sym-

posium on Information Theory - Proceedings. vol. 441; 1995. p. 181.

22. Moore BCJ, Glasberg BR. Suggested formulae for calculating auditory-ﬁlter bandwidths and exci-

tation patterns. J Acoust Soc Am. 1983;74(3):750-3.

23. Majdak P, Iwaya Y, Carpentier T, Nicol R, Parmentier M, Roginska A, et al. Spatially Oriented

Format for Acoustics: A data exchange format representing head-related transfer functions. In:

AES 134th Conv. Rome; 2013. p. 1-11.

24. Wefers F, Vorländer M. Efﬁcient time-varying FIR ﬁltering using crossfading implemented in the

DFT domain. Proceedings of Forum Acusticum. 2014;2014-Janua(August 2015).

25. McCormack L, Politis A. SPARTA and COMPASS: Real-time implementations of linear and para-

metric spatial audio reproduction and processing methods. In: Proceedings of the AES Int. Conf..

vol. 2019-March; 2019. p. 1-12.

26. Götz G, Hold C, McKenzie T, Schlecht SJ, Pulkki V. Analysis of multi-exponential and anisotropic

sound energy decay. In: Jahrestagung für Akustik - DAGA 2022. Stuttgart; 2022. p. 1-4.

27. Daniel J, Moreau S. Further study of sound ﬁeld coding with higher order Ambisonics. In: Proc. of

the 116th Conv. of the Audio Eng. Soc. Berlin; 2004. p. 1-14.

28. Politis A. Microphone array processing for parametric spatial audio techniques [PhD Thesis]. Aalto

University; 2016.

29. Politis A, McCormack L, Pulkki V. Enhancement of Ambisonic binaural reproduction using direc-

tional audio coding with optimal adaptive mixing. In: IEEE Workshop on Applications of Sig. Proc.

to Audio and Acoustics; 2017. p. 379-83.

30. Llado P, McKenzie T, Meyer-kahlen N, Schlecht SJ. Predicting perceptual transparency of head-

worn devices. Journal of the Audio Engineering Society. 2022;70(7/8):585-600.

31. Hintze JL, Nelson RD. Violin plots: a box plot-density trace synergism. The American Statistician.

1998;52(2):181-4.