Conference PaperPDF Available

Binaural dark-velvet-noise reverberator

Authors:

Abstract and Figures

Binaural late-reverberation modeling necessitates the synthesis of frequency-dependent inter-aural coherence, a crucial aspect of spatial auditory perception. Prior studies have explored methodolo-gies such as filtering and cross-mixing two incoherent late reverberation impulse responses to emulate the coherence observed in measured binaural late reverberation. In this study, we introduce two variants of the binaural dark-velvet-noise reverberator. The first one uses cross-mixing of two incoherent dark-velvet-noise sequences that can be generated efficiently. The second variant is a novel time-domain jitter-based approach. The methods' accuracies are assessed through objective and subjective evaluations, revealing that both methods yield comparable performance and clear improvements over using incoherent sequences. Moreover, the advantages of the jitter-based approach over cross-mixing are highlighted by introducing a parametric width control, based on the jitter-distribution width, into the binaural dark velvet noise reverberator. The jitter-based approach can also introduce time-dependent coherence modifications without additional computational cost.
Content may be subject to copyright.
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
BINAURAL DARK-VELVET-NOISE REVERBERATOR
Jon Fagerström1, Nils Meyer-Kahlen1, Sebastian J. Schlecht1,2and Vesa Välimäki1
1Acoustics Lab, Department of of Information and Communications Engineering
2Media Lab, Department of Art and Media
Aalto University, Espoo, Finland
jon.fagerstrom@aalto.fi
ABSTRACT
Binaural late-reverberation modeling necessitates the synthesis of
frequency-dependent inter-aural coherence, a crucial aspect of spa-
tial auditory perception. Prior studies have explored methodolo-
gies such as filtering and cross-mixing two incoherent late rever-
beration impulse responses to emulate the coherence observed in
measured binaural late reverberation. In this study, we introduce
two variants of the binaural dark-velvet-noise reverberator. The
first one uses cross-mixing of two incoherent dark-velvet-noise se-
quences that can be generated efficiently. The second variant is
a novel time-domain jitter-based approach. The methods’ accu-
racies are assessed through objective and subjective evaluations,
revealing that both methods yield comparable performance and
clear improvements over using incoherent sequences. Moreover,
the advantages of the jitter-based approach over cross-mixing are
highlighted by introducing a parametric width control, based on
the jitter-distribution width, into the binaural dark velvet noise
reverberator. The jitter-based approach can also introduce time-
dependent coherence modifications without additional computa-
tional cost.
1. INTRODUCTION
Sound for virtual and augmented reality requires rendering binau-
ral reverberation. Binaural reverberation is characterized by a spe-
cific, frequency-dependent interaural coherence (IC), which de-
pends on the soundfield and the physiology of the human head.
Early on, artificial reverberation (AR) algorithms have been pro-
posed that aim at matching the broadband IC by matching the max-
imum of the interaural cross-correlation (IACC) to a measured bin-
aural room impulse response (BRIR) [1]. Later, it was shown that
matching the broadband IC is not accurate enough; instead, the
frequency-dependent IC should be matched to synthesize binaural
reverberation [2].
Menzer and Faller proposed a modified feedback-delay-
network (FDN) reverberator with frequency-dependent IC match-
ing [3, 4]. Their method requires synthesizing two uncorre-
lated outputs of an FDN, which are then filtered and cross-mixed
with specially designed coherence-matching filters. An alternative
method is to generate several uncorrelated outputs of the rever-
berator and to convolve each of them with a head-related transfer
function (HRTF) belonging to a different direction [5, 6]. Kirsch
et al. have found that at least 12 directions are required if the re-
verberation is isotropic, i.e., it does not depend on direction [7].
Copyright: © 2024 Jon Fagerström et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution 4.0 International License, which
permits unrestricted use, distribution, adaptation, and reproduction in any medium,
provided the original author and source are credited.
However, in the isotropic case, the cross-mixing approach is more
efficient.
Karjalainen and Järveläinen proposed Velvet noise, due to
its sparsity and perceptual smoothness, as an effective and effi-
cient model of the late reverberation tail [8]. Various feedback-
based [8, 9, 10] and feedforward-based [11, 12] velvet-noise-
reverberators have been proposed for generating the frequency-
dependent decay found in natural late reverberation. The latest
in the feedforward-based methods is the dark velvet noise (DVN)
[13], which was later generalized as the extended DVN (EDVN).
Compared to FDNs and other recursive algorithms, EDVN allows
non-exponential decay and simpler matching of the frequency-
dependent decay and overall coloration [14].
This paper proposes binaural dark velvet noise (BDVN) as a
two-channel version of EDVN [14]. BDVN generates binaural re-
verberation with a given IC using two variants. The first one em-
ploys cross-mixing as in [3, 2]. It is demonstrated how the two re-
quired incoherent sequences can be generated efficiently through
small modifications of the existing EDVN structure. The second
approach jitters one sequence relative to the other, utilizing the
fundamental relationship between IC and jitter distribution. As far
as we know, this relationship has not been previously employed
for synthesizing binaural reverberation.
The rest of this paper is organized as follows. Sec. 2 sum-
marizes the relevant background on IC and the EDVN algorithm.
Sec. 3 describes the novel BDVN algorithm and its two variants.
Sec. 4 presents objective and perceptual evaluations. Sec. 5 con-
cludes the work.
2. BACKGROUND
2.1. Binaural Reverberation
Many artificial reverb algorithms, including the EDVN reverbera-
tor, produce single-channel output. However, in real-world listen-
ing, we experience binaural reverberation, with differences in the
signals received by each ear. The IC is defined as
ΦLRpωq |SLRpωq|2
SLpωqSRpωq,(1)
where ωis the frequency in radians, SLpωq, SRpωqare the power
spectral densities (PSDs) of the two ear channels, and |SLRpωq|is
the absolute value of the cross power spectral density (CPSD).
In practice, IC is estimated using time averages of the short-
time Fourier-transform (STFT) coefficients
ˆ
ΦLRpωq ř8
νν0|HLpν, ωqH˚
Rpν, ωq|2
ř8
νν0|HLpm, ωq|2ř8
νν0|HRpν, ωq|2,(2)
DAFx.1
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
246
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
where HLpν, ωq, HRpν, ωqare the STFT coefficients of the left
and the right channel of the BRIR, respectively, and .˚denotes
complex conjugation. The sum starts at the time index ν0. In this
study, we focus on modeling only the late part of the response,
starting 50 ms after the direct sound arrival—a conservative esti-
mate of the mixing time for most rooms[15]—to estimate IC for
the remainder of the work.
If the responses are identical, the IC equals 1 across all fre-
quencies; if perfectly incoherent, the IC is 0. In an ideal diffuse
field, sound from different directions is incoherent and has equal
intensity. However, the IC measured at a binaural receiver is not
zero due to time and level differences caused by the finite size head
acting as a scattering object, increasing coherence at low frequen-
cies.
2.2. Cross Mixing
A common method to generate artificial binaural reverberation
with a specified IC involves cross-mixing two completely inco-
herent channel responses, h1and h2[3, 2, 4]. The mixing filters
are
U1pωq d1`aΦLRpωq
2, U2pωq d1´aΦLRpωq
2.(3)
They are applied to the incoherent sequences to create the binaural
responses
HLpωq U1pωqH1pωq ` U2pωqH2pωq(4)
HRpωq U1pωqH1pωq ´ U2pωqH2pωq(5)
where H1pωqand H2pωqare the discrete-time-Fourier transforms
(DTFTs) of two incoherent sequences h1ptqand h2ptq, and HLpωq
and HRpωqare the DTFTs of the synthesized binaural responses of
the left and right ear, respectively.
Menzer and Faller applied the cross-mixing approach for mix-
ing two incoherent Gaussian noise sequences [2] as well as two
FDN outputs [3, 4]. Below, we demonstrate the effectiveness of
cross-mixing for creating BDVN from EDVN, leveraging the ease
of generating two incoherent EDVN sequences.
2.3. Extended Dark Velvet Noise
Original Velvet Noise (OVN), the basis of EDVN, is a sparse
pseudo-random sequence composed of a jittered unit impulse train
with uniformly distributed signs [8]. As such, OVN has a flat PSD
[16]. The placement of each unit impulse is constrained via the
grid size Tddefined as
Tdfs
ρ,(6)
where fsis the sample rate in Hz and ρis the pulse density in
pulses per second. The sample rate fs48 kHz was used through-
out this work. A single unit impulse is placed on a uniformly dis-
tributed random location within each grid segment. The impulse
locations are computed with [8]
kpmq tmTd`r1pmqpTd´1qs,(7)
where mis the pulse index, t¨sis the rounding operator and r1pmq
is a uniform random number between 0and 1. The sign of each
pulse is given by
spmq 2tr2pmqs´1,(8)
-1
0
1
01234
Figure 1: Beginning of an EDVN IR (line) and the underlying OVN
sequence (stems) with a pulse density ρ1500 pulses/s, using
sample rate fs48 kHz.
where r2pmqis a uniform random number between 0and 1. The
stems in Fig.1 correspond to the pulses of an OVN sequence.
EDVN is an extension of the OVN to achieve an arbitrary PSD
[14]. In the EDVN, each unit impulse of the OVN is replaced with
an arbitrary filter IR from a set of Q!Mpredefined dictionary
filters Fq, (cf. Fig 2a). Additionally, the pulse signs spmqand
pulse gains gpmq, which are used to control the broadband decay,
are combined into a single variable
gspmq spmqgpmq.(9)
Fig. 1 shows the beginning of an EDVN IR (line). Although,
the resulting IR is no longer sparse the desired sparse convolution
property is retained within the delay line (purple) in Fig 2a.
In Fig. 2a, the qth EDVN sub-sequence containing the pulses
routed to the qth dictionary filter is given by
vqpnq #gspmqfor nkpmq ^ ϕpmq q,
0 otherwise,(10)
where qP t1,2, ..., Qu, is the pulse filter index, nis the sample in-
dex, gpmqis a gain term for each pulse, and ϕpmqPt1,2, ..., Qu
a list of filter indices. The transfer function of the EDVN is given
as
Hpzq
Q
ÿ
q1
VqpzqFqpzq,(11)
where Vqpzqis the transfer function of the sub-sequence routed to
the qth pulse filter with the transfer function Fqpzq. An example
of a generated sequence hpnqis shown in Fig.1.
The pulse filters are selected based on filter probability, de-
noted by the vector representing probabilities for a single pulse
pp1, p2,...,pQTě0,with
Q
ÿ
q1
pq1,(12)
where r¨sTis the transpose operation. The list of filter indices for
each pulse is then determined based on the pulse-filter probabilities
ppmqwith the following greedy selection:
ϕpmq arg max
q
tpτqpmq ` ϵrqpmqqpqpmqu,(13)
where ϵis a free parameter for the level of randomization, rqpmq
is a uniform random number between 0and 1, and τqis the sample
index (i.e., time) when the qth dictionary filter was last selected.
DAFx.2
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
247
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
(a)
-
(b)
Figure 2: Structures of (a) the single-channel EDVN convolution
and (b) the proposed BDVN convolution with Qdictionary filters
and Mpulse gains. The translucent blocks (dotted line) are only
needed for the cross-mixing version and are omitted when using
the jitter version. The ˘1gains represent the random sign flips.
The value τqis updated sequentially based on the previous pulse
filter selection with
τqpm`1q #0 for ϕpmq q,
τqpmq ` 1 otherwise.(14)
-1
0
1
01234
(a)
-1
0
1
01234
(b)
Figure 3: (a) First four milliseconds of an incoherent two-channel
EDVN IR and of (b) a BDVN IR using the jitter method. The un-
derlying pulses are shown with stems for the right (red circle) and
left (black asterisk) channels.
3. PROPOSED BINAURAL LATE-REVERBERATOR
Building upon the previous EDVN algorithm [14], this section in-
troduces two variants of the BDVN reverberator. The first vari-
ant directly integrates the cross-mixing approach[3], relying on
synthesizing two incoherent channel sequences. The second vari-
ant achieves the desired IC by jittering the pulse locations of one
channel according to a jitter distribution derived from the CPSD.
A MATLAB implementation of the proposed method is available
online1.
3.1. Cross-Mixing-Based Coherence Matching
The cross-mixing requires synthesizing two incoherent sequences
to provide an accurate coherence match [3]. Previous work on
generating incoherent velvet-noise sequences relies on the permu-
tation of the branches of the interleaved velvet-noise algorithm
[10, 17]. In the current work, we propose the EDVN algorithm
[14] (cf. Fig. 2a) to be amended to create two incoherent outputs
h1and h2as shown in Fig. 2b. By transposing the original struc-
ture of Fig. 2a, the dictionary filters and pulse gains can be shared
between the two channels, assuming there are no large color dif-
ferences between them. The only additions are a second delay line
for the second channel and the channel gains G1and G2.
The two outputs h1and h2of the respective delay lines be-
come incoherent by randomizing the signs spmqbetween the
channels. The sign randomization is shown in Fig. 2b with the
˘1gains. The beginning of an example incoherent two-channel
EDVN IR is shown in Fig. 3a, where some of the pulse signs are
flipped and some are not. Over a long sequence, the random sign
flips result in incoherent signals h1pnqand h2pnq. Subsequently,
the two incoherent outputs are cross-mixed, following the method-
1https://github.com/Ion3rik/
dark-velvet- noise-reverb
DAFx.3
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
248
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
ology outlined in Section 2.2, utilizing the filters U1pzqand U2pzq
to attain the desired IC.
3.2. Jitter-Based Coherence Matching
This section introduces the second variant of the BDVN, where the
target IC is achieved by jittering the delays of the second channel
based on those of the first channel. In contrast to the cross-mixing
approach, the signs spmqare shared between the channels in the
jitter variant. Thus, the sign flip blocks in Fig. 2b can be bypassed.
By selecting appropriate jitter values, h1and h2achieve the target
IC directly and thus correspond to hLand hR. Consequently, the
cross-mixing block depicted in Fig. 2b can also be bypassed.
To find the probability mass function pplqfrom which these
jitter values pmqare drawn, we assume without loss of gen-
erality that hRis a jittered version of hL. If the pulses, which
are equivalent to the delays in Fig. 2, used to generate hLare
placed at kLpmq kpmqaccording to the EDVN algorithm, see
Eq. (7), the pulses for hRare placed at kRpmq kLpmq ` pmq,
while making sure that no negative pulse locations are generated.
Fig. 3b shows the beginning of an example jittered BDVN IR,
where the jitter pmqis visible as random time delays between
the left (black) and right (red) channel pulses. The absence of the
random sign flips is also evident in the figure.
The pulse sequences before applying the coloration filters are
denoted here as vLand vR. These sequences would be obtained if
all filters were set to unit impulses in Eq. (11). We utilize the fun-
damental relationship between the CPSD and the cross-correlation
function rLR in that the CPSD can be found as the DTFT of the
cross-correlation, as in
SLRpωq
8
ÿ
l“´8
rLRplqe´ıωl ,(15)
where the cross-correlation is
rLRplq EtvLpnqvRpn`lqu,(16)
in which Etu denotes the expectation operator.
For the white, zero-mean pulse sequences, the expected value
is only non-zero if the shift applied to vLpnqis exactly l, i.e.,
pnq l:
rLRplq EpvLpnqvLpn`pnqq (17)
v2
Lpnqpppnq lq(18)
σ2
Lpppnq lq,(19)
assuming that the value of the sequence is independent of the jitter
process. Here, σ2
Lis the variance of the sequence, which can be
set to 1.
Now pppnq lq pplqis the distribution from which the
jitter values are drawn. Thus, using Eq. (15), the generated CPSD
is found directly as the DTFT of the jitter distribution:
SLRpωq σ2
L
8
ÿ
l“´8
pplqe´ıωl.(20)
For synthesizing a sequence with a given coherence, the inverse
DTFT can be used:
pplq 1
2πżπ
´π
ˆ
SLRpωqeıωl . (21)
50 250 1k 5k 20k
0
0.5
1
Synthesized
Diffuse Field
(a)
-50 0 50
0
0.005
0.01
0.015
(b)
Figure 4: (a) Jitter distribution pobtained via solving the inverse
problem of Eq. (21). The CPSD was analyzed from a 60-s binaural
diffuse-field noise sequence, synthesized from a set of HRTF mea-
surements. (b) The corresponding coherence estimated from the
diffuse field binaural noise (dotted) and the resynthesized jittered
velvet noise sequence (blue).
The target CPSD ˆ
SLR used here should not simply be the numer-
ator of Eq. (1). Coloration should not be modeled using the jitter
distribution, as it is modeled separately by the dictionary filters.
Therefore, whitening should be applied to the estimated CPSD.
We used
ˆ
SLRpωq ř8
νν0HLpν, ωqH˚
Rpν, ωq
bř8
νν0|HLpν, ωq|2ř8
νν0|HRpν, ωq|2
,(22)
to design the jitter distribution, the square of which is the coher-
ence.
Once the distribution is obtained using Eq. (21), the jitter sam-
ples for each pulse pmqneed to be generated. Random samples
from a unidimensional distribution can be drawn with the inver-
sion method, drawing samples from a uniform distribution and
mapping those values via the inverse cumulative density function
(CDF) of the random variables, i.e.,
pmq F´1
pr3pmqq,(23)
where r3pmqis a uniform random variable between 0 and 1, and
F´1
is the inverse CDF of p, obtained numerically.
As an example, Fig. 4a shows the binaural diffuse field co-
herence of the KU100 dummy head [18], estimated from binaural
noise generated via filtering incoherent, white Gaussian noise with
a set of measured HRTFs in a quasi-uniform grid. It also shows the
synthesized coherence, which exhibits a close match to the target
diffuse field coherence. Fig. 4b shows the jitter distribution that
was determined from the diffuse field coherence, with Eq. (21),
and used for synthesis.
DAFx.4
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
249
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
Table 1: Target BRIR parameters as averages between 500 Hz and
1 kHz, following the standard ISO 3382. The Direct-to-reverberant
ratio (DRR) is shown for left and right ear responses. “Arni”
refers to the variable acoustics room at the Aalto Acoustics Lab.
Target BRIR Description T20 (s) DRR (dB)
Room 1 Arni Medium 0.49 -7.6 / -2.8
Room 2 Arni Reverberant 0.98 -10.0 / -6.5
Room 3 Pori Promenadi Hall 2.24 -19.0 / -18.0
4. EVALUATION
This section presents both objective and subjective assessments of
the proposed BDVN algorithm. Specifically, the proposed method
was used to model the binaural coherence of three distinct mea-
sured BRIRs, alongside room-independent binaural diffuse field
coherence derived from a collection of HRTF measurements. Fur-
thermore, the capabilities of the BDVN algorithm in synthesizing
altered coherence patterns beyond those inherent in natural binau-
ral listening scenarios, are explored. This includes discussing the
algorithm’s efficacy in implementing parametric width control for
coherence and its ability to generate diverse time-dependent coher-
ence modifications. Sound examples are provided online for the
curious reader to experience the quality of the BDVN algorithm
themselves 2.
4.1. Target Coherence Profiles
Three different BRIRs with varying reverberation times were se-
lected to be modeled with the two proposed method variants. Ta-
ble 1 shows the reverberation time and direct-to-reverberant ratio
(DRR) estimated from the target BRIRs. The selected rooms from
Room 1to 3are ordered from the driest (Room 1) to the most
reverberant (Room 3). “Room 1” and “room 2” correspond to
measurements that were both made in the variable acoustics room
“Arni”, at the Aalto Acoustics Lab, in which the room was config-
ured to two different settings. In both cases, the source was 3.3 m
in front and 1.3 m to the right of the binaural receiver (a KEMAR
head and torso simulator). The fact that the source was not central
can be seen in the increased DRR at the right ear.
All the coherence estimates and the BDVN modeling were im-
plemented for the late part (after 50 ms) of each BRIR. In addition
to the target coherence estimated from the varied set of BRIRs,
the binaural diffuse field of the KU100 shown above is included
as one of the target coherences. The motivation is to investigate
whether room-independent binaural coherence provides a percep-
tually accurate approximation for each tested room. It is expected
that different head sizes lead to slightly different IC, but studying
these differences is beyond the scope of this paper.
The coherence of the selected BRIRs and the diffuse field bin-
aural coherence is shown in Fig. 5. It can be seen that the coher-
ence of the measured BRIRs is always larger or equal to that of
the binaural diffuse field coherence, across all frequencies. This
is expected as rooms are not expected to generate perfect diffuse
fields (for example due to anisotropy [19, 20]). In general, the
coherence of each BRIR still follows the binaural diffuse field co-
herence fairly closely.
2http://research.spa.aalto.fi/publications/
papers/dafx24-bdvn/
50 250 1k 5k 20k
0
0.5
1
Room 1
Room 2
Room 3
Diffuse Field
Figure 5: Binaural diffuse field coherence and coherence of the
three BRIRs beyond 50 ms.
4.2. Objective Evaluation
The spectrograms of the three target BRIRs (top) and the jitter-
based BDVN model instances (bottom) tuned to the target BRIRs
are shown in Fig. 6. Each model instance includes the original
direct and early part from the target BRIR and the modeled late re-
verberation part. The dictionary filters and the pre-and post-filters
for the three cases were estimated from the left channel of each
target BRIR, as no large spectral differences were identified be-
tween the left and right channels in Fig. 6. More details are pro-
vided on fitting the model to a target response in [14]. Each BDVN
model instance utilized a pulse density of ρ1500,Q10 fifth-
order allpole dictionary filters, a single 10th-order post-filter, and a
second-order DC-blocker. The computational costs of the method
are analyzed in more detail in [13].
The reverberation time (RT) estimates of the target BRIRs
(dashed) and the model BRIRs (solid) are overlayed on the spec-
trograms. The median RT error over all frequencies is 3.2%, 1.7%,
and 7.1%, for Rooms 1, 2, and 3, respectively. A large error be-
tween the target and model RT can be observed for Room 1 at the
low frequencies (cf. Fig. 6a). However, the error arises mostly
from the DC-blocker of the model which is by design removing
low-frequency noise present in the target BRIR [14].
Three different BDVN models with different coherence-
matching approaches were generated for each of the three target
BRIRs. The coherence estimates are shown in Fig. 5. The inco-
herent version refers to using h1and h2of Fig. 2b directly when
the signs of the channels are randomized. The cross-mixed ver-
sion (BDVN mix) is then derived from h1and h2via applying the
cross-mixing stage and taking the output from hLand hR. Finally,
two versions utilizing the jitter approach were generated; one fitted
to each measured coherence separately (BDVN jit.), and one fitted
to the generic binaural diffuse field coherence (BDVN jit. diff.).
The driest room, Room 1, shows some increase in coherence
due to the limited length of the underlying sequences for the in-
coherent version Fig. 7a (green). Room 2 in Fig. 7b shows the
largest difference between the binaural diffuse field match (BDVN
jit. diff.) and the measurement matches (BDVN jit. and BDVN
mix). For Room 3 (Fig. 7c) all the versions show a slightly lower
coherence at lower frequencies compared to the measured coher-
ence (Ref).
4.3. Perceptual Test
A Multiple Stimuli with Hidden Reference and Anchor
(MUSHRA) like test was conducted to evaluate the similarity be-
tween modeled and measured BRIRs. In total, 16 participants took
the test (mean age 27.6 years, standard deviation 3.9 years). All of
DAFx.5
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
250
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
(a) (b) (c)
Figure 6: Spectrograms of the three measured BRIRs (top) and the corresponding BDVN jit. model instances (bottom) of (a) Room 1, (b)
Room 2, and (c) Room 3. The reverberation time estimates of the target BRIRs (dotted) and BDVN model instances (solid) are overlayed
on the spectrograms.
50 250 1k 5k 20k
0
0.5
1
(a)
50 250 1k 5k 20k
0
0.5
1
(b)
50 250 1k 5k 20k
0
0.5
1
Incoherent
BDVN Mix.
BDVN Jit.
BDVN Jit. Diff.
Ref
(c)
Figure 7: Coherence of BDVN model instances tuned to (a) Room 1, (b) Room 2, and (c) Room 3.
the participants reported having normal hearing and 15 of them
had previous experience with formal listening tests. The listen-
ing test was implemented with the webMUSHRA platform [22],
and conducted in the sound-proofed listening booths at the Aalto
Acoustics Lab using Sennheiser HD-650 headphones.
The task in the test was to assess the similarity of render-
ings using artificial reverberation to a reference measured BRIR.
The five conditions discussed above, and their coherence shown in
Fig. 5, were used. In addition, monaural reverberation, which was
the right channel of BDVN jit., served as an anchor, yielding six
conditions in total. The test signals included male singing, a drum
loop, and pink noise. With these samples, we aimed to answer
the following three questions: 1) Does applying binaural coher-
ence yield an improvement over using incoherent channels? 2) Is
there a difference between the cross-mixing- and the jitter-based
BDVN? 3) Is fitting the coherence to each room necessary, or can
the diffuse field coherence be used for all rooms?
Four participants were excluded from the analysis since they
gave a rating lower than 95 for the reference in more than 15% of
the trials. The results are shown in Fig. 8, where different signals
are marked by different symbols. In general, the highest scores
were obtained using singing, especially in Room 1.
For statistical analysis, non-parametric tests were used, since
all samples could not be considered stemming from a normal dis-
tribution. Disregarding the anchor and the reference, Friedman
tests indicated significant differences between conditions in all
three rooms, with Chi-squared values of 17.77, 26.55, and 44.55
for Rooms 1, 2, and 3, respectively. All of these lead to pă10´3.
Pairwise comparisons using Wilcoxon signed rank tests with the
Bonferroni-Holm correction revealed significant differences be-
tween incoherent noise and the three BDVN conditions. For Room
1, the differences were the smallest. Here, the differences in
median between incoherent generation and the BDVN condition
with the lowest performance (BDVN - Jit. Dif.) was 11 points,
Z ´3.5,pă0.001. In the other two rooms, the differences
between incoherent EDVN and all BDVN variants were larger; for
Room 2, the difference between incoherent generation and BDVN
Mix (the lowest performing BDVN alternative in this case) was
17.5 points, Z ´3.09,p0.002; for Room 3 the differ-
ence between incoherent signals and BDVN Mix was 26 points,
Z ´3.83,pă0.001. Thus, using BDVN consistently im-
proves the match compared to using incoherent signals for both
DAFx.6
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
251
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
(a) Room 1 (b) Room 2
Singing
Drums
Pink Noise
(c) Room 3
Figure 8: Perceptual test results for the three rooms and three signals violinplot function [21]. The boxplot shape is included as a
black line in the center of the violin. The large black dot, gray square, and white triangle indicate median results for singing, drums, and
pink noise, respectively. The horizontal lines indicate the overall medians and the bottom and top edges of the boxes indicate the 25th and
75th percentiles, respectively. The violin outline shows the kernel density estimation.
ears.
Comparing the BDVN variants to each other, some trends, but
no significant differences were found. Interestingly, for Room 1,
all methods performed equally well, with median scores of 80.5 for
BDVN Mix, 79.0 for BDVN Jit., and 76.5 for BDVN Jit. Diff. For
Rooms 2 and 3, the BDVN with cross-mixing performed slightly
worse than the jitter approach. For Rooms 2 and 3, BDVN with
the jitter approach fit to the room and using diffuse field coherence
yielded almost identical results (medians both 76.0 for Room 2
and 70.0 vs 69.5 for Room 3). Thus, matching the diffuse field
coherence yields favorable results for the given rooms, particularly
in the more reverberant rooms 2 and 3.
4.4. Beyond Natural Coherence
Besides replicating natural binaural coherence in physical rooms,
the jitter-based BDVN variant facilitates the generation of artis-
tic two-channel reverberation effects. These include a parametric
width control and various time-dependent effects. Fig. 9a shows a
parametric Hann-window distribution, with various maximum jit-
ters. Changing the maximum jitter of the distribution alters the IC
as shown in Fig. 9b, which in turn alters the perceived width of the
reverberation. Increasing the maximum jitter reduces coherence,
notably shifting the IC cutoff towards lower frequencies. Further-
more, by adjusting the selected parametric distribution, tuning the
maximum jitter can effectively match the binaural diffuse field co-
herence (dashed line).
The maximum jitter can be time-dependent without extra com-
putational load, enabling reverberation with time-varying coher-
ence and thus perceived source width. Fig. 10a shows a coherence
profile that becomes more and more coherent towards the end of
the BDVN response. The effect is implemented by decreasing the
maximum jitter (i.e., the width of the jitter distribution of Fig. 9a)
in time. This creates an unnatural collapsing sensation where the
perceived width of the late reverberation starts wide and then col-
lapses in the middle. Another possible time-dependent coherence
effect is presented in Fig. 10b, where the maximum jitter is modu-
lated with a low frequency (2Hz) sine wave. The resulting percep-
tual effect is a modulated panning-like sensation in the reverberant
tail, as the two channels move between incoherent and coherent.
5. CONCLUSIONS
BDVN is proposed in this paper as a two-channel extension of the
previous EDVN algorithm, capable of synthesizing binaural coher-
ence between two output channels. The desired coherence can be
generated via cross-mixing and filtering two incoherent channels
or with the jitter-based approach. For the latter, the jitter distri-
bution derivation based on a target coherence profile is presented,
and a numerical simulation is provided to confirm the method’s
effectiveness. The jitter and cross-mixing-based BDVN instances
were parametrized to model three different BRIRs and compared
objectively and subjectively. The objective evaluation shows that
the coherence of each target BRIR can be matched accurately with
both alternatives. Furthermore, it is shown that each of the mea-
sured BRIRs’ coherence is similar to a generic, diffuse field bin-
aural coherence.
The perceptual test revealed a significant difference between
incoherent rendering and the BDVN methods. The cross-mixing
and jitter-based methods showed no significant differences be-
tween each other. Moreover, similar scores were obtained when
matching the room-specific coherence or the generic binaural dif-
fuse field coherence, suggesting that a room-independent static
binaural rendering might be a perceptually accurate model for syn-
thesizing binaural late reverberation of any room.
Finally, the benefits of the jitter-based method over cross-
mixing were highlighted by introducing a parametric width con-
trol for the jitter-based approach. Additionally, it was shown that
a time-dependent jitter distribution can be designed with no added
computational load, generating various exciting artistic reverber-
ation effects beyond the natural reverberation. These effects pre-
sented here include widening or narrowing the width in time and
applying arbitrary IC modulation.
6. REFERENCES
[1] J.-M. Jot, V. Larcher, and O. Warusfel, “Digital signal
processing issues in the context of binaural and transaural
stereophony, in Proc. AES 98th Conv., Feb. 1995.
[2] F. Menzer and C. Faller, “Investigations on modeling BRIR
DAFx.7
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
252
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September 2024
-10 -5 0 5 10
0
0.5
1
(a)
(b)
Figure 9: (a) Hann-window jitter distribution with various maxi-
mum jitters and (b) the corresponding coherence. The dotted line
shows the diffuse field binaural coherence.
(a)
(b)
Figure 10: (a) Increasing coherence perceived as the reverb image
collapsing towards the middle and (b) sine-modulated coherence
generating a subtle panning-like movement in the reverberation.
tails with filtered and coherence-matched noise,” J. Audio
Eng. Soc., 2009.
[3] F. Menzer and C. Faller, “Binaural reverberation using a
modified Jot reverberator with frequency-dependent interau-
ral coherence matching,” in Proc. 111th AES Conv., Munich,
Germany, Jan. 2009, paper 7765.
[4] F. Menzer, “Binaural reverberation using two parallel feed-
back delay networks,” in Proc. AES 40th Int. Conf. Spatial
Audio: Sense the Sound of Space, Oct 2010.
[5] C. Kirsch, J. Poppitz, T. Wendt, S. van de Par, and S. D.
Ewert, “Computationally efficient spatial rendering of late
reverberation in virtual acoustic environments, in Proc. Im-
mersive and 3D Audio (I3DA), Sep. 2021, pp. 1–8.
[6] N. Agus, H. Anderson, J.-M. Chen, S. Lui, and D. Herre-
mans, “Minimally simple binaural room modeling using a
single feedback delay network,” J. Audio Eng. Soc., 2018.
[7] C. Kirsch, J. Poppitz, and T. Wendt, “Spatial resolution of
late reverberation in virtual acoustic environments, Trends
in Hearing, vol. 25, 2021.
[8] M. Karjalainen and H. Järveläinen, “Reverberation modeling
using velvet noise, in Proc. 30th AES Int. Conf. Intelligent
Audio, Saariselkä, Finland, Mar. 2007.
[9] K. Lee, J.S. Abel, V Välimäki, T Stilson, and D. P. Berners,
“The switched convolution reverberator, J. Audio Eng. Soc.,
vol. 60, no. 4, pp. 227–236, Apr. 2012.
[10] V. Välimäki and K. Prawda, “Late-reverberation synthesis
using interleaved velvet-noise sequences, IEEE/ACM Trans.
Audio Speech Lang. Process., vol. 29, pp. 1149–1160, Feb.
2021.
[11] B. Holm-Rasmussen, H.-M. Lehtonen, and V. Välimäki, “A
new reverberator based on variable sparsity convolution, in
Proc. Int. Conf. Digital Audio Effects (DAFx), Maynooth,
Ireland, Sep. 2013, pp. 344–350.
[12] V. Välimäki, B. Holm-Rasmussen, B. Alary, and H-M.
Lehtonen, “Late reverberation synthesis using filtered vel-
vet noise, Appl. Sci., vol. 7, no. 5, May 2017.
[13] J. Fagerström, N. Meyer-Kahlen, S. J. Schlecht, and
V. Välimäki, “Dark velvet noise, in Proc. Int. Conf. Dig-
ital Audio Effects (DAFx), Vienna, Austria, Sep. 2022, pp.
192–199.
[14] J. Fagerström, S. J. Schlecht, and V. Välimäki, “Non-
exponential reverberation modeling using dark velvet noise,”
J. Audio Eng. Soc., vol. 72, no. 6, pp. 370–382, Jun. 2024.
[15] P. Goetz, K. Kowalczyk, A. Silzle, and E. Habets, “Mix-
ing time prediction using spherical microphone arrays,” J.
Acoust. Soc. Am., vol. 137, pp. 206–212, Feb. 2015.
[16] N. Meyer-Kahlen, S. J. Schlecht, and V. Välimäki, “Colours
of velvet noise, Electron. Lett., vol. 58, no. 12, pp. 495–497,
Jun. 2022, https://doi.org/10.1049/ell2.12501.
[17] K. Prawda, S. J. Schlecht, and V. Välimäki, “Multichannel
interleaved velvet noise, in Proc. Int. Conf. Digital Audio
Effects (DAFx), Vienna, Austria, Sep. 2022, pp. 208–215.
[18] B. Bernschütz, “A spherical far field HRIR/HRTF compila-
tion of the Neumann KU100,” 2013.
[19] B. Alary, P. Massé, S. J. Schlecht, M. Noisternig, and
V. Välimäki, “Perceptual analysis of directional late rever-
beration,” J. Acoust. Soc. Am., vol. 149, no. 5, pp. 3189–
3199, May 2021.
[20] M. Berzborn and M. Vorländer, “Directional sound field de-
cay analysis in performance spaces,” Building Acoustics, vol.
28, no. 3, pp. 249–263, Sep. 2021.
[21] B. Bechtold, “Violin Plots for Matlab, Github Project,”
Available at https://github.com/bastibe/Violinplot-Matlab,
accessed Mar 27, 2024.
[22] M. Schoeffler, S. Bartoschek, F.-R. Stöter, M. Roess,
S. Westphal, B. Edler, and J. Herre, “WebMUSHRA—A
comprehensive framework for web-based listening tests, J.
Open Res. Softw., vol. 6, no. 1, pp. 1–8, Feb. 2018.
DAFx.8
<
>
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24) Guildford, Surrey, UK, September 3-7, 2024
253
... In contrast to the magnitude envelope term, the phase is determined using a specific coefficient ϕ c,m,f ∈ [π, π) for each channel-frame-bin triplet. This allows for fitting decorrelated left and right channel realizations of the diffuse part, as typically observed in BRIRs above 1 kHz [17], [18]. Additionally, while the early reflections of the BRIR are not explicitly incorporated into the model, the unconstrained phases allow the model to fit these reflections to some extent. ...
Conference Paper
We present a head-related transfer function (HRTF) estimation method which relies on a data-driven prior given by a score-based diffusion model. The HRTF is estimated in reverberant environments using natural excitation signals, e.g. human speech. The impulse response of the room is estimated along with the HRTF by optimizing a parametric model of reverberation based on the statistical behaviour of room acoustics. The posterior distribution of HRTF given the reverberant measurement and excitation signal is modelled using the score-based HRTF prior and a log-likelihood approximation. We show that the resulting method outperforms several baselines, including an oracle recommender system that assigns the optimal HRTF in our training set based on the smallest distance to the true HRTF at the given direction of arrival. In particular, we show that the diffusion prior can account for the large variability of high-frequency content in HRTFs.
... In contrast to the magnitude envelope term, the phase is determined using a specific coefficient ϕ c,m,f ∈ [π, π) for each channel-frame-bin triplet. This allows for fitting decorrelated left and right channel realizations of the diffuse part, as typically observed in BRIRs above 1 kHz [17], [18]. Additionally, while the early reflections of the BRIR are not explicitly incorporated into the model, the unconstrained phases allow the model to fit these reflections to some extent. ...
Preprint
Full-text available
We present a head-related transfer function (HRTF) estimation method which relies on a data-driven prior given by a score-based diffusion model. The HRTF is estimated in reverberant environments using natural excitation signals, e.g. human speech. The impulse response of the room is estimated along with the HRTF by optimizing a parametric model of reverberation based on the statistical behaviour of room acoustics. The posterior distribution of HRTF given the reverberant measurement and excitation signal is modelled using the score-based HRTF prior and a log-likelihood approximation. We show that the resulting method outperforms several baselines, including an oracle recommender system that assigns the optimal HRTF in our training set based on the smallest distance to the true HRTF at the given direction of arrival. In particular, we show that the diffusion prior can account for the large variability of high-frequency content in HRTFs.
Article
Full-text available
Previous research on late-reverberation modeling has mainly focused on exponentially decaying room impulse responses, whereas methods for accurately modeling non-exponential reverberation remain challenging. This paper extends the previously proposed basic dark-velvet-noise reverberation algorithm and proposes a parametrization scheme for modeling late reverberation with arbitrary temporal energy decay. Each pulse in the velvet-noise sequence is routed to a single dictionary filter that is selected from a set of filters based on weighted probabilities. The probabilities control the spectral evolution of the late-reverberation model and are optimized to fit a target impulse response via non-negative least-squares optimization. In this way, the frequency-dependent energy decay of a target late-reverberation impulse response can be fitted with mean and maximum reverberation-time errors of 4% and 8%, respectively, requiring about 50% less coloration filters than a previously proposed filtered-velvet-noise algorithm. Furthermore, the extended dark-velvet-noise reverberation algorithm allows the modeled impulse response to be gated, the frequency-dependent reverberation time to be modified, and the model's spectral evolution and broadband decay to be decoupled. The proposed method is suitable for the parametric late-reverberation synthesis of various acoustic environments, especially spaces that exhibit a non-exponential energy decay, motivating its use in musical audio and virtual reality.
Conference Paper
Full-text available
The cross-correlation of multichannel reverberation generated using interleaved velvet noise is studied. The interleaved velvet-noise reverberator was proposed recently for synthesizing the late reverb of an acoustic space. In addition to providing a computationally efficient structure and a perceptually smooth response, the interleaving method allows combining its independent branch outputs in different permutations, which are all equally smooth and flutter-free. For instance, a four-branch output can be combined in 4! or 24 ways. Additionally, each branch output set is mixed orthogonally, which increases the number of permutations from M! to M^2!, since sign inversions are taken along. Using specific matrices for this operation, which change the sign of velvet-noise sequences, decreases the correlation of some of the combinations. This paper shows that many selections of permutations offer a set of well decorrelated output channels, which produce a diffuse and colorless sound field, which is validated with spatial variation. The results of this work can be applied in the design of computationally efficient multichannel reverberators.
Article
Full-text available
Velvet noise is a sparse ternary pseudo-random signal containing only a small portion of non-zero values. In this work, the derivation of the spectral properties of velvet noise is presented. In particular, it is shown that the original velvet noise is white, i.e. has a constant power spectrum. For velvet noise variants with altered probability of polarity, the spectral characteristics are analytically derived. Crushed additive velvet noise is shown to have potential in the design of coloured sparse noise sequences, which are useful in acoustic signal processing.
Article
Full-text available
Late reverberation involves the superposition of many sound reflections, approaching the properties of a diffuse sound field. Since the spatially resolved perception of individual late reflections is impossible, simplifications can potentially be made for modelling late reverberation in room acoustics simulations with reduced spatial resolution. Such simplifications are desired for interactive, real-time virtual acoustic environments with applications in hearing research and for the evaluation of hearing supportive devices. In this context, the number and spatial arrangement of loudspeakers used for playback additionally affect spatial resolution. The current study assessed the minimum number of spatially evenly distributed virtual late reverberation sources required to perceptually approximate spatially highly resolved isotropic and anisotropic late reverberation and to technically approximate a spherically isotropic sound field. The spatial resolution of the rendering was systematically reduced by using subsets of the loudspeakers of an 86-channel spherical loudspeaker array in an anechoic chamber, onto which virtual reverberation sources were mapped using vector base amplitude panning. It was tested whether listeners can distinguish lower spatial resolutions of reproduction of late reverberation from the highest achievable spatial resolution in different simulated rooms. The rendering of early reflections remained unchanged. The coherence of the sound field across a pair of microphones at ear and behind-the-ear hearing device distance was assessed to separate the effects of number of virtual sources and loudspeaker array geometry. Results show that between 12 and 24 reverberation sources are required for the rendering of late reverberation in virtual acoustic environments.
Article
Full-text available
The late reverberation characteristics of a sound field are often assumed to be perceptually isotropic, meaning that the decay of energy is perceived as equivalent in every direction. In this paper, we employ Ambisonics reproduction methods to reassess how a decaying sound field is analyzed and characterized and our capacity to hear directional characteristics within late reverberation. We propose the use of objective measures to assess the anisotropy characteristics of a decaying sound field. The energy-decay deviation is defined as the difference of the direction-dependent decay from the average decay. A perceptual study demonstrates a positive link between the range of these energy deviations and their audibility. These results suggest that accurate sound reproduction should account for directional properties throughout the decay.
Article
Full-text available
This paper proposes a novel algorithm for simulating the late part of room reverberation. A well-known fact is that a room impulse response sounds similar to exponentially decaying filtered noise some time after the beginning. The algorithm proposed here employs several velvet-noise sequences in parallel and combines them so that their non-zero samples never occur at the same time. Each velvet-noise sequence is driven by the same input signal but is filtered with its own feedback filter which has the same delay-line length as the velvet-noise sequence. The resulting response is sparse and consists of filtered noise that decays approximately exponentially with a given frequency-dependent reverberation time profile. We show via a formal listening test that four interleaved branches are sufficient to produce a smooth high-quality response. The outputs of the branches connected in different combinations produce decorrelated output signals for multichannel reproduction. The proposed method is compared with a state-of-the-art delay-based reverberation method and its advantages are pointed out. The computational load of the method is 60% smaller than that of a comparable existing method, the feedback delay network. The proposed method is well suited to the synthesis of diffuse late reverberation in audio and music production.
Article
Full-text available
The most efficient binaural acoustic modeling systems use a multi-tap delay to generate accurately modeled early reflections, combined with a feedback delay network that produces generic late reverberation. We present a method of binaural acoustic simulation that uses one feedback delay network to simultaneously model both first-order reflections and late reverberation. The advantages are simplicity and efficiency. We compare the proposed method against the existing method of modeling binaural early reflections using a multi-tap delay line. Measurements of ISO standard evaluators including interaural correlation coefficient, decay time, clarity, definition, and center time, indicate that the proposed method achieves comparable level of accuracy as less-efficient existing methods. This method is implemented as an iOS application, and is able to auralize input signal directly without convolution and update in real time.
Article
Full-text available
For a long time, many popular listening test methods, such as ITU-R BS.1534 (MUSHRA), could not be carried out as web-based listening tests, since established web standards did not support all required audio processing features. With the standardization of the Web Audio API, the required features became available and, therefore, also the possibility to implement a wide range of established methods as web-based listening tests. In order to simplify the implementation of MUSHRA listening tests, the development of webMUSHRA was started. By utilizing webMUSHRA, experimenters can configure web-based MUSHRA listening tests without the need of web programming expertise. Today, webMUSHRA supports many more listening test methods, such as ITU-R BS.1116 and forced-choice procedures. Moreover, webMUSHRA is highly customizable and has been used in many auditory studies for different purposes.
Article
The analysis of the spatio-temporal features of sound fields is of great interest in the field of room acoustics, as they inevitably contribute to a listeners impression of the room. The perceived spaciousness is linked to lateral sound incidence during the early and late part of the impulse response which largely depends on the geometry of the room. In complex geometries, particularly in rooms with reverberation reservoirs or coupled spaces, the reverberation process might show distinct spatio-temporal characteristics. In the present study, we apply the analysis of directional energy decay curves based on the decomposition of the sound field into a plane wave basis, previously proposed for reverberation room characterization, to general purpose performance spaces. A simulation study of a concert hall and two churches is presented uncovering anisotropic sound field decays in two cases and highlighting implications for the resulting temporal evolution of the sound field diffuseness.