ArticlePDF Available

Abstract and Figures

While room acoustic measurements can accurately capture the sound field of real rooms, they are usually time consuming and tedious if many positions need to be measured. Therefore this contribution presents the Autonomous Robot Twin System for Room Acoustic Measurements (ARTSRAM) to autonomously capture large sets of room impulse responses with variable sound source and receiver positions. The proposed implementation of the system consists of two robots, one of which is equipped with a loudspeaker, while the other one is equipped with a microphone array. Each robot contains collision sensors, thus enabling it to move autonomously within the room. The robots move according to a random walk procedure to ensure a big variability between measured positions. A tracking system provides position data matching the respective measurements. After outlining the robot system, this paper presents a validation, in which anechoic responses of the robots are presented and the movement paths resulting from the random walk procedure are investigated. Additionally the quality of the obtained room impulse responses is demonstrated with a sound field visualization. In summary, the evaluation of the robot system indicates that large sets of diverse and high-quality room impulse responses can be captured with the system in an automated way. Such large sets of measurements will benefit research in the fields of room acoustics and acoustic virtual reality.
Content may be subject to copyright.
Autonomous robot twin system for room acoustic
Aalto Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
Whilst room acoustic measurements can accurately capture the sound field of real rooms,
they are usually time consuming and tedious if many positions need to be measured. Therefore,
this contribution presents the Autonomous Robot Twin System for Room Acoustic Measure-
ments (ARTSRAM) to autonomously capture large sets of room impulse responses with vari-
able sound source and receiver positions. The proposed implementation of the system consists
of two robots, one of which is equipped with a loudspeaker, while the other one is equipped
with a microphone array. Each robot contains collision sensors, thus enabling it to move au-
tonomously within the room. The robots move according to a random walk procedure to en-
sure a big variability between measured positions. A tracking system provides position data
matching the respective measurements. After outlining the robot system, this paper presents a
validation, in which anechoic responses of the robots are presented and the movement paths
resulting from the random walk procedure are investigated. Additionally, the quality of the ob-
tained room impulse responses is demonstrated with a sound field visualization. In summary,
the evaluation of the robot system indicates that large sets of diverse and high-quality room
impulse responses can be captured with the system in an automated way. Such large sets of
measurements will benefit research in the fields of room acoustics and acoustic virtual reality.
Various physical phenomena interact when sound propa-
gates inside a room. Consequently, the resulting sound field
and its spatial variations can be fairly complex [1, 2]. Be-
tween any two positions, the combined effects on sound
waves can be summarized in terms of a transfer-function.
Its time-domain representation is called room-impulse re-
sponse (RIR).
Knowledge about the sound field in rooms is important
for various applications, including acoustic planning in ar-
chitectural acoustics [3–7], source localization, enhance-
ment, and separation algorithms [8–10], active room com-
pensation systems [11], or six-degrees-of-freedom (6DoF)
audio rendering for computer games and virtual / aug-
mented reality (VR/AR) [3, 12]. In general, the sound field
inside a room can be reconstructed over large areas if the
spatial Nyquist theorem is fulfilled [13], i.e., if the sound
field is sampled with a sufficient quantity of measurement
points. The theoretically required number of measurements
can become large, especially if the sound field consists
of high frequencies [13]. Various approaches exist to re-
duce the required number of measurements with technical
means [14–16]. For 6DoF audio rendering, inaccuracies of
*Correspondence should be addressed to:
human hearing can be exploited to reduce the number of
required measurements while ensuring a plausible listen-
ing experience [17].
RIRs are usually obtained by computing simulations
or by conducting room acoustic measurements. Although
room simulation algorithms have been extensively stud-
ied, current approaches still lack the accuracy and per-
ceptual quality that can be achieved with room acous-
tic measurements [18]. However, room acoustic measure-
ments require considerable measurement effort, thus mak-
ing them impractical for collecting large amounts of RIRs.
In principle, room acoustic measurements do not neces-
sarily need to be conducted manually by an experimenter.
Instead, they could be automated by using robots to save
valuable hours of work. Previous studies demonstrate that
room acoustic measurements can be automated with robot
systems [5–7, 9, 10, 19]. However, some questions still re-
main open regarding the flexibility and versatility of the
robots, the collection of position-dependent RIRs at room
scale, and the system’s autonomy. In this paper, we want to
address these questions by proposing a new robot system
for RIR measurements. We believe that this is a good op-
portunity to share design decisions and insights we gained
while developing the system. This might help other re-
searchers to speed up the process of building similar sys-
J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January 1
Autonomous robot twin system for
room acoustic measurements
Georg Götz*, Abraham Martinez Ornelas, Sebastian J. Schlecht, and Ville Pulkki
Aalto Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
This is the author’s accepted manuscript of the paper. It is published under the AES “green” open access
policy. The published version of the paper can be accessed from the AES E-Library:
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
0. Introduction
The main objective of this project was to develop
a flexible, versatile, and cost-efficient RIR measure-
ment system, utilizing off-the-shelf components. The
resulting system should be capable of measuring large
quantities of position-dependent RIRs at room-scale in
various types of rooms. Therefore, we introduce the
Autonomous Robot Twin System for Room Acoustic
Measurements (ARTSRAM). The ARTSRAM consists
of two robots, one of which is equipped with a loud-
speaker, thus acting as a sound source. The other robot is
equipped with a first-order microphone array to capture
RIRs. Both robots are equipped with collision sensors,
thus allowing them to move autonomously within a room
without requiring any prior mapping of the room or ex-
ternal control by an experimenter. The robots are moving
according to a random walk algorithm [20, 21]. A tracking
system provides position data of both robots matching the
respective measurements. Investigations of the random
walk implementation in a test scenario and simulations
indicate a sufficiently uniform coverage of the room. Fur-
thermore, measurements acquired with the ARTSRAM
provide reasonable room acoustic information that are
exemplarily used in this paper for calculating a sound field
visualization of a shoebox-shaped room.
The remainder of this paper is organized as follows. Sec-
tion 1 reviews prior work on robot systems for room acous-
tic measurements. Section 2 clarifies the design principles
behind the ARTSRAM and highlights how they differ from
previous studies. Section 3 introduces the robot system and
describes its setup as well as all of the used components. In
Section 4, the system is validated from four perspectives.
Firstly, anechoic measurements of the utilized mobile loud-
speaker are presented. Secondly, they are compared with
anechoic measurements of the entire robot system to deter-
mine the sound field alterations caused by the robot fixture.
Thirdly, we evaluate the proposed random walk procedure
in a test measurement and compare the results with simu-
lations. Finally, we use a set of ARTSRAM measurements
and demonstrate their validity by calculating a sound field
visualization for the measured room. Section 5 discusses
the results and Section 6 concludes the paper.
Robots have already been used in previous studies for
conducting various room acoustic measurement tasks. For
example, Witew et al. [5] built a truss-based apparatus to
move 32 microphones with an arbitrarily fine resolution
along a 5.3m8m measurement area. Using this device,
they measured large sets of RIRs in an automated way to
visualize the sound fields of a concert hall and demonstrate
the scattering effect caused by the auditorium seats. Xi-
ang et al. [6] used a step-motor-driven mechanism to move
microphones along a grid inside a scale model. The moti-
vation behind their work was to investigate the relationship
between receiver positions, aperture size, and the sound de-
cay of coupled rooms. ˇ
Cmejla et al. [9] used a robot arm at
a fixed position inside a 6 m 6 m room to move a loud-
speaker within a 46 cm 36 cm 32 cm cube. They mea-
sured RIRs with six static microphone arrays, while mov-
ing the loudspeaker inside the cube with a uniform horizon-
tal and vertical resolution of 2 cm and 4 cm, respectively.
The measurements were conducted in a laboratory room
with variable reverberation times to build up a database
of RIRs that can be used to evaluate source enhancement,
localization, and separation algorithms. Uehara et al. [7]
proposed to use measurements conducted by a robot for
visualizing the distribution of room acoustic parameters.
Their developed robot is operated with a remote control.
It carries a cardioid microphone to measure RIRs at var-
ious positions in a room, while the sound source is kept
stationary. Other recent work by Feng et al. [19] used mo-
bile robots for conducting in-situ surface impedance mea-
surements of several material samples inside a room. After
a preliminary decomposition of the scene and a selection
of all desired measurement positions, the robots work au-
tonomously. Lastly, Le Roux et al. [10] conceptualized a
robot system for collecting datasets of speech utterances
inside rooms with various acoustic environments. Their
robot was mainly conceptualized for collecting datasets of
speech, but they hypothesized that their robot system could
also be used for acquiring RIRs in the future.
Although robot systems have previously been used for
room acoustic measurements, none of them has met our
requirements. In this project, our goal was to develop a
system that is flexible regarding different rooms, measures
RIRs at room scale, doesn’t require prior path and grid
planning, mainly uses off-the-shelf components, and works
without external supervision. In this section, we want to
highlight why these design principles were important fac-
tors during the development of the ARTSRAM.
A desirable property of an RIR measurement robot
would be, that it can be used in many different rooms and
in an ad hoc way, without setting up large amounts of addi-
tional hardware or cables. The installation time of the robot
should be low to leverage the full potential of automating
the measurements. This is especially important if the sys-
tem is supposed to be used for different rooms. A measure-
ment system that takes a long time to set up might easily
end up being more inefficient than manual measurements
by an experimenter when many rooms are measured.
Furthermore, it is important to think about whether the
system is supposed to be used on a microscopic or a macro-
scopic scale. When it is required to cover only a small mea-
surement area, a simple robot arm or turntable might be
sufficient. In contrast, RIR measurement over large areas
of the room would require robots that are capable of cover-
ing bigger distances.
Another important factor that needs to be considered
in large-scale measurements is the autonomy of the robot
system. In grid-based approaches, robots execute measure-
ments along pre-defined grid points in a room. Such sys-
tems can work nicely in regular geometries, but setting
up grids and corresponding movement paths for irregu-
lar room shapes or rooms with obstacles may be time-
2 J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
1. Prior Work
consuming. Moreover, if the robots are supposed to move
automatically, they require some mechanism that enables
them to accurately move to a specific grid point and ver-
ify their position. A precise positioning of the robots can
be realized using position-tracking solutions or computer-
vision-based approaches with visual markers. Both alter-
natives require considerable implementation efforts, espe-
cially if the robots are supposed to be used in different
rooms. Additionally, accurate positioning may be disturbed
by non-ideal lighting conditions or electromagnetic inter-
ference with the tracking system.
In contrast, measurement procedures based on random
walks do not require prior path and grid planning. A ran-
dom walk generates a path by making random steps. Each
random step involves moving into a random direction with
a random step size [20]. Random walks can also generate
paths on pre-defined graphs. In this case, each step con-
sists of randomly choosing a neighbouring node [21]. The
theory of random walks is well researched. For scenarios
with limited complexity, quantitative measures, such as the
time until a node is first visited (access time), the time un-
til the walk returns to its origin (return time), or the time
until all nodes were visited (cover time), can be derived
analytically [20–22]. For example, finite graphs consist-
ing of Nnodes have a cover time with a lower bound of
Nlog(N)[22]. In our use case, such a lower bound trans-
lates into the intuitive assumption that RIR measurement
robots moving according to a random walk will require
more steps to reach a certain coverage than grid-based
robots. This drawback is balanced by the reduced path
planning time and the increased flexibility of the random
walk procedure.
The aforementioned design principles also impose cer-
tain limitations on the choice of components. For exam-
ple, the desired mobility of the robots reduces the alter-
natives for loudspeakers and microphones, because weight
and cabling need to be considered. Furthermore, the au-
tonomy of robots and their versatility with respect to room
types requires a robot system with collision sensors or a
certain awareness of its surrounding. Our design choices
will therefore be highlighted in the following section.
In this section, we describe the novel Autonomous
Robot Twin System for Room Acoustic Measurements
(ARTSRAM). The ARTSRAM is capable of measuring
room impulse responses (RIRs) with variable sound source
and receiver positions. It consists of two independent
robots that are able to move freely in a room. Both robots
are equipped with collision sensors, thus allowing them
to explore the room autonomously. The measurements of
RIRs are complemented with corresponding position infor-
mation of the robots.
The aim of this section is to outline an overview of the
ARTSRAM and the communication between its compo-
nents. More detailed elaborations on the single components
will follow in the subsequent sections.
Figure 1 illustrates the overall structure of the robot sys-
tem. The base for each robot is an iRobot Create 2 Roomba
robot, which is controlled by a Raspberry Pi single-board
computer. HTC Vive trackers of the second generation are
used for tracking the position of the robots. Although it
would be possible to mount several microphone arrays
and loudspeakers on each of the robots to enable multi-
way RIR measurements, we chose to clearly distinguish
between a source and a receiver robot in this paper. The
source robot uses a Minirig MRBT-2 portable loudspeaker
to play back excitation signals that are recorded by the re-
ceiver robot with a Zoom H3-VR first-order microphone
array. The resulting RIRs are stored on a microSD card in-
serted into the receiver robot’s Raspberry Pi.
The entire measurement procedure is implemented in
Python. It is controlled by a main measurement script run-
ning on a separate measurement laptop. The main measure-
ment script sends commands over a TCP network socket to
the Raspberry Pis of the source and receiver robot. Sub-
sequently, the corresponding server scripts running on the
Raspberry Pis handle the commands. For example, the
server scripts can trigger robot movements or an RIR mea-
surement. The HTC Vive trackers directly communicate
with the measurement laptop over another wireless con-
nection, thus allowing the main measurement script to im-
mediately access and store the positions of both robots.
The Python source code of the ARTSRAM is publicly
available, c.f. Section 8.
This section provides more detailed descriptions of the
single components of the ARTSRAM. Additionally, we
aim to explain our motivation behind choosing the com-
The iRobot Create 2 Roomba robot is used as a base
for the measurement robots. The robot has a circular shape
with a diameter of approximately 35 cm and a height of
approximately 10 cm. It is equipped with collision and step
The Roomba robots are connected over serial port to
USB connectors with the Raspberry Pis, thus enabling the
server scripts running on the Raspberry Pis to trigger move-
ments of the robots or access the robot’s sensor informa-
tion. Several high-level APIs are available for this purpose,
such as the Python package irobot [23], which was also
used in this project. Due to the Roomba’s way of construc-
tion, the robots can either spin around their center axis or
move along straight or curved paths on floor level. This
means, the robots can effectively move with three degrees-
of-freedom. Collision sensor information is used to prevent
the robots from hitting obstacles while exploring the room.
The vacuum functionality of the robots is disabled through-
out the whole measurement session to ensure a silent mea-
surement environment.
J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January 3
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
3. Robot-based measurement system
3.1. Measurement system overview
3.2. Robot components in detail
3.2.1. Robot base
Fig. 1: Block diagram of the proposed implementation of the Autonomous Robot Twin System for Room Acoustic Mea-
surements (ARTSRAM). The ARTSRAM can be used to autonomously measure large sets of room impulse responses
(RIRs) for various source-to-receiver combinations. The system can easily be extended with additional loudspeakers and
microphone arrays.
Fig. 2: The proposed robot system inside an anechoic
An additional fixture is mounted onto both of the robots
to attach the microphone array and loudspeaker at a height
of approximately 1.3 m. A picture of the whole apparatus
is shown in Figure 2. A circular wooden plate was attached
onto the Roomba robot. Additionally, a chicken fence was
formed into a cylinder and fixed on the wooden plate. The
construction allows a partial disassembly of the robot sys-
tem for easier transportation. Similar to the mesh floor in
an anechoic chamber, a chicken fence has minimal influ-
ence on the sound field around the robots. A detailed eval-
uation of the sound field alterations by the robot fixtures is
presented in Section 4.2.
In principle, the currently used fixture could be ex-
tended with additional platforms at various heights inside
the chicken fence to mount further loudspeakers or micro-
phone arrays. This would allow capturing different RIRs
or transfer-functions, for example from floor to ear height.
Such a transfer-function could be useful for the auraliza-
tion of footsteps in acoustic virtual reality scenarios.
HTC Vive trackers were used to track the position of
the robots. They are an affordable and off-the-shelf po-
sition tracking solution based on measuring synchronized
light sweeps emitted by the corresponding HTC Vive light-
houses. The trackers have been used by various scientific
projects and have a reportedly high accuracy and preci-
sion that can reach the millimetre range [24, 25]. The mea-
surement script accesses the position data over Valve’s
OpenVR SDK and its respective Python bindings [26].
Two HTC Vive lighthouses are required for a maximum
measurement area of 5 m by 5 m. However, the measure-
ment area could potentially be extended up to 10 m by 10 m
with two additional lighthouses.
In order to obtain spatial information about the sound
field, the measurement system requires a microphone array
consisting of multiple microphone capsules. Consequently,
an RIR is measured for every microphone capsule of the
4 J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
3.2.2. Additional Fixture
3.2.3. Position Tracking
3.2.4. Microphone array
array. This allows transforming the measured RIRs to Am-
bisonic streams, which encode the sound field around the
microphone array with a given resolution that depends on
the number of capsules in the array [27, 28]. The resolution
can be described by the order of the Ambisonic stream.
For the proposed measurement system, we use the Zoom
H3-VR, which is a microphone array that records first-
order Ambisonic streams. It can be connected to the Rasp-
berry Pis via USB. In principle, higher-order microphone
arrays that capture the sound field with a higher resolu-
tion than the HR-VR are available on the market. How-
ever, such microphone arrays are usually either unwieldy
or hardly portable due to their cabling, thus making them
unsuitable for the ARTSRAM.
Exponential sine sweeps [29] are used for measuring
RIRs with the ARTSRAM. After every measurement, the
two robots move one after another to new positions in the
room. The robots move according to a random walk pro-
cedure, in which every robot first spins for a random time
span and subsequently drives straight forward for another
randomly selected time period. Both time spans should be
bounded, such that too short or too long movements are
avoided. Time spans corresponding to spins between 15
and 180and straight movements between 15 cm and half
of the length of the longest room dimension prove success-
ful during our tests. If a robot detects an obstacle during
the random walk, it will stop its movement and drive back-
wards for 1 s to 3 s. As we will demonstrate in Section 4.3,
good coverage of a measurement area and high variabil-
ity between the measured positions can be ensured by fol-
lowing this random walk procedure. Furthermore, it does
not require any additional path or grid planning before the
measurement. A video demonstration of the measurement
procedure can be found on the paper’s companion page,
c.f. Section 8.
In this section, the proposed ARTSRAM is validated
from four perspectives. Firstly, anechoic measurements of
the utilized mobile loudspeaker are presented. Secondly,
they are compared with anechoic measurements of the en-
tire robot system to determine the sound field alterations
caused by the robot fixture. Thirdly, the proposed random
walk procedure is evaluated with a test measurement and
simulations. Lastly, we use ARTSRAM measurements and
demonstrate their validity by calculating a sound field vi-
sualization for a shoebox-shaped room.
Ideally, a loudspeaker for RIR measurements should be
omnidirectional in order to excite the room as evenly as
possible [30]. Additionally, its magnitude response should
be flat to ensure that all frequencies are excited equally.
The Minirig MRBT-2 loudspeaker is a consumer device
Fig. 3: Magnitude response of Minirig MRBT-2 loud-
speaker, measured from five different azimuth angles. The
measurements were conducted in an anechoic chamber.
The responses are smoothed in 1/3 octave bands.
and therefore not primarily designed to conduct RIR mea-
surements. For this reason, the loudspeaker was measured
in an anechoic chamber to evaluate its suitability for the
measurement system.
Figure 3 shows 1/3-octave-band-smoothed magnitude
responses, measured with a 1/2” GRAS 46AF free-field
microphone for five different azimuth angles. Measure-
ments along the median plane of the loudspeaker would
lead to similar responses, because the loudspeaker has a
circular shape (c.f. Figure 2). The plot illustrates that the
magnitude response is reasonably flat for azimuth angles
of 0and 30. In contrast, the measured response at 60,
120, and 180exhibit strong attenuation of frequencies
above 1 kHz due to shadowing effects of the loudspeaker
Every RIR measurement setup requires at least one loud-
speaker and one microphone. Adding additional compo-
nents to the RIR measurement setup always comes at the
cost of potentially disturbing the sound field that is sup-
posed to be measured. In order to determine whether the
Roomba robot and the additional fixture disturb the sound
field, we conducted an RIR measurement with the entire
system inside an anechoic chamber. In other words, we
measured the anechoic response from the source robot to
the receiver robot. Every disturbance caused by the robot
fixtures or the utilized hardware would appear in this mea-
surement as additional reflections in the time domain or as
deviations from the loudspeaker response in the frequency
domain. In the following, we will refer to this measurement
as the twin measurement.
Figure 4a depicts the RIR that was obtained during the
twin measurement. A clear direct sound peak is visible.
Additionally, the response exhibits smaller ripples during
the millisecond following the direct sound peak. These rip-
ples are most likely caused by the loudspeaker itself, be-
cause they also appear in the loudspeaker’s impulse re-
sponse. The robots were placed 1 m apart and the dis-
tance from the microphone or loudspeaker to the wooden
plate is approximately 1.1 m. Therefore, potential reflec-
J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January 5
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
3.3. Room impulse response
measurement procedure
4. Validation
4.1. Loudspeaker measurement
4.2. Anechoic robot twin measurement
(a) Impulse response (b) Magnitude response
Fig. 4: Responses from the twin measurement. During the twin measurement, the response from the source robot to the
receiver robot was measured inside an anechoic chamber. The magnitude response is smoothed in 1/3 octave bands.
tions caused by the wooden plate of the fixture would be
visible after approximately 4 ms. However, the plot shows
only minor ripples at these time instances. The impulse re-
sponse after 10 ms is not depicted in the plot, because it
remains close to zero.
Figure 4b depicts the magnitude response of the twin
measurement. Only minor deviations from the anechoic
loudspeaker response can be observed. The influence of
the additional fixture is maximal between 2 kHz to 6 kHz
with deviations of up to 4.3 dB between the loudspeaker
response and the twin measurement. Overall, these results
imply that the Roomba robot, the utilized hardware, and
the additional fixture do not evoke any major disturbances
of the sound field that is supposed to be measured.
As outlined in Section 3.3, the robots move accord-
ing to a random walk procedure. The motivation behind
this was to ensure a big variability between measured
positions without requiring a manually defined measure-
ment grid. The literature on random walks is extensive
and scenarios with limited complexity can be described
analytically [20–22]. However, our proposed random walk
exhibits a higher complexity, because the measurement
area is limited by walls and collisions with obstacles trigger
interruptions to the standard walk. Therefore, we want to
avoid an overly-mathematical derivation of a measurement
point distribution at this point and present a more empiri-
cal evaluation instead. More precisely, in this section, we
present results from a test measurement and from simula-
tions of the random walk procedure. This evaluation should
help to balance the pros and cons of our chosen approach
compared to a grid-based measurement procedure.
A test measurement session was conducted in an exem-
plary room to evaluate the movements of the robots. In-
side this room, a rectangular measurement area of approxi-
mately 6 m4 m was delimited with wooden planks on the
floor. The session comprised 340 individual RIR measure-
ments and took 4 hours to complete. During the random
walk, the time spans for spinning and moving straight were
randomly sampled from the intervals {0.3s tspin 4s}
Number of steps Percent visited
Source Receiver
150 56.0 ±3.1 56.2 ±3.4
300 80.1 ±2.7 80.5 ±3.0
600 95.6 ±1.7 95.7 ±1.7
1000 99.2 ±0.7 99.2 ±0.7
5000 99.9 ±0.1 99.9 ±0.1
Table 1: Amount of cells visited by the robots during
the proposed random walk procedure. The values are
calculated from simulations of 100 runs over a mea-
surement area of 6 m 4 m. A uniform cell size of
40 cm 40 cm is assumed. All results are given as
mean ±standard deviation calculated over the 100 runs.
and {1s tmove straight 20 s}respectively. These time in-
tervals approximately correspond to spins between 15and
180and to straight movements between 0.15 m and 3 m.
Therefore, the robots could potentially move very accu-
rately, but they were also able to traverse half of the longest
measurement area dimension within one random walk step.
Figure 5 depicts the resulting movement paths of both
robots. Initially, both robots were placed as close to each
other as possible. From this position, they began their ran-
dom walk and moved into different directions. The plots
suggest that the robots cover the room fairly equally. No
part of the room was entirely skipped. Only some portions
close to the measurement area’s borders were not reached
by the random walk.
We ran additional simulations to investigate the ran-
dom walk’s coverage of the measurement area for differ-
ent numbers of steps. During the simulations, the number
of measurement positions or random walk steps was varied
between 150 and 5000. A random walk of 150 steps means
that each of the robots moves 150 times. For every num-
ber of steps, 100 runs of the proposed random walk pro-
cedure were simulated with the same parameters and the
same measurement area dimensions as in the exemplary
Table 1 summarizes the calculated statistics from the
simulations. To quantify the coverage of the random walk,
6 J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
4.3. Random walk test
Fig. 5: Robot paths measured during a measurement session with 340 measurements. Both robots followed the proposed
random walk procedure.
the measurement area was divided into 150 uniform cells of
40 cm 40 cm. The “percent visited” value quantifies how
many of the 150 cells were visited by each of the robots
after the random walk with a certain number of steps is
completed. A cell counts as visited, if a measurement was
conducted in it. Differences between runs are accounted
for by calculating the mean and standard deviation over
all 100 simulated runs. The randomization of the measure-
ment procedure means that more measurements need to be
conducted than would be required if the approach was grid-
based. In fact, the random walk only visits about half of
the cells during 150 steps, whereas a grid-based procedure
would visit all of them. However, the coverage gets signif-
icantly better already for double the number of steps. After
600 steps, more than 95 % of the cells were visited at least
once by both the source and the receiver.
Additionally, we used the data of the test measurement
and the simulations to evaluate the random walk proce-
dure with respect to the achievable variability of measure-
ments. For this reason, we investigated the distribution of
measured source-to-receiver configurations. Figures 6a -
6c depict probability density functions (PDFs) of observed
source-to-receiver distances for the measurement, simula-
tions with 340 steps and simulations with 5000 steps, re-
spectively. Although they show entries for a big range of
different distances, the distributions are considerably non-
uniform. An exceptionally high number of measurements
was conducted with intermediate source-to-receiver dis-
tances, whereas fewer measurements were conducted with
very small or very large distances. One reason for this is
that the robots cannot get closer to each other than ap-
proximately 0.35 m because of their physical extent. Fur-
thermore, big distances between the robots occur less fre-
quently due to the spatial limitations of the measurement
area. Interestingly, the resulting distributions are similar
to the distribution that would be observable for a uniform
measurement grid. To demonstrate this, PDFs of a uniform
grid are included as a reference in Figure 6.
Another important observation is that random walks
with fewer steps exhibited higher inter-run variability of
the distributions than random walks with more steps. To il-
lustrate this, Figure 6 shows PDFs of multiple runs. While
the PDFs vary considerably among different runs of the
340 step walk, the PDFs of individual runs coincide better
for a random walk with 5000 steps. This has practical im-
plications, because it means that the measurement position
distributions for runs with few steps are less predictable
and reproducible. In other words, the distribution of our
measurement, which is just one exemplary run of a 340
step random walk, could look somewhat different if we re-
peated it. For a higher number of steps, we could in contrast
be more confident that the distribution converges to some
average distribution.
The following section evaluates the validity of
ARTSRAM-based RIR measurements by using them
to visualize a sound field inside a room. A set of 330
RIRs was measured with the ARTSRAM inside the same
shoebox-shaped room that was used during the evaluation
of the random walk in Section 4.3. The room has a size of
7.85m 5.35m 3.15 m and the measurement area was
again restricted with wooden planks on the floor. During
the measurement session, the source robot remained static,
while the receiver robot was exploring the room according
to the proposed random walk procedure. The measurement
session took 3 hours. It is 1 hour shorter than the session
in the previous section, because only one of the robots
Figure 7 shows multiple snapshots of the sound
field recorded in the room. The visualization is based
on a directional analysis of the sound energy prop-
agating through the room. At every receiver position,
direction-of-arrivals (DOAs) were calculated for multiple
time windows along the entire RIR. The DOA estimation
was based on the sound intensity vector calculated from
J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January 7
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
4.4. Sound field visualization
(a) Measurement with 340
(b) Simulations with 340 steps per walk (c) Simulations with 5000 steps per walk
Fig. 6: Probability density functions (PDFs) of source-to-receiver distance. The plots compare PDFs resulting from the
proposed random walk with the PDF of a simulated uniform grid.
Ambisonic streams [31] and was implemented using the
SDM toolbox [32].
For every time window, the DOA information can be
used together with the time index of the window to extrap-
olate a position in space where the corresponding sound
energy originated from. In a simplified way, this can be
done by following a ray that points from the receiver po-
sition towards the DOA with a length of l=c(nT +T
where nis the window number, Tis the window length,
and cis the speed of sound. For the early part of the
RIR, this will result in rays that point from the receiver
towards the sound source and its image-sources. In prin-
ciple, this is reciprocal to calculating energy responses
based on the image-source method. At the global obser-
vation time t=0, i.e., at the emission time of the pulse
from the source, the approach returns energy contributions
clustered at the primary source and its image-sources. With
increasing t, sound energy propagates along the calculated
rays. Consequently, it is possible to visualize the temporal
progression of the sound field as shown in Figure 7.
The plots in Figure 7 are based on 330 first-order Am-
bisonic RIRs measured at the receiver positions depicted in
Figure 7f. The grey dots in the figure correspond to sound
energy contributions at the positions in the room, which
were calculated with the previously outlined procedure.
They propagate along rays between the (image-) sources
and the receiver, thus rendering the different snapshots over
time. The superposition of these energy contributions at
one time instance yields the depicted wave fronts, which
are indicated by the dashed auxiliary lines. Dots outside
the room are contributions in the mirror rooms. They are
depicted to illustrate the reflection formation process via
image-sources and the notional continuation of the wave
propagation into the mirror rooms. An animation of these
sound field plots is available on the companion page of this
paper, c.f. Section 8.
Although the measurement grid is non-uniform, the
measurements reproduce many of the acoustic phenom-
ena that can be expected from a theoretical point of view.
Firstly, Figure 7a shows the direct sound as a spherical
wave front propagating from the sound source into the
room. Secondly, distinct wave fronts originating from first
order reflections at the walls of the room can be observed
in Figures 7b – 7d. Lastly, the visualization in Figure 7e
shows a second-order reflection that is caused by the first-
order reflection depicted in Figure 7d. For clarity reasons,
the second-order reflection was plotted while it is still in
the mirror room, i.e., before the reflection happened.
5 Discussion
The aim of this paper was to develop a flexible, versatile,
scalable, and autonomous robot system for capturing large
sets of RIRs with variable sound source and receiver posi-
tions. During the development process, the design princi-
ples highlighted in Section 2 played a major role in choos-
ing the components of the system. Keeping this in mind,
the results of the previous validation section should be in-
terpreted accordingly.
The anechoic loudspeaker measurement outlined in Sec-
tion 4.1 illustrates that the Minirig loudspeaker deviates
considerably from an ideal omni-directional loudspeaker.
Nevertheless, its frequency response is sufficiently flat for
sound radiating towards directions between 0and 30off
the loudspeaker’s main axis. This applies for directions
along the horizontal and median plane in the same way,
because the loudspeaker is circular. Although the Minirig
loudspeaker is certainly not ideal for RIR measurements,
we assume that it is still a good choice, because compa-
rable mobile loudspeakers with similar dimensions will
most likely not perform significantly better. Omnidirec-
tional loudspeakers [33] or variable directivity higher-order
Ambisonics speakers [34, 35] can be heavy and usually re-
quire a large amount of cabling and additional hardware.
Obviously, this means that such loudspeakers cannot be
used in a system like ours, for which mobility has been a
crucial design principle. Additionally, the loudspeaker di-
rectivity of the Minirig is probably less problematic for the
use case of acoustic virtual reality rendering, because usual
sources in such scenes (human voice, many instruments,
gunshots or other FX) exhibit similarly directive sound ra-
diation patterns [36, 37]. In this case, it would be necessary
to measure multiple source orientations for every measure-
ment position. Such an extension of the random walk pro-
cedure would be very time-efficient, because spins of the
robots can be performed quickly.
Section 4.2 explored how the sound field is influenced
by the fixture that is mounted on the robots. The results
suggest that the fixture does not disturb the sound field con-
siderably. To the greatest extent, reflections can be avoided
8 J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
5. Discussion
(a) t = 4.17 ms (direct sound) (b) t = 8.33 ms (1st order reflec-
tion at long wall with y=5.3)
(c) t = 13.54 ms (1st order re-
flection at short wall with x=0)
(d) t = 18.75 ms (1st order re-
flection at long wall with y=0)
(e) t = 25 ms (2nd order reflec-
tion at long wall with y=5.3)
(f) Source and receiver positions
Fig. 7: Visualization of the sound energy (grey dots) inside a room, based on 330 ARTSRAM measurements. The mea-
surement procedure followed the proposed random walk. The static sound source is depicted as the black square. The bold
dashed auxiliary lines indicate observable wave fronts, which are classified in the respective subcaptions of the plots.
by placing the loudspeaker and microphone on a mesh con-
struction, such as the proposed chicken fence. Only minor
disturbances are caused by the robot base itself.
The results of Section 4.3 indicate that a high cover-
age of the room and a big variability in measured po-
sitions can be achieved with the proposed random walk
procedure. As a consequence, large and diverse sets of
RIRs can be captured, without requiring the ARTSRAM
to measure positions along a predefined grid. In compari-
son to a grid-based approach, the random walk procedure
requires a higher number of measurement steps to reach
a comparable coverage. Approximately four times more
measurements were required for the simulated scenario of
Section 4.3. However, grid-based measurement procedures
are less flexible, because they require additional grid and
path planning as well as positioning efforts. We believe
that the increased amount of measurements in the random
walk are more tolerable than the additional efforts of grid-
based approaches in the long run. Once the random walk
measurement is running, additional measurements can be
easily accommodated. Furthermore, additional measure-
ments are less problematic because a session continues au-
tonomously without any need for supervision and could po-
tentially run over night or during weekends.
Section 4.4 demonstrated the validity of ARTSRAM-
based RIRs measurements and their corresponding posi-
tion data. It was exemplarily shown that such measure-
ments can be used to visualize the sound field inside a
room and reproduce well known acoustic phenomena, such
as specular reflections. Another interesting perspective in
this context is the question whether the sound field can
theoretically be reconstructed from an irregular sampling
with a distribution like in Section 4.3. The theoretical
limit of sound field reconstruction is given by the Nyquist
theorem [13]. If high frequency contents are supposed to
be considered, the amount of required measurements may
become very large. Taking into account the increased num-
ber of measurements due to the random walk procedure,
the measurement time to reach a sufficient resolution may
exceed what is feasible to measure with our proposed sys-
tem. However, promising approaches exist for reconstruct-
ing sound fields with fewer measurements [14–16]. An ir-
regular, as opposed to a uniform sampling, may be benefi-
cial in terms of reconstructing the sound field because of a
lower coherence between measurements [14]. Datasets ob-
tained with the ARTSRAM might be valuable to accelerate
further developments of sound field reconstruction algo-
rithms, especially for data-heavy approaches such as [15].
6 Conclusions
In this paper, we proposed the Autonomous Robot Twin
System for Room Acoustic Measurements (ARTSRAM).
It consists of two autonomous robots equipped with elec-
J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January 9
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
6. Conclusions
troacoustic transducers and position-tracking devices. The
robots are programmed to perform random walks in rooms
for large-scale room acoustic measurements. For example,
the ARTSRAM can be used to capture large sets of RIRs
in an automated way, including the corresponding position
data of the sound source and receiver. Collision sensors
make the robots autonomous and ensure a high flexibility
and versatility, because the system can be used in various
rooms without big installation efforts.
Although the components of the system were chosen ac-
cording to our best knowledge, some non-ideal properties
had to be accepted in order to ensure a high mobility of the
robots. For example, currently the robots are equipped with
a microphone array of only first order and mobile loud-
speakers with a considerable directivity. However, we as-
sume that the system’s flexibility and the achievable acqui-
sition speed of RIRs compensate for the minor shortcom-
ings of the incorporated components. The modular struc-
ture of the robot system easily allows replacing its compo-
nents as soon as better alternatives become available.
Our evaluation indicates that the proposed system is ca-
pable of measuring large, diverse, and high-quality RIR
data sets in an automated way. The acoustic disturbance
caused by the custom-made robot fixture was shown to
be negligible. Furthermore, the implemented random-walk
procedure exhibited a good coverage of an exemplary mea-
surement area. Finally, it was shown that ARTSRAM-
based RIR measurements and their corresponding posi-
tion data are capable of accurately capturing acoustic
phenomena, such as specular reflections. Therefore, the
ARTSRAM can be seen as a valid tool to obtain position-
dynamic RIR measurements for future research in the
fields of room acoustics and acoustic virtual reality.
The authors would like to express their gratitude to
Aleksi ¨
Oyry for the helpful advice and support while build-
ing the additional fixture and mounting it on the robots.
The project has received funding from from
the Academy of Finland, project no. 317341, and
from Nordic Sound and Music Computing Network
(NordicSMC), project no. 86892.
A companion page for this paper can be found at:
It includes a video demonstration of a measurement session
with the ARTSRAM and an animation of the sound field
visualization depicted in Figure 7.
The source code of the ARTSRAM implementation can be
found at:
[1] Akama, T., Suzuki, H., and Omoto, A., “Distribution
of selected monaural acoustical parameters in con-
cert halls,” Appl. Acoust., 71(6), pp. 564–577, 2010,
[2] Mourjopoulos, J., “On the variation and invertibil-
ity of room impulse response functions,” J. Sound
Vib., 102(2), pp. 217–228, 1985, ISSN 0022-460X,
[3] Llopis, H. S., Pind, F., and Jeong, C.-H., “De-
velopment of an auditory virtual reality sys-
tem based on pre-computed B-format impulse re-
sponses for building design evaluation,Build. En-
viron., 169, pp. 106553:1 – 106553:10, 2019,
[4] Kuttruff, H., Room Acoustics, Spon Press, London,
UK, 4th edition, 2000.
[5] Witew, I. B., Vorl¨
ander, M., and Xiang, N., “Sampling
the sound field in auditoria using large natural-scale
array measurements,” J. Acoust. Soc. Am., 141(3), pp.
EL300–EL306, 2017, doi:10.1121/1.4978022.
[6] Xiang, N., Escolano, J., Navarro, J. M., and
Jing, Y., “Investigation on the effect of aperture
sizes and receiver positions in coupled rooms,J.
Acoust. Soc. Am., 133(6), pp. 3975–3985, 2013,
[7] Uehara, M., Ishikawa, N., and Okawa, S., “Visual-
ization of Distribution of Room Acoustic Parameters
by Using Mobile Robot,” in Proc. 23rd Int. Congr.
Acoust. (ICA), pp. 4611–4616, Deutsche Gesellschaft
ur Akustik (DEGA), Aachen, Germany, 2019.
[8] Adavanne, S., Politis, A., Nikunen, J., and Vir-
tanen, T., “Sound Event Localization and De-
tection of Overlapping Sources Using Convolu-
tional Recurrent Neural Networks,IEEE J. Sel.
Topics Signal Process., 13(1), pp. 34–48, 2018,
[9] ˇ
Cmejla, J., Kounovsk´
y, T., Gannot, S., Koldovsk´
Z., and Tandeitnik, P., “MIRaGe: Multichannel
Database Of Room Impulse Responses Measured
On High-Resolution Cube-Shaped Grid In Multi-
ple Acoustic Conditions,”
abs/1907.12421, 2019, (Last accessed: 29 Octo-
ber 2020).
[10] Le Roux, J., Vincent, E., Hershey, J. R., and El-
lis, D. P. W., “Micbots: Collecting large realistic
datasets for speech and audio research using mo-
bile robots,” in Int. Conf. Acoust., Speech, Sig. Proc.
(ICASSP), pp. 5635–5639, IEEE, Brisbane, Australia,
2015, doi:10.1109/icassp.2015.7179050.
[11] Cecchi, S., Carini, A., and Spors, S., “Room Re-
sponse Equalization—A Review,” Appl. Sci., 8(1),
p. 16, 2017, doi:10.3390/app8010016.
[12] Raghuvanshi, N. and Snyder, J., “Parametric direc-
tional coding for precomputed sound propagation,”
ACM Trans. Graphics, 37(4), p. 108, 2018 July,
10 J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
7. Acknowledgement
8. Companion page and source code
[13] Ajdler, T., Sbaiz, L., and Vetterli, M., “The Ple-
nacoustic Function and Its Sampling,” IEEE Trans.
Signal Process., 54(10), pp. 3790–3804, 2006,
[14] Verburg, S. A. and Fernandez-Grande, E., “Recon-
struction of the sound field in a room using compres-
sive sensing,J. Acoust. Soc. Am., 143(6), pp. 3770–
3779, 2018, doi:10.1121/1.5042247.
[15] Llu´
ıs, F., Mart´
ınez-Nuevo, P., Møller, M. B.,
and Shepstone, S. E., “Sound field reconstruc-
tion in rooms: Inpainting meets super-resolution,
J. Acoust. Soc. Am., 148(2), pp. 649–659, 2020,
[16] Antonello, N., Sena, E. D., Moonen, M., Naylor,
P. A., and van Waterschoot, T., “Room Impulse
Response Interpolation Using a Sparse Spatio-
Temporal Representation of the Sound Field,
IEEE/ACM Trans. Audio, Speech, Language
Process., 25(10), pp. 1929–1941, 2017 July,
[17] Werner, S., Klein, F., and G ¨
otz, G., “Investigation on
spatial auditory perception using non-uniform spatial
distribution of binaural room impulse responses,” in
Proc. 5th Int. Conf. Spatial Audio (ICSA), pp. 137–
144, Verband Deutscher Tonmeister (VDT), Ilmenau,
Germany, 2019, doi:10.22032/dbt.39936.
[18] Brinkmann, F., Asp¨
ock, L., Ackermann, D., Lepa,
S., Vorl¨
ander, M., and Weinzierl, S., “A round robin
on room acoustical simulation and auralization,” J.
Acoust. Soc. Am., 145(4), pp. 2746–2760, 2019 April,
[19] Feng, Y., Khodayi-mehr, R., Kantaros, Y., Calkins, L.,
and Zavlanos, M. M., “Active Acoustic Impedance
Mapping Using Mobile Robots,” in IEEE Conf. De-
cision Control (CDC), pp. 3910–3915, IEEE, Miami
Beach, USA, 2018, doi:10.1109/cdc.2018.8618924.
[20] Hughes, B. D., Random Walks and Random Environ-
ments - Volume 1: Random Walks, Oxford University
Press, New York, NY, USA, 1995.
[21] Lov´
asz, L., “Random Walks on Graphs: A Survey,”
in D. Miklos, V. T. Sos, and T. Szony, editors, Com-
binatorics, Paul Erd˝
os is Eighty, number 2 in Bolyai
Society Mathematical Studies, pp. 1–46, J´
anos Bolyai
Mathematical Society, Budapest, Hungary, 1993.
[22] Aldous, D. J., “Lower bounds for covering times
for reversible Markov chains and random walks on
graphs,” J. Theor. Probab., 2(1), pp. 91–100, 1989,
[23] Witherwax, M., “iRobot - A Python implementation
of the iRobot Open Interface,https://github.
com/julianpistorius/irobot, 2016, (Last
accessed: 15 September 2020).
[24] Borges, M., Symington, A., Coltin, B., Smith, T., and
Ventura, R., “HTC Vive: Analysis and Accuracy Im-
provement,” in Int. Conf. Intelligent Robots Systems
(IROS), pp. 2610–2615, IEEE/RSJ, Madrid, Spain,
2018, doi:10.1109/iros.2018.8593707.
[25] Niehorster, D. C., Li, L., and Lappe, M., “The Accu-
racy and Precision of Position and Orientation Track-
ing in the HTC Vive Virtual Reality System for Sci-
entific Research,” i-Perception, 8(3), pp. 1–23, 2017
May-June, doi:10.1177/2041669517708205.
[26] Bruns, C., “pyOpenVR - Unofficial python bind-
ings for Valve’s OpenVR virtual reality SDK,”,
2016, (Last accessed: 15 September 2020).
[27] Rafaely, B., Fundamentals of Spherical Array Pro-
cessing, number 8 in Springer Topics in Signal
Processing, Springer, Berlin, Heidelberg; Germany,
2015, doi:10.1007/978-3-662-45664-4.
[28] Zotter, F. and Frank, M., Ambisonics - A Practical
3D Audio Theory for Recording, Studio Production,
Sound Reinforcement, and Virtual Reality, number 19
in Springer Topics in Signal Processing, Springer Na-
ture, Cham, Switzerland, 2019, doi:10.1007/978-3-
[29] Farina, A., “Simultaneous Measurement of Impulse
Response and Distortion with a Swept-Sine Tech-
nique,” in 108th Conv. Audio Eng. Soc., pp. 1–23,
AES, Paris, France, 2000.
[30] ISO 3382-2, “Acoustics - Measurement of room
acoustic parameters - Part 2: Reverberation time in
ordinary rooms,” Standard, International Organiza-
tion for Standardization (ISO), Geneva, Switzerland,
[31] Merimaa, J. and Pulkki, V., “Spatial Impulse Re-
sponse Rendering I: Analysis and Synthesis,” J. Au-
dio Eng. Soc., 53(12), pp. 1115–1127, 2005 Decem-
[32] Tervo, S., “SDM Toolbox,” https://
fileexchange/56663-sdm-toolbox, 2018,
(Last accessed: 15 September 2020).
[33] Leishman, T. W., Rollins, S., and Smith, H. M.,
“An experimental evaluation of regular polyhedron
loudspeakers as omnidirectional sources of sound,
J. Acoust. Soc. Am., 120(3), pp. 1411–1422, 2006,
[34] Farina, A. and Chiesi, L., “A novel 32-speakers spher-
ical source,” in 140th Conv. Audio Eng. Soc., pp. 1–6,
AES, Paris, France, 2016.
[35] Avizienis, R., Freed, A., Kassakian, P., and Wessel,
D., “A Compact 120 Independent Element Spheri-
cal Loudspeaker Array with Programmable Radiation
Patterns,” in 120th Conv. Audio Eng. Soc., pp. 1–7,
AES, Paris, France, 2006.
[36] Monson, B. B., Hunter, E. J., and Story, B. H., “Hor-
izontal directivity of low- and high-frequency energy
in speech and singing,” J. Acoust. Soc. Am., 132(1),
pp. 433–441, 2012 July, doi:10.1121/1.4725963.
[37] P¨
atynen, J. and Lokki, T., “Directivities of Sym-
phony Orchestra Instruments,Acta Acust. united
Ac., 96(1), pp. 138–167, 2010 January/February,
J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January 11
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
Georg G¨
otz Sebastian J. Schlecht
Georg G¨
otz is a doctoral candidate at the Acoustics Lab,
Department of Signal Processing and Acoustics, of Aalto
University, Finland. He received his B. Sc. (2016) and
M. Sc. (2019) degrees in Media Technology from Ilmenau
University of Technology, specializing in spatial audio and
psychoacoustics. He was the deputy chair of the AES Stu-
dent Section Ilmenau from 2013 to 2016.
The research of his doctoral thesis is about machine-
learning-based virtual acoustics rendering. His other re-
search interests include room acoustics, psychoacoustics,
and VR/AR technology.
Abraham Martinez Ornelas received his M. Sc. from
Aalto University.
Sebastian J. Schlecht is a Professor of Practice for Sound
in Virtual Reality at the Acoustics Lab, Department of Sig-
nal Processing and Acoustics and Media Labs, Department
of Media, of Aalto University, Finland. He received the
Diploma in Applied Mathematics from the University of
Trier, Germany in 2010, and an M.Sc. degree in Digital
Music Processing from School of Electronic Engineering
and Computer Science at Queen Mary University of Lon-
don, U.K. in 2011. In 2017, he received a Doctoral de-
gree at the International Audio Laboratories Erlangen, Ger-
many, on the topic of artificial spatial reverberation and
reverberation enhancement systems. From 2012 on, Dr.
Schlecht was also external research and development con-
sultant and lead developer of the 3D Reverb algorithm at
the Fraunhofer IIS, Erlangen, Germany.
His research interests are acoustic modeling and au-
ditory perception of acoustics, analysis, and synthesis of
feedback systems, music information retrieval, and virtual
and augmented reality. He is the recipient of multiple best
paper awards including Best Paper in Journal of the Audio
Engineering Society (JAES) in Jun 2020, Best Paper Award
at IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA) in Oct 2019, 2nd Best
Paper Award at International Conference on Digital Audio
Effects (DAFx) in Sep 2018, and, Best Peer-Reviewed Pa-
per at 2018 AES International Conference on Audio for
Virtual and Augmented Reality (AES AVAR) in Aug 2018.
The biography of Ville Pulkki was published in the 2020
May issue of the Journal.
12 J. Audio Eng. Sco., Vol. 1, No. 1, 2021 January
JAES Vol. 69, No. 4, April 2021 — this is the author’s accepted manuscript.
Published version:
... Ro2 was validated on a database of swept-sine measurements collected in the Arni room at the Acoustics Lab of Aalto University, Espoo, Finland. 5,48 Arni is a rectangular room, with dimensions 8.9 m  6.3 m  3.6 m (length, width, and height, respectively). The room's walls and ceiling are equipped with acoustic panels that can switch their state between open and closed, changing the amount of absorption and thus varying the acoustics within the space. ...
Full-text available
The exponential sine sweep is a commonly used excitation signal in acoustic measurements, which, however, is susceptible to non-stationary noise. This paper shows how to detect contaminated sweep signals and select clean ones based on a procedure called the rule of two, which analyzes repeated sweep measurements. A high correlation between a pair of signals indicates that they are devoid of non-stationary noise. The detection threshold for the correlation is determined based on the energy of background noise and time variance. Not being disturbed by non-stationary events, a median-based method is suggested for reliable background noise energy estimation. The proposed method is shown to detect reliably 95% of impulsive noises and 75% of dropouts in the synthesized sweeps. Tested on a large set of measurements and compared with a previous method, the proposed method is shown to be more robust in detecting various non-stationary disturbances, improving the detection rate by 30 percentage points. The rule-of-two procedure increases the robustness of practical acoustic and audio measurements.
... The spatio-temporal acquisition of a sound field over a large volume of space is experimentally challenging, as a large number of transducers is required to sample the threedimensional field. Some approaches have been presented to effectively acquire the pressure field over a large aperture, e.g. by means of automated measurements [1], [2], IoT-enabled devices [3], by deploying large sensor networks throughout space [4], [5] or via remote sensing principles [6]- [9]. In most cases, the direct acquisition of the sound field with sufficient spatial resolution is not viable, due to excessive spatial sampling requirements. ...
Conference Paper
Full-text available
The acquisition of the spatio-temporal characteristics of a sound field over a large volume of space is experimentally challenging, as a large number of transducers is required to sample the sound field. Sound field reconstruction methods are a resourceful approach, as they enable the interpolation and extrapolation of the sound field from a limited number of observed data. In this study we examine the spatio-temporal and spatio-spectral reconstruction of the sound field in a room from distributed measurements of the sound pressure. Specifically, a variational Gaussian process regression model is formulated, using time-domain anisotropic kernels to reconstruct the direct sound and early reflections, and frequency-domain isotropic kernels for reconstructing the late reverberant field. The proposed methodology is compared experimentally to classical regression models based on plane wave decompositions, which are widely used in sound field reconstruction in enclosures due to their simplicity and accuracy.
Conference Paper
Full-text available
For spatial audio reproduction in the context of virtual and augmented reality, a position-dynamic binaural synthesis can be used to reproduce the ear signals for a moving listener. A set of binaural room impulse responses (BRIRs) is required for each possible position of the listener in the room. The required spatial resolution of the BRIR positions can be estimated by spatial auditory perception thresholds. If the resolution is too low, jumps in perception of direction and distance and coloration effects occur. This contribution presents an evaluation of spatial audio quality using different spatial resolutions of the position of the used BRIRs. The evaluation is performed with a moving listener. The test persons evaluate any abnormalities in the spatial audio quality. The result is a comparison of the quality and the spatial resolution of the various conditions used.
Full-text available
A round robin was conducted to evaluate the state of the art of room acoustic modeling software both in the physical and perceptual realms. The test was based on six acoustic scenes highlighting specific acoustic phenomena and for three complex, “real-world” spatial environments. The results demonstrate that most present simulation algorithms generate obvious model errors once the assumptions of geometrical acoustics are no longer met. As a consequence, they are neither able to provide a reliable pattern of early reflections nor do they provide a reliable prediction of room acoustic parameters outside a medium frequency range. In the perceptual domain, the algorithms under test could generate mostly plausible but not authentic auralizations, i.e., the difference between simulated and measured impulse responses of the same scene was always clearly audible. Most relevant for this perceptual difference are deviations in tone color and source position between measurement and simulation, which to a large extent can be traced back to the simplified use of random incidence absorption and scattering coefficients and shortcomings in the simulation of early reflections due to the missing or insufficient modeling of diffraction.
Full-text available
Capturing the impulse or frequency response functions within extended regions of a room requires an unfeasible number of measurements. In this study, a method to reconstruct the response at arbitrary points based on compressive sensing (CS) is examined. The sound field is expanded into plane waves and their amplitudes are estimated via CS, obtaining a spatially sparse representation of the sound field. The validity of the CS assumptions are discussed, namely, the assumption of the wave field spatial sparsity (which depends strongly on the properties of the specific room), and the coherence of the sensing matrix due to different spatial sampling schemes. An experimental study is presented in order to analyze the accuracy of the reconstruction. Measurements with a scanning robotic arm make it possible to circumvent uncertainty due to positioning and transducer mismatch, and examine the accuracy of the reconstruction over extended regions of space. The results indicate that near perfect reconstructions are possible at low frequencies, even from a limited set of measurements. In addition, the study shows that it is possible to reconstruct damped room responses with reasonable accuracy well into the mid-frequency range.
This open access book provides a concise explanation of the fundamentals and background of the surround sound recording and playback technology Ambisonics. It equips readers with the psychoacoustical, signal processing, acoustical, and mathematical knowledge needed to understand the inner workings of modern processing utilities, special equipment for recording, manipulation, and reproduction in the higher-order Ambisonic format. The book comes with various practical examples based on free software tools and open scientific data for reproducible research. The book’s introductory section offers a perspective on Ambisonics spanning from the origins of coincident recordings in the 1930s to the Ambisonic concepts of the 1970s, as well as classical ways of applying Ambisonics in first-order coincident sound scene recording and reproduction that have been practiced since the 1980s. As, from time to time, the underlying mathematics become quite involved, but should be comprehensive without sacrificing readability, the book includes an extensive mathematical appendix. The book offers readers a deeper understanding of Ambisonic technologies, and will especially benefit scientists, audio-system and audio-recording engineers. In the advanced sections of the book, fundamentals and modern techniques as higher-order Ambisonic decoding, 3D audio effects, and higher-order recording are explained. Those techniques are shown to be suitable to supply audience areas ranging from studio-sized to hundreds of listeners, or headphone-based playback, regardless whether it is live, interactive, or studio-produced 3D audio material.
In this paper, a deep-learning-based method for sound field reconstruction is proposed. The possibility to reconstruct the magnitude of the sound pressure in the frequency band 30–300 Hz for an entire room by using a very low number of irregularly distributed microphones arbitrarily arranged is shown. Moreover, the approach is agnostic to the location of the measurements in the Euclidean space. In particular, the presented approach uses a limited number of arbitrary discrete measurements of the magnitude of the sound field pressure in order to extrapolate this field to a higher-resolution grid of discrete points in space with a low computational complexity. The method is based on a U-net-like neural network with partial convolutions trained solely on simulated data, which itself is constructed from numerical simulations of Green's function across thousands of common rectangular rooms. Although extensible to three dimensions and different room shapes, the method focuses on reconstructing the two-dimensional plane of a rectangular room from measurements of the three-dimensional sound field. Experiments using simulated data together with an experimental validation in a real listening room are shown. The results suggest a performance which may exceed conventional reconstruction techniques for a low number of microphones and computational requirements.
This study presents an auditory virtual reality system which relies on pre-computed B-format impulse responses in a grid across the domain. Spatial information is encoded in the impulse responses by means of higher order Ambisonics and decoded to a virtual loudspeaker array, which follows the listener during run-time. The impulse responses are convolved with source signals, played back through the virtual loudspeaker array and synthesized for binaural headphone reproduction through head related transfer functions. This approach allows for a completely free movement and orientation of the listener in the virtual scene. Furthermore, it allows for the usage of highly accurate off-line simulations of room impulse responses. The system is validated with two listening tests. First, the sound source localization performance in virtual reverberant rooms is tested while varying the order of Ambisonics, visuals and head movement. Second, the effects of the grid resolution on the perceived realism, sound source size and sound continuity are investigated. The results reveal that when using second order Ambisonics, together with visuals and allowing head movement, the localization error is very low, being less than one just-noticeable difference. Furthermore, when using a coarse grid, the perceptual sound source size is increased and spread out. Perceived realism and sound continuity are not significantly affected by the grid resolution. Two video files that show the proposed system in use are provided.
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.
Convincing audio for games and virtual reality requires modeling directional propagation effects. The initial sound's arrival direction is particularly salient and derives from multiply-diffracted paths in complex scenes. When source and listener straddle occluders, the initial sound and multiply-scattered reverberation stream through gaps and portals, helping the listener navigate. Geometry near the source and/or listener reveals its presence through anisotropic reflections. We propose the first precomputed wave technique to capture such directional effects in general scenes comprising millions of polygons. These effects are formally represented with the 9D directional response function of 3D source and listener location, time, and direction at the listener, making memory use the major concern. We propose a novel parametric encoder that compresses this function within a budget of ~100MB for large scenes, while capturing many salient acoustic effects indoors and outdoors. The encoder is complemented with a lightweight signal processing algorithm whose filtering cost is largely insensitive to the number of sound sources, resulting in an immediately practical system.