ArticlePDF Available

An Overview on Networked Music Performance Technologies


Abstract and Figures

Networked Music Performance (NMP) is a potential game changer among Internet applications, as it aims at revolutionizing the traditional concept of musical interaction by enabling remote musicians to interact and perform together through a telecommunication network. Ensuring realistic performance conditions, however, constitutes a significant engineering challenge due to the extremely strict requirements in terms of network delay and audio quality, which are needed to maintain a stable tempo, a satisfying synchronicity between performers and, more generally, a high-quality interaction experience. In this paper we offer a review of the psycho-perceptual studies conducted in the past decade, aimed at identifying latency tolerance thresholds for synchronous real-time musical performance. We also provide an overview of hardware/software enabling technologies for NMP, with a particular emphasis on system architecture paradigms, networking configurations, and applications to real use cases.
Content may be subject to copyright.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
An Overview on Networked Music Performance
Cristina Rottondi, Chris Chafe, Claudio Allocchio, Augusto Sarti
Dalle Molle Institute for Artificial Intelligence (IDSIA)
University of Lugano (USI) - University of Applied Science and Arts of Southern Switzerland (SUPSI)
Center for Computer Research in Music and Acoustics, Stanford University, California, USA
Consortium GARR, Rome, Italy
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Italy
Abstract—Networked Music Performance (NMP) is a poten-
tial game changer among Internet applications, as it aims at
revolutionizing the traditional concept of musical interaction
by enabling remote musicians to interact and perform together
through a telecommunication network. Ensuring realistic perfor-
mance conditions, however, constitutes a significant engineering
challenge due to the extremely strict requirements in terms of
network delay and audio quality, which are needed to maintain a
stable tempo, a satisfying synchronicity between performers and,
more generally, a high-quality interaction experience.
In this paper we offer a review of the psycho-perceptual
studies conducted in the past decade, aimed at identifying
latency tolerance thresholds for synchronous real-time musical
performance. We also provide an overview of hardware/software
enabling technologies for NMP, with a particular emphasis on
system architecture paradigms, networking configurations, and
applications to real use cases.
Index Terms—Music; Audio Systems; Audio-visual Systems;
Networked Music Performance; Network Latency.
Networked Music Performance (NMP) represents a me-
diated interaction modality characterized by extremely strict
requirements on network latency. Enabling musicians to per-
form together from different geographic locations requires
capturing and transmitting audio streams through the Internet,
which introduces packet delays and processing delays that can
easily have an adverse impact on the synchronicity of the
Computer-aided musical collaboration has been investigated
starting from the early ‘70s, when musicians and composers,
inspired by the electroacoustic music tradition, began ex-
ploiting computer technologies as enablers for innovative
manipulations of acoustic phenomena (see [1] for a historical
overview of related works of sonic art). In the past two decades
the massive growth of the Internet has greatly widened the
opportunities for new forms of online musical interactions. A
categorization of computer systems for musical interactions is
offered in [1], which include:
local interconnected musical networks ensuring interplay
between multiple musicians who simultaneously interact
with virtual instruments [2];
musical team-composing systems allowing for asyn-
chronous exchange and editing of MIDI (Musical Instru-
ment Digital Interface) data [3]–[6] or recreating virtual
online rooms for remote recording sessions based on dis-
tributed systems connected through centralized servers1;
shared sonic environments, which take advantage of dis-
tributed networks by involving multiple players in im-
provisation experiments, as well as audience participation
events [7]–[10];
remote music performance systems supporting real-time
synchronous musical interactions among geographically-
displaced musicians.
NMP focuses on the last of the above categories and aims
at reproducing realistic environmental conditions for a wide
range of applications from tele-auditions, remote music teach-
ing and rehearsals, to distributed jam sessions and concerts.
However, several aspects of musical interactions must be taken
into account. Musicians practicing in the same room rely on
several modalities in addition to the sounds generated by
their instruments, including sound reverberation within the
physical environment and visual feedback from movements
and gestures of other players [11]. Though communication
technologies are still not sufficiently advanced to reliably
and conveniently reproduce all the details of presence in
musical performances, some technical necessities to enable
remote interaction can be identified [12]. In particular, from
the networking point of view, very strict requirements in terms
of latency and jitter must be satisfied to keep the one-way end-
to-end transmission delay below a few tens of milliseconds.
According to several studies [13], [14], the delay tolerance
threshold is estimated to be 20 30 ms, corresponding to a
distance of 8-9 m (considering the speed of sound propagation
in air), which is traditionally considered as the maximum
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
physical separation ensuring the maintenance of a common
tempo without a conductor. However, this threshold varies a
great deal depending on the abilities of the musician, his/her
own stylistic expressions, and the strategies he/she applies to
cope with delayed self-sound and other feedback affecting
tempo. Several studies reported in [15], [16] show that an
asynchronism of up to 3050 ms (due either to the spatial dis-
location of the performers, the delay in the auditory feedback
introduced by the instrument itself, or the reaction delay elaps-
ing between motor commands, the musician’s corresponding
haptic sensations, and audio feedbacks) are largely tolerated
and even consciously emphasized to achieve specific stylistic
effects. For example, organ players naturally compensate the
delay between pressing on the keyboard keys and hearing the
emitted sound, due to the great physical displacement between
keyboard and pipes. The same applies to piano performance
in which the time elapsed between the pressing of a key
and the corresponding note onset varies between 30 and 100
ms according to the musical dynamics (sound loudness) and
articulation (e.g. legato, staccato) [17].
In addition to subjecting players to remote topologies which
violate their rhythmic comfort zones, a particularly hard digital
challenge in NMP frameworks is synchronization of audio
streams emitted by devices which do not share the same clock.
Clock drift issues arise over minutes of performance which
can create under-run (i.e. the condition occurring when the
application buffer at the receiver side is fed with data at a
lower bit-rate than that used by the application to read from
the buffer, which obliges the application to pause the reading
from time to time to let the buffer refill) or over-run conditions
(i.e., when the buffer is fed at a higher bit-rate than the
application reading rate, which leads to losses when incoming
data find the buffer completely full). Operating systems of
general purpose computers introduce processing delays up to
a few milliseconds, which, in turn, affects the data acquisition
and timestamping procedures at both capture and playback
sides. In order to ensure an accurate stream alignment, signal
processing techniques must be adopted to truncate or pad audio
data blocks without impairing the perceptual audio quality by
introducing artifacts. The same approaches must be adopted
in case of network packet loss or to compensate the effects
of network jitter: if a packet reaches its destination after its
scheduled playback time, its audio data is no longer valid.
Given the wide variety of approaches to networked musical
interactions which have been developed so far, the contribution
of this survey is threefold. We first provide an in-depth analysis
of the factors influencing musicians’ latency acceptability
thresholds. We then discuss the contributions to the overall de-
lay experienced by NMP users along the audio streaming chain
and identify system parameters affecting latency and perceived
audio quality. We also provide a comprehensive overview
of existing NMP frameworks and discuss hardware/software
technologies supporting remote musical performances.
In particular, in Section II we discuss the methodologies
for real-time audio streaming and strategies which minimize
delay introduced by audio acquisition, processing and trans-
mission. In Section III we summarize the results of several
perceptual studies aimed at identifying the ranges of latency
tolerance in different NMP scenarios, focusing on the impact
of environmental and instrumental characteristics (e.g. acoustic
reverberation and timbre), musical features (e.g. rhythmic
complexity) and interpretative choices (e.g. respective rela-
tionship of leading parts, presence of a conductor). Then,
in Section IV we discuss the instrument-to-ear delay that is
perceived by a NMP performer due to the various processing
and transmission stages required for audio streaming, and
identify the system parameters affecting such contributions.
Several state-of-the-art NMP frameworks are comparatively
reviewed in Section V, detailing their architectural, hardware
and software characteristics. A discussion on future research
directions in the field of Networked Music Performance is
provided in Section VI. Finally, conclusions are drawn in
Section VII.
A. Musical Data Streaming and Prediction
Though the majority of NMP systems are designed to
convert sound waves generated by a generic audio source
into trasnmitted digital signals (which ensure full compati-
bility with non-electronic instruments and voice), alternative
paradigms have also been considered to avoid transmission of
audio streams through the network, thus reducing bandwidth
requirements and improving scalability. Some frameworks
use MIDI to transport synthetic audio contents, thus being
suitable only for electronic instruments. Computer-controlled
instruments such as Yamaha Disklavier2offer practical NMP
capabilities: these pianos are equipped with a measurement
unit storing the information derived from key shutters (usually
located below the key and at the hammer shank), and a
reproduction unit controlling solenoids below the back of
each key. The gestural data measured during the pianist’s
performance can be communicated to a remote instrument via
an Internet connection using a proprietary protocol, so that
the key strokes actuated by the pianist are reproduced at the
remote side.
A gestural data codification approach (using audio signal
recognition) has been implemented for percussion [18] and is
combined with a prediction mechanism based on the analysis
of previous musical phrases. Methodologies drawn from stud-
ies on computer accompaniment systems have been applied
to NMP by modeling each performer as a local agent and
recreating the performed audio at the remote side by means
of a combination of note onset detection and score following
techniques, based on pre-recorded audio tracks [19]. Bayesian
networks for the estimation and prediction of musical timings
in NMP have also been investigated [20].
Motion-tracking technologies have been employed in NMP
to create graphics for remote orchestra conducting which
outperform the latencies of traditional video acquisition [21],
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
or for prediction of percussion hits based on a drummer’s
gesture, so that the sound can be synthesized at a receiver’s
location at approximately the same moment the hit occurs at
the sender’s location [22]. Frameworks for networked score
communication/editing [23] and score-following [24] in sup-
port of NMP have also been developed.
B. Strategies for Delay Compensation
In [25], according to the network latency conditions, playing
strategies between pairs of musicians are categorized as fol-
lows : The first two of these result from acoustical situations
with varying amount of delay between the players (temporal
separation) and depend on the type of music (style, tempo,
etc.). The last is a delay-compensating technique in which
the acoustical situation is manipulated electronically to add a
certain amount of self-delay to offset the perceived temporal
separation of the other musician.
1) realistic interaction: this is the conventional musical
interaction approach, enabling full, natural interplay be-
tween the musicians and implying no awareness of delay.
Interactions of this type are the only way to achieve truly
acceptable performances by professional players.
2) master-slave: One of a pair of players assumes a leading
role and establishes the musical timing exclusively while
ignoring the audio from the other player, who adapts to it
and follows without significant inconvenience. Network
delays can be tolerated up to a self-delay tolerance
threshold (usually around 100 200 ms [26]).
3) delayed feedback: an alternative approach consists in
artificially adding a self-delay to the musicians’ own
audio feedback, up to an amount equal to the audio
round trip time of the full NMP circuit. At that level,
it is perceived as synchronized with audio generated by
the remote counterpart. A variant to this solution adds
a self-delay equal to the one-way network delay and
requires the usage of a metronome (or any other cue)
which must sound in perfect sync at both sides. However,
such conditions may be difficult to achieve, due to drift
issues in the synchronization of the local clocks [27].
Various studies of the effects of delay on live musical
interactions have appeared in the last decade. Table I sum-
marizes the characteristics of the tests reported in each of
them, including the types of instruments and the musical
pieces performed, the latency and tempo ranges tested, and
the quality metrics applied for the numerical assessments.
A first group of experiments focuses on rhythmic patterns
performed by hand-clapping by both musicians and untrained
subjects, [28]–[33], whereas other studies [26], [34], [35], [38],
[40]–[42] consider a wide range of acoustic and electronic
instruments, performing both predefined rhythmic sequences
or classical/modern musical pieces. One work [36] focuses on
theatrical opera pieces involving singers, a piano player and a
conductor, and investigates the effects of network latency on
Fig. 1: Typical testbed configuration for tester pairs. Latency
can be introduced either by digital delay on the central
experiment computer or by a network emulator (shown).
Fig. 2: Testbed configuration for experiments in [36]
both audio and video data. Combined transmission of audio
and video data is considered in [38], [42].
A. Test Setup and Description
With the exception of [36], all the studies listed in Table
I use similar test setups consisting of two sound-isolated
anechoic rooms, each one hosting one musician, as depicted in
Fig. 1. The subjects hear each other by means of headphones.
Visual interactions between the musicians are prevented. Au-
dio signals are captured by means of microphones, and either
connected to a central experiment computer which inserts
digital delay, or are directly converted in the room, packetized
and transmitted via a wired Local Area Network (LAN) to
the counterpart. Each subject hears his/her own instrument
without additional delay, whereas the audio feedback from
the counterpart is delayed by a delay which is electronically
added by the central experiment computer or at the network
interface via dedicated network impairment software (e.g.
Netem [43]). Apart from the experiments in [40], where the
behavior of a telecommunication network in terms of variable
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
TABLE I: Summary of latency tolerance tests
Authors Hand
Instruments rhythmic
Musical Pieces Musical Fea-
Quality Metrics
vich et
al. [28]–
3 7 3 60 120 1 77 3 7 7 BPM trend, BPM
Farner et
al. [32]
3 7 3 86 94 6 67 7 7 Reverberation,
BPM trend, BPM
slope, initial BPM
value, imprecision,
asymmetry, subjective
et al.
3 7 3 90 30 90 7 7 7 steady-state BPM,
time difference,
subjective rating
et al.
7NA 6206 7“Twelve Duets k.487”,
No.2 and No. 5 (W.A.
Mean Pacing,
Mean Regularity,
Mean Asymmetry,
Subjective rating
Carot et
al. [35]
bass, sax-
360 160 1 75 7“The days of wine and
roses” (H. Mancini, J.
Subjective rating
et al.
380 25 120 7 7 attack time BPM trend, BPM dy-
namic time warping
et al.
7NA 15 135
60 180
3“Il core vi dono..”, (W.A.
Mozart, from “Cosi fan
tutte”); “Ah! Voi signor”
(G. Verdi, from “La Travi-
ata”); “Bess you are my
woman” (G. Gershwin,
from “Porgy and Bess”)
7Subjective rating,
galvanic skin
response, skin
response, BPM
dynamic time
warping analysis,
BPM curvature points
Chew et
al. [37]–
7piano 746 160 0 150 7Sonata for Piano Four-
Hands (F. Poulenc)
Subjective rating,
BPM segmental
et al.
780 132 15 75 7“Bolero” (M. Ravel),
“Master blaster” (S.
Wonder), “Yellow
submarine” (The Beatles)
musical part
BPM trend, BPM
slope, subjective
et al.
7MIDI pi-
7N.A. 0100 7demonstrative monodic
7onset global and local
phase difference
transmission delay and jitter is emulated by generating random
packet delays according to a normal statistical distribution, the
testbeds introduce constant packet delays.
In [35], [38], the addition of self-delay is introduced. In
[32], [35], configurations with artificially added reverbera-
tion are tested. The tests in [38], [42] assume a combined
streaming of unsynchronized audio and video data, but no
specific investigation on the impact of video usage is provided.
Conversely, the experiments in [36] require a testbed with
three acoustically-insulated rooms, each one equipped with
microphones/headsets, cameras and screens (see Fig. 2). Both
synchronized and unsynchronized audio/video transmissions
are tested. Each session involves 2-3 singers, a conductor and a
pianist: the singers are located in two rooms and the conductor
in the third room. According to the specific test session, the
pianist performs either in the conductor’s room or in one of
the singers’ rooms.
Fig. 3: Rhythmic pattern considered in [28]–[33], [35]
Almost all the reviewed studies report that the rhythmic
patterns/music scores were provided to the testers in advance
and that they were free to practice together in the same room
until they felt comfortable with the performance. In [34],
[38], scenarios in which the testers could practice at each
network latency level and develop strategies to compensate the
audio latency are also considered. All the tests requiring the
execution of rhythmic patterns consider the rhythmic structure
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
Fig. 4: Coupled oscillators with delay τcontrolled by phase
detectors (PDs) [33].
reported in Fig. 3, whereas the repertoire used for the tests on
musical pieces is reported in Table I. Tests focusing on hand
clapping consider a population of subjects with and without
musical education, whereas the remaining ones assume prior
musical education at amateur or professional level.
In all the experiments, the order of the tested network delay
scenarios was randomly chosen and kept undisclosed to the
testers, in order to avoid biases or conditioning. In [26], [28]–
[33], [35], [40], some bars of beats at the reference tempo
δexpressed in Beat Per Minute (BPM) were provided to the
players before the beginning of each single performance, either
by an instructor or by means of a metronome. In [32]–[36],
[38], [40], once each trial was concluded, the subjects were
asked to provide a subjective rating of the performance within
a predefined scale of ranges or to answer to a qualitative
B. Analytical Models
Some insight into the impact of delay can be gathered in
an analytical fashion by developing interaction models and
studying their behavior. Two types of approaches have been
used, one which uses signal-based, non-parametric method-
ologies based on measured time-series information [44]–[46]
and the other starting from parametric representations of
the nonlinear interaction between dynamical systems [33].
The latter approach derives a qualitative description of the
impact of delay from a rough analytic prediction of the tempo
evolution as a function of the network delay. Measured results
are loosely in agreement.
In both [33], [46], interacting performers are modeled by
coupled oscillators as in Fig. 4. As we can see, each oscillator
monitors its own pace as well as that of the other oscillator
through a delayed observation. The frequency correction (con-
trol input) of each oscillator depends on the phase difference
measured at the corresponding Phase Detector (PD) [33],
[47], [48]. Each oscillator, however, has its own free running
frequency, which is what is attained in the absence of input
control. A closed-form analysis of the behavior of these
coupled dynamical systems in [33] reveals that the oscillation
frequency of the resulting system settles to the value
Ω = ω
1 + ,(1)
where ωis the mean value between the two natural oscillation
frequencies, τis the network delay and Kis a constant
that describes the state-update equations of the oscillators
[33]. Accompanying experimental results obtained by directly
measuring the steady-state tempo confirm the model’s validity,
though in other, and more complex tempo evaluation exper-
iments [29], [40], [46], other phenomena seem to have an
influence on the evolving tempo, possibily related to the role
of the musician, the rhythmic complexity, and etc. This is
further described in Section III-D.
An alternate approach to studying the impact of delay
on the resulting tempo is described in [28]. Gurevich et al.
modeled performers as memoryless systems that have no prior
knowledge of the tempo and react instantaneously to the last
detected beat without introducing arbitrary tempo deviations.
Under such strong assumptions, the instantaneous tempo δof
the musical performance can be computed as:
δ(n) = 60
where 60is the quarter note interval (in seconds) with
reference tempo δ(in BPM), nis the number of elapsed
quarter pulses and τis the end-to-end delay, which is assumed
to be two-way symmetrical. If τ= 0, then δ(n) = δ, otherwise
δ(n)decreases (less than linearly) with n. As performers
tend to perceive tempo over intervals that are longer than a
single beat, we expect them to largely outperform the model
described by Equation 2. As a matter of fact, as confirmed by
the numerical results shown in [28], the value of δ(n)can be
taken as a lower bound of the real performance tempo. In [46],
the model was contrasted to a coupled oscillator model which
includes anticipation. Predicted tempo values were closer (but
not the same) as measured tempi.
C. Quantitative Performance Assessment Metrics and Indica-
The metrics proposed in the scientific literature to evaluate
the quality of a networked musical interaction can be organized
in two macro-categories, i.e. subjective and objective metrics.
The former category includes opinion scores provided by the
musicians to evaluate various aspects of their performance e.g.,
their emotional connection with the remote musician [36], the
perceived delay [40] and level of asynchrony with respect to
their counterpart [33], their willingness to use an NMP system
based on their personal experience during the tests [33], or a
global rating on the overall quality of experience [32], [34]–
[36], [38], [40].
The latter category comprises numerical attributes extracted
from the recorded audio tracks with the following procedure:
first, the time instants tnin which the n-th quarter onset/hand-
clap occurs are identified (either manually or by means of
a peak search algorithm). Then, Inter-Onsets Intervals (IOIs,
measured in seconds) between quarter notes are computed as
IOIn=tn+1 tn. Finally, the conversion of IOI to actual
tempo (in BPM) is obtained as δ(n) = 60/IOIn.
Based on δ(n)and on the sequence of IOIs, the following
metrics can be computed:
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
Pacing, π[34]: mean IOI computed over the whole trace
as π=1
n=1 IOIn;
Regularity, ρ[34]: coefficient of variability of the se-
quence of IOIs calculated as the IOI standard deviation
to mean ratio.
Asymmetry, α[29], [32], [34]: mean time that per-
former Blags behind performer A, measured as α=
Imprecision, µ[32]: standard deviation of the
inter-subject time differences, measured as
Tempo Slope, κ[28], [30], [40]: the sequence δ(n)can
be linearly interpolated to estimate tempo slope. Posi-
tive slopes indicate a tendency to acceleration, whereas
negative slopes indicate a tempo decrease. Though the
definition of tempo slope may at first glance appear
contradictory w.r.t. the assumption of existence of an
asymptotic steady-state tempo postulated in [33], it is
worth noticing that in most cases the audio trace is
divided in time windows of a few seconds duration and
the trend of the actual tempo δ(m)maintained during the
m-th time window is obtained as a function of the average
IOI over the onsets occurred within the window duration
[26], [36]. It follows that the tempo slope may exhibit
fluctuations over time. In particular, indications from
experiments (e.g. the ones reported in [40]) show that
negative slopes often occur during the first few seconds of
networked musical interaction, especially in the presence
of high latency. With the passing of time, the performance
often tends to stabilize around a lower bpm, meaning that
the slope κassumes near-to-zero values, in accordance to
the trend obtainable by means of equation (2). However,
this is not always the case: in some conditions the players
are unable to reach a stable tempo asset and the tempo
trend exhibits a monotonically decreasing trend, which
eventually leads to the interruption of the performance.
Therefore, the steady-state tempo (when achieved) can
be defined as the value of δ(m)corresponding to near-
to-zero values of the tempo slope κ.
Additional metrics (not discussed here) can also be extracted
from the time warping analysis of the tempo trend δ(n).
D. Quantitative and Qualitative Results
The main goal of all the experiments is the identification
of latency ranges allowing for satisfactory real-time musical
interactions and the investigation of the impact on such ranges
of musical features including the rhythmic complexity of
the performed pieces, the effect of the instruments’ timbral
characterizaticts and attack time, and the musical role of the
performed part. Other factors affecting sensitivity to delay are
the acoustical conditions (i.e. presence/absence of reverbera-
tion), the level of musical training of the performers and the
role of delay-compensating strategies. In the following, we
summarize the main outcomes of the surveyed experiments.
1) Effects of Latency on the Performed Tempo: Studies
[28]–[32], [40] report the trend of the tempo slope κaveraged
over multiple trials with reference tempo δin the range 8694
BPM [28]–[32] and 60 132 BPM [40].
In all the reported results, positive tempo slopes occur for
latency values below 1015 ms. The authors of [29] postulate
such behavior as the consequence of an intrinsic tendency
to anticipate which has been already identified in studies
on negative mean asynchrony in metronome-based tapping
experiments (see [44] for an overview). In the range from
1015 to 2025 ms, performers are generally able to maintain
a stable tempo, very close to the reference δ. Opinion ratings
agree, providing positive/very positive evaluation ratings of
the overall performance quality [32], [34], [40] and latency is
either not perceived at all or is slightly noticed [40]. Delay
becomes clearly perceivable in the range between 20 25
and 50 60 ms, when the quality of the performance starts
deteriorating: the performers exhibit a pronounced tendency to
decelerate (i.e., κassumes negative values, whereas the pacing
πincreases) and their quality ratings consistently diminish.
Moreover, the values of imprecision and asymmetry, which
remained almost constant for delays below 25 30 ms [29],
[32], [34], [41], start rising. With delays above 60 ms, the
performance is heavily impaired and the latency conditions are
generally judged as barely tolerable [35], [38], [40]. Timings
lose regularity even within a single part (i.e., ρexhibit a
remarkable increase above 60 80 ms). Interestingly, the
segmental tempo analysis conducted in [38] shows that the
highest tempo variations occur in the case of delays in the
range 50 100 ms, whereas delays above 100 ms exhibit
lower tempo variability, though the absolute tempo reduction
is more consistent. As suggested by the authors, a possible
explanation for this phenomenon is that such latency values
are so unacceptable that the players were performing on “auto-
pilot”, disregarding the auditory feedback from the counterpart
and only focusing on maintaining a stable tempo. This kind of
behaviour emerges also from the analysis in [41], which shows
that the average onset phase difference exhibits a peak in the
range 5080 ms which then decreases again between 80100
ms, whereas its standard deviation remains almost constant in
case of delays below 80 ms and rises sharply when latency
exceeds 80 ms. The authors motivate these surprising result
as follows: above 80 ms the synchronism of the performance
is so compromised that phase differences between onsets start
fluctuating, i.e. they may assume either positive or negative
values (meaning that one performer may either be lagging or
anticipating the other, without a clear trend, thus explaining
the high standard deviation), which brings the average onset
phase difference closer to 0. These results seem to indicate that
in the experiments conducted in [41] the switch to “auto-pilot”
performing modality did not take place.
Some studies also evaluate the effect of asymmetric delays
[30], [36], concluding that the effects are dominated by the
impact of the highest network delay contribution among the
two directions (forward/backward transmission).
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
2) Impact of Instrument Timbre and Attack Time: Paper
[34] makes a first attempt to investigate the impact of the
instrument choice on the performance quality by comparing
two executions of the same piece, performed by a string
duo and a clarinet duo, respectively: strings exhibit higher
deceleration and asynchrony compared to clarinets for all the
tested latency conditions, though the subjective rating of the
players is unexpectedly slightly higher for the strings duo.
A more thorough investigation on the dependencies between
performance quality and timbral attributes of the played in-
struments is proposed in [40], where the timbre of a set of
seven instruments is characterized by means of the first four
statistical moments of their spectrum magnitude. The study
shows that instruments with high spectrum entropy and flatness
(which are widely used as indicators of sound noisiness) are
more prone to tempo deceleration for latencies above 35 ms.
The corresponding quality ratings provided by the performers
also exhibit a decrease for when spectrum entropy and flatness
increase. The same considerations hold for instruments with
high values of the spectral centroid, which is related to the
sound brightness.
Paper [26] specifically investigates the correlation between
instruments’ perceptual attack times and sensitivity to network
delay: experiments are conducted by asking two players to
modulate the attack strokes of a violin and a cello when
performing with a bow. Results show that, for a given latency
value, slow attack times lead to a more pronounced deceler-
ation w.r.t. sharp attack times. However, a better synchrony
is achieved in case of slow attack times. Though the above
mentioned experiments have highlighted that attack times have
an impact on the performance quality, a more in-depth analysis
is required to verify the applicability of the results to a
wider range of musical instruments. For example, the usage of
electronic instruments with settable/tunable attack times would
make them independent of the execution style of the single
performers, thus enabling a more objective evaluation of the
effects of the attack time variation.
3) Impact of Reverberation Effects: Though all the tests
described take place in semi-anechoic or anechoic rooms, two
studies compare the results with those obtained by adding
artificial reverberation: paper [35] reports that no noticeable
improvements were observed by the players when performing
with reverb, [32] finds an increase in asymmetry and a
decrease in the regularity within the single parts in anechoic
conditions, whereas artificial reverberations caused a slight
decrease in the initial tempo (evaluated over the first 5 onsets
of each performance).
4) Impact of Rhythmic Complexity: In [30], [32], [35],
[40], experiments are conducted where, for each latency value,
multiple executions of the same rhythmic structure or musical
piece with different reference tempi are performed. Paper
[30] shows that, when performing a fixed rhythmic pattern,
increasing the reference tempo while maintaining a constant
network latency leads to a decrease of the slope κ. Similar
results are provided in [40], where the performed musical
pieces are rhythmically characterized by means of the mean
event density (i.e., the number of distinct onsets per second)
and rhythmic complexity (which is a function of the reference
BPM δand of the rhythmic figures appearing in the score).
Therefore, the latency tolerance thresholds decrease when the
reference BPM increases, as shown by the performance quality
ratings provided by the musicians in [35], [40]: as an example,
drummers/bassists performing a succession of quarter notes at
δ= 60 BPM could on average tolerate latencies up to 40 ms,
which reduced to only 15 ms (on average) when the reference
tempo was doubled to δ= 120 BPM.
Moreover, as reported in [32], [40], in the presence of
network delay the higher is the reference BPM value, the lower
is the initial tempo at which the musicians performed the first
few measures (typically corresponding to the first 510 s of
musical interaction). These results indicate that the influence
of latency is immediately perceived by the performers, who
start adjusting their tempo from the very first measure. Close
examination of the first cycle of beats in [46] reveals a
switch in phase adaptation (anticipation) with different delay
conditions that is almost instantaneous.
5) Impact of Musical Training, Prior Practice, and Latency
Compensation Strategies: The hand-clapping experiments dis-
cussed in [32] included two sets of subjects, grouped based on
their musical education level. Results show that musicians are
more sensitive to latency, since on average their performance
exhibits a more pronounced deceleration with respect to non-
musicians for a given delay value, and for non-musicians the
average asynchrony is higher.
The effect of prior training with various network delay
configurations is investigated in [34], [38]: the first study
shows that allowing the players to practice at each latency
level, possibly developing common strategies to cope with
the delay, did not lead to noticeable difference in the delay
tolerance thresholds, whereas the second reports that prior
practice reduces the tempo deceleration, improves the syn-
chrony among the two players and the regularity within a
single part (with more pronounced improvements for strings
as opposed to clarinets), whereas no significant variation of
the quality rating is registered. It is worth noting that all
the considered works address the three delay compensation
strategies enumerated in Section I, whereas [35], [38] consider
the introduction of an additive self-delay at one of the two
sides. Such strategy leads to a considerable increase of the
latency acceptance thresholds and quality rating (up to 65 ms
in [38], where the players claim that the performance could
have become “perfect” with further practice, and up to 190
ms in [35], though the testers defined the scenarios where
self-delays exceeded 30 ms as “unnatural”). Conversely, the
adoption of self-delay at both sides led to much lower latency
tolerance levels, (at most 80 ms, as reported in [35]) and such
configuration was considered unacceptable by a remarkable
portion of the tested subjects. Such results confirm the ones
obtained in [26] by imposing a delayed auditory self-feedback
during solo performances (i.e., the sound produced by the
performer’s own instrument is delayed by a fixed time lag):
the tolerance ranges vary from 60 to 200 ms, depending on
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
the reference tempo and type of instrument.
Though some experiments dedicated to the evaluation of
the dependency of steady-state tempo and self-delay in a solo
performance have appeared in [26], none of the above men-
tioned studies attempted to identify a correlation between the
maximum individual latency tolerance applied to the auditory
feedback from the musicians’ own instrument and the quality
of their performance when interacting with a counterpart in
the presence of latency. Since the self-delay tolerance is highly
subjective and depends on a variety of factors including the
instrument type and the proficiency of the musician, it could
indeed serve as benchmark to remove the biases introduced
by such factors during a networked musical interaction.
Finally, the presence of a conductor avoids the recursive
drag on tempo by providing a common cue to the perform-
ers: results reported in [36] counterintuitively show a tempo
increase for high network delays compared to a benchmark
condition in which no network delay is imposed. This scenario
leads to a variation of the master-slave strategy where the
master role is taken by the conductor him/herself. However,
in order to maintain a stable tempo, the conductor typically
ignores the audio feedback from the performers, therefore
she/he cannot adopt any correction strategy as a reaction to the
performers’ execution but relies exclusively on his/her inner
6) Impact of Combining Audio-Video Data: The only in-
vestigation found to date regarding the impact of the de-
synchronization of audio and video data is [36], which focuses
on opera performances. The singers, conductor and pianist
receive both audio and video feedback from two remote
locations: audio and video can be either synchronized or
manipulated for different latency values (within ranges of
15 135 ms for audio and of 60 180 ms for video). Though
the performers claimed that they attributed more importance
to visual contact than to the audio feedback and that they
generally referred to one single cue source (chosen among
the conductor’s gesture, the piano accompaniment or the
audio/gesture of the singing partner) while ignoring the others,
the results of the set of experiments did not lead to a clear
identification of a preferred combination of modality types
and manipulated delays e.g., syncronized audio but delayed
video, syncronized audio-video but with higher delay, etc.
Measurements of the electrodermal activity (i.e. the continuous
variations in the electrical characteristics of the skin) of the
performers through galvanic and conductance skin responses,
which provide an indication of the degree of excitation of a
subject, reported a higher level of stress when the pianist was
not located in the same room of the conductor, possibly leading
to unsyncrhonization between the musical accompaniment and
conductor’s cueing gestures. A comprehensive assessment of
the importance of video signals and the effects of audio-video
misalignment remains to be investigated. It is still unclear
under what conditions combining audio and video data will
improve or negatively affect networked music interactions.
In NMP, the overall delay experienced by the players
includes multiple contributes due to the different stages of
the audio signal transmission: the first is the delay introduced
by the audio acquisition/playout, processing, and packetiza-
tion/depacketization at sender/receiver sides; the second is
the pure propagation delay over the physical transmission
medium; the third is the data processing delay introduced by
the intermediate network nodes traversed by the audio data
packets along their path from source to destination, the fourth
is the playout buffering which might be required to compensate
the effects of jitter in order to provide sufficiently low packet
losses to ensure a target audio quality level.
More specifically, as depicted in Figure 5 the Over-all One-
way Source-to-Ear (OOSE) delay perceived by the users of an
NMP framework includes:
1) in-air sound propagation from source to microphone;
2) transduction from the acoustic wave to electric signal in
the microphone (negligible);
3) signal transmission through the microphone’s connector
4) analog to digital conversion (possibly with encoding) and
internal data buffering of the sender’s sound card driver,
5) processing time of the sender’s machine to packetize the
audio data prior to transmission;
6) network propagation, transmission and routing delay;
7) processing time of the receiver’s PC to depacketize (and
possibly decode) the received audio data;
8) application buffering delay;
9) driver buffering of the receiver’s sound card and digital
to analog conversion ;
10) transmission of the signal through the headphone or audio
monitor (loudspeaker) connector (negligible);
11) transduction from electric signal to acoustic wave in the
headphones or loudspeaker (negligible)
12) for loudspeakers, in-air sound propagation from loud-
speaker to ear.
Most of the above listed contributions depend on system
parameters such as the audio card sampling rate and resolution,
the audio block and buffer sizes (see Table II), whereas others
(e.g. the network routing delay) are independent of the system
design and cannot be directly controlled. By tuning such
parameters, the experienced end-to-end delay may vary signif-
icantly. However, latency savings can usually be achieved only
at the expense of the audio quality level. Therefore, a trade-
off emerges between system latency, bandwidth requirements
and audio quality. In the following, we discuss in detail the
contributions of each stage to the total delay budget and their
dependencies on the NMP system parameters.
A. Audio Acquisition and Digitization
By varying the distance between the audio source and the
microphone, and the loudspeaker and the user’s ears on the
receiving side, the in-air sound propagation delay can be made
arbitrarily low with an intelligent placement of the transducers.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
Fig. 5: Contributions to OOSE delay [13]. Darker background
colors indicate higher delay contributions.
TABLE II: List of symbols
Symbol Description Range of Values
Rsoundcard sampling rate 896 KHz
Lsoundcard sampling res-
16 24 bits/sample
Fsoundcard I/O filter
50 100 audio samples
Psoundcard block size 64 512 samples
ηcodec compression ratio 0.25 1
Ch number of audio chan-
Bapplication buffer size 216 blocks
Htotal packet data over-
<1000 bits
Cbandwidth of the net-
work interface card
0.054 20 Mbit/s
Therefore, in the following we will assume that the in-air
propagation delay prior to the audio signal acquisition is
Typically, the soundcard first applies an analog anti-aliasing
low pass filter [49], then samples the signal at rate R, and
quantizes each sample using a given number of bits, L
(i.e., using 2Lquantization levels). The delay Daintroduced
by these stages is given by the sampling time 1/R (e.g.,
20.83 µs for R= 48 kHz). An optional coding stage can
be introduced here: audio codecs implement algorithms to
(de)compress audio data according to a given coding format
with the objective of representing the signal with the minimum
number of bits while preserving high-fidelity quality, in order
to reduce the bandwidth required for transmission. Audio
codecs usually rely on sub-band coding, which enables for
inclusion of pyscho-perceptual models. The more the sub-
bands, the higher is the achievable compression ratio, η3. How-
ever, increasing the number of sub-bands also leads to higher
encoding/decoding algorithmic delays, Dc: a comparison of
the performance of the most widely used high-fidelity formats
(e.g. MPEG/MP3 [50]) reported in [51] shows intrinsic la-
tencies of at least 20 ms, which is hardly acceptable for the
OOSE delay budget. Therefore, low-delay codecs specifically
addressing latency intolerant applications have recently been
developed: the OPUS [52], [53] (evolved from the prior CELT
[54] ), ULD [55] and Wavpack [56] codecs achieve algorithmic
delays as low as 48ms. A thorough performance assessment
of the packet loss concealment techniques implemented in the
OPUS codec is reported in [57]. Modifications of the ULD
codec specifically tailored for NMP have been proposed [58]
with the aim of increasing resiliency to lost and late data
packets. Despite these efforts, several of the currently available
NMP frameworks opt for the streaming of uncoded audio data
(i.e., η= 1,Dc= 0), thus sacrificing the codec bandwidth
savings to avoid additional contributions to the delay budget
(e.g. [59]–[61]).
Soundcards often include some form of digital filtering
(graphic equalizer, reverberation, etc.). Depending on the
implementation, this filtering sometimes introduces additional
delay, particularly when implemented in the frequency domain
through Overlap-and-Add processing (Short-Time Fourier
Transform). This, of course, needs to be factored in.
B. Soundcard Blocking Delay and Application Buffering
General purpose computer architectures do not access the
soundcard output on sample-by-sample basis but handle audio
data in batches of a given number of samples P(the so called
block size), which is typically a multiple of 64. Therefore,
before retrieving an audio block, computer processors wait for
the generation of Psamples. In the meanwhile, the available
samples are stored in a physical input buffer, as depicted in
Figure 6. One block corresponds to a data volume of P L
bits and introduces a blocking delay Ds=P/R s. When
the processor accesses the physical buffer by means of a
callback function, it copies the Pblock samples in a second
buffer (namely the application buffer) where they will wait to
be processed. The choice of Pimplies a trade-off between
system stability and latency: handling larger blocks leads to
3Note that ηcould be either constant or vary according to the specific
audio content and codec implementation. For the sake of simplification, in
the following we will assume constant compression ratios
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
capture 1
4 5
4 5
copy at
time t=Ds
4 5
copy at
time t=2Ds
Fig. 6: Soundcard input/output buffer, assuming that the block
size is P= 8 samples and that the application buffer
introduces no additional queuing time.
1 2 3 4
1 2
1 2 4
2 3 4
Fig. 7: Example of packet dropout due to application buffer
a more stable behavior w.r.t. small blocks (which impose
higher interrupt frequencies and operating system overhead),
but more time has to elapse for the generation of the whole
batch of samples, which increases the end-to-end latency.
In addition, in general purpose operating system there may
be multiple processes competing for CPU resources, which
could introduce delays in the interrupt triggering process.
To overcome this issue, [59] and [62] propose dedicated
kernel and network driver solutions. Note that, since the above
described functional mechanism holds for both the input and
output soundcards, the soundcard blocking delay appears twice
in the computation of the OOSE delay.
Moreover, the application buffer may introduce an addi-
tional delay if the application imposes a threshold on the min-
imum number of queuing blocks before starting the playout.
This lag is usually introduced at the receiver side to compen-
sate for the effects of network jitter and clock drift, creating
enough elasticity so that slight variations of the packet arrival
rate are more unlikely to cause buffer under-run conditions
(i.e. the application buffer is found empty when the callback
function accesses it to copy one block into the soundcard phys-
ical output buffer). However, the reverse problem may also
occur in case of bursts of packet arrivals or if the clock of the
remote side runs faster than the local clock: the receiver buffer
may not be able to accommodate all the incoming audio data
and some of them have to be discarded (buffer over-run). In
turn, buffer under/over-runs lead to artifacts and glitches in the
reconstructed audio signal (i.e. micro-silences or significant
amplitude discontinuities within very short time intervals due
to missing audio samples), as exemplified in Figure 7. A more
in-depth discussion on the management of buffer over/under-
runs can be found in [63]. Therefore, the application buffer size
Bneeds to be properly sized according to the delay tolerances,
possibly with automatic dynamic adjustments to adapt to the
varying network conditions (implemented e.g. in [64]). For a
comparative evaluation on the latency-audio quality trade-off
in different audio streaming engines and NMP frameworks,
the reader is referred to [63]. As a rule of thumb, assuming
that the application at the receiver side waits until a half of
the buffer size is filled with data before starting the playout,
the application buffer delay can be estimated as Db=BP
C. Packetization Delay
Once the media content is avaliable at application level
at the client side, it can be packetized and transmitted over
the telecommunication network. The processing delay taken
by the layers of the ISO/OSI stack [65] mainly depends on
the machine hardware and is in the order of hundreds of
microseconds, thus introducing negligible contributions to the
delay budget. During this process, packet headers are added
at each layer (from application to physical layer), whose size
depends on the specific protocol choices and implementations
and results in an overall overhead of Hbits for each audio
block 4. Since His a constant term and does not depend on
the packet size, the smaller the soundcard block size P, the
higher will be the number of packets generated in a given
time interval and consequently the higher the overhead due to
packet header addition (see [66] [67] for the detailed computa-
tions of overhead and overall data rates of ULD-encoded audio
data assuming ADSL network access technology). Analogous
considerations hold for the reverse de-packetization process at
the receiver side, which removes the packet headers layer by
layer from the physical to the application level.
D. Network Delay
The network delay includes three contributions: the trans-
mission delay imposed by the bandwidth Cof the network
interface card, the propagation time required by the signal to
4Note that a single block may be split over multiple packets at lower layers
due to the restrictions imposed on the maximum packet size. When computing
H, block splitting must therefore be taken into account.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
propagate over the physical medium from source to destina-
tion, and the processing delay introduced by the network nodes
(i.e. switches and routers). The transmission delay depends
on the volume of the data stream and can be computed as
Dt=RηChL+dRηChL/P eH
Conversely, the propagation delay depends exclusively on
the choice of the transmission medium: being cthe speed of
light, the propagation speed cmis roughly 0.8cfor copper
cables and 0.7cfor optical fibers (which are the typical
medium used in long-haul backbone networks). Therefore, the
propagation delay results to be in the order of 5µs/km and
can be calculated as Dp=L/cm, where Lis the physical
distance between source and destination.
Finally, the processing delay Dqintroduced by each inter-
mediate network node is a function of multiple factors includ-
ing the network topology and traffic congestion conditions,
the implemented routing algorithms, the policies applied by
network operators to ensure quality of service guarantees,
the queuing mechanisms adopted by routers and switches
[68]. Analytic models taking into account all the above listed
features cannot be obtained except for extremely simple net-
work configurations. Therefore, the approach adopted by a
substantial body of literature is to model the overall network
delay (thus including propagation, transmission and processing
delays), as a random variable with a given statistic distribution,
whose parameters are adjusted according to the network char-
acteristics (among the numerous studies, see e.g. [69], [70]).
E. Estimation of OOSE Delay
Summing all the above discussed contributions, the OOSE
delay experienced by the user of a NMP system can be
estimated as:
Dtot = 2(Da+Dc+Ds) + Db+Dp+Dt+Dq(3)
where the term Da+Dc+Dsis considered twice to account
for the soundcard delay contributions at both transmitter and
receiver side. Assuming that no information about Dqis
available and that the bandwidth Bis sufficiently high to make
the impact of the transmission delay negligible, Dtot can be
lower-bounded by the amount:
Dtot 2(1 + F+P
R+Dc) + BP
Apart from the propagation delay L/cm, which can be esti-
mated with a rough computation of the geographical distance
between the NMP players, and from the coding delay Dc,
which can be varied or even eliminated depending on the
choices about the audio codec usage, the remaining contribu-
tions exhibit an inverse dependency on the audio card sampling
rate Rand are directly proportional to the audio block size P,
as depicted in Figure 8.
F. Packet Loss Concealment Techniques
When a packet arrives corrupted at the receiver; or arrives
too late to be able to contribute to the audio stream; or simply
does not arrive at all; actions must be taken in order to
Fig. 8: Dependency of the delay lower bound on the audio
block size and soundcard sampling rate, assuming a soundcard
filter length of F= 100 samples, an application buffer size of
B= 8 blocks and a geographical distance of L= 1000 km.
minimize the impact of the missing information. The literature
is rich with techniques for containing the damage, involving
the receiving end as well as the sender, which in some cases
are specifically designed around the coding format [71], and
in other cases focus on signal reconstruction. In this Section
we offer a very brief summary of such solutions [72]–[74].
The need of successfully repairing packet loss tends to
be in contrast with the requirements of real-time low-latency
operation, which are typical of NMPs, so in some cases [59]
the approach consists in not doing any correction, assuming
the network service is so reliable that packet losses occur with
negligible probability (e.g. in the order of 106108with
network jitter below 1 ms, as in academic networks5).
Alternatively, the simplest solution on the sender side
consists of transmitting duplicate packets in order to reduce
the probability of losing them, but this tends to increase
the data rate in the stream and, consequently, worsen the
delay. Another sender-based countermeasure against packet
loss implies packet interleaving [75], which is done with the
purpose of dispersing data vacancies, thus limiting their size
to one or two packets at a time. This is an operation that
most audio compression schemes can do without additional
complexity, but the price to pay is an added delay that
depends on the interleaving factor in packets, which is there
even if no repair is needed. Again, on the sender side, it
is possible to send redundancy data to the stream, which
can be used on the receiver side to repair data losses and
recover lost packets. Such solutions, are known as Forward
Error Correction (FEC) methods [72], can be classified into
media-independent or media-dependent. The former do not
consider what information is being sent in the packets, while
the latter take into account whether it is an audio signal or
a video signal. Media independent techniques are suitable for
both audio and media content and do not depend on which
compression scheme is being considered. Furthermore, their
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
computational cost is limited and are easy to implement.
On the other hand, they tend to worsen the delay (repair
cannot begin until a sufficient number of packets is collected).
Furthermore, they tend to increase the bandwidth at the risk of
causing additional packet loss. Media-specific FEC methods,
on the other hand, are characterized by low added latency,
as they usually only need to wait for a single packet to
repair (if dealing with burst-like losses, the latency tends
to worsen). Furthermore, they do not significantly increase
the bandwidth, in comparison with media-independent FEC
solutions. On the other hand, computational cost is a factor as
well as implementation complexity, also final quality tends to
be affected.
If we focus on the receiver side, we talk about Packet Loss
Concealment (PLC) strategies. In PLC the receiver does the
best it can to recover data losses, and it works best for limited
packet loss rate (<15%) and for small packets (4-40 ms).
Such solutions generally perform less well than sender-based
methods and therefore they are not usually employed. Because
of their limited effectiveness, it is advisable to use them in
conjunction with sender-based methods.
We can identify three wide categories of PLC: those based
on data insertion, those that perform data interpolation, and
those that rely on data regeneration [72].
Data insertion simply consists of replacing a lost packet with
data filler. The simplest solution of the sort is known as zero
insertion (or silence substitution) with obvious interpretation
of its meaning. This solution has the advantages of preserving
the timing of the audio stream and of being of immediate
implementation. Notice that a silence of a few ms will be
perceived more of as a “click” than as a silence in the usual
sense, therefore this method will work acceptably well only
for very short packets (typically 4-16 ms), while for packets
of more standard size (40ms) it will ineffective. Furthermore,
the quality of PLC will start dropping significantly for packet
loss rates that are higher than 2%. An alternate solution to
zero insertion is noise substitution, which relies on the fact
that a certain amount of repair work can be performed by
the listener’s brain (phonemic restoration) if the data loss is
replaced by something other than silence. Typical choices for
data fill-ins are white noise or “comfort noise”. This solution
shares similar advantages to zero replacement (preservation
of timing, low complexity), but it requires a careful noise
magnitude adjustment. Instead of filling the gap with a random
signal, we could opt for patching it with a signal excerpt
picked from some other location in the audio stream. One
straightforward way to do so consists of repeating the previous
frame (a.k.a. “wavetable” mode, since a continuously repeated
packet will create a tone with a fundamental period equal
to the buffer size). Better solutions minimize overlapping
artifacts through some form of dynamic realignment based on
synchronous OverLap and Add (OLA) or Pitch-Synchronous
OLA [76] . Such methods rely on a quasi-periodic behavior
of the audio stream or some pitch-related peculiarities of the
waveshape. Whatever the choice, this splicing operation needs
to be done in such a way to seamlessly blend the patch on
either side through either direct or synchronized cross-fading.
This class of solutions are quite effective as long as the gaps
are quite narrow, and the quality significantly drops when
the length of the lost packets extends beyond 20ms and/or
the packet loss rate grows above 3%. Other limitations of
OLA/PSOLA methods are in that they tend to interfere with
delay buffering and do not preserve timing.
Data interpolation methods operate by bridging the gap based
on the content on either side of it. The simplest method of
the sort is waveform substitution, which consists of waveform
repetition or mirroring [77] from both sides of gap. This
represents an improvement over simple repetition, which uses
information from just one side of the gap. PLC methods based
on waveform substitution are particularly popular thanks to
their implementational simplicity. One such method is also
proposed in ITU recommendation G.711 Appendix I [78].
There are also hybrid solutions based on pitch waveform
replication, which rely on repetition during the unvoiced
portion of the signal (typically speech) and extend the duration
the voiced portion of the signal in a model-based fashion.
Such solutions tend to perform slightly better than simple
waveform substitution. Model-based PLC methods rely on a
specific model for patching signal gaps. A very popular choice
is the Auto-Regressive model combined with Least Squares
minimization (LSAR) [79]. Such methods, however, are only
effective for filling the gap left by very short individual
packets. An alternative solution consists of relying on time
scale modification, which stretches the audio signal across the
gap. This approach generates a new plausible waveform that
smoothly blends across the gap. It is computationally heavier,
but tends to perform better than other solutions.
Data Regeneration methods use the knowledge of the adopted
audio compression technique to derive codec parameters for
repair. There are quite a few solutions of the sort [72], which
rely on the interpolation of transmitted state, and interpret
what state the codec is in. Such solutions are quite effective
and tend to reduce boundary effects. On the other hand they
are quite expensive from the computational standpoint, and
the improvement tends to flatten beyond a certain level of
Several software and hardware solutions have been de-
veloped to support NMP in the last two decades. In this
Section we provide an overview of the state-of-the-art soft-
ware frameworks listed in Table III, comparing technological
characteristics as well as the specific framework purpose (e.g.
e-learning, etc.). For a thorough historical perspective on the
milestones achieved in the field of NMP from 1965 on, the
reader is referred to [80].
A. Framework Purpose
Though all the reviewed frameworks are aimed at supporting
real-time musical interactions with at least audio transport,
their implementations vary according to the designers’ artistic
concept or to cope with technology-dependent issues.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
Combined audio and video frameworks [42], [59], [81]–[83]
aim at providing an NMP environment supporting e-learning
(in which the visual component plays a fundamental role for
an effective acquisition of technical and expressive skills) and
content delivery to a passive audience (e.g. real-time concert
streaming with immersive sound systems with multiple audio
channels, advanced spatial audio rendering techniques and
high definition video). High-quality video transmission soft-
ware solutions developed for telemedicine and cinematography
(e.g. [84]) have also been employed for NMP applications,
but their typical latencies (above 60 ms) usually exceeds
the tolerance thresholds for real-time synchronous musical
performances. Motion capture techniques have also been in-
tegrated to improve the performers’ control over the audio
mix of instrument sources, e.g. by automatically adjusting
stereo panning and volume of remote audio sources perceived
by each player according to his/her orientation and spatial
coordinates with respect to the virtual locations of the other
performers [85]. Video data can also be integrated in NMP
frameworks to provide alternative forms of feedback e.g.,
visuals and text elements which are dynamically-generated
according to the received audio content [86].
In [87], in addition to real-time NMP, on-demand rehearsals
are also supported: in this scenario, audio data is recorded in
advance, stored in a dedicated repository and delivered upon
request. Also [59] provides support for real-time rehearsals in
which the whole activity is completely performed live.
Some other frameworks do not include video data [60], [64],
[88]–[91] or support only the transmission of MIDI data, thus
restricting the choice of the instruments to electronic ones and
excluding voice [86], [92]–[94]. Quite often, an independent
video channel is used to accompany the low-latency audio
framework, for example, using commodity video-conference
technologies with the audio turned off [95] or software video-
only transport [96]. A 2004 example of the former, [97]
combined Tandberg 6000 video codecs with jacktrip for sev-
eral channels of real-time performer-performer and audience-
audience interaction.
A framework for large-scale collaborative audio environ-
ments is described by [98] in which mobile users share virtual
audio-scenes and interact with each other or with locative
audio interfaces/installations and overlaid virtual elements.
This is done by combining mobile technologies for wire-
less audio streaming with sensor-based motion and location-
tracking techniques.
B. Architecture
Both decentralized peer-to-peer (P2P) and client-server have
been used as architectures for NMP systems. As depicted in
Figure 9(a), P2P solutions (implemented in [59], [60], [82],
[83], [85], [86], [88], [90], [91], [99]) require each participant
in the networked performance to send audio/video data to each
of the other players. Therefore, in case of Nparticipants,
every user sends/receives N1streams (see Fig. 9a), which
could heavily hinder scalability: since typical audio rates are
in the order of magnitude of hundreds of kilobits per second
and uplink link capacities for retail users subscriptions reach
at most a few Mb/s, a trade-off emerges between number
of participants in the NMP session and audio quality, which
degrades when lowering the soundcard audio resolution and/or
increasing the codec compression rate.
Conversely, in client-server architectures as the ones pro-
posed in [42], [81], [83], [87], [89], [92]–[94], [98], [99] each
player transmits his/her data streams to a central server, which
is in charge of mixing the contributions from the Nsources
in a single stream and to send it back to each participant,
as depicted in Fig. 9b. This way, the bandwidth request of
a client is limited to one single stream in both uplink and
downlink directions, whereas the server receives/transmits N
data streams. This solution removes scalability issues at the
client side, but may require significant bandwidth availabil-
ity and computational resources at the server side, which
often requires specific hardware configurations to avoid the
introduction of additional delay contributions. The added path
to the server results in added delay between clients when
compared to P2P. Looking in more detail, the packets of
each media stream received by the network interface cards
of the server are passed from the kernel to the application
layer, which replicates them according to the user’s requests
(communicated via a separated control data stream) and passes
them back to the kernel for transmission. This involves context
switching between the kernel and the application, as well as
data replication which grows with the number of participants.
Both data copying and context switching between kernel and
application are well-known sources of delay. To reduce server
latency, different architectural options are available (see Figure
10): the authors of [62], [100] compare three alternative
approaches. The first is an FPGA-based solution [101] in
which the application layer processes only control data (which
are not time-critical) whereas the routing procedure of media
packets (i.e. reception, copy and transmission) is entirely
hardware-handled without kernel intervention. To do so, a
table indicating how to treat each packet is maintained in
the NetFPGA memory and modified based on the signaling
packets received by the application. Moreover, the NetFPGA
can rewrite packet headers and transmit replicated packets
multiple times, thus reducing memory bandwidth require-
ments. The second solution is a Click [102] modular router
which allows for in-kernel execution, thus avoiding most
packet copying and context switching. The router can be split
in two parts: a control part residing at the application level
and a routing part residing at the kernel level. However, the
Click router still consumes processor time and may not be
able to perform packet replication without copying. The third
solution is a Netmap server framework [103], which enables
the application layer to handle the packets directly in the kernel
memory without need to copy them in the upper layer, thus
avoiding context switching. In this unicast paradigm, each user
selectively receives a subset of data streams and can adjust
the settings of each of them individually (e.g. volume level,
audio/video codec).
Bandwidth consumption issues can be mitigated in case
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
Fig. 9: Architectures for NMP systems
the network natively supports multicast, as in the case of
Information Centric Networks [99]: when multicast is enabled,
the sender transmits a single copy of the content to be
communicated to a pool of receivers, and the network routers
duplicate the data when it is needed, according to the network
topology and to the addressees’ location. Therefore, as de-
picted in Fig. 9c,d, in both P2P and client-server infrastructures
the uplink bandwidth request in case of multicast is limited
to a single data stream (also for the server node [87]). The
drawback of the introduction of multicast is that it does not
allow for stream customization, since all the players receive
the same content.
C. Network Span
In principle, a NMP environment should not introduce
additional spatial limitations to the geographical displacement
of the musicians other than the ones imposed by the signal
propagation delays over the physical mediums. As discussed
in Section IV, no more than a few thousands of km can
be covered to maintain the end-to-end delay experienced by
the players below the acceptability thresholds reported in
Section III. Most of the frameworks listed in Table III are
designed to support communications on both Local and Wide
Area Networks (WANs). However, the authors of [85], [87],
[92], [93] test their proposed frameworks only on LANs. In
other cases (e.g. [59]), the WAN is considered as part of
the NMP scenario and a fundamental component; as such,
it is supposed to be adequate in terms of bandwidth, latency
and error rate to guarantee a sufficient quality of experience.
Transmission over Wireless Local Area Networks (WLANs)
is supported by [90], [92], [98], but the latency introduced by
wireless communications is typically more pronounced, less
Fig. 10: NMP server architectural models
controllable and more prone to fluctuations compared to wired
mediums. Therefore, in both frameworks the network span is
limited to hundreds of meters/few kilometers using Wi-Fi or
Bluetooth technologies.
D. Network Protocol Stack
Regarding the protocol stack, all the frameworks listed in
Table III prefer the usage of UDP [114] as transport layer
protocol for audio/video media streaming. UDP introduces less
transmission overhead due to the smaller packet-header size
compared to TCP and is inherently more suitable for real-
time applications due to its lightweight nature, since it does
not support any mechanism for packet retransmission, in-order
delivery or congestion control. Therefore, packet loss recovery
algorithms must be implemented at application level to cope
with audio artifacts introduced by missing packets during the
media playout. Such algorithms usually combine forward error
correction (i.e. the transmission of redundant data in each
packet, as in [60]) and error concealment techniques based on
data interpolation/substitution [115]. Analogous techniques are
applied also to MIDI signals [94]. In one case [59] packet loss
recovery techniques are explicitly avoided, to further decrease
network latency; in such a case, the network is supposed
to be ”error free”. Some frameworks [83], [87], [94], [105],
[109] build data loss management on top of the RTP/RTCP
transport protocol [116], taking advantage of the timestamps
and sequence numbers included in each RTP packet header.
Note that RTP also defines a specific payload format dedicated
to MIDI data [117]. UDP with custom sequence numbering
schemes have also been used [60]. Frameworks requiring
presence discovery of participants, session initialization and
the management of textual data or any other type of content
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
TABLE III: Summary of NMP frameworks
Authors Name Architecture Network
Network protocols Supported
data type
Nr. of Au-
dio Chan-
Saputra et al.
BeatME Client-Server LAN,
UDP or OSC [104] MIDI 16 (input),
1 (output).
none uncompressed
Kurtisi, Gu et al.
[87], [105]
- Client-Server LAN RTP,UDP (stream)
TCP (session data)
audio n.a. NTP ADPCM, FLAC
(real-time) or
Renwick et al.
Sourcenode Client-Server LAN UDP MIDI n.a. none uncompressed
Stais et al. [99] - Client-Server
or P2P
WAN n.a. audio 2 NTP uncompressed
Kapur et al. [86] Gigapopr Client-Server WAN UDP audio, video,
n.a. n.a. uncompressed
Wozniewski et al.
Audioscape Client-Server WLAN n.a. audio 1 (input),
2 (output)
GPS uncompressed
Chew et al.
[37]–[39], [42],
- Client-Server WAN RTP/RTSP, UDP audio, video,
Akoumianakis et
al. [83], [106]
Musinet Client-Server
or P2P
WAN SIP (signaling),
RTP (stream),
HTTP (text)
audio, video any none OPUS (audio),
H.264 (video)
Carot et al. [82] Soundjack P2P WAN UDP audio and
8 external master
uncompressed or
JPEG video
Drioli et al. [59],
LOLA P2P WAN TCP (control) UDP
audio, video 8 n.a. uncompressed
audio and video
Lazzaro et al.
- Client-Server
(control) P2P
(stream), SIP
El-Shimy et al.
- P2P LAN audio, video n.a. n.a.
Fischer et al. [64] Jamulus Client-Server WAN UDP audio 2 none OPUS
Caceres et al.
[60], [89], [108]
Jacktrip Client-Server
or P2P
WAN UDP audio any software-
based audio
Akoumianakis et
al. [61], [109],
Diamouses Client-Server
or P2P
WAN RTP, TCP/UDP audio, video,
any internal
audio, MJPEG
Gabrielli et al.
[90], [111]–[113]
WeMust P2P LAN,
TCP or UDP audio, MIDI 12 software-
based audio
uncompressed or
Meier et al. [91] Jamberry P2P WAN UDP audio 2 external master
Chafe et al. [88] StreamBD P2P WLAN UDP, TCP audio any none uncompressed
in support of the pure media streams rely on SIP and HTTP
over TCP, respectively.
E. Supported Data Types
In NMP, ensuring high audio quality is of great importance
in creating acoustical environments providing conditions as
close as possible to those of in-presence interactions. There-
fore, several frameworks [59], [82], [86], [90], [94], [98], [99],
[109] support the transmission of uncompressed audio data.
The authors of [87] opt for the usage of the FLAC lossless
codec for real-time performances and of MP3/MPEG4 for on-
demand rehearsing: the two latter codifications introduce a
startup compression delays of 20 ms or more and are thus
considered unsuitable for real-time interactions. Nevertheless,
MPEG formats have been used in the framework proposed in
[42], [81]). Alternative solutions rely on low-latency codecs
such as the proprietary Fraunhofer ULD [82], CELT [82], [83],
[90] and OPUS [82], [91]. The number of supported audio
channels varies considerably: most of the frameworks work
with mono/stereo configurations, but some of them support
channel counts from several to dozens per source [42], [59],
[81]–[83], [89], [90], [94].
Conversely, the quality of the video data is less critical for
a successful musical interaction and becomes relevant only
in case the presence of a passive audience is assumed. In
the former case, video codecs such as MJPEG [109], MPEG
[42], [81], SVC and H.264 [83] are used, whereas in the latter
uncompressed video streaming is supported [59].
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
F. Data Stream Synchronization
When multiple remote locations are involved in a NMP ses-
sion, the problem of synchronizing audio/video data streams
generated by different sources arises, also due to the difference
between the nominal and actual value of the audio card
hardware clock frequency, which may cause drifts. To this aim,
timestamps have to be associated to audio packets and local
clocks need a tight synchronization to a common reference to
maintain precise timings. The master clock can be transmitted
over a side channel or incorporated in the audio streaming
data and is reconstructed at the receiver side by means of
a Phase-Lock loop. This is accomplished by the RTP/RTCP
protocol in [94] or by the Network Time Protocol (NTP) in
[87], [99], with initial synchronization during the NMP session
setup phase and periodical refreshments during the session.
However, though the accuracy of NTP-based synchronization
is about 0.2ms in LANs, more consistent skews (up to a
few ms) may occur in long-distance communications over
WANs. Therefore, alternative solutions based on the Global
Positioning Satellite (GPS) system and on the time signals
broadcasted by the Code Division Multiple Access (CDMA)
cell phone transmitters have been investigated in [42], which
ensure an accuracy in the order of tens of microseconds.
Global time syncronization using affordable, dedicated GPS
timeservers [118]. A dedicated solution based on the dynamic
adjustment of the world reference clock frequency by means of
a measurement of the offset between the triggering instants of
the remote and the local audio callback functions is discussed
in [119] and used in [82], [91]. The proposed method also
implements a jitter compensation mechanism for communica-
tions over WANs.
In case of client-server architectures supporting MIDI data
[93], synchronization is achieved by means of a MIDI-Clock
generated by the server node and transmitted to all the clients,
in a master-slave fashion. Clock events are sent at a rate of
24 pulses per quarter note.
All the above mentioned synchronization mechanisms as-
sume that the slave nodes adjust their hardware clock pace
according to the clock frequency of the master node. How-
ever, software-based solutions relying exclusively on audio
resampling have also been investigated: in [113], a digital
Infinte Impulse Response (IIR) filter is proposed to estimate
the master clock frequency ˆ
RM. Based on such estimation and
on the estimated slave local clock frequency ˆ
RS, audio data
are resampled at rate Rrdefined as:
where PM(respectively PS) is the block size at the master
(resp. slave) side.
An alternative approach is the propagation of a metronome
signal via a dedicated data stream [109]. The signal is gener-
ated by the central server (in case of client-server architecture)
or by one of the peers (in case of P2P communications).
As described in this Overview, networked music perfor-
mance is an extremely challenging application scenario due
to the requirements of this type of interactive communication.
What makes it challenging is the fact that musicians are highly
sensitive to interaction delays, and in NMP this delay is not
just unavoidable but also has a physical lower bound. Many
strategies have been adopted for pushing the limits of this
type of interactive communication, some aiming at minimizing
the time lag in the network, others involving a-posteriori
correction strategies. It is only in the past several years,
however, that research has begun addressing the problem from
a perceptual perspective. We believe this is an extremely
promising direction as it is aims at answering open and
complex questions that are critical to advancing research in
the field.
Research on perceptual aspects of NMP is still in its infancy,
and a first significant step ahead could come from assessing
under what conditions musicians are able to adapt to NMP
limitations. Little is known about adaptability to NMP though
a great deal of data is being collected [59] which could
help shed light on this aspect. Quite unsurprisingly, direct
experience tells us that adaptability is adversely affected by
age, but this is not information we can put into productive use
unless we sort out the causes. We know that digital “natives”
(those born just before the turn of the millennium) are often
able to immediately adapt to an NMP environment, while older
musicians tend to take longer, though it is not clear whether
this is to be attributed to a lack familiarity with the devices
in use (in-ear headphones, microphones, etc.) or to being less
accustomed to interaction delay in mediated communications.
A better understanding of these aspects is expected to come
from targeted perceptual experiments and data/questionnaire
One aspect that needs to be assessed when discussing
perception-aware NMP solutions, is how content influences
the quality of the NMP experience. This issue was initially
raised in [40], where the tempo slowdowns were found to
be dependent on the rhythmic and timbral features of the
musical piece that was being played. This aspect indeed
deserves further exploration and modeling in order to come
to better designed NMP solutions. Content-aware analysis
is not easy to achieve because it needs to be approached
at various levels of abstraction. The tempo dependency on
musical features described in [40], for example, is conducted
at a low level of abstraction. Accounting for expressive tempo
changes, however, requires content analysis at a higher level
of abstraction. Similarly, expressive descriptors, are bound to
provide important information on the levels of engagement
and entrainment of the musicians involved in the NMP, which
are likely to play a crucial role in the assessment of the NMP
As mentioned above, understanding and modeling percep-
tual aspects of NMP requires a large number of perceptual
experiments, which are to be conducted in a organized and
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
systematic fashion. One initial outcome of such experiments
is a complete revision of all the metrics that are currently
adopted for evaluating the quality of NMP experience. The
slowing of the tempo, in fact, is one of the most commonly
adopted metric for assessing the impact of interactional delay
on NMP. In order to reliably assess the quality of the inter-
action experience, however, we need to account for a much
wider range of factors. For example an undesirably limited
mutual entrainment between musicians might come into play
or, conversely, desirable expressive ritardandi might become
part of the performance. Understanding what contributes to
the quality of an interactive musical experience, in fact, is still
a relatively young and unexplored research problem, which
needs to be teased apart from conditions naturally inherent
in musical practice. Addressing such issues could greatly
help construct models for overcoming, or at least easing, the
inherent limitations of NMP.
From the perceptual standpoint, it is also of great impor-
tance to explore how different modalities (typically audio and
video) jointly contribute to the quality of the NMP experience.
For example, there seems to be a clear correlation between
video definition/resolution, high frame rate, image size and the
overall NMP experience. However, it is still unclear how the
NMP experience is affected by such parameters. For example,
although we cannot clearly “see” the difference between 30
and 90 frames per second in the video refresh rate, the level
of comfort is heavily affected by it [120]. Research is only
beginning its first step towards understanding the interplay of
the various modalities and their parameters towards a better
NMP experience.
Beyond perceptual aspects of the NMP experience is the
understanding of cognitive mechanisms that musicians develop
to cope with NMP limitations, particularly with delays. This
understanding would, in fact, lead to novel technological
solutions and compensatory measures in NMP. Musicians, over
time, learn to adapt to adverse NMP conditions far beyond
what we tend to give them credit for. Organ players, for
example, may be forced to compensate delays up to over
one second between action and perceived sound, when the
physical displacement between the instrument keyboard and
pipes is significant. Opera singers and choruses are often
expected to anticipate their singing of several hundreds of
milliseconds with respect to the perceived orchestra sound in
order to compensate for the delay caused by their geographic
displacement with respect to the orchestra pit. This ability,
however, is likely to be associated with a strong musical
background and a certain level of proficiency as performer.
Is it possible to predict the evolution of the performance
ahead enough to compensate for communication delays? Can
this prediction ability go as far as anticipating expressive
changes in order to preserve the impression of an interactive
performance? These questions have recently raised the interest
of the scientific community, giving birth to a research field
called “robotic musicianship” [121].
As we can see, the corpus of knowledge that is yet to be
gathered and processed for a better understanding of NMP
experiences is, in fact, quite extensive and, as we progress in
this research effort, we need to organize it systematically. As
done in other fields of research, one effective way to map this
knowledge is to develop an ontology that captures the relevant
aspects of such knowledge and the relations between them.
This requires a careful collection of semantic descriptors and
conducting questionnaires for organizing them into semantic
spaces equipped with proper metrics. It requires mapping what
we know about NMP (and other mediated interactional expe-
riences) into knowledge maps. It also requires using machine
intelligence to explicitly map relationships between collected
data and semantic qualifiers/descriptors. This is an approach
that is commonly adopted in the area of semantic web, or in
music information retrieval, but it applies particularly well to
all areas of research where knowledge is mapped as a network
of relations at various levels of abstraction and strength. We
believe that a systematic approach to understanding, modeling
and developing solutions for NMP could be extended to other
areas of applications where mediated interaction plays a cru-
cial role, particularly gaming and tele-presence applications.
Another topic that we believe could be of great interest
to musicians and worth considering and exploring, is the
possibility to fully exploit the peculiarities of a NMP and
turning what are normally considered liabilities of this form
of communication into an asset. We know, for example, that
NMPs are characterized by
freedom from space and ensemble size constraints: which
means that a NMP could involve (in principle) an unlim-
ited number of performers;
unbounded augmentation: the network can transport not
just sound but also control signals, therefore a musician
could simultaneously play/control a large number of
remote musical instruments;
virtualization of space: sometimes in real interactive
performances the mutual positioning of musicians is not
optimal and not everyone is happy with their mutual
spatial location or the environmental conditions; NMP
potentially offers the possibility to personalize the posi-
tioning and the environment to one’s preferences;
internet acoustics: new possibilities open up also on the
acoustic forefront, which consist, for example, of shar-
ing a particularly favorable environment with the other
musicians and making the whole performance acousti-
cally interact with it; or making several environments
become part of a wider shared acoustic space (a global
and distributed reverberator), or creating physical model
instruments whose delay elements are network delay, or
creating a virtual global environment in which all per-
formers interact also from the same acoustical standpoint.
With this article we offered an aerial view of the current and
recent literature on Networked Music Performance research.
We discussed how this interaction modality is approached and
studied, with special attention to the unavoidable problem
of network latency. For this particular aspect we discussed
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
experiments, perceptual aspects and related evaluation met-
rics, and offered a comparative overview of what constitutes
the state of the art. We then discussed hardware/software
enabling technologies and experimental frameworks, with a
specific telecommunications and networking-oriented focus.
Finally we offered a critical perspective on possible future
research directions in the field of network-mediated musical
[1] ´
A. Barbosa, “Displaced soundscapes: A survey of network systems
for music and sonic art creation,” Leonardo Music Journal, vol. 13,
pp. 53–59, 2003. [Online]. Available:
barbosa LMJ13.pdf
[2] T. Blaine and C. Forlines, “Jam-o-world: evolution of the jam-o-drum
multi-player musical controller into the jam-o-whirl gaming interface,”
in Proceedings of the 2002 conference on New interfaces for musical
expression. National University of Singapore, 2002, Available:http:
// 012.pdf, pp. 1–6.
[3] C. Latta, “Notes from the netjam project,” Leonardo Music Journal,
pp. 103–105, 1991. [Online]. Available:
[4] “Midi composers’ exchange.” [Online]. Available: http://mp3.about.
[5] D. Akoumianakis, G. Ktistakis, G. Vlachakis, P. Zervas, and C. Alexan-
draki, “Collaborative music making as “remediated” practice,” in
Digital Signal Processing (DSP), 2013 18th International Confer-
ence on, July 2013, Available:
jsp?arnumber=6622733, pp. 1–8.
[6] G. Vlachakis, N. Karadimitriou, and D. Akoumianakis, “Using a dedi-
cated toolkit and the cloud to coordinate shared music representations,”
in Information, Intelligence, Systems and Applications, IISA 2014,
The 5th International Conference on. IEEE, 2014, Available:http:
//, pp. 20–26.
[7] W. Duckworth, “Making music on the web,” Leonardo Music
Journal, vol. 9, pp. 13–17, 1999. [Online]. Available: http:
[8] P. Rebelo and R. King, “Anticipation in networked musical per-
formance,” in Proceedings of the 2010 international conference on
Electronic Visualisation and the Arts. British Computer Society, 2010,
Available:, pp. 31–
[9] C. McKinney and N. Collins, “Yig, the father of serpents: A real-
time network music performance environment,” in Proceedings
of the 9th Sound and Music Computing Conference, 2012,
papers/Yigsmc2012.pdf, pp. 101–106.
[10] ´
A. Barbosa, J. Cardoso, and G. Geiger, “Network latency adaptive
tempo in the public sound objects system,” in Proceedings of the 2005
conference on New interfaces for musical expression. National Univer-
sity of Singapore, 2005, Available:
2005/nime2005 184.pdf, pp. 184–187.
[11] W. Woszczyk, J. Cooperstock, J. Roston, and W. Martens, “Shake,
rattle, and roll: Gettiing immersed in multisensory, interactiive
music via broadband networks,” Journal of the Audio Engineering
Society, vol. 53, no. 4, pp. 336–344, 2005. [Online]. Available:
[12] C. Alexandraki and I. Kalantzis, “Requirements and application
scenarios in the context of network based music collaboration,
in Conference i-maestro workshop, 2007 Available:http:
// and Application
Scenarios in the Context of Network Based Music Collaboration,
pp. 39–46.
[13] A. Carˆ
ot and C. Werner, “Fundamentals and principles of musical
telepresence,” Journal of Science and Technology of the Arts, vol. 1,
no. 1, pp. 26–37, 2009. [Online]. Available:
[14] F. Winckel, Music, sound and sensation: A modern exposition.
Courier Corporation, 2014. [Online]. Available:
[15] N. P. Lago and F. Kon, “The quest for low latency,” in Proceedings
of the International Computer Music Conference, 2004, Available:http:
//, pp. 33–36.
[16] T. M¨
aki-Patola, “Musical effects of latency,” Suomen
Musiikintutkijoiden, vol. 9, pp. 82–85, 2005. [Online]. Available:
[17] A. Askenfelt and E. V. Jansson, “From touch to string vibrations.
i: Timing in the grand piano action,The Journal of the Acoustical
Society of America, vol. 88, no. 1, pp. 52–63, 1990. [Online]. Available:
[18] M. Sarkar and B. Vercoe, “Recognition and prediction in a network
music performance system for indian percussion,” in Proceedings of the
7th international conference on New interfaces for musical expression.
ACM, 2007, Available:,
pp. 317–320.
[19] C. Alexandrak and R. Bader, “Using computer accompaniment to
assist networked music performance,” in Audio Engineering Society
Conference: 53rd International Conference: Semantic Audio. Au-
dio Engineering Society, 2014, Available:
AES53 AlexandrakiBader.pdf.
[20] B. Vera and E. Chew, “Towards seamless network music perfor-
mance: Predicting an ensemble?s eexpressive decisions for distributed
performance,” in Proceedings of the 15th International Society for
Music Information Retrieval Conference, 2014, Available:http://www. 194 Paper.pdf, pp.
[21] A. Car ˆ
ot and G. Schuller, “Towards a telematic visual-conducting
system,” in Audio Engineering Society Conference: 44th International
Conference: Audio Networking. Audio Engineering Society, 2011,
[22] R. Oda, A. Finkelstein, and R. Fiebrink, “Towards note-level
prediction for networked music performance,” in Proceedings of
the 13th International Conference on New Interfaces for Musical
Expression, 2013, Available:
Oda-Finkelstein- Fiebrink NIME2013.pdf.
[23] R. Canning, “Real-time web technologies in the networked perfor-
mance environment,” in Proceedings of the 2012 International Com-
puter Music Conference. Ann Arbor, MI: MPublishing, University
of Michigan Library, Sept. 2012, Available:
[24] M. Ritter, K. Hamel, and B. Pritchard, “Integrated multimodal score-
following environment,” in Proceedings of the International Com-
puter Music Conference, 2013, Available:
[25] A. Carˆ
ot and C. Werner, “Network music performance-problems,
approaches and perspectives,” in Proceedings of the ?Music in the
Global Village?-Conference, Budapest, Hungary, 2007, Available: http:
// AC CW.pdf.
[26] ´
A. Barbosa and J. Cordeiro, “The influence of perceptual attack
times in networked music performance,” in Proceedings of the Au-
dio Engineering Society Conference: 44th International Conference:
Audio Networking. San Diego: Audio Engineering Society, 2011,
[27] P. Teehan, M. Greenstreet, and G. Lemieux, “A survey and
taxonomy of gals design styles,” Design & Test of Computers,
IEEE, vol. 24, no. 5, pp. 418–428, 2007. [Online]. Available:
[28] M. Gurevich, C. Chafe, G. Leslie, and S. Tyan, “Simulation of
networked ensemble performance with varying time delays: Charac-
terization of ensemble accuracy,” in Proceedings of the 2004 Inter-
national Computer Music Conference, Miami, USA, 2004, Available:
[29] C. Chafe, J.-P. C´
aceres, and M. Gurevich, “Effect of temporal
separation on synchronization in rhythmic performance,” Perception,
vol. 39, no. 7, pp. 982–992, 2010. [Online]. Available: https:
[30] C. Chafe and M. Gurevich, “Network time delay and ensemble ac-
curacy: Effects of latency, asymmetry,” in Audio Engineering Society
Convention 117. Audio Engineering Society, 2004, Available:http:
[31] C. Chafe, M. Gurevich, G. Leslie, and S. Tyan, “Effect of time delay
on ensemble accuracy,” in Proceedings of the International Symposium
on Musical Acoustics, vol. 31, 2004, Available:https://ccrma.stanford.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
[32] S. Farner, A. Solvang, A. Sæbo, and U. P. Svensson, “Ensemble
hand-clapping experiments under the influence of delay and
various acoustic environments, Journal of the Audio Engineering
Society, vol. 57, no. 12, pp. 1028–1041, 2009. [Online]. Available:
[33] P. F. Driessen, T. E. Darcie, and B. Pillay, “The effects of
network delay on tempo in musical performance,” Computer Music
Journal, vol. 35, no. 1, pp. 76–89, 2011. [Online]. Available: a 00041
[34] C. Bartlette and M. Bocko, “Effect of network latency on
interactive musical performance,JSTOR, 2006. [Online]. Available:
[35] A. Carˆ
ot, C. Werner, and T. Fischinger, Towards a comprehensive
cognitive analysis of delay-influenced rhythmical interaction. Ann
Arbor, MI: MPublishing, University of Michigan Library, 2009, Avail-
[36] A. Olmos, M. Brul ´
e, N. Bouillot, M. Benovoy, J. Blum, H. Sun,
N. W. Lund, and J. R. Cooperstock, “Exploring the role of la-
tency and orchestra placement on the networked performance of
a distributed opera,” in Twelfth Annual International Workshop
on Presence, 2009, Available:
Proceedings/2009/Olmos et al.pdf, p. 9.
[37] E. Chew, R. Zimmermann, A. A. Sawchuk, C. Kyriakakis, C. Pa-
padopoulos, A. Franc¸ois, G. Kim, A. Rizzo, and A. Volk, “Musical
interaction at a distance: Distributed immersive performance,” in Pro-
ceedings of the MusicNetwork Fourth Open Workshop on Integration
of Music in Multimedia Applications, 2004, Available:http://www.staff., pp. 15–16.
[38] E. Chew, A. Sawchuk, C. Tanoue, and R. Zimmermann, “Seg-
mental tempo analysis of performances in user-centered experi-
ments in the distributed immersive performance project,” in Pro-
ceedings of the Sound and Music Computing Conference, Salerno,
Italy, 2005, Available:
ElanieChew/cstz-smc05 final.pdf.
[39] E. Chew, R. Zimmermann, A. Sawchuk, C. Papadopoulos, C. Kyri-
akakis, C. Tanoue, D. Desai, M. Pawar, R. Sinha, and W. Meyer, “A
second report on the user experiments in the distributed immersive
performance project,” in Proceedings of the 5th Open Workshop
of MUSICNETWORK: Integration of Music in Multimedia Applica-
tions. Citeseer, 2005, Available:
pubs/userexperimentsindip-musicnetwork05.pdf, pp. 1–7.
[40] C. Rottondi, M. Buccoli, M. Zanoni, D. Garao, G. Verticale,
and A. Sarti, “Feature-based analysis of the effects of packet
delay on networked musical interactions,” J. Audio Eng. Soc,
vol. 63, no. 11, pp. 864–875, 2015. [Online]. Available: http:
[41] Y. Kobayashi, Y. Nagata, and Y. Miyake, “Analysis of network en-
semble with time lag,” in Computational Intelligence in Robotics and
Automation, 2003. Proceedings. 2003 IEEE International Symposium
on, vol. 1. IEEE, 2003, Available:
all.jsp?arnumber=1222112, pp. 336–341.
[42] A. A. Sawchuk, E. Chew, R. Zimmermann, C. Papadopoulos, and
C. Kyriakakis, “From remote media immersion to distributed immersive
performance,” in Proceedings of the 2003 ACM SIGMM Workshop
on Experiential Telepresence, ser. ETP ’03. New York, NY, USA:
ACM, 2003, Available:, pp.
[43] S. Hemminger, “Network Emulation with NetEm,” 2005.
[Online]. Available:
LCA2005 paper.pdf
[44] B. H. Repp, “Sensorimotor synchronization: a review of the tapping
literature,” Psychonomic bulletin & review, vol. 12, no. 6, pp. 969–992,
2005. [Online]. Available:
[45] J. Pressing, “The referential dynamics of cognition and action.” Psy-
chological Review, vol. 106, no. 4, p. 714, 1999.
[46] J. P. Caceres, “Synchronization in rhythmic performance with delay,
PhD Thesis, 2013.
[47] W. C. Lindsey, F. Ghazvinian, W. C. Hagmann, and K. Dessouky,
“Network synchronization,” Proceedings of the IEEE, vol. 73, no. 10,
pp. 1445–1467, 1985.
[48] H. G. Schuster and P. Wagner, “Mutual entrainment of two limit
cycle oscillators with time delayed coupling,” Progress of Theoretical
Physics, vol. 81, no. 5, pp. 939–945, 1989. [Online]. Available:
[49] K. C. Pohlmann, Principles of digital audio, 5th edition. McGraw-Hill
New York, 2005.
[50] M. Nilsson, “Rfc3003: The audio/mpeg media type,” 2000. [Online].
[51] M. Lutzky, G. Schuller, M. Gayer, U. Kr¨
amer, and S. Wabnik,
“A guideline to audio codec delay,” in AES 116th convention,
Berlin, Germany, 2004, Available:
content/dam/iis/de/doc/ame/conference/AES-116- Convention
guideline-to- audio-codec- delay AES116.pdf, pp. 8–11.
[52] J.-M. Valin, K. Vos, and T. B. Terriberry, “Rfc6716: Definition
of the opus audio codec,” Sept. 2012. [Online]. Available:
[53] J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. Vos, “High-quality,
low-delay music coding in the opus codec,” in Audio Engineering
Society Convention 135. Audio Engineering Society, 2013, Avail-
[54] J. M. Valin, T. B. Terriberry, and G. Maxwell, “A full-bandwidth
audio codec with low complexity and very low delay,” in Sig-
nal Processing Conference, 2009 17th European, Aug 2009, Avail-
able: all.jsp?arnumber=7077384, pp.
[55] M. Lutzky, “Fraunhofer iis uld audio software - ultra low delay (uld)
audio encoders and decoders,” 2004.
[56] Z. Kurtisi and L. Wolf, “Using wavpack for real-time audio coding
in interactive applications,” in Multimedia and Expo, 2008 IEEE
International Conference on. IEEE, 2008, Available:http://ieeexplore. all.jsp?arnumber=4607701, pp. 1381–1384.
[57] A. Pcjic, P. Stanic, and S. Pletl, “Analysis of packet loss prediction
effects on the objective quality measures of opus codec,” in Intelligent
Systems and Informatics (SISY), 2014 IEEE 12th International Sym-
posium on. IEEE, 2014, Available:
stamp.jsp?arnumber=6923611, pp. 33–37.
[58] U. Kr¨
amer, H. Jens, G. Schuller, S. Wabnik, A. Car ˆ
ot, and C. Werner,
“Network music performance with ultra-low-delay audio coding under
unreliable network conditions,” in Audio Engineering Society Conven-
tion 123. Audio Engineering Society, 2007, Available:http://www.aes.
[59] C. Drioli, C. Allocchio, and N. Buso, Information Technologies
for Performing Arts, Media Access, and Entertainment: Second
International Conference, ECLAP 2013, Porto, Portugal, April 8-
10, 2013, Revised Selected Papers. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2013, ch. Networked Performances and Natural
Interaction via LOLA: Low Latency High Quality A/V Streaming
System, pp. 240–250. [Online]. Available:
978-3- 642-40050- 6 21
[60] J.-P. C´
aceres and C. Chafe, “Jacktrip: Under the hood
of an engine for network audio,” Journal of New
Music Research, vol. 39, no. 3, pp. 183–187, 2010.
[Online]. Available:
PDFs/conferences/2009-caceres chafe- ICMC-jacktrip.pdf
[61] C. Alexandraki and D. Akoumianakis, “Exploring new perspectives in
network music performance: The diamouses framework,Computer
Music Journal, vol. 34, no. 2, pp. 66–83, 2010. [Online]. Available:
[62] G. Baltas and G. Xylomenos, “Ultra low delay switching for net-
worked music performance,” in Information, Intelligence, Systems
and Applications, IISA 2014, The 5th International Conference
on. IEEE, 2014, Available: all.jsp?
arnumber=6878798, pp. 70–74.
[63] N. Bouillot and J. R. Cooperstock, “Challenges and performance of
high-fidelity audio streaming for interactive performances,” in Proceed-
ings of the 9th international conference on New Interfaces for Musical
Expression (NIME’09), Pittsburgh, 2009, Available:
[64] [Online]. Available:
[65] “Iso/iec 7498-1: Information technology - open system
interconnections - basic reference model - the basic model,”
1996. [Online]. Available:
[66] A. Carˆ
ot, U. Kr¨
amer, and G. Schuller, “Network music performance
(nmp) in narrow band networks,” in Audio Engineering Society Con-
vention 120. Audio Engineering Society, 2006, Available:http://www.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
[67] D. Van Aken and S. Peckeelbeen, “Encapsulation overhead (s) in adsl
access networks,” Thomson SpeedTouch, v1. 0 edition, 2003. [Online].
[68] L. Angrisani, G. Ventre, L. Peluso, and A. Tedesco, “Measurement of
processing and queuing delays introduced by an open-source router in
a single-hop network,” Instrumentation and Measurement, IEEE Trans-
actions on, vol. 55, no. 4, pp. 1065–1076, 2006. [Online]. Available:
[69] W. Zhang and J. He, “Statistical modeling and correlation analysis
of end-to-end delay in wide area networks,” in Software Engineering,
Artificial Intelligence, Networking, and Parallel/Distributed Comput-
ing, 2007. SNPD 2007. Eighth ACIS International Conference on,
vol. 3, July 2007, Available:
jsp?tp=&arnumber=4287989, pp. 968–973.
[70] M. Lucas, D. Wrege, B. Dempsey, and A. Weaver, “Statistical
characterization of wide-area ip traffic,” in Computer Communica-
tions and Networks, 1997. Proceedings., Sixth International Confer-
ence on, Sep 1997, Available: all.jsp?
arnumber=623349, pp. 442–447.
[71] E. Thirunavukkarasu and E. Karthikeyan, “A survey on voip
packet loss techniques,” Intl. J. Communication Networks
and Distributed Systems, vol. 14, pp. 106–116, 2015.
[Online]. Available:
[72] C. Perkins, O. Hodson, and V. Hardman, “A survey of packet loss
recovery techniques for streaming audio,IEEE Network, vol. 12,
pp. 40–48, 1998. [Online]. Available:
[73] B. W. Wah, X. Su, and D. Lin, “A survey of error-concealment schemes
for real-time audio and video transmissions over the internet,” in Proc.
Intl. Symp. on Multimedia Software Engineering., 2000, pp. 17–24.
[74] J.-Y. Lee, H.-G. Kim, and J. Y. Kim, “Packet loss concealment for
improving audio streaming service,” in Mobile and Wireless Technology
2015, ser. Lecture Notes in Electrical Engineering, K. J. Kim and
N. Wattanapongsakorn, Eds. Springer, 2015, vol. 310, pp. 123–126.
[Online]. Available:
[75] D. Figueiredo, E. De Souza, and E. Silva, “A survey of error-
concealment schemes for real-time audio and video transmissions over
the internet,” in IEEE Globecom., vol. 3, 1999, pp. 1830–1837.
[76] J. Yeh, P. Lin, M. Kuo, and Z. Hsu, “Bilateral waveform similarity
overlap-and-add based packet loss concealment for voice over ip,
Journal of Applied Research and Technology, vol. 11, pp. 559–
567, 2013. [Online]. Available:
[77] K. Maheswari and M. Punithavalli, “Performance evaluation of packet
loss replacement using repetititon technique in voip streams,” Intl. J.
Computer Information Systems and Industrial Management Applica-
tions (IJCISIM), vol. 2, pp. 289–296, 2010.
[78] ITU, “ITU recommendation G.711 appendix I,” 1999. [Online].
Available: REC-G.711- 199909-I!AppI/en
[79] J.-H. Chen, “Packet loss concealment based on extrapolation of speech
waveform,” in Intl. Conf. on Acoustics, Speech, and Signal Processing.
ICASSP-09., 2009, pp. 4129–4132.
[80] [Online]. Available:
[81] R. Zimmermann, E. Chew, S. A. Ay, and M. Pawar, “Distributed
musical performances: Architecture and stream management,” ACM
Transactions on Multimedia Computing, Communications, and
Applications (TOMCCAP), vol. 4, no. 2, p. 14, 2008. [Online].
[82] A. Carˆ
ot and C. Werner, “Distributed network music workshop with
soundjack,” Proceedings of the 25th Tonmeistertagung, Leipzig, Ger-
many, 2008. [Online]. Available:
[83] D. Akoumianakis, C. Alexandraki, V. Alexiou, C. Anagnostopoulou,
A. Eleftheriadis, V. Lalioti, A. Mouchtaris, D. Pavlidi, G. Polyzos,
P. Tsakalides et al., “The musinet project: Towards unraveling the full
potential of networked music performance systems,” in Information,
Intelligence, Systems and Applications, IISA 2014, The 5th Interna-
tional Conference on. IEEE, 2014, Available:
org/xpls/abs all.jsp?arnumber=6878779&tag=1, pp. 1–6.
[84] P. Holub, J. Matela, M. Pulec, and M. ˇ
Srom, “Ultragrid: Low-
latency high-quality video transmissions on commodity hardware,” in
Proceedings of the 20th ACM International Conference on Multimedia,
ser. MM ’12. New York, NY, USA: ACM, 2012, Available:http:
//, pp. 1457–1460. [Online].
[85] D. El-Shimy and J. R. Cooperstock, “Reactive environment for network
music performance,” in Proceedings of New Interfaces for Musi-
cal Expression, May 2013, Available:
nime2013 66.pdf.
[86] A. Kapur, G. Wang, P. Davidson, and P. R. Cook, “Interactive
network performance: a dream worth dreaming?” Organised Sound,
vol. 10, no. 03, pp. 209–219, 2005. [Online]. Available: http:
[87] X. Gu, M. Dick, Z. Kurtisi, U. Noyer, and L. Wolf, “Network-centric
music performance: Practice and experiments,” Communications
Magazine, IEEE, vol. 43, no. 6, pp. 86–93, 2005. [Online]. Available: all.jsp?arnumber=1452835&tag=1
[88] C. Chafe, S. Wilson, A. Leistikow, D. Chisholm, and G. Scavone,
“A simplified approach to high quality music and sound over ip,” in
COST-G6 Conference on Digital Audio Effects (DAFx-00, 2000, Avail-
[89] J.-P. C´
aceres and C. Chafe, “Jacktrip/soundwire meets server farm,”
Computer Music Journal, vol. 34, no. 3, pp. 29–34, 2010. [Online].
Available: a 00001
[90] L. Gabrielli and S. Squartini, “Wireless networked music performance,
in Wireless Networked Music Performance. Springer, 2016, Available:, pp. 53–92.
[91] F. Meier, M. Fink, and U. Z¨
olzer, “The jamberry-a stand-alone device
for networked music performance based on the raspberry pi,” in Linux
audio conference, Karlsruhe, 2014, Available:
[92] R. E. Saputra and A. S. Prihatmanto, “Design and implementation
of beatme as a networked music performance (nmp) system,” in
Proceedings of the International Conference on System Engineering
and Technology (ICSET). IEEE, 2012, Available:
org/stamp/stamp.jsp?arnumber=6339349, pp. 1–6.
[93] R. Renwick, SOURCENODE: A Network Sourched
Approach to Network Music Performacne (NMP). Ann
Arbor, MI: MPublishing, University of Michigan Library,
sourcenode-a- network-sourced- approach-to- network-music.pdf?
[94] J. Lazzaro and J. Wawrzynek, “A case for network musical perfor-
mance,” in Proceedings of the 11th international workshop on Network
and operating systems support for digital audio and video. ACM,
2001, Available:, pp. 157–
[95] S. Guha, N. Daswani, and R. Jain, “An Experimental Study of the
Skype Peer-to-Peer VoIP System,” in Proceedings of The 5th Interna-
tional Workshop on Peer-to-Peer Systems (IPTPS), Santa Barbara, CA,
February 2006, pp. 1 – 6.
[96] W. Taymans, S. Baker, A. Wingo, R. S. Bultje, and S. Kost, “Gstreamer
application development manual (1.2. 3),Publicado en la Web, 2013.
[97] L. Handberg, A. Jonsson, and K. Claus, “Community building through
cultural exchange in mediated performance events: a conference 2005,
in The Virtuala room without borders? School of Communication,
Technology and Design S¨
orn College University, 2005.
[98] M. Wozniewski, N. Bouillot, Z. Settel, and J. R. Cooperstock, “Large-
scale mobile audio environments for collaborative musical interaction,
in 8 th International Conference on New Interfaces for Musical Ex-
pression NIME08, 2008, Available:
mobileAudioscape NIME2008.pdf, p. 13.
[99] C. Stais, Y. Thomas, G. Xylomenos, and C. Tsilopoulos, “Networked
music performance over information-centric networks,” in Proceedings
of the IEEE International Conference on Communications Workshops
(ICC). IEEE, 2013, Available: all.
jsp?arnumber=6649313, pp. 647–651.
[100] G. Xylomenos, C. Tsilopoulos, Y. Thomas, and G. C. Polyzos, “Re-
duced switching delay for networked music performance,” in Packet
Video Workshop (Poster Session), 2013, Available:
publications/2013-MCU- PV.pdf.
[101] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke,
J. Naous, R. Raghuraman, and J. Luo, “Netfpga–an open platform
for gigabit-rate network switching and routing,” in Microelectronic
Systems Education, 2007. MSE’07. IEEE International Conference
on. IEEE, 2007, Available: all.jsp?
arnumber=4231497, pp. 160–161.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2628440, IEEE Access
[102] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek,
“The click modular router,ACM Transactions on Computer Systems
(TOCS), vol. 18, no. 3, pp. 263–297, 2000. [Online]. Available:
[103] L. Rizzo, “netmap: A novel framework for fast packet i/o.” in
USENIX Annual Technical Conference, 2012,Available:http://dl.acm.
org/citation.cfm?id=2342821.2342830, pp. 101–112.
[104] [Online]. Available:
[105] Z. Kurtisi, X. Gu, and L. Wolf, “Enabling network-centric
music performance in wide-area networks,” Communications of
the ACM, vol. 49, no. 11, pp. 52–54, 2006. [Online]. Available:
[106] D. Akoumianakis, C. Alexandraki, V. Alexiou, C. Anagnostopoulou,
A. Eleftheriadis, V. Lalioti, A. Mouchtaris, D. Pavlidi, G. Polyzos,
P. Tsakalides, G. Xylomenos, and P. Zervas, “The musinet project:
Addressing the challenges in networked music performance systems,”
in Proc. of International Conference on Information, Intelligence,
Systems and Applications, July 2015,Available:
[107] G. Davies, “The effectiveness of lola (low latency) audiovisual
streaming technology for distributed music practice.” [Online].
Available: effectiveness
of LOLA LOw LAtency audiovisual streaming technology for
distributed music practice
[108] C. Chafe, “Living with net lag,” in Audio Engineering Society Confer-
ence: 43rd International Conference: Audio for Wirelessly Networked
Personal Devices. Audio Engineering Society, 2011, Available:https:
[109] C. Alexandraki, P. Koutlemanis, P. Gasteratos, N. Valsamakis,
D. Akoumianakis, G. Milolidakis, G. Vellis, and D. Kotsalis, “To-
wards the implementation of a generic platform for networked mu-
sic performance: the diamouses approach,” in PROCEEDINGS OF
Belfast, 2008, Available:
c=icmc;idno=bbp2372.2008.101, pp. 251–258.
[110] D. Akoumianakis, G. Vellis, I. Milolidakis, D. Kotsalis, and C. Alexan-
draki, “Distributed collective practices in collaborative music perfor-
mance,” in Proceedings of the 3rd International Conference on Digital
Interactive Media in Entertainment and Arts, ser. DIMEA ’08. New
York, NY, USA: ACM, 2008, Available:
1413634.1413700, pp. 368–375.
[111] L. Gabrielli, S. Squartini, E. Principi, and F. Piazza, “Networked bea-
gleboards for wireless music applications,” in Education and Research
Conference (EDERC), 2012 5th European DSP, Sept 2012, Avail-
able: all.jsp?arnumber=6532274, pp.
[112] L. Gabrielli, S. Squartini, and F. Piazza, “Advancements and perfor-
mance analysis on the wireless music studio (wemust) framework,” in
Audio Engineering Society Convention 134. Audio Engineering Soci-
ety, 2013, Available:
[113] L. Gabrielli, M. aBussolotto, and S. Squartini, “Reducing the latency in
live music transmission with the beagleboard xm through resampling,
in Education and Research Conference (EDERC), 2014 6th European
Embedded Design in, Sept 2014, Available:
stamp/stamp.jsp?arnumber=6924409, pp. 302–306.
[114] J. Postel, “Rfc768: User datagram protocol,” 1980. [Online]. Available:
[115] R. Sinha, C. Papadopoulos, and C. Kyriakakis, “Loss concealment for
multi-channel streaming audio,” in Proceedings of the 13th interna-
tional workshop on Network and operating systems support for digital
audio and video. ACM, 2003, Available:
christos/papers/nossdav03.pdf, pp. 100–109.
[116] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “Rfc3550:
Rtp: A transport protocol for real-time applications,” 2003. [Online].
[117] J. Lazzaro and J. Wawrzynek, “Rfc6295: Rtp payload format for
midi,” 2011. [Online]. Available:
[118] R. Oda and R. Fiebrink, “The global metronome: Absolute tempo sync
for networked musical performance,” in 16th International Conference
on New Interfaces for Musical Expression, NIME 2016, Brisbane,
Australia, July 2016.
[119] A. Carˆ
ot and C. Werner, “External latency-optimized soundcard syn-
chronization for applications in wide-area networks,” in AES 14th
regional convention, Tokio, Japan, vol. 7, 2009, Available:http://www. Tokyo.pdf, p. 10.
[120] A. Wilkins, J. Veitch, and B. Lehman, “Led lighting flicker and
potential health concerns: Ieee standard par1789 update,” in 2010 IEEE
Energy Conversion Congress and Exposition. IEEE, 2010, pp. 171–
[121] M. Bretan and G. Weinberg, “A survey of robotic musicianship,
Communications of the ACM, vol. 59, no. 5, pp. 100–109, 2016.
Cristina Rottondi received both Master and Ph.D. Degrees cum laude in
Telecommunications Engineering from Politecnico di Milano in 2010 and
2014 respectively. She is currently postdoctoral researcher at the Department
of Electronics, Information, and Bioengineering (DEIB) of Politecnico di
Milano. Her research interests include cryptography, communication security,
design and planning of optical networks, and networked music performance.
Chris Chafe is a composer, improvisor, and cellist, developing much of his
music alongside computer-based research. He is Director of Stanford Uni-
versity’s Center for Computer Research in Music and Acoustics (CCRMA).
At IRCAM (Paris) and The Banff Centre (Alberta), he pursued methods
for digital synthesis, music performance and real-time internet collaboration.
CCRMA’s SoundWIRE project involves live concertizing with musicians the
world over.
Claudio Allocchio is a physicist and computer scientist. He was researcher
at the Astronomic Observatory in Trieste and at the European Organization
for Nuclear Research (CERN) in Geneve. He contributed to the foundation
of the GARR Consortium (the ultrawide-band Italian research and eductaion
network), where he is currently appointed as Senior Officer. Since 1990
he collaborates with the Internet Engineering Task Force (IETF) and he is
currently senior member of the Application Area Directorate. Since 2005
he collaborates to the development of LOLA (LOw LAtency Audio Visual
Streaming System ) and organizes the Network Performing Arts Production
Workshop every year as member of the programme committee.
Augusto Sarti received the M.S. and the Ph.D. degrees in electronic engineer-
ing, both from the University of Padua, Italy, in 1988 and 1993, respectively.
His graduate studies included a joint graduate program with the University
of California, Berkeley. In 1993, he joined the Politecnico di Milano, Milan,
Italy, where he is currently an Associate Professor. In 2013, he also joined
the University of California, Davis, as an Adjunct Professor. His research
interests are in the area of multimedia signal processing, with particular focus
on sound analysis, synthesis and processing; space-time audio processing;
geometrical acoustics; and music information extraction. He has also worked
on problems of multidimensional signal processing, image analysis and 3D
vision. He coauthored well over 200 scientific publications on international
journals and congresses as well as numerous patents in the multimedia signal
processing area. He is an active member of the IEEE Technical Committee
on Audio and Acoustics Signal Processing, and is in the Editorial Boards
of IEEE Signal Processing Letters and of IEEE Tr. on Audio, Speech and
Language Processing.
... In the past two decades and in particular in the last few years, we have witnessed the birth and diffusion of an increasing number of technologies, products, and applications at the intersection of music and networking [5,10,13,21,25,35,53]. As a result of the growing attention devoted by academy and industry to this area, three main research fields have emerged and progressively consolidated: the Networked Music Performances (NMP) [60], Ubiquitous Music (Ubimus) [42,48], and lately the Internet of Musical Things (IoMusT) [76]. ...
... In the following, we report a brief overview of the most relevant ones. The interested reader may refer to [60] for a thorough survey. ...
... Music-related information refers to data sensed and/or processed by a Musical Thing, and/or communicated to a human or another Musical Thing for musical purposes. A Musical Thing is a device capable of sensing, acquiring, actuating, exchanging, or processing data for musical purposes The IoMusT research field originates from the integration of many lines of existing research including ubimus [42], networked music performance systems [33,60], Internet of Things [11], new interfaces for musical expression [39], music information retrieval [15], human-computer interaction [61], Musical XR [83], and participatory art [34]. ...
Full-text available
In the past two decades, we have witnessed the diffusion of an increasing number of technologies, products, and applications at the intersection of music and networking. As a result of the growing attention devoted by academy and industry to this area, three main research fields have emerged and progressively consolidated: the Networked Music Performances, Ubiquitous Music, and the Internet of Musical Things. Based on the review of the most relevant works in these fields, this paper attempts to delineate their differences and commonalities. The aim of this inquiry is helping avoid confusion between such fields and achieve a correct use of the terminology. A trend towards the convergence between such fields has already been identified, and it is plausible to expect that in the future their evolution will lead to a progressive blurring of the boundaries identified today.
... In this context, various technologies such as Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR) address a wide range of musical applications (composition, education, entertainment, perception study, performance, and sound engineering) [1]. At the same time, the technical possibilities for networked music making between geographically distant locations increased, resulting in a demand for such systems [2]. ...
... Since there is a relatively small body of research that is concerned with the importance of the visual component in multi-user experience of INMP [2,10], especially exploring the Realistic Interaction Approaches (RIA) [12] using musical Extended Reality Environments (XRE) [7], it seems to be worthwhile to create future musical XREs and to analyze their influence on the subjective perception of co-presence in comparison to the musical outcome. ...
... For our internal use of the IRENE-system, an existing Dante infrastructure is used for audio streaming. In case of a Wide Area Network (WAN) connection between several computers, various software solutions [2] or low latency hardware solutions [4] can be easily integrated into the system infrastructure. ...
Conference Paper
This paper presents an Extended Reality Environment (XRE) for Immersive Networked Music Performances (INMP) called Immersive Room ExtensioN Environment (IRENE). The design of the IRENE-system includes a position-dependent visual and auditory representation of the users within a Virtual Room Extension (VRE). Early reflections and diffuse reverberation of a shared virtual space are rendered to provide participants a sense of co-presence. A continuous transition between the real and the virtual space is achieved by the visual representation using projection screens, enabling a multiuser experience. Several audiovisual environments have been implemented and are presented. The IRENE-system enables a wide range of future investigations in the field of INMP regarding the objective and subjective benefits of musical Extended Reality (XR) with focus on immersive audio.
... Such instruments are the result of the integration of embedded computational intelligence (running on dedicated platforms, e.g., [62]), wireless connectivity, embedded sound delivery system, and an onboard system for feedback to the player. They offer direct point-to-point communication between each other and other portable sensor-enabled devices connected to local networks and to the Internet, fostering networked music performance systems [63,64]. To date, only a handful of instruments with such characteristics exist. ...
... The first kind concerns ecosystems in which low-latency communication is a crucial aspect (e.g., < 100 milliseconds, depending on the application at hand). To this category belong ecosystems based on applications for real-time networked music performances [63,64] and in general real-time interactive musical systems. The second kind regards those ecosystems in which the communication may be asynchronous or real-time but tolerating relatively large latencies. ...
Full-text available
The Internet of Musical Things (IoMusT) refers to the extension of the Internet of Things paradigm to the musical domain. Interoperability represents a central issue within this domain, where heterogeneous Musical Things serving radically different purposes are envisioned to communicate between each other. Automatic discovery of resources is also a desirable feature in IoMusT ecosystems. However, the existing musical protocols are not adequate to support discoverability and interoperability across the wide heterogeneity of Musical Things, as they are typically not flexible, lack high resolution, are not equipped with inference mechanisms that could exploit on board the information on the whole application environment. Besides, they hardly ever support easy integration with the Web. In addition, IoMusT applications are often characterized by strict requirements in terms of latency of the exchanged messages. Semantic Web of Things technologies have the potential to overcome the limitations of existing musical protocols by enabling discoverability and interoperability across heterogeneous Musical Things. In this paper we propose the Musical Semantic Event Processing Architecture (MUSEPA), a semantically-based architecture designed to meet the IoMusT requirements of low-latency communication, discoverability, interoperability, and automatic inference. The architecture is based on the CoAP protocol, a semantic publish/subscribe broker, and the adoption of shared ontologies for describing Musical Things and their interactions. The code implementing MUSEPA can be accessed at:
... T HE COVID-19 pandemic and advances in network technologies such as 5G have increased the demand for real-time interactive applications that use video and audio media such as remote Personal Computer (PC) operations, remote robot and machine operations [1], cloud gaming [2], and remote ensembles [3]. The experience of such interactive applications is determined by the latency of the media transfer. ...
Full-text available
To improve the experience of real-time interactive applications based on video and audio, there is a growing demand to realize deterministic networks that can transmit these media with low latency. In recent years, various deterministic network technologies have been studied such as Time Sensitive Networking(TSN). Time slot allocation type schemes such as Time-Aware Shaper (TAS) offer guaranteed delayed determinism and zero-jitter, but suffer from low network accommodation efficiency. Shaper-based techniques such as Asynchronous Traffic Shaping (ATS) provide high network accommodation efficiency and delay determinism inside the network. However, if the input traffic from the sender is bursty, End-to-End delay increases due to the shaping delay caused by the shapers at the edges of the network. In this paper, we propose Delay-Based Shaper (DBS), which dynamically controls the bandwidth while shaping the bursts so that the upper bound of the shaping delay is protected. In addition, we also propose a Dynamic Token Bucket Algorithm (DTBA) that extends the conventional token bucket algorithm to implement DBS. We show that DBS can both shape bursts and comply with the upper bound of shaping delay by comparing the behavior when bursts are input via a conventional shaper. We also demonstrate good End-to-End delay determinism and high network accommodation efficiency by applying DBS to the edge of the network.
... To ensure a realistic musical interplay and good quality of experience in NMP, very strict requirements must be satisfied to keep the one-way end-to-end transmission latency below a few tens of milliseconds [7]. Using uncompressed audio streams and leveraging UDP (User Datagram Protocol) as transport layer protocol is a common solution to reduce time overhead in a real time application. ...
... The Mouth to Ear (M2E) delay is used in many studies, either for the evaluation of telecommunication systems in the case of oral communication, or to investigate the delay impact in tempo and musical performance in NMP [3]. It is often referred to as the one way delay, since it covers one direction of communication. ...
Conference Paper
Full-text available
Music collaboration over the Internet, known as Network Music Performance (NMP), remains a challenge for researchers and engineers, since transmission, switching and audio processing delays hinder the synchronization of the participating musicians. Although widely available Web-based voice and video communication tools are not designed for real-time remote musical performances, during the pandemic many musicians worldwide had to use them, due to the lack of widespread NMP-oriented tools. In this paper we provide measurements for the end-to-end audio delay of a number of Web conference tools, using real network conditions, either in a LAN or in a WAN setting, and compare them to the corresponding delays exhibited by an NMP-specific tool.
Restrictions arising from the COVID-19 pandemic have limited opportunities for older people to participate in face-to-face organised social activities. Many organisations moved these activities online, but little is known about older adults' experiences of participating in those activities. This paper reports an investigation of older adults' experiences of participating in social activities that they used to attend in-person, but which were moved online because of strict lockdown restrictions. We conducted in-depth interviews with 40 older adults living independently (alone or with others). Findings from a reflexive thematic analysis show that online social activities were important during the pandemic for not only staying connected to other people but also helping older adults stay engaged in meaningful activities, including arts, sports, cultural, and civic events. Online activities provided older adults with opportunities to connect with like-minded people; share care, encouragement, and support; participate in civic agendas; learn knowledge and develop new skills; and experience entertainment, distraction, and mental stimulation. Our participants had diverse perceptions of the transition from in-person to online social activities. Based on the findings, we present a taxonomy of multi-layered meaningful activities for older adults' digital social participation and highlight implications for future technology design.
Full-text available
A Networked Music Performance (NMP) is defined as what happens when geographically displaced musicians interact together while connected via network. The first NMP experiments begun in the 1970s, however, only recently the development of network communication technologies has created the necessary infrastructure needed to successfully create an NMP. Moreover, the widespread adoption of network-based interactions during the COVID-19 pandemic has generated a renewed interest towards distant music-based interaction. In this chapter we present the Intelligent networked Music PERforMANce experiENCEs (IMPERMANENCE) as a comprehensive NMP framework that aims at creating a compelling performance experience for the musicians. In order to do this, we first develop the neTworkEd Music PErfoRmANCe rEsearch (TEMPERANCE) framework in order to understand which are the main needs of the participants in a NMP. Informed by these results we then develop IMPERMANENCE accordingly.
In this chapter, a quantum music generation application called QuiKo is introduced. QuiKo combines existing quantum algorithms with data encoding methods from Quantum Machine Learning to build drum and audio sample patterns from a database of audio tracks. QuiKo leverages the physical properties and characteristics of quantum computing to generate what can be referred to as soft rules. These rules take advantage of noise produced by quantum devices to develop rules for music generation. These properties include qubit decoherence and phase kickback to controlled quantum gates within the quantum circuit. QuiKo attempts to mimic and react to external musical inputs, like the way that human musicians play and compose with one another. Audio signals (ideally rhythmic in nature) are used as inputs into the system. Feature extraction is performed on the signal to identify its harmonic and percussive elements. This information is then encoded onto a quantum circuit. Then, measurements of the quantum circuit are taken providing results in the form of probability distributions. These distributions are used to build a new rhythmic pattern.