Content uploaded by Henrik von Coler
Author content
All content in this area was uploaded by Henrik von Coler on Apr 02, 2019
Content may be subject to copyright.
Proceedings of the 17th Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23–26, 2019
A JACK-BASED APPLICATION FOR SPECTRO-SPATIAL ADDITIVE SYNTHESIS
Henrik von Coler
Audio Communication Group
TU Berlin
voncoler@tu-berlin.de
ABSTRACT
This paper presents a real-time additive sound synthesis appli-
cation with individual outputs for each partial and noise component.
The synthesizer is programmed in C++, relying on the Jack API for
audio connectivity with an OSC interface for control input. These
features allow the individual spatialization of the partials and noise,
referred to as spectro-spatial synthesis, in connection with an OSC
capable spatial rendering software. Additive synthesis is performed
in the time domain, using previously extracted partial trajectories
from instrument recordings. Noise is synthesized using bark band
energy trajectories. The sinusoidal data set for the synthesis is gen-
erated from a custom violin sample library in advance. Spatialization
is realized using established rendering software implementations on
a dedicated server. Pure Data is used for processing control streams
from an expressive musical interface and distributing it to synthe-
sizer and renderer.
1. INTRODUCTION
1.1. Sinusoidal Modeling
Additive synthesis is among the oldest digital sound creation meth-
ods and has been the foundation of early experiments by Max Math-
ews at Bell Labs. It allows the generation of sounds rich in timbre,
by superimposing single sinusoidal components, referred to as par-
tials, either in the time- or frequency domain. Based on the Fourier
Principle, any quasi-periodic signal y(t)can be expressed as a sum
of Npart sinusoids with varying amplitudes an(t)and frequencies
ωn(t)and an individual phase offset ϕn:
y(t) =
Npart
X
n=1
an(t)sin(ωn(t)t+ϕn)(1)
In harmonic cases, which applies to the majority of musical in-
strument sounds, the partial frequencies can be approximated as in-
teger multiples of f0:
y(t) =
Npart
X
n=1
an(t)sin(2 π n f0(t)t+ϕn)(2)
Although relative phase fluctuations are important for the per-
ception [1], the original phase can be ignored in many cases, which
is of benefit for manipulations of the modeled sound:
y(t) =
Npart
X
n=1
an(t)sin(2 π n f0(t)t)(3)
Based on this theory, an algorithm for speech synthesis has been
proposed by McAulay et al. [2]. For musical sound synthesis the
algorithm has been added a noise component [3], resulting in the
sinusoids+noise model. The signal is then modeled as the sum of the
deterministic part xdet and the stochastic part xstoch, also referred
to as residual:
x=xdet +xstoch (4)
Modeling of residuals can for example be performed by approx-
imating the spectral envelope using linear predictive coding [3] or a
filter bank based on Bark frequencies [4]. The phase of the stochastic
signal is random, in theory, and thus needs not be modeled. However,
residuals usually are not completely random since they still contain
information from the removed harmonic content.
In order to fully model the sounds of arbitrary musical instru-
ments, a transient component xtrans is included [4] in the full signal
model. This component captures plucking sounds and other percus-
sive elements:
x=xdet +xstoch +xtrans (5)
Since the work presented in this paper focuses on the violin in
legato techniques, the transient component can be neglected without
impairing the perceived quality of a re-synthesis.
1.2. Spectral Spatialization
In electronic and electroacoustic music, the term spectral spatializa-
tion refers to the individual treatment of a sound’s frequency compo-
nents for a distribution on sound reproduction systems [5]. Timbral
sound qualities can thusly be linked to the spatial image of the sound,
even for pre-existing or fixed sound material. In the case of spectro-
spatial synthesis, this process is integrated on the synthesis level,for
example in additive approaches. This is not yet a common feature
in available synthesizers, but several research projects have been in-
vestigating the possibilities of such approaches with applications in
musical sound processing, sound design, virtual acoustics and psy-
choacoustics.
Topper et al. [6] apply additive synthesis of basic waveforms
(square wave, sawtooth), physical modeling and sub-band decompo-
sition in a multichannel panning system with real time, prerecorded
and graphic control. Their system is implemented in MAX/MSP and
RTcmix, running on both Mac and PC/Linux hardware with a total
of 8 audio channels.
Verron et al. [7] use the sinusoids + noise model for spectral
spatialization of environmental sounds. Each component can be syn-
thesized with individual position in space on Ambisonics and Binau-
ral systems. Deterministic and stochastic components are composed
and added together in the frequency domain and subsequently spa-
tially encoded with a filterbank. Control over the synthesis process
is depending on the nature of the environmental sounds [8].
In the context of electroacoustic music, James [9] expands Den-
nis Smalley’s concept of spectromorphology to the idea of spatiomor-
phology.Timbre Spatialization is achieved using terrain surfaces
Proceedings of the 17th Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23–26, 2019
0 50 100
10−6.00
10−5.00
10−4.00
10−3.00
10−2.00
10−1.00
F rame
ai
0
10
20
30
Partial index
Figure 1: Partial amplitude trajectories of a violin sound
0 50 100
0
10,000
20,000
F rame
f/Hz
0
10
20
30
Partial index
Figure 2: Partial frequency trajectories of a violin sound
and by mapping these to spacio-spectral distributions. Max-MSP
is used for computing the contribution of spectral content to individ-
ual speakers with Distance-based amplitude panning (DBAP) and
Ambisonic Equivalent panning (AEP) methods.
Spectral spatialization can also be used to synthesize dynamic
directivity patterns of musical instruments in virtual acoustic envi-
ronments. Since the directivity in combination with movement has
a significant influence on an instrument’s sound, this can increase
the plausibility. Warusfel et al. [10] use a tower with three cubes,
each containing multiple speakers, to spatialize frequency bands of
an input signal for the simulation of radiation patterns.
1.3. The Presented Application
The presented application incorporates different synthesis modes, of
which only the so called deterministic mode will be subject of this
paper. In this basic mode, precalculated parameter trajectories, as
presented in Sec. 2, are used for a manipulable resynthesis of the
original instrument sounds.
The software architecture is designed to allow the use of addi-
tive synthesis, respectively of sinusoidal modeling, on sound field
synthesis systems or other reproduction setups. This is achieved by
providing individual outputs for all partials and noise bands in an
application implemented as a JACK client, described in Sec. 3. Us-
ing JACK allows the connection of all individual synthesizer output
0 100 200 300
−300.00
−200.00
−100.00
0.00
100.00
F rame
ϕ
0
10
20
30
Partial index
Figure 3: Unwrapped partial phases of a violin sound
0 100 200 300
10−6.00
10−5.00
10−4.00
10−3.00
10−2.00
F rame
RMSi
0
10
20
Bark band
Figure 4: Bark band energy trajectories of a violin sound
channels to a JACK-capable renderer, such as the SoundScape Ren-
derer (SSR) [11], Panoramix [12] or the HOA- Library [13]. Making
each partial a single virtual sound source in combination with these
rendering softwares, the spatial distribution of the synthesis can be
modulated in real-time. Pure Data [14] is used to receive control
data from gestural interfaces or to play back predefined trajectories
for generating control streams for both the synthesizer and the spa-
tialization renderer. A direct linkage between timbre and spatializa-
tion is thus created, which is considered essential for a meaningful
spectro-spatial synthesis.
2. ANALYSIS
The TU-Note Violin Sample Library [15], [16], is used as audio con-
tent for generating the sinusoidal model. Designed in the style of
classic sample libraries, this data set contains single sounds of a vio-
lin in different pitches and intensities, recorded at an audio sampling
rate of 96 kHz with 24 Bit resolution.
Analysis and modeling is performed beforehand in Matlab, us-
ing monophonic pitch tracking and subsequent extraction of the par-
tial trajectories by peak picking in the spectrogram. YIN [17] and
SWIPE [18] are used as monophonic pitch tracking algorithms. Based
on the f0-trajectories, partial tracking is performed with STFT, ap-
plying a hop-size of 256 samples (2.7ms) and a window size of
Proceedings of the 17th Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23–26, 2019
update_voices()
set_parameters()
get_deterministic_value()
set_interpolator()
cycle_start_deterministic()
getNextSample()
sine-sample
new_random()
get_next_value()
noise-sample
getNextBlock_TD(pos)
out
getNextFrame_TD(*OUTBUFF)
*OUTBUFF
jackclient:JackClient voicemanager:VoiceManager singlevoice:SingleVoice sinusoid:Sinusoid ressynth:ResidualSynth
loop
[nPart]
loop
[nBuff]
loop
[nPart]
loop
[nBuff]
loop
[nBands]
loop
[nVoices]
Figure 5: Sequence diagram for the jack callback function
4096 samples, zero-padded to 8192 samples. Quadratic interpola-
tion (QIFFT), as presented by Smith et al. [19], is applied for peak
parameter estimation of up to 80 partials. Due to the sampling fre-
quency, the full number of partials is only analyzable up to the note
D5(576.65 Hz)
By subtracting the deterministic part from the complete sound in
the time domain, the residual signal is obtained. The residual is then
filtered using a Bark scale filterbank with second order Chebyshev
bandpasses and the temporal energy trajectories are calculated for
the resulting 24 band-limited signals. At this point, a large amount
of information is removed from the residual signal. Due to the short-
comings of the time domain subtraction method, the residual still
contains information from the deterministic component. By averag-
ing the energy over the Bark bands, this relation is eliminated.
Results of the synthesis stage are trajectories of the partial am-
plitudes, as shown in Figure 1, the trajectories of partial frequencies
and phases, as shown in Figure 2, respectively Figure 3 as well as the
trajectories of the Bark-band energies, illustrated in Figure 4. The
resulting data is exported to individual YAML files for each sound,
which can be read by the synthesis system.
3. SYNTHESIS SYSTEM
3.1. Libraries
The synthesis application is designed as a standalone Linux com-
mand line software. The main functionality of the synthesis system
relies on the JACK1API for audio connectivity and the liblo2, respec-
1http://jackaudio.org/
2https://github.com/radarsat1/liblo
tively the liblo C++ wrapper for receiving control signals. libyaml-
cpp3is used for reading the data of the modeled sounds and the rel-
evant configuration files. libsndfile4for reading the original sound
files, as well as the libfftw5are included but not relevant for the as-
pects presented in this paper. Frequency domain synthesis and sam-
ple playback are partially implemented but not used at this point.
3.2. Algorithm
Both the sinusoidal and the noise component are synthesized in the
time domain, using a non-overlapping method. For the sinusoidal
component, the builtin sin() function of the cmath library and a
custom lookup table can be selected. The choice does not affect the
overall performance, significantly. The filter bank for the noise syn-
thesis consists of 24 second order Chebyshev bandpass filters with
fixed coefficients, calculated before runtime. The amplitude of each
frequency band is driven by the previously analyzed energy trajecto-
ries.
During synthesis, the algorithm reads a new set of support points
from the model data for each audio buffer and increments the posi-
tion within the played note. Figure 5 shows a sequence diagram
for the deterministic synthesis algorithm, starting at the JACK call-
back function, which is executed for each buffer of the JACK audio
server. Since the synth is designed to enable polyphonic play, the
voice manager object handles incoming OSC messages in the func-
tion update_voices() to activate or deactivate single voices.
3https://github.com/jbeder/yaml-cpp/
4http://www.mega-nerd.com/libsndfile/
5http://www.fftw.org/
Proceedings of the 17th Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23–26, 2019
Control input
Puredata
Synth
Spatial Renderer
Audio Out
MIDI, OSC, ...
OSC OSC
Multi-Audio
(1ch / partial)
Multi-Audio
(1ch / speaker)
Performance
System
Figure 6: Combination of synthesizer and renderer on separate ma-
chines using Pure Data for synth configuration and parameter parsing
For the synthesis of mostly monophonic, excitation continuous in-
struments like the violin, the polyphony merely handles the overlap-
ping of released notes. Subsequently, the voice manager loops over
all active voices in the function getNextFrame_TD(), first set-
ting the new control parameters for each voice.
In cycle_start_deterministic(), support points for
all partial’s parameters are picked at the relevant voice’s playback
position. These support points are then linearly interpolated over the
buffer length in set_interpolator().
Finally, in getNextBlock_TD(), each single voice gener-
ates the output for all sinusoids and all noise bands in two separate
vectorizable loops, adding both to the output buffer.
3.3. Runtime Environment and Periphery
The runtime system for the synthesis is starting a JACK server with
48 kHz sampling rate, a buffer size of 128 samples and 2 periods
per buffer. This results in 5.3 ms latency for the audio playback,
which is within the limits for this synthesis approach. On an Intel(R)
Core(TM) i7-5500U CPU @ 2.40GHz with disabled speed-stepping
and a Fireface UFX, the JACK server is showing an average load of
approximately 20 %.
The interaction of the involved software components is visual-
ized in Figure 6. For reasons of performance and increased flexibility
in the studio, two separate machines are used for synthesis and spa-
tialization. Connectivity between the systems is realized with MADI
or DANTE, using individual channels for the 80 partials and 24 noise
bands.
3.4. Control
−3−2−1 1 2 3
−3
−2
−1
1
2
3
ϕ
S
x
y
Figure 7: Spatialization scene in a 2D setup with 30 partials and their
positions
The control data for the partial positions in the rendering soft-
ware is not generated in the synthesis system at this point and is
managed, externally. This offers more flexibility for testing different
mappings at this stage of development. A Pure Data patch is used to
receive incoming control messages, either from OSC or MIDI, and
distribute them to the synthesizer and the spatialization software. For
live performance, the patch receives continuous control streams for
pitch and intensity from an improved version of the interface pre-
sented by von Coler et al. [20] and visualizes the sensor data. Pitch
and intensity are forwarded to the synth, directly. Additionally, data
from several Force Sensitive Resistors (FSR) and a 9 degrees of free-
dom IMU, which can be used for controlling the spatialization, is
sent to the patch.
Figure 7 shows an example for a simple spatialization mapping
on a 2D system. The absolute orientation of the IMU is used to con-
trol the general direction ϕof the partial flock. A second parameter
S, derived from the intensity and additional sensor data, controls the
spread of the partials around this angle, depending on the partial in-
dex.
4. CONCLUSION
After significantly improving the performance of the synthesis sys-
tem, the application can now be used with the full 80 partials and
24 Bark bands as individual outputs. Recent tests in combination
with different spatial rendering softwares and different loudspeaker
setups show promising results. However, the dynamic spatialization
of such number of virtual sound sources and the resulting traffic of
OSC messages is demanding for the runtime system. Using separate
machines for synthesis and rendering reduces the individual load.
The number of rendering inputs can also be reduced without limit-
ing the perceived quality of the spatialization. Multiple partials may
share one virtual sound source.
Proceedings of the 17th Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23–26, 2019
Next steps are now possible, which include the empirical inves-
tigation of mappings from controller sensors to both the spectral and
spatial sound properties. This includes user experiments to evalu-
ate different mapping and control paradigms, as well as perceptual
measurements of the synthesis results.
5. ACKNOWLEDGMENTS
Thanks to Benjamin Wiemann for contributions to the project in it’s
early stage and to Robin Gareus for the help in restructuring the code
and hencewith improving the performance.
6. REFERENCES
[1] T. H. Andersen and K. Jensen, “Importance and Representa-
tion of Phase in the Sinusoidal Model”, J. Audio Eng. Soc,
vol. 52, no. 11, pp. 1157–1169, 2004.
[2] R. McAulay and T. Quatieri, “Speech analysis/Synthesis based
on a sinusoidal representation”, Acoustics, Speech and Signal
Processing, IEEE Transactions on, vol. 34, no. 4, pp. 744–
754, 1986.
[3] X. Serra and J. Smith, “Spectral Modeling Synthesis: A Sound
Analysis/Synthesis System Based on a Deterministic Plus Stochas-
tic Decomposition ”, Computer Music Journal, vol. 14, no. 4,
pp. 12–14, 1990.
[4] S. N. Levine and J. O. Smith, “A Sines+Transients+Noise
Audio Representation for Data Compression and Time/Pitch
Scale Modi cations”, in Proceedings of the 105th Audio En-
gineering Society Convention, San Francisco, CA, 1998.
[5] D. Kim-Boyle, “Spectral spatialization - an Overview”, in
Proceedings of the International Computer Music Conference,
Belfast, UK, 2008.
[6] D. Topper, M. Burtner, and S. Serafin, “Spatio-operational
spectral (sos) synthesis.”, in Proceedings of the International
Computer Music Conference (ICMC), Singapore, 2003.
[7] C. Verron, M. Aramaki, R. Kronland-Martinet, and G. Pal-
lone, “Spatialized additive synthesis of environmental sounds”,
in Audio Engineering Society Convention 125, Audio Engi-
neering Society, 2008.
[8] C. Verron, G. Pallone, M. Aramaki, and R. Kronland-Martinet,
“Controlling a spatialized environmental sound synthesizer”,
in 2009 IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics, IEEE, 2009, pp. 321–324.
[9] S. James, “Spectromorphology and spatiomorphology of sound
shapes: Audio-rate aep and dbap panning of spectra”, in Pro-
ceedings of the International Computer Music Conference 2015,
2015.
[10] O. Warusfel and N. Misdariis, “Directivity synthesis with a 3d
array of loudspeakers: Application for stage performance”, in
Proceedings of the COST G-6 Conference on Digital Audio
Effects (DAFX-01), Limerick, Ireland, 2001, pp. 1–5.
[11] J. Ahrens, M. Geier, and S. Spors, “The SoundScape Ren-
derer: A unified spatial audio reproduction framework for ar-
bitrary rendering methods”, in Audio Engineering Society Con-
vention 124, Audio Engineering Society, 2008.
[12] T. Carpentier, “Panoramix: 3d mixing and post-production
workstation”, in Proceedings of the International Computer
Music Conference (ICMC), 2016.
[13] A. Sèdes, P. Guillot, and E. Paris, “The HOA library, review
and prospects”, in International Computer Music Conference|
Sound and Music Computing, 2014, pp. 855–860.
[14] M. S. Puckette, “Pure Data”, in Proceedings of the Interna-
tional Computer Music Conference (ICMC), Thessaloniki,
Greece, 1997.
[15] H. von Coler, J. Margraf, and P. Schuladen, TU-Note Vio-
lin Sample Library, TU-Berlin, 2018. DO I:10 . 14279 /
depositonce-6747.
[16] H. von Coler, “TU-Note Violin Sample Library – A Database
of Violin Sounds with Segmentation Ground Truth”, in Pro-
ceedings of the 21st Int. Conference on Digital Audio Effects
(DAFx-18), Aveiro, Portugal, 2018.
[17] A. de Cheveigné and H. Kawahara, “YIN, a Fundamental
Frequency Estimator for Speech and Music”, The Journal of
the Acoustical Society of America, vol. 111, no. 4, pp. 1917–
1930, 2002.
[18] A. Camacho, “Swipe: A Sawtooth Waveform Inspired Pitch
Estimator for Speech and Music”, PhD thesis, Gainesville,
FL, USA, 2007.
[19] J. O. Smith and X. Serra, “PARSHL: An Analysis/Synthesis
Program for Non-Harmonic Sounds Based on a Sinusoidal
Representation”, Center for Computer Research in Music and
Acoustics (CCRMA), Stanford University, Tech. Rep., 2005.
[20] H. von Coler, G. Treindl, H. Egermann, and S. Weinzierl,
“Development and Evaluation of an Interface with Four-Finger
Pitch Selection”, in Audio Engineering Society Convention
142, Audio Engineering Society, 2017.