Conference PaperPDF Available

Gesture-Timbre Space: Multidimensional Feature Mapping Using Machine Learning & Concatenative Synthesis


Abstract and Figures

This paper presents a method for mapping embodied gesture , acquired with electromyography and motion sensing, to a corpus of small sound units, organised by derived timbral features using concatenative synthesis. Gestures and sounds can be associated directly using individual units and static poses, or by using a sound tracing method that leverages our intuitive associations between sound and embodied movement. We propose a method for augmenting corporal density to enable expressive variation on the original gesture-timbre space.
Content may be subject to copyright.
Gesture-Timbre Space: Multidimensional
Feature Mapping Using Machine Learning &
Concatenative Synthesis
Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka ?
Embodied Audiovisual Interaction Group
Goldsmiths, University of London
New Cross, London, SE14 6NW, UK,,
Abstract. This paper presents a method for mapping embodied ges-
ture, acquired with electromyography and motion sensing, to a corpus of
small sound units, organised by derived timbral features using concate-
native synthesis. Gestures and sounds can be associated directly using
individual units and static poses, or by using a sound tracing method
that leverages our intuitive associations between sound and embodied
movement. We propose a method for augmenting corporal density to
enable expressive variation on the original gesture-timbre space.
1 Introduction
Corpus-based concatenative synthesis (CBCS) is a compelling means to create
new sonic timbres based on navigating a timbral feature space. In its use of
atomic source units that are analysed, CBCS is an extension of granular syn-
thesis that harnesses the power of music information retrieval and the timbral
descriptors it generates. The actual sound to be played is specified by a target
and features associated with that target. In speech synthesis, the target is text.
In audio resynthesis and “mosaicing” applications, the target can be another
sound. In digital musical instrument (DMI) performance, the target may be sen-
sor data or some representation of performer action, or gesture. The target may
be of the same or dierent modality than the corpus, and it may have the same
or dierent feature dimensionality.
CBCS performance systems until now have, on the whole, been implemented
using dimensionality reduction. A subset of corporal features are projected into a
low dimension space, typically Cartesian, and performance input is constrained
to these dimensions. The dimensionality reduction acts as funnel that does not
provide access to the complete feature space of the corpus and may forsake the
richness of performance input.
?The research leading to these results has received funding from the European Re-
search Council (ERC) under the European Unions Horizon 2020 research and inno-
vation programme (Grant agreement No. 789825)
2 Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka
Fig. 1. A performer, wearing a sensor armband and engaging in a sound tracing study.
Our work builds upon a previous sound tracing study, where participants de-
signed gestures to articulate time-varying sound using simple granular synthesis
(Fig. 1). While sound tracing usually studies evoked gestural response [3], we
extended traditional sound tracing to enable gesture-sound reproduction, and
trained machine learning models to enable exploratory gestural performance by
articulating expressive variations on the original sound. Participants were inter-
ested in exploring new timbres, but were limited by the provided corpus. The re-
gression algorithm allowed them to scrub to dierent granular parameters in the
stimulus sound, but not to a broader heterogeneous corpus. The study pointed
out the potential for a more robust synthesis outlet for the gesture input regres-
sion model. Could we provide a corpus with sounds not in the original sound
tracing stimulus? Could a regression model be harnessed to carry out feature
mapping from the input domain (gesture) to the output domain (sound)?
We propose a system for exploring and performing with a multidimensional
audio space using multimodal gesture sensing as the input. The input takes
features extracted from electromyographic (EMG) and inertial sensors, and uses
machine learning through regression modelling to create a contiguous gesture
and motion space. EMG sensors on the forearm have demonstrated potential for
expressive, multidimensional musical control, capturing small voltage variations
associated with motions of the hand and fingers. The output target is generated
via CBCS[11, 13], a technique which creates longer sounds by combining shorter
sounds, called “units.” A corpus of sounds is segmented into units which are
catalogued by auditory features. Units can be recalled by query with a vector
of those features. Our system allows musicians to quickly create an association
between points and trajectories in a gesture feature space and units in a timbral
feature space. The spaces can be explored and augmented together, interactively.
The paper is structured as follows. We first review related work on concatena-
tive synthesis in performance. We then describe the proposed system, its archi-
Gesture-Timbre Space 3
tecture and technical implementation. Section 3 presents sound design strategies
that address questions of corporal density to enable expressive performance, and
the user workflow to associate gesture and CBCS sound via regression. In the
discussion we provide a critical assessment of this approach and point out per-
spectives for future work before concluding.
2 Related Work
Aucouturier and Pachet [1] used concatenative sound synthesis to generate new
musical pieces by recomposing segments of pre-existent pieces. They developed a
constraint-satisfaction algorithm for controlling high-level properties like energy
or continuity of the new track. They presented an example where a musician con-
trols the system via MIDI, demonstrating an audio engine suitable for building
real-time, interactive audio systems.
Stowell and Plumbley’s [15] work focused on building associations between
two dierently distributed, unlabelled sets of timbre data. They succeeded in the
implementation of a regression technique which learns the relations between a
corpus of audio grains and input control data. In evaluating their system, they
observed that such an approach provides a robust way of building trajectories
between grains, and mapping these trajectories to input control parameters.
Schwarz et al. [14] used CataRT controlled through a 2D GUI in live perfor-
mance taking five dierent approaches: (i) re-arranging the corpus in a dierent
order than the original one, (ii) interaction with self-recorded live sound, (iii)
composing by navigation of the corpus, (iv) cross-selection and interpolation
between sound corpora, and (v) corpus-based orchestration by descriptor or-
ganisation. After performing in these five modes they concluded that CataRT
empowers musicians to produce rich and complex sounds while maintaining pre-
cision in the gestural control of synthesis parameters. It presents itself as a blank
canvas, without imposing upon the composer/performer any precise sonority.
In a later work [12], Schwarz et al. extended interaction modes and controllers
(2D or 3D positional control, audio input) and concluded by stating the need
for machine learning approaches in order to allow the user to explore a corpus
by the use of XYZ-type input devices. They present gestural control of CataRT
as an expressive and playful tool for improvised performance.
Savary et al. [9,10] created Dirty Tangible Interfaces, a typology of user inter-
faces that favour the production of very rich and complex sounds using CataRT.
Interfaces can be constantly evolved, irreversibly, by dierent performers at the
same time. The interface is composed of a black box containing a camera and
LED to illuminate a glass positioned above the camera, where users can posi-
tion solid and liquid materials. Material topologies are detected by the camera,
where a grey scale gradient is then converted into a depth map. This map is the
projected onto a 3D reduction of the corpus space to trigger dierent grains.
Another example of gestural control of concatenative synthesis is the artistic
project Luna Park by G. Beller [2]. He uses one accelerometer on top of each
hand to estimate momentum variation, hit energy, and absolute position of the
4 Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka
hands. Two piezoelectric microphones responded to percussive patterns played
in dierent zones of his body (one near the left hip and the other one near the
right shoulder). Sensor data were then mapped to audio engine parameters to
synthesise and interact with another performers recorded speech.
3 Methodology
Neural Network Regression Model
Time varying
sound design
(anchor points)
Sensor Data
Feature Extraction
Audio Feature
Sound Tracing
Fig. 2. System architecture diagram.
3.1 Implementation
We use a commercial sensor device1, an armband worn on the forearm which
packages eight electromyography (EMG) muscle sensors and an inertial mea-
surement unit (IMU) for gross movement and orientation sensing, and transmits
them over Bluetooth to a computer. We have also verified our approach with
other biosensor packages, such as Plux’s BITalino2.
The software system is implemented in Cycling ’74’s Max3.Weusethemyo4
object to capture raw EMG output of the sensors along with orientation quater-
nions from the on-board IMU to generate a multimodal feature vector repre-
senting the orientation, motion, and muscular state of the performer’s forearm.
Quaternions (x,y,z, and w: calculated by the device from accelerometer, gy-
roscope, and magnetometer data) give orientation, and we take the first-order
dierence between the current quaternion frame and the previous frame (xd,yd,
Gesture-Timbre Space 5
zd,wd) to represent the current motion of the forearm. This is an important
feature because hand gestures can be performed ballistically or in a more static
fashion, causing dierent patterns of muscular activation even though the results
are visually similar.
Raw EMG signals are intrinsically noisy, and we do not include them in our
feature vector. Instead, we use a Bayesian[8] filter to probabilistically predict
the amplitude envelope for each electrode in the armband. The sum of all eight
amplitude envelopes is also included in the input feature vector, along with a
new feature we have developed called “vector sum.” Vector sum (Fig. 3) is a rep-
resentation of the fact that the forearm muscles are situated around the arm in
such a way that they can oppose or reinforce the action of other muscles. To cal-
culate the vector sum, we model each electrode as representing a vector pointing
away from the centre of a circle, evenly spaced every 45 degrees. The direction
for each electrode vector does not change and the magnitude is proportional to
the amplitude calculated by the Bayesian filter. The eight vectors are summed,
and the resulting vector is related to the overall direction of force represented by
all of the electrodes. When compared to the sum of all electrodes, the vector sum
can distinguish gestures where muscles are opposing one another isometrically.
This is an important feature, since joint movement might be minimal in such
gestures but the subjective perception of eort is quite high. The vector sum is
reported as a pair of Cartesian coordinates, which are better suited to regres-
sion than polar coordinates because they do not wrap around at zero degrees.
See table 3.1 for a lists of the full gestural and timbral feature vectors. Where
relevant, we took the average (µ) and standard deviation () of each timbral
feature over the whole audio unit.
Fig. 3. An example vector sum, in black, drawn with the individual EMG vectors in
We implement machine learning in Max using an external object called
rapidmax[7]. This object implements basic machine learning algorithms, such
as multilayer perceptrons, k-nearest neighbour, and dynamic time warping, to
allow Max users to quickly employ machine learning for regression or classifica-
tion tasks. It is a Max wrapper around RapidLib [18], a C++ and JavaScript
6 Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka
library for creative, interactive machine learning applications in the style of
Wekinator[4]. Specifically, we use a multilayer perceptron (MLP) neural net-
work with one hidden layer to create models that perform regression based on
user-provided training examples. This particular implementation uses a linear ac-
tivation function on the output layer, allowing for model outputs that go beyond
the numerical range of the provided examples, creating a larger and potentially
more interesting generative space for aesthetic exploration. Training examples
are created by associating inputs—gesture feature vectors—with outputs: vec-
tors of timbral features. Performers can record example interactions, associating
positions and gestures with sounds, to build an exploratory and performative
gesture-timbre space.
Table 1 . Input and output feature vectors for regression models
Input features(gesture) Output features (timbre)
yFrequency µ
wEnergy µ
ydPeriodicity µ
wdAC1 µ
EMG2Loudness µ
EMG4Centroid µ
EMG6Spread µ
EMG8Skewness µ
EMGsum Skewness
vectorSumxKurtosis µ
The audio engine is implemented using MuBu5CataRT Max objects. We
use the mubu.process object for segmentation and auditory feature analysis,
mubu.knn for retrieval of the closest matching unit to a given set of auditory
features, and the mubu.concat object for synthesising the unit once recalled in
our workflow (see Section 3.3).
When a sound file is imported into MuBu, it is automatically segmented into
units, either of a fixed length or determined by an onset detection algorithm
(Fig. 4). A vector of auditory features (enumerated in table 3.1) is derived for
each unit. These vectors of auditory features are associated with sensor feature
Gesture-Timbre Space 7
Fig. 4. A sound file imported into a MuBu buer, using the onseg algorithm. Unit
boundaries are shown as vertical, red lines. These lines would be equally spaced in
chop mode. This display can be opened from the main gui window.
vectors to train a neural network, and roughly represent a high-dimensional
timbral similarity space.
During playback, the amplitude and panning of the output is controlled by
the “Amplitude Panner” (Fig. 5, upper right panel). The EMG sensors are di-
vided into two groups and their amplitude envelopes are summed. The sum of
each group is used to control the overall amplitude of the audio output in the
left and right channels, respectively. When there is no muscular activation, both
channels have near zero gain, giving the performer a natural method to make
the instrument silent when they are not putting any energy into the system.
3.2 Sound & Gesture Design
In our previous study [16], we proposed four dierent approaches to designing
gesture-timbre interaction based on a sound tracing exercise. In this system we
revisit two of those approaches using our concatenative audio engine. Sound trac-
ing is an exercise where a sound is given as a stimulus to study evoked gestural
response [3]. Sound tracing has been used as a starting point for techniques of
“mapping-by-demonstration” [5].
For that exercise, we used a general purpose software synthesizer, SCP by
Manuel Poletti, controlled a breakpoint envelope-based playback system. We
chose to design sounds that transition between four fixed anchor points with
fixed synthesis parameters, primarily using SCP’s granular synthesis engine. En-
velopes interpolate between these fixed points. The temporal evolution of sound
is captured as dierent states in the breakpoint editor whose envelopes run dur-
ing playback. Any of the parameters can be assigned to breakpoint envelopes to
be controlled during playback.
Users were asked to design a gesture that matched a pre-designed sound,
and to train a regression model by associating data from that gesture with
8 Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka
the parameters of the sound. This created an exploratory space for performing
variations on the source sound.
Our current work extends that activity with the use of CBCS. We go beyond
the regression-based control of parametric synthesis from the previous study to
create a mapping from gesture features to timbral features. This enables the
user to perform the sound’s corpus in real time, using variations on the original
sound tracing gesture to articulate new sounds.
With this technique, we encounter potential problems of sparsity of the cor-
pus feature space. There is no guarantee that there will be a unit that is closely
related to the timbral features generated by the neural network in response to
a given set of target gesture features. In order to address this, we added a step
in our sound design method to fill the corpus with sound related to the original
sound stimulus, but that had a wider range of timbral features. To do so, we went
back to the original SCP sound authoring patch and replaced the source sample
with a series of other sound samples. We then played the synthesis envelopes
to generate time-varying sounds that followed the pitch/amplitude/parametric
contour of the original stimulus. These timbral variants were recorded as sepa-
rate audio files, imported into CataRT, and analysed. In this way, the corpus was
enriched in a way that was directly related to the sound design of the original
stimulus but had a greater diversity of timbral features, creating potential for
more expressive variation in performance.
3.3 Workflow
Fig. 5. Graphical-User Interface.
Fig. 5 shows the controls to interact with our system. At the beginning of a
session, a performer activates the sensor armband using the toggle in the first
column. The feature vector derived from the EMG/IMU sensors is displayed in
Gesture-Timbre Space 9
the same panel, allowing the performer to verify that the sensors are working as
expected. They next select, in the CataRT column, the type of segmentation and
analysis that will be performed on sound files imported into a corpus, choosing
between onset-based segmentation and “chop,” which divides the sound into
equal-sized units. Parameters for these are available in a subpanel. The performer
then imports individual sound files or folders of sounds into the corpus. Importing
is cumulative; the whole corpus can be cleared with the clear button. It is also
possible to re-analyse the current corpus using new parameters.
Once the sensors are activated and a corpus has been imported, segmented,
and analysed, the gesture-timbre mapping process can begin. Performers may
listen to individual units by selecting the buer index (which file the unit is from)
and data index (which unit in a file). The analysed timbral features associated
with the selected unit are displayed by the multislider in the same column. In
order to associate gestural features with the selected timbral features, they press
the record button in the RapidMax column, which automatically captures 500ms
of sensor data associated with the selected unit data.
Another way of interacting with audio units is to play an entire file from
the first to the last unit. This enables the sound tracing workflow. Performers
can listen to the whole sound and design the appropriate gesture to accompany
that sound. Once that gesture has been designed, they can click on the playback
while recording data button and then perform their gesture synchronously with
sound playback. As each unit passes in order, the associated timbral data will be
recorded in conjunction with the gestural data at that moment. This corresponds
to the “whole gesture regression” mode from our previous work.
Once a set of potentially interesting example data has been recorded, users
train the neural network by clicking the train & run button. When training has
finished, the system enters run mode. At this point, incoming gestural data is sent
to the trained regression model. This model outputs a vector of target timbral
features that is sent to a k-nearest neighbours algorithm implemented in MuBu.
That model outputs the unit buer and data indices that most closely match
the targeted timbral features. mubu.concat receives the indices and plays the
requested unit. In this way, we have coupled the MLP model of gestural and
timbral features to a k-NN search of timbrally defined units the corpus. This
process allows musicians to explore the gesture-timbre space, and perform with
it in real time.
4 Discussion
In lieu of a formal evaluation, in this section we reflect on the strengths and
weaknesses of the system with a critical assessment of its aordances based on
testing by the authors.
Concatenative synthesis is described in terms of a target that one tries to
synthesise by navigating a corpus. In an audio-audio mosaicing task, the “tar-
get” is an example sound that one is trying to resynthesize with the corpus. In
cases using interfaces for live controllers [9,12], the mapping between gesture
10 Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka
and sound has, until now, taken place in a reduced dimension space. Two or
three features are selected as pertinent and projected onto a graphical Cartesian
representation. Our system does not require dimensionality reduction, and the
number of input dimensions does not need to match the number output dimen-
sions. This creates the disadvantage, however, of not being able to visualise the
feature space. Schwarz in [12] finds seeing a reduced projection of the feature
space convenient, but prefers to perform without it.
Schwarz describes exploratory performance as a DMI application of CBCS
that distinguishes it from the more deterministic applications of speech synthesis
or audio mosaicing [12]. He provides the example of improvised music where the
performer uses an input device to explore a corpus, sometimes one that is being
filled during the performance by live sampling another instrumentalist. This
creates an element of surprise for the performer. Here we sought to create a
system that enables timbral exploration, but that would be reproducible, and
useful in compositional contexts where both sound and associated gesture can
be designed apriori. In using our system, we were able to perform the sound
tracing gesture to reconstruct the original sound. This shows that the generation
of time-varying sound sources from our parametric synthesis programme were
faithfully reproduced by CataRT in this playing mode.
Diculties arose when we created gesture variations where the regression
model started to “look for” units in the feature space that simply were not
there, raising the problem of corpus sparseness. The sound design strategy to
generate variants eectively filled the feature space. It was important that the
feature space was filled not just with any sound, but with sound relevant to the
original for which the gesture had been authored. By generating variants using
dierent sound samples but that followed the same broad sonic morphology, we
populate the corpus with units that were musically connected to the original but
that were timbrally (and in terms of features) distinct. This creates a kind of
hybrid between a homogeneous and heterogeneous corpus. It is heterogeneous
in the diversity of sound at the performer’s disposal, but remains musically
coherent and homogeneous with the original sound/gesture design. This allowed
expressive variation on a composed sound tracing gesture.
In order to support expressive performance we need to create gesture-timbre
spaces that maximise sonic diversity. When a performer navigates through ges-
ture space, the outcome is more rich and expressive if a diverse range of individual
units are activated. The nuances of gesture become sonically meaningful if that
gestural trajectory has a fine-grained sonic result. This is not always the result of
the proposed workflow, especially in the case where the user chooses individual
units and associates them with specific gestural input. It is, again, a problem of
sparseness; the distribution of units in a high-dimensional space is not usually
even. In a typical corpus, there will be areas with large clusters of units and
other units that are relatively isolated. When musicians choose individual units
to use for gesture mapping, there is a tendency to choose the units that have the
most character. These units are often outliers. In a good outcome, outlier units
represent the edges of the timbral space of the corpus. In this case, a regression
Gesture-Timbre Space 11
between units on dierent edges of the space will activate a wide range of in-
termediate units. However, it is also possible that a gestural path between two
interesting units does not pass near any other units in the corpus. In that case,
the resulting space performs more like a classifier, allowing the performer to play
one unit or another without any transitional material between them. It would
be helpful to give users an idea about where the potentially interesting parts of
a corpus lie. It might also be possible to automatically present performers with
units that represent extreme points of the timbral space, or areas where gesture
mapping might yield interesting results.
Future work to address varying sparseness and density of the corpus feature
space maybe be in dynamic focus on areas more likely to have sound. Schwarz
in [12] uses Delaunay triangulation to evenly redistribute the three dimensional
projection of the corpus in his tablet based performance interface. This operation
would be more dicult in a higher dimensional space. One recent development
in the CataRT community has been the exploration of using self-organising maps
to create a more even distribution of features in the data space [6].
Another potentially interesting way to use multidimensional gesture-timbre
mapping is to generate feature mapping using one corpus of sounds, and then
either augment that corpus or change to an entirely dierent corpus – moving
the timbral trajectory of a gesture space into a new set of sounds. This can be
fruitful when units in the new corpus intersect with the existing gesture space,
but it is dicult to give users an idea about whether or not that will be the
case. One idea we are exploring is “transposing” a trajectory in timbral feature
space so that it intersects with the highest number of units in a new corpus. This
could be accomplished using machine learning techniques, such as dynamic time
warping, to calculate the “cost” of dierent ways to match a specific trajectory to
a given set of units, and find the optimal transposition. We know from Wessel’s
seminal work on timbre spaces [17] that transposition in a low dimensional timbre
space is perceptually relevant. Automatically generating these transpositions, or
suggesting multiple possible transpositions, has the potential to generate novel
musical phrases that are perceptually connected to the original training inputs.
5 Conclusions
We have presented a system that combines regression-based machine learning
with corpus-based concatenative synthesis. We extend a previous study where a
sound tracing workflow was used to design gestures to articulate time varying
sounds. Gesture input from EMG and IMU sensors generated multidimensional
targets and are associated with specific points in a high-dimensional timbral
feature space in order to train a neural network. Using this workflow, we were able
to reproduce original sound tracings. By populating the corpus with related, but
timbrally diverse grains, we increased the corporal density to enable expressive
variation on the original gesture. This workflow demonstrates the use of real-
time, interactive machine learning with CataRT and creates a multidimensional
feature mapping linking gesture to sound synthesis.
12 Michael Zbyszy´nski, Balandino Di Donato, and Atau Tanaka
1. J.-J. Aucouturier and F. Pachet,Jamming with Plunderphonics: Interac-
tive concatenative synthesis of music, Journal of New Music Research, 35 (2006),
pp. 35–50.
2. G. Beller,Gestural control of real time speech synthesis in lunapark, in Proceed-
ings of Sound Music Computing Conference, SMC, Padova, Italy, 2011.
3. B. Caramiaux, F. Bevilacqua, and N. Schnell,Towards a gesture-sound
cross-modal analysis, in Gesture in Embodied Communication and Human-
Computer Interaction, Berlin, Heidelberg, 2010, pp. 158–170.
4. R. Fiebrink and P. R. Cook,The wekinator: a system for real-time, interactive
machine learning in music, in Proceedings of The International Society for Music
Information Retrieval Conference, ISMIR 2010, Utrecht, Netherlands, 2010.
5. J. Franc¸ oise,Motion-sound mapping by demonstration, PhD thesis, UPMC, 2015.
6. J. Margraf,Masters thesis, TU Berlin, (2019).
7. S. T. Parke-Wolfe, H. Scurto, and R. Fiebrink,Sound control: Supporting
custom musical interface design for children with disabilities, in Proceedings of
the International Conference on New Interfaces for Musical Expression, NIME’19,
Porto Alegre, Brazil, 2019.
8. T. D. Sanger,Bayesian filtering of myoelectric signals, Journal of neurophysiol-
ogy, 97 (2007), pp. 1839–1845.
9. M. Savary, D. Schwarz, and D. Pellerin,Dirti —dirty tangible interfaces,in
Proceedings of the International Conference on New Interfaces for Musical Expres-
sion, NIME’12, Ann Arbor, Michigan, 2012.
10. M. Savary, D. Schwarz, D. Pellerin, F. Massin, C. Jacquemin, and R. Ca-
hen,Dirty tangible interfaces: Expressive control of computers with true grit,in
CHI ’13 Extended Abstracts on Human Factors in Computing Systems, CHI EA
’13, Paris, France, 2013, ACM, pp. 2991–2994.
11. D. Schwarz,Concatenative sound synthesis: The early years, Journal of New
Music Research, 35 (2006), pp. 3–22.
12. D. Schwarz,The sound space as musical instrument: Playing corpus-based con-
catenative synthesis, in Proceedings of the International Conference on New Inter-
faces for Musical Expression, NIME’12, Ann Arbor, Michigan, 2012.
13. D. Schwarz, G. Beller, B. Verbrugghe, and S. Britton,Real-Time Corpus-
Based Concatenative Synthesis with CataRT, in 9th International Conference on
Digital Audio Eects, DAFx 19, Montreal, Canada, 2006, pp. 279–282.
14. D. Schwarz, R. Cahen, and S. Britton,Principles and applications of inter-
active corpus-based concatenative synthesis, in Journes d’Informatique Musicale,
JIM, Albi, France, 2008.
15. D. Stowell and M. D. Pumbley,Timbre remapping through a regression-tree
trchnique, in Proceedings of the Sound Music Computing Confetence, SMC, 2010.
16. A. Tanaka, B. Di Donato, and M. Zbyszy´
nski,Designing gestures for con-
tinuous sonic interaction, in Proceedings of the International Conference on New
Interfaces for Musical Expression, NIME’19, Porto Alegre, Brazil, 2019.
17. D. L. Wessel,Timbre space as a musical control structure, Computer music jour-
nal, (1979), pp. 45–52.
18. M. Zbyszy´
nski, M. Grierson, and M. Yee-King,Rap id p rot ot yp in g of n ew
instruments with codecircle, in Proceedings of the International Conference on New
Interfaces for Musical Expression, Copenhagen, Denmark, 2017, pp. 227–230.
Conference Paper
Full-text available
We present a system that allows users to try different ways to train neural networks and temporal modelling to associate gestures with time-varying sound. We created a software framework for this and evaluated it in a workshop-based study. We build upon research in sound tracing and mapping-by-demonstration to ask participants to design gestures for performing time-varying sounds using a multimodal, inertial measurement (IMU) and muscle sensing (EMG) device. We presented the user with two classical techniques from the literature, Static Position regression and Hidden Markov based temporal modelling, and propose a new technique for capturing gesture anchor points on the fly as training data for neural network based regression , called Windowed Regression. Our results show trade-offs between accurate, predictable reproduction of source sounds and exploration of the gesture-sound space. Several users were attracted to our windowed regression technique. This paper will be of interest to musicians engaged in going from sound design to gesture design and offers a workflow for interactive machine learning.
Full-text available
Designing the relationship between motion and sound is essential to the creation of interactive systems. This thesis proposes an approach to the design of the mapping between motion and sound called Mapping-by-Demonstration. Mapping-by-Demonstration is a framework for crafting sonic interactions from demonstrations of embodied associations between motion and sound. It draws upon existing literature emphasizing the importance of bodily experience in sound perception and cognition. It uses an interactive machine learning approach to build the mapping iteratively from user demonstrations. Drawing upon related work in the fields of animation, speech processing and robotics, we propose to fully exploit the generative nature of probabilistic models, from continuous gesture recognition to continuous sound parameter generation. We studied several probabilistic models under the light of continuous interaction. We examined both instantaneous (Gaussian Mixture Model) and temporal models (Hidden Markov Model) for recognition, regression and parameter generation. We adopted an Interactive Machine Learning perspective with a focus on learning sequence models from few examples, and continuously performing recognition and mapping. The models either focus on movement, or integrate a joint representation of motion and sound. In movement models, the system learns the association between the input movement and an output modality that might be gesture labels or movement characteristics. In motion-sound models, we model motion and sound jointly, and the learned mapping directly generates sound parameters from input movements. We explored a set of applications and experiments relating to real-world problems in movement practice, sonic interaction design, and music. We proposed two approaches to movement analysis based on Hidden Markov Model and Hidden Markov Regression, respectively. We showed, through a use-case in Tai Chi performance, how the models help characterizing movement sequences across trials and performers. We presented two generic systems for movement sonification. The first system allows users to craft hand gesture control strategies for the exploration of sound textures, based on Gaussian Mixture Regression. The second system exploits the temporal modeling of Hidden Markov Regression for associating vocalizations to continuous gestures. Both systems gave birth to interactive installations that we presented to a wide public, and we started investigating their interest to support gesture learning.
Conference Paper
Full-text available
Corpus-based concatenative synthesis is based on descriptor analysis of any number of existing or live-recorded sounds, and synthesis by selection of sound segments from the database matching given sound characteristics. It is well described in the literature, but has been rarely examined for its capacity as a new interface for musical expression. The outcome of such an examination is that the actual instru-ment is the space of sound characteristics, through which the performer navigates with gestures captured by various input devices. We will take a look at different types of interaction modes and controllers (positional, inertial, au-dio) and the gestures they afford, and provide a critical as-sessment of their musical and expressive capabilities, based on several years of musical experience, performing with the CataRT system for real-time CBCS.
Conference Paper
Full-text available
Dirty Tangible Interfaces (DIRTI) are a new concept in interface design that forgoes the dogma of repeatability in favor of a richer and more complex experience, constantly evolving, never reversible, and infinitely modifiable. We built a prototype interface realizing the DIRTI principles based on low-cost commodity hardware and kitchenware: A video camera tracks a granular or liquid interaction material placed in a glass dish. The 3D relief estimated from the images, and the dynamic changes applied to it by the user(s), are used to control two applications: For 3D scene authoring, the relief is directly translated into a terrain, allowing fast and intuitive map editing. For expressive audio-graphic music performance, both the relief and real-time changes are interpreted as activation profiles to drive corpus-based concatenative sound synthesis, allowing one or more musicians to mold sonic landscapes and to plow through them in an inherently collaborative, expressive, and dynamic experience.
Full-text available
Concatenative sound synthesis is a promising method of musical sound synthesis with a steady stream of work and publications for over five years now. This article offers a comparative survey and taxonomy of the many different approaches to concatenative synthesis throughout the history of electronic music, starting in the 1950s, even if they weren't known as such at their time, up to the recent surge of contemporary methods. Concatenative sound synthesis methods use a large database of source sounds, segmented into units, and a unit selection algorithm that finds the units that match best the sound or musical phrase to be synthesized, called the target. The selection is performed according to the descriptors of the units. These are characteristics extracted from the source sounds, e.g. pitch, or attributed to them, e.g. instrument class. The selected units are then transformed to fully match the target specification, and concatenated. However, if the database is sufficiently large, the probability is high that a matching unit will be found, so the need to apply transformations is reduced. The most urgent and interesting problems for further work on concatenative synthesis are listed concerning segmentation, descriptors, efficiency, legality, data mining and real time interaction. Finally, the conclusion tries to provide some insight into the current and future state of concatenative synthesis research.
Full-text available
This paper proposes to use the techniques of Concate-native Sound Synthesis in the context of real-time Music Interaction. We describe a system that generates an au-dio track by concatenating audio segments extracted from pre-existing musical files. The track can be controlled in real-time by specifying high-level properties (or con-straints) holding on metadata about the audio segments. A constraint-satisfaction mechanism, based on local search, selects audio segments that best match those constraints at any time. We describe the real-time aspects of the system, notably the asynchronous adding/removing of constraints, and report on several constraints and controllers designed for the system. We illustrate the system with several appli-cation examples, notably a virtual drummer able to interact with a human musician in real-time.
Conference Paper
Full-text available
This article reports on the exploration of a method based on canonical correlation analysis (CCA) for the analysis of the relationship between gesture and sound in the context of music performance and listening. This method is a first step in the design of an analysis tool for gesture-sound relationships. In this exploration we used motion capture data recorded from subjects performing free hand movements while listening to short sound examples. We assume that even though the relationship between gesture and sound might be more complex, at least part of it can be revealed and quantified by linear multivariate regression applied to the motion capture data and audio descriptors extracted from the sound examples. After outlining the theoretical background, the article shows how the method allows for pertinent reasoning about the relationship between gesture and sound by analysing the data sets recorded from multiple and individual subjects.
Surface electromyography is used in research, to estimate the activity of muscle, in prosthetic design, to provide a control signal, and in biofeedback, to provide subjects with a visual or auditory indication of muscle contraction. Unfortunately, successful applications are limited by the variability in the signal and the consequent poor quality of estimates. I propose to use a nonlinear recursive filter based on Bayesian estimation. The desired filtered signal is modeled as a combined diffusion and jump process and the measured electromyographic (EMG) signal is modeled as a random process with a density in the exponential family and rate given by the desired signal. The rate is estimated on-line by calculating the full conditional density given all past measurements from a single electrode. The Bayesian estimate gives the filtered signal that best describes the observed EMG signal. This estimate yields results with very low short-time variability but also with the capability of very rapid response to change. The estimate approximates isometric joint torque with lower error and higher signal-to-noise ratio than current linear methods. Use of the nonlinear filter significantly reduces noise compared with current algorithms, and it may therefore permit more effective use of the EMG signal for prosthetic control, biofeedback, and neurophysiology research.
Gestural control of real time speech synthesis in lunapark
  • G Beller
G. Beller, Gestural control of real time speech synthesis in lunapark, in Proceedings of Sound Music Computing Conference, SMC, Padova, Italy, 2011.