Conference PaperPDF Available

Designing Gestures for Continuous Sonic Interaction

Authors:

Abstract and Figures

We present a system that allows users to try different ways to train neural networks and temporal modelling to associate gestures with time-varying sound. We created a software framework for this and evaluated it in a workshop-based study. We build upon research in sound tracing and mapping-by-demonstration to ask participants to design gestures for performing time-varying sounds using a multimodal, inertial measurement (IMU) and muscle sensing (EMG) device. We presented the user with two classical techniques from the literature, Static Position regression and Hidden Markov based temporal modelling, and propose a new technique for capturing gesture anchor points on the fly as training data for neural network based regression , called Windowed Regression. Our results show trade-offs between accurate, predictable reproduction of source sounds and exploration of the gesture-sound space. Several users were attracted to our windowed regression technique. This paper will be of interest to musicians engaged in going from sound design to gesture design and offers a workflow for interactive machine learning.
Content may be subject to copyright.
Designing Gestures for Continuous Sonic Interaction
Atau Tanaka, Balandino Di Donato, and
Michael Zbyszy´
nski
Embodied Audiovisual Interaction Group
Goldsmiths, University of London
SE14 6NW, London, UK
[a.tanaka, b.didonato,
m.zbyszynski]@gold.ac.uk
Geert Roks
Music and Technology Department
HKU University of the Arts
3582 VB, Utrecht, Netherlands
geertrocks@gmail.com
ABSTRACT
We present a system that allows users to try different ways
to train neural networks and temporal modelling to asso-
ciate gestures with time-varying sound. We created a soft-
ware framework for this and evaluated it in a workshop-
based study. We build upon research in sound tracing
and mapping-by-demonstration to ask participants to de-
sign gestures for performing time-varying sounds using a
multimodal, inertial measurement (IMU) and muscle sens-
ing (EMG) device. We presented the user with two classical
techniques from the literature, Static Position regression
and Hidden Markov based temporal modelling, and pro-
pose a new technique for capturing gesture anchor points
on the fly as training data for neural network based regres-
sion, called Windowed Regression. Our results show trade-
offs between accurate, predictable reproduction of source
sounds and exploration of the gesture-sound space. Several
users were attracted to our windowed regression technique.
This paper will be of interest to musicians engaged in going
from sound design to gesture design and offers a workflow
for interactive machine learning.
Author Keywords
Sonic Interaction Design, Interactive Machine Learning,
Gestural Interaction
CCS Concepts
Human-centered computing Empirical studies
in interaction design; Applied computing Sound
and music computing;
1. INTRODUCTION
Designing gestures for the articulation of dynamic sound
synthesis is a key part of the preparation of a performance
with a DMI. Traditionally this takes place through a care-
ful and manual process of mapping. Strategies for mapping,
including “one-to-many” and “many-to-one” [15] are funda-
mental techniques in NIME. The field of embodied music
cognition looks at the relationship between corporeal action
and music [17]. The notion of sonic affordances draws upon
the notion of affordance from environmental psychology [13]
to look at how a sound may invite action [1].
Licensed under a Creative Commons Attribution
4.0 International License (CC BY 4.0). Copyright
remains with the author(s).
NIME’19, June 3-6, 2019, Federal University of Rio Grande do Sul,
Porto Alegre, Brazil.
Sound tracing is an exercise where a sound is given as
a stimulus to study evoked gestural response [3]. Sound
tracing has been used as a starting point for techniques of
“mapping-by-demonstration” [12]. While these studies look
at the articulation of gesture in response to sounds, they
focus on evoked gesture. In the field of sonic interaction de-
sign, embodied interaction has been used to design sounds.
This includes techniques applying interactive technologies
to traditions of Foley, or by vocalisation [7] and invoke the
body in the design of sounds.
The synthesis of time-varying sounds and the exploration
of timbral spaces is a practice at the heart of computer mu-
sic research. Wessel’s seminal work in the field defines tim-
bre space in a Cartesian plane [22]. Momeni has proposed
interactive techniques for exploring timbre spaces [18].
Neural networks can be trained for regression tasks by
providing examples of inputs associated with desired out-
puts. In systems for interactive machine learning, like Wek-
inator [9], this is implemented by associating positions in 3D
space to synthesised sound output. Once a model is trained,
the user performs by moving between (and beyond) the ex-
ample positions to create dynamic sound by gestures. While
performance is dynamic, the training is based on poses as-
sociated with sound synthesis parameters that are fixed for
each input example. Here we call this approach “static re-
gression.”
Time-varying gestures can be modelled by probabilistic
approaches, such as Hidden Markov Models. In perfor-
mance, live input is compared to transition states of the
model, allowing the algorithm to track where in the exam-
ple gesture the input is. This approach is commonly referred
to as temporal modelling.
We present a system for designing gestures to perform
time-varying synthesised sound. It extends the notion of
mapping-by-demonstration in a practical setting by en-
abling users to capture gesture while listening to sound, and
then to train different machine learning models. It asso-
ciates the authoring of gesture to interactive sound synthe-
sis and in so doing, explores the connection between sound
design and gesture design. The technique uses commonly
available tools for musical performance and machine learn-
ing and assumes no specialist knowledge of machine learn-
ing. It will be useful for artists wishing to create gestures
for interactive music performances in which gestural input
articulates dynamic synthesised sound where the associa-
tion of gesture and sound is not made by direct mapping,
but mediated by machine learning.
We propose an automated technique for training a neu-
ral network with a windowed set of anchor points captured
on the fly from a dynamic gesture made in response to a
sound tracing stimulus. We call this technique Windowed
Regression and evaluate it alongside static regression and
temporal modelling to gain insight into its usefulness in a
180
gesture design task.
This paper is organised as follows. In the next section,
we survey related work in the area of machine learning of
musical gesture. In Section 3, we present the architecture of
our system, its techniques of sound design, machine learning
and the proposed workflow. Section 4 presents a workshop-
based evaluation. This is followed by a discussion to gather
insight from user experiences.
2. RELATED WORK
Fiebrink established an interactive machine learning (IML)
workflow for musicians carrying out classification and re-
gression tasks with gestural input driving sound synthesis
output where users are able to edit, delete, and add to train-
ing datasets interactively [9]. In a typical workflow with
Wekinator, a regression task would be trained by static pos-
tures. Scurto [10] proposes a method of extracting examples
from dynamic performances in response to sonic stimuli.
Caramiaux [3] uses Canonical Correlation Analysis to
study evoked gestures in response to sound stimuli and ex-
plores the different movement-sound relationships evoked
by “causal” and“non-causal” sounds [5]. In the latter, users
trace the sound’s frequency/amplitude morphology.
Nymoen [19] conducted a large scale sound tracing study
relating gesture features (position, velocity, acceleration) to
sound features such as loudness, brightness and pitch, and
found a direct relationship between spectral centroid and
vertical motion. When the movement of pitch was opposite
to the motion of the spectral centroid, participants were
more likely to move their hands following the pitch. When
listening to noisy sounds, participants performed gestures
that were characterised by a higher acceleration.
Fran¸coise [11] studied different probabilistic models in
mapping-by-demonstration. He uses two kinds of mod-
elling, Gaussian Mixture Models (GMM), and Hierarchi-
cal Hidden Markov Models (HHMM) and uses each in two
different ways: 1.) to model gesture itself (single mode),
and 2.) to model gesture along with the associated sound
(multimodal). GMMs provide a probabilistic classification
of gesture or regression based on a gesture-sound relation-
ship, while HMM-based approaches create temporal mod-
els either of the gesture by itself or of the gesture-sound
association. We adopt his HHMM approach as one of the
algorithms used in our proposed system.
There are an increasing number of machine learning soft-
ware packages for interactive music applications [8] [14] [2]
[20] [23]. While these tools expose machine learning tech-
nologies to artists, they still require configuration and in-
tegration into a music composition or performance system.
One part of our proposed system is a scriptable interface
where the user can assign gesture features to feed Wek-
inator, and select synthesis parameters to be controlled
by Wekinator’s output. We provide a generic Wekinator
project that runs in the background that is controlled by
our system.
3. THE SYSTEM
We developed our system using Cycling’74 Max, Fiebrink’s
Wekinator for neural network regression, and the HHMM
object from IRCAM’s MuBu library for temporal modelling.
3.1 Architecture
Our system is modular, comprised of three (3) blocks:
1. A scriptable sensor input and gesture feature extrac-
tion module
2. A scriptable synthesiser controller with breakpoint en-
velopes to dynamically send selected parameters to
the machine learning module
3. A machine learning training module to capture gesture
training sets and remotely control Wekinator
3.1.1 Sensor input & feature extraction
For this study, we capture gesture using a Thalmic Labs
Myo, using its electromyogram (EMG) muscle sensing and
inertial measurement unit (IMU) gross movement and ori-
entation sensing. To extract orientation from the IMU, we
capture Euler Angles (x, y, z) of the forearm. We calculate
the first order differences (xd, yd, zd) of these angles, which
are correlated with direction and speed of displacement, and
augment our regression feature vector with historical data.
We detect gesture power [4] by tracking muscle exertion,
following the amplitude envelope of four (of the Myo’s 8)
EMG channels with a Bayesian filter [21].
The sendrcv scripting system we propose allows the user
to select any number of features to be sent to Wekinator
as inputs. In this way, the proposed system is not specific
to the Myo and can be used with other sensors and input
feature extraction algorithms.
3.1.2 Synthesizer playback
We used a general purpose software synthesizer, SCP by
Manuel Poletti. This synthesizer is controlled by our break-
point envelope-based playback system. We chose to design
sounds that transition between four fixed anchor points
(start, two intermediate points, and end) that represent
fixed synthesis parameters. The envelope interpolates be-
tween these fixed points. The temporal evolution of sound
is captured as different states in the breakpoint editor whose
envelopes run during playback, feeding both synthesizer and
Wekinator. Any of the parameters can be assigned to break-
point envelopes to be controlled during playback.
The sounds are customisable. For the workshop, we cre-
ated two sounds with granular synthesis and one sound us-
ing a looping sample synthesizer. These sound trajectories
are reproduced during the gesture design and model train-
ing phases of our workflow (section 3.2). In performance
a model maps sensor data to synthesis parameters, allow-
ing users to reproduce the designed sounds or explore sonic
space around the existing sounds.
3.1.3 Wekinator communications
We developed a scripting system, sendrcv, in Max that al-
lows modularity and high-level use of the system. Sendrcv
is a configurable scaling and mapping abstraction that sets
up assignable sends and receives between Wekinator, the
gesture features that feed it, and the synthesis parameters
it controls. On input, it allows the user to select gesture
features to be recorded by Wekinator. On output, each
instantiation makes a bridge between a parameter in the
synthesizer and the model output.
Sendrcv is invoked with network ports as arguments, al-
lowing multiple sensor inputs and synthesizers to be used in
parallel with a corresponding number of Wekinator projects.
It is instantiated with a unique name so messages can be
addressed specifying the gesture feature or synthesizer pa-
rameter that it feeds or is controls. It is bidirectional, allow-
ing the use of a synthesizer’s user interface or the Wekinator
sliders to author sounds. The relevant range of a synthesizer
parameter can be defined in the script and is normalised to
a floating point value in the range, 0.01.0. This allows
a Wekinator pro ject to be agnostic to synthesizer specifics.
Other scripting features include throttling the data rate us-
181
ing speedlim, and a ramp destination time for Max’s line
object. A typical setup script is:
; 6448weki01 sendrcv mysend;
6448weki01 arg myarg;
6448weki01 min 0;
6448weki01 max 127;
6448weki01 speedlim 10;
6448weki01 time 10;
The sound and gesture design workflow are described be-
low in the section 3.3.
3.2 Machine Learning Approaches
Four different approaches to machine learning (ML) are
used in the system. We provide three different ways to
train neural networks for regression, each using the same al-
gorithm and topology, but varying in the way training data
are captured. A fourth method uses HHMMs for tempo-
ral modelling, which we chose because it can track progress
inside of a gesture.
3.2.1 Static Regression
In the first approach, after designing the sound-gesture in-
teraction through the sound tracing exercise, users segment
their gestural performance into four discrete poses, or an-
chor points. These points coincide with breakpoints in
the synthesis parameters (section 3.1.2). Training data are
recorded by pairing sensor data from static poses with fixed
synthesis parameters. These data are used to train a re-
gression model, so in performance participants can explore
a continuous mapping between the defined training points.
We refer to this technique as Static Regression.
3.2.2 Temporal Modelling
In the second approach, we train temporal models, specifi-
cally Hierarchical Hidden Markov Models implemented with
MuBu [11]. HHMMs are used to automatically segment a
gesture into 10 equal-sized states, each represented by a
Gaussian Mixture Model. In performance, the output of an
HHMM is used to step along the synthesis parameter time-
line. Here, we refer to this technique as Temporal Modelling.
3.2.3 Whole Regression
In a third approach, we train a neural network using input
and output data generated during the whole duration of the
sound. We call this algorithm Whole Regression.
3.2.4 Windowed Regression
Finally, we propose our method: training a neural network
with gestural data and synthesis parameters from four tem-
poral windows centred around the four fixed anchor points
in the sound. Anchor points are defined as points in time
where there is a breakpoint in the functions that gener-
ate synthesis parameters over time (red circles in Figure 1).
This includes the beginning and end of the sound, as well
as two equally spaced intermediate points. Training data
are recorded during windows that are and centred around
the anchor points and have a size of 1/6 of the whole du-
ration of given sound (grey areas in Figure 1). We call this
Windowed Regression.
3.3 Workflow
The workflow is divided into four top level activities: Sound
design, Gesture design, Machine training and Performance.
While we present them here in order, they and the steps
within them can be carried out iteratively and interactively.
Synth
parameters/
Labels (1, 2, …)
Gestural/Input
Data (1, 2, …)
1 2 3 4 Time
Anchor Points
Figure 1: Windowed Regression. The red circles
represent the four anchor points, and the grey zones
show the window of recorded data around each an-
chor point.
3.3.1 Sound design
In the sound design phase of our workflow, users use their
preferred synthesizer to author sounds. They select salient
synthesis parameters that will be modulated in the temporal
evolution of the sound. These parameters are scripted in
sendrcv. A sound trajectory is then composed of four anchor
points. The user then records these anchor points using the
Envelope window of our system. They create a variant on
their sound, select the breakpoint to which they would like
to assign it (0 3), and click Set (Fig. 2).
Figure 2: The Envelope window showing progress
bar, sound duration, anchor point selection, set but-
ton above, and several envelopes below.
In this way, short (<10 second) sounds can be created
with dynamic parameter envelopes that are suitable for
sound tracing.
3.3.2 Gesture design
The gesture design part of the system (Fig. 3) enables the
user to choose between the different ML approaches men-
tioned above (Section 3.2). The user selects a sound to
preview in the left part of the panel. In the current ver-
sion, there are three (3) authored sounds that can be pre-
viewed, each with four (4) anchor points. The Play button
below the progress bar changes name contextually to play
the time-varying sound trajectory or one of the selected
anchor points. In this way the user can conceive their ges-
ture by sound tracing, practice executing it while listening
to the sound, and find salient anchor points in the gesture
that correspond to anchor points in the sound.
3.3.3 Model training
Once the gestures are designed, the user can train their
choice of ML algorithms. Figure 4 shows the logical se-
quence. First, the user decides whether to work with an-
chor points in a static regression approach or using dynamic
gesture in one of three time-based approaches. In the lat-
ter case, they choose from whole or windowed regression or
temporal modelling. This part is seen in the middle pane
of the interface in Fig. 3. Once the algorithm is chosen,
the user proceeds with training using the right panel. The
Record button records examples. If a dynamic algorithm
is chosen, this will play the selected sound, and the user
182
Figure 3: The machine learning training panel, with
selection of sounds (with possible selection of anchor
point for static regression)(Left), Selection of ML
algorithm (Centre), and Record, Play, Train, and
Clear Dataset buttons (Right).
records a training set by sound tracing, in the same way
they designed the gesture. If the user has chosen Static Re-
gression, they select an anchor point on the left, hold the
pose to associated with the anchor point, and then click
the Record button. This is repeated for each of the anchor
points. At any point, the user has the possibility to Clear
their recording (the Cbutton) to re-take their gesture or
their posture. The Data Set Size field shows the number of
samples recorded. If they are happy with their recording,
the user then trains a model by clicking the Tbutton.
Run algorithm
Record
gesture
Train ML algorithm
Choose sound
Record postures
Whole Gesture or
Windowed Regression or
HHMM?
Dynamic or Static?
Record
gesture
Record
gesture
Whole
Gesture
Windowed
Regression
Temporal
(HHMM)
Static
Regression
Dynamic Static
Figure 4: The machine training decision tree, where
the user selects static regression, one of two types
of dynamic regression, or temporal modelling.
4. EVALUATION
We organised a half-day workshop where we presented the
software and asked participants to explore each approach
to machine learning. We collected qualitative data in the
form of video capturing participants’ experience using our
proposed system. Data were analysed by adopting Open
and Axial Coding Methods [6].
4.1 Participants
The workshop was not meant to be a tutorial on ML tech-
niques nor a primer on sonic interaction design. We, there-
fore, recruited participants who were creative practitioners
in music, dance, or computational art, who had some prior
exposure to topics of embodied interaction and ML. We
recruited five (5) participants (3 female, 2 male). Three
were Computational Arts Masters students with interest in
dance technology, one was a recent undergraduate Creative
Computing graduate, and one was a PhD student in live
audiovisual performance.
Figure 5: A workshop participant demonstrating
her gesture.
4.2 Procedure
We provided the hardware and software system on lab com-
puters. We also provided three (3) sounds that had been
prepared for the study:
A A Theremin-like whistling sound with a frequency tra-
jectory ending in rapid vibrato
B A rhythmic sound of repeating bells where speed and
pitch were modulated
C Scrubbing of a pop song where granular synthesis al-
lowed time stretching
By providing the sounds, the workshop focused on the Ges-
ture Design segment of the workflow described above.
We focused on Sound A, the frequency trajectory of the
whistling tone. Participants listened to the sound, design-
ing their gesture by sound tracing. They then tried Whole
Regression. In the second task, participants were asked to
think about breaking their gesture down into anchor points
to train the system by Static Regression. Task three con-
sisted of trying Windowed Regression and Temporal Mod-
elling. We finished with a free exploration segment where
the participants tried the other two sounds with algorithms
of their choosing.
4.3 Results
Four of five participants designed a gesture for Sound A
that was consistent with theory from sound tracing; they
followed the amplitude/frequency morphology of the sound
with sweeping arm gestures and muscle tension. One par-
ticipant designed her gesture with a drawing on paper (Fig.
6). Participants tried to represent the wobbly vibrato at the
end of the sound in different ways: by wiggling their fingers,
flapping their hands, or making a fist. P1 commented on
Whole Regression where interaction with the sound“became
embodied, it was giving me, and I was giving it.”
The participants responded differently to decomposing
their gesture into anchor points for Static Regression. For
P1 this meant that she“could be more precise.” P2 identified
what she called, “natural” points along her paper sketch as
anchors. These included key points like the turn of a line,
but also the middle of a smooth curve (Fig. 6). P3 felt
that this technique had a less fluid response, like triggering
different “samples”. P4 found it difficult to decompose her
183
smooth gesture into constituent anchors: “It was difficult to
have the four anchor points... Sure the sound was divided
up in different pitches but..”. P5 felt that “the connection
between the sound and the movement was not as close [as
using Whole Regression].” P1 took this as a creative op-
portunity, “I had the possibility to reinvent the transitions.”
BA
Figure 6: Gesture design by drawing. P2 in Task 1
(Left), then Task 2 with anchor points (Right).
With Temporal modelling, P1 seemed to track the orien-
tation of her arm more than her hand gestures. P3 found
it to be “too discrete” and P4 sound it “super choppy.” P5
remarked, “you could hear the transitions, it was less fluid
than number one [Whole regression]. It was steppy.”
Three participants (P1, P3, P4) had positive reactions
to our Windowed Regression technique. P1 used it with
Sound B (a rhythmic bell) in a gesture consisting of waving
her hand out and twisting her wrist while moving her arm
from frontwards to upwards. After trying and clearing the
recording four times, she perfected her gesture by paying
attention to shoulder position and finger tension. P3 and
P4 chose Windowed Regression with Sound C (a scrubbed
and filtered sample of a pop song). P3 “performed” with
it in a playful manner: “What I was trying to do was...
to separate out the bits.” P4 played with the “acceleration
of the gesture... because of the sound [song], that’s more
a continuous sound and movement, so I worked more with
the acceleration.” P1 and 3 felt that this technique enabled
them to reproduce the sound accurately but at the same
time also to explore new sonic possibilities.
In the free exploration segment of the workshop, four out
of five participants (P2, P3, P4 and P5) presented their
explorations with Sound B (rhythmic bells). P5 trained a
Static Regression model with different spatial positions of
the arm. P3 did similarly and attempted to add striking
gestures to follow the rhythmic accelerando. P2 associated
speed of movement to bell triggering using Temporal Mod-
elling. She tried with the arm in a fixed position and again
by changing orientation, and felt that the latter worked bet-
ter for her. P2 showed how she used the bell sound with
Whole Regression. She performed a zig-zag like movement,
and explored the quiet moments she could attain through
stillness, similar to the work of Jensenius et al. [16]
Participants were interested in going beyond reproducing
the sound trajectory they had traced, exploring the expres-
sivity of a given technique and responding to variations of
gesture within and outside the designed gesture. Sound B
(rhythmic bell) was the most difficult sample to reproduce
faithfully but gave more expressivity, P5 said “it gave the
best interaction... the most surprising results.”
5. DISCUSSION AND CONCLUSIONS
We have presented a system for designing gesture and imple-
menting four related machine learning techniques. We pre-
sented those techniques in a workshop without giving their
name or technical details on how each algorithm worked.
The only indication about how static modelling differed
from the dynamic techniques was that participants were
asked to train the system using gesture anchor points. In
this sense, this study was not a comparison of different mod-
elling techniques. In the release version of our system1, we
expose the names of the algorithms in the UI, making a
direct comparison possible.
The workflow afforded by our system enables the user,
without specialist knowledge of ML and without directly
configuring and operating ML algorithms, to enter into a
musically productive gesture design activity following the
IML paradigm. Our system is aimed at musicians and
artists who might imagine incorporating embodied interac-
tion and machine learning into a performance. The work-
shop participants represented such a user group: they were
comfortable with digital technologies, but did not have spe-
cific technical knowledge of feature extraction or machine
learning. However, they were articulate in describing what
they experienced and insightful in discerning the different
kinds of gesture-sound interaction each algorithm afforded.
The intuitive way in which our users explored the differ-
ent algorithms means they were able to train models that
did not perform as expected. Without visibility into the
data and how an algorithm was processing it, it is difficult
to know how to alter one’s approach when training a new
model. While sometimes unpredictable performance was a
positive effect, it was more commonly viewed as an error.
Three users (P3, P4, P5) felt that Static Regression did
not result in smooth interaction. This may be due to large
amounts of training data and a possible overfitting effect.
We took this into consideration in a design iteration of the
system. Based on this, we added an auto-stop feature in the
static gesture recording so that it stops after 200 samples.
Participants on the whole confirmed findings of sound
tracing studies. They followed the amplitude/frequency
morphology of a sound when it was non-causal [5]. When
they sought to trace a more casual type of sound such as the
bell hits, they tried to make striking gestures. Such gestures
would be missed by a regression algorithm. Meanwhile, a
temporal model would have difficulty tracking the repeti-
tive looping nature of such a gesture. While in the output
of the neural network, modulation of the sample loop-end
point caused an accelerando in the bell rhythm, a striking
rhythm on input was not modelled.
Meanwhile having multiple input modalities (EMG and
IMU) gave the users multiple dimensions on which to trace
sound morphology. With a single modality, like motion cap-
ture in Cartesian space, it can be unclear whether a gesture
like raising the arms is tracing rising frequency or amplitude
or both. By using muscle tension and orientation indepen-
dently, we saw that our users used the IMU to follow pitch
contour, and muscle tension to follow intensity of sound – be
they in amplitude or effects like the nervous vibrato at the
end of the whistling Theremin-like tone. This is consistent
with Nymoen’s observation on the change in sound tracing
strategies as users encounter noisier sounds [19]. While Ny-
moen sees increased acceleration, here the EMG modality
allows an effort dimension in sound tracing that does not
have to follow pitch or spectral centroid.
While the workshop focused on the gesture design work-
flow, we imagine users will be interested in designing sounds
along with performance gestures, and training models ac-
cordingly. We hope our method of designing sounds with
trajectories is effective. However, authoring sounds using
only four anchor points may be frustrating for some. If the
number of anchor points is too few, our system could be
expanded to accommodate more. However, in the current
version, anchor points are synchronous. It is possible that
1https://gitlab.doc.gold.ac.uk/biomusic/
continuous-gesture-sound-interaction
184
sound designers would not want parameters to have break-
points at the same points in time. Future development will
involve integrating our system into full musical performance
environments, incorporating multiple sounds and gestures,
providing an interface for saving and loading models, and
accounting for performance issues such as fatigue.
In demonstrations of machine learning for artists, tuto-
rials often focus on the rapid prototyping advantages of
the IML paradigm. In a desire to get artists up and run-
ning with regression and modelling techniques, examples
are recorded quickly and trained on random variations of
synthesizer sounds. The focus is on speed and ease of use.
Scurto found that the serendipity this causes can bring a
certain creative satisfaction [10]. However, we can imagine
that once comfortable with the record-train-perform-iterate
IML loop, that composers and performers will want to work
with specific sounds or choreographies of movement. It is
here that sound design and gesture design meet. Our sys-
tem provides a sound and gesture design front end to IML
that connects the two via sound tracing.
Participants in our workshop were concerned about the
fluidity of response of the ML algorithms. They discussed
the choice of algorithms as a trade-off between faithfully
reproducing the traced sound and giving them a space of
exploration to produce new, unexpected ways to articulate
the sounds. In this way, they began to examine the ges-
ture/sound affordances of the different approaches to re-
gression and temporal modelling our system offered. We
might say that this enabled them to exploit IML for a ges-
tural exploration of Wessel’s timbre space.
This paper presented a system that enabled sound and
gesture design to use techniques of sound tracing and IML in
authoring continuous embodied sonic interaction. It intro-
duced established techniques of static regression and tem-
poral modelling and proposed a hybrid approach, called
Windowed Regression, to track time-varying sound and as-
sociated gesture to automatically train a neural network
with salient examples. Workshop participants responded
favourably to Windowed Regression, finding it fluid and ex-
pressive. They were successful in using our system in an it-
erative workflow to design gestures in response to dynamic,
time-varying sound synthesis. We hope that this system and
associated techniques will be of interest to artists preparing
performances with time-based media and machine learning.
6. ACKNOWLEDGEMENT
We acknowledge our funding body H2020-EU.1.1. - EX-
CELLENT SCIENCE - European Research Council (ERC)
- ERC-2017-Proof of Concept (PoC) - Project name:
BioMusic - Project ID: 789825.
7. REFERENCES
[1] A. Altavilla, B. Caramiaux, and A. Tanaka. Towards
gestural sonic affordances. In Proc. NIME, Daejeon,
Korea, 2013.
[2] J. Bullock and A. Momeni. ml.lib: Robust,
cross-platform, open-source machine learning for max
and pure data. In Proc. NIME, pages 265–270, Baton
Rouge, Louisiana, USA, 2015.
[3] B. Caramiaux, F. Bevilacqua, and N. Schnell.
Towards a gesture-sound cross-modal analysis. In
Gesture in Embodied Communication and
Human-Computer Interaction, pages 158–170, Berlin,
Heidelberg, 2010.
[4] B. Caramiaux, M. Donnarumma, and A. Tanaka.
Understanding gesture expressivity through muscle
sensing. ACM Transactions on Computer-Human
Interaction (TOCHI), 21(6):31, 2015.
[5] B. Caramiaux, P. Susini, T. Bianco, et al. Gestural
embodiment of environmental sounds : an
experimental study. In Proc. NIME, pages 144–148,
Oslo, Norway, 2011.
[6] J. M. Corbin and A. L. Strauss. Basics of Qualitative
Research: Techniques and Procedures for Developing
Grounded Theory. SAGE, Fourth edition, 2015.
[7] S. Delle Monache and D. Rocchesso. To embody or
not to embody: A sound design dilemma. In Machine
Sounds, Sound Machines. XXII Colloquium of Music
Informatics, Venice, Italy, 2018.
[8] R. Fiebrink and P. R. Cook. The Wekinator: a system
for real-time, interactive machine learning in music.
In Proc. ISMIR, Utrecht, Netherlands, 2010.
[9] R. Fiebrink, P. R. Cook, and D. Trueman. Human
model evaluation in interactive supervised learning. In
Proc. CHI, pages 147–156, Vancouver, BC, Canada,
2011.
[10] R. Fiebrink and H. Scurto. Grab-and-play mapping:
Creative machine learning approaches for musical
inclusion and exploration. In Proc. ICMC, pages
12–16, 2016.
[11] J. Fran¸coise, N. Schnell, R. Borghesi, and
F. Bevilacqua. Probabilistic models for designing
motion and sound relationships. In Proc. NIME,
pages 287–292, London, UK, 2014.
[12] J. Fran¸coise. Motion-sound mapping by
demonstration. PhD thesis, UPMC, 2015.
[13] J. Gibson. Theory of affordances. In The ecological
approach to visual perception. Lawrence Erlbaum
Associates, 1986.
[14] N. Gillian and J. A. Paradiso. The Gesture
Recognition Toolkit. The Journal of Machine
Learning Research, 15(1):3483–3487, 2014.
[15] A. Hunt and M. M. Wanderley. Mapping performer
parameters to synthesis engines. Organised Sound,
7(2):97–108, 2002.
[16] A. R. Jensenius, V. E. Gonzalez Sanchez,
A. Zelechowska, and K. A. V. Bjerkestrand. Exploring
the myo controller for sonic microinteraction. In Proc.
NIME, pages 442–445, Copenhagen, Denmark, 2017.
[17] M. Leman. Embodied music cognition and mediation
technology. MIT Press, 2008.
[18] A. Momeni and D. Wessel. Characterizing and
controlling musical material intuitively with geometric
models. In Proc. NIME, pages 54–62, Montreal,
Canada, 2003.
[19] K. Nymoen, J. Torresen, R. I. Godøy, and A. R.
Jensenius. A statistical approach to analyzing sound
tracings. In Speech, Sound and Music Processing:
Embracing Research in India, pages 120–145, Berlin,
Heidelberg, 2012.
[20] A. Parkinson, M. Zbyszy´nski, and F. Bernardo.
Demonstrating interactive machine learning tools for
rapid prototyping of gestural instruments in the
browser. In Proc. Web Audio Conference, London,
UK, 2017.
[21] T. D. Sanger. Bayesian filtering of myoelectric signals.
Journal of neurophysiology, 97(2):1839–1845, 2007.
[22] D. L. Wessel. Timbre space as a musical control
structure. Computer Music Journal, pages 45–52,
1979.
[23] M. Zbyszy´nski, M. Grierson, M. Yee-King, et al.
Rapid prototyping of new instruments with codecircle.
In Proc. NIME, Copenhagen, Denmark, 2017.
185
... Performance gesture representation is important to study, prior to model training, because it can disparately affect artistic outcomes. Literature in this field has investigated two types of performance gestures within ML practice, prior to model training, when using EMG data and GIs (i.e., the Myo): static and dynamic [17]. However, this is without objective evaluation. ...
... Tanaka et al. [17] MPNN and HHMM 4 All gestures user-designed in a workshop Dalmazzo and Ramirez [24] DT and HMM 4 4 fingers (left hand, index to ring fingers) Ruan et al. [25] MPNN 3 3 fingers (left hand -thumb, index, middle) Di Donato et al. [26] MPNN and DTW 6 5 activation gestures and 1 modulating gesture (left arm) Erdem et al. [27] RNN 6 6 gestures (exercises on guitar playing -both arms) Dalmazzo and Ramírez [28] HHMM and DT 7 Violin bowing techniques Literature within the scientific community has been integral when informing how we should use complex EMG information from available GIs. In particular, Arief et al. [29] showed that the mean absolute value (MAV) feature extraction method is one of the most efficient when working with EMG data. ...
... Recent studies using Wekinator to investigate how IML can be used to drive sophisticated interactive gestural systems, in music, have seldom used several ML models to do so [17,26]; instead, they focus on the usability of a few models to drive such a system. This means that an ML model is used to achieve a particular research aim but optimal model choice is uninformed. ...
Article
Full-text available
Interactive music uses wearable sensors (i.e., gestural interfaces—GIs) and biometric datasets to reinvent traditional human–computer interaction and enhance music composition. In recent years, machine learning (ML) has been important for the artform. This is because ML helps process complex biometric datasets from GIs when predicting musical actions (termed performance gestures). ML allows musicians to create novel interactions with digital media. Wekinator is a popular ML software amongst artists, allowing users to train models through demonstration. It is built on the Waikato Environment for Knowledge Analysis (WEKA) framework, which is used to build supervised predictive models. Previous research has used biometric data from GIs to train specific ML models. However, previous research does not inform optimum ML model choice, within music, or compare model performance. Wekinator offers several ML models. Thus, we used Wekinator and the Myo armband GI and study three performance gestures for piano practice to solve this problem. Using these, we trained all models in Wekinator and investigated their accuracy, how gesture representation affects model accuracy and if optimisation can arise. Results show that neural networks are the strongest continuous classifiers, mapping behaviour differs amongst continuous models, optimisation can occur and gesture representation disparately affects model mapping behaviour; impacting music practice.
... In the context of music performance, we looked at the concepts of intention, effort, and restraint in relation to the use of electromyogram (EMG) for digital musical instrument application [19]. EMG is a signal representing muscle activity employed in the biomedical and HCI fields as a highly sensitive way to capture human movement and has been used as a signal with which to sense musical gesture [20,21]. Using EMG for music presents several challenges. ...
... The model obtained by training a neural network may then be used to map incoming motion features to sound synthesis continuously and in real time. Several approaches based on regression may be used to map gestural features to sound synthesis [20]. We have developed the GIMLeT pedagogical toolkit for Max [49] to provide some practical examples of using linear regression for this purpose. ...
... By gesticulating to a sound that evolves in time, we author gesture that then becomes training data for the regression algorithm in a "mappingby-demonstration" workflow. In order to author time varying sound using this synthesiser, we create a system of "anchor points", salient points in the timbral evolution of the sound that are practical for sound synthesis parametrisation, and useful in pose-based gesture training [20]. The synthesiser is controlled by our break-point envelope-based playback system and enables the user to design sounds that transition between four fixed anchor points (start, two intermediate points, and end) that represent fixed synthesis parameters. ...
Preprint
Full-text available
This chapter presents an overview of Interactive Machine Learning (IML) techniques applied to the analysis and design of musical gestures. We go through the main challenges and needs related to capturing, analysing, and applying IML techniques to human bodily gestures with the purpose of performing with sound synthesis systems. We discuss how different algorithms may be used to accomplish different tasks, including interacting with complex synthesis techniques and exploring interaction possibilities by means of Reinforcement Learning (RL) in an interaction paradigm we developed called Assisted Interactive Machine Learning (AIML). We conclude the chapter with a description of how some of these techniques were employed by the authors for the development of four musical pieces, thus outlining the implications that IML have for musical practice.
... The pilot study was conducted in a workshop-based scenario to teach nonspecialists gesture design without needing technical knowledge of machine learning [31]. Participants were able to explore new timbres using the four approaches described in section 3.1 and one of three pre-designed sounds. ...
... Interestingly, participants thought that the HMM temporal modeling was too "choppy" for exploration outside the designed gesture. Static regression was found to be a precise way to recreate the designed gesture/sound relationship, while whole regression was found by participants to be "embodied" [31]. ...
Chapter
This chapter explores three systems for mapping embodied gesture, acquired with electromyography and motion sensing, to sound synthesis. A pilot study using granular synthesis is presented, followed by studies employing corpus-based concatenative synthesis, where small sound units are organized by derived timbral features. We use interactive machine learning in a mapping-by-demonstration paradigm to create regression models that map high-dimensional gestural data to timbral data without dimensionality reduction in three distinct workflows. First, by directly associating individual sound units and static poses (anchor points) in static regression. Second, in whole regression a sound tracing method leverages our intuitive associations between time-varying sound and embodied movement. Third, we extend interactive machine learning through the use of artificial agents and reinforcement learning in an assisted interactive machine learning workflow. We discuss the benefits of organizing the sound corpus using self-organizing maps to address corpus sparseness, and the potential of regression-based mapping at different points in a musical workflow: gesture design, sound design, and mapping design. These systems support expressive performance by creating gesture-timbre spaces that maximize sonic diversity while maintaining coherence, enabling reliable reproduction of target sounds as well as improvisatory exploration of a sonic corpus. They have been made available to the research community, and have been used by the authors in concert performance.
... The main goal is to develop a computer-assisted pedagogical tool for self-regulated learners. Tanaka et al. [14] Based on the mapping-bydemonstration principle, authors describe different ML approaches to interact with generative sound and upper limb gestural patterns, applying techniques such as Static Regression, Temporal Modelling (HMM), Neural Network Regression and Windowed Regression, where the ML was feed using an IMU device including electromyogram (EMG) musician muscle-activity of the forearm signals. Dalmazzo and Ramrez [5] presented an ML approach to describe seven standard bow-stroke articulations (Dtach, Martel, Spiccato, Ricochet, Sautill, Staccato and Bariolage). ...
... In the case where a small amount of training data is available, HHMM is a robust algorithm for pattern recognition of temporal events. The mapping-bydemonstration principles is sufficient for modelling an ML human gestures classifier; as in the case of generative music and gesture interaction [14]. However, for a more generalist model, similar to an MNIST [15], another approach would be needed, perhaps the implementation of Recurrent Neural Networks (RNN), and bigger datasets. ...
Chapter
Full-text available
To acquire new skills in a high-level music context, students need many years of conscious dedication and practice. It is understood that precise motor actions have to be incorporated into the musicians’ automatic executions, where a repertoire of technical actions must be learned and mastered. In this study, we develop a computer modelled assistant applying machine learning algorithms, for self-practice musicians with the violin as a test case. We recorded synchronized data from the performer’s forearms implementing an IMU device with ambient sound recordings. The musicians perform seven standard bow gesture. We tested the model with three different expertise levels to identify relevant dissimilitudes among students and teachers.
... Another study [26] investigated user-designed gestures in a workshop setting, using two ML model types (static and dynamic) and a Myo GI with IMU and EMG data to realise designed gestures. In the work, workshop participants were asked to design a gesture and choose one of four ML models to represent those for music generation; one static model and three dynamic models. ...
Chapter
Full-text available
Since 2015, commercial gestural interfaces have widened accessibility for researchers and artists to use novel Electromyographic (EMG) biometric data. EMG data measures musclar amplitude and allows us to enhance Human-Computer Interaction (HCI) through providing natural gestural interaction with digital media. Virtual Reality (VR) is an immersive technology capable of simulating the real world and abstractions of it. However, current commercial VR technology is not equipped to process and use biometric information. Using biometrics within VR allows for better gestural detailing and use of complex custom gestures, such as those found within instrumental music performance, compared to using optical sensors for gesture recognition in current commercial VR equipment. However, EMG data is complex and machine learning must be used to employ it. This study uses a Myo armband to classify four custom gestures in Wekinator and observe their prediction accuracies and representations (including or omitting signal onset) to compose music within VR. Results show that specific regression and classification models, according to gesture representation type, are the most accurate when classifying four music gestures for advanced music HCI in VR. We apply and record our results, showing that EMG biometrics are promising for future interactive music composition systems in VR.
... Sound mimicry is a similar approach, based on examining how sound-producing actions can be imitated "in the air," that is, without a physical interface (Godøy, 2006;Godøy, Haga, & Jensenius, 2005;Valles, Martínez, Ordás, & Pissinis, 2018). Several other studies have aimed at identifying musical mapping strategies, drawing on concepts of embodied music cognition as a starting point (e.g., Caramiaux, Bevilacqua, Zamborlin, & Schnell, 2009;Françoise, 2015;Maes, Leman, Lesaffre, Demey, & Moelants, 2010;Tanaka, Donato, Zbyszynski, & Roks, 2019;Visi, Coorevits, Schramm, & Miranda, 2017). ...
Article
Full-text available
We investigated how the action-sound relationships found in electric guitar performance can be used in the design of new instruments. Thirty-one trained guitarists performed a set of basic sound-producing actions (impulsive, sustained, and iterative) and free improvisations on an electric guitar. We performed a statistical analysis of the muscle activation data (EMG) and audio recordings from the experiment. Then we trained a long short-term memory network with nine different configurations to map EMG signal to sound. We found that the preliminary models were able to predict audio energy features of free improvisations on the guitar, based on the dataset of raw EMG from the basic sound-producing actions. The results provide evidence of similarities between body motion and sound in music performance, compatible with embodied music cognition theories. They also show the potential of using machine learning on recorded performance data in the design of new musical instruments.
... In our previous study [16], we proposed four di↵erent approaches to designing gesture-timbre interaction based on a sound tracing exercise. In this system we revisit two of those approaches using our concatenative audio engine. ...
Conference Paper
Full-text available
This paper presents a method for mapping embodied gesture , acquired with electromyography and motion sensing, to a corpus of small sound units, organised by derived timbral features using concatenative synthesis. Gestures and sounds can be associated directly using individual units and static poses, or by using a sound tracing method that leverages our intuitive associations between sound and embodied movement. We propose a method for augmenting corporal density to enable expressive variation on the original gesture-timbre space.
... Par défaut, 5 gestes sont reconnus, mais par l'intermédiaire de son SDK, de nouveaux gestes peuvent être appris. Cet appareil est utilisé dans plusieurs travaux relativement récents comme [10,11] pour la performance musicale (violonistes), pour la classification du mouvement des doigts par [32], pour l'analyse de la navigation par geste de la main par [24], pour la réalité virtuelle par [22], pour la création d'un mapping interactif entre les gestes et des sons musicaux [34] etc. Le Myo™ offre plusieurs avantages par rapport aux autres appareils : un coût abordable (environ 250 €), une configuration simple, et la possibilité d'être caché ce qui est très intéressant en terme de performances artistiques [35]. (visible sur les figures 1 et 4) a une seule manière d'être exécuté, alors que Scanning 7 peut être signé par un bras, ou l'autre, ou bien les 2 en même temps. ...
Conference Paper
Full-text available
Nowadays, gestures are being adopted as a new modality in the field of Human-Computer Interaction (HMI), where the physical movements of the whole body can perform unlimited actions. Soundpainting is a language of artistic composition used for more than forty years. However, the work on the recognition of SoundPainting gestures is limited and they do not take into account the movements of the fingers and the hand in the gestures which constitute an essential part of SoundPainting. In this context, we conducted a study to explore the combination of 3D postures and muscle activity for the recognition of Sound- Painting gestures. In order to carry out this study, we created a SoundPainting database of 17 gestures with data from two sensors (Kinect® and Myo™). We formulated four hypotheses concerning the accuracy of recognition. The results allowed to characterize the best sensor according to the typology of the gesture, to show that a "simple" combination of the two sensors does not necessarily improves the recognition, that a combination of features is not necessarily more efficient than taking into account a single well chosen feature, finally, that changing the frequency of the data acquisition provided by these sensors does not have a significant impact on the recognition of gestures.
Conference Paper
Full-text available
Designing with sound is about constructing appropriate sound representations of a concept, from the early ideas to the final product. A survey research on embodied sound sketching is presented, and the problems of early representation in sound design are discussed by analysing the questionnaire results of three workshops on vocal sketching .
Thesis
Full-text available
Designing the relationship between motion and sound is essential to the creation of interactive systems. This thesis proposes an approach to the design of the mapping between motion and sound called Mapping-by-Demonstration. Mapping-by-Demonstration is a framework for crafting sonic interactions from demonstrations of embodied associations between motion and sound. It draws upon existing literature emphasizing the importance of bodily experience in sound perception and cognition. It uses an interactive machine learning approach to build the mapping iteratively from user demonstrations. Drawing upon related work in the fields of animation, speech processing and robotics, we propose to fully exploit the generative nature of probabilistic models, from continuous gesture recognition to continuous sound parameter generation. We studied several probabilistic models under the light of continuous interaction. We examined both instantaneous (Gaussian Mixture Model) and temporal models (Hidden Markov Model) for recognition, regression and parameter generation. We adopted an Interactive Machine Learning perspective with a focus on learning sequence models from few examples, and continuously performing recognition and mapping. The models either focus on movement, or integrate a joint representation of motion and sound. In movement models, the system learns the association between the input movement and an output modality that might be gesture labels or movement characteristics. In motion-sound models, we model motion and sound jointly, and the learned mapping directly generates sound parameters from input movements. We explored a set of applications and experiments relating to real-world problems in movement practice, sonic interaction design, and music. We proposed two approaches to movement analysis based on Hidden Markov Model and Hidden Markov Regression, respectively. We showed, through a use-case in Tai Chi performance, how the models help characterizing movement sequences across trials and performers. We presented two generic systems for movement sonification. The first system allows users to craft hand gesture control strategies for the exploration of sound textures, based on Gaussian Mixture Regression. The second system exploits the temporal modeling of Hidden Markov Regression for associating vocalizations to continuous gestures. Both systems gave birth to interactive installations that we presented to a wide public, and we started investigating their interest to support gesture learning.
Article
Full-text available
Expressivity is a visceral capacity of the human body. To understand what makes a gesture expressive, we need to consider not only its spatial placement and orientation, but also its dynamics and the mechanisms enacting them. We start by defining gesture and gesture expressivity, and then present fundamental aspects of muscle activity and ways to capture information through electromyography (EMG) and mechanomyog-raphy (MMG). We present pilot studies that inspect the ability of users to control spatial and temporal variations of 2D shapes and that use muscle sensing to assess expressive information in gesture execution beyond space and time. This leads us to the design of a study that explores the notion of gesture power in terms of control and sensing. Results give insights to interaction designers to go beyond simplistic gestural interaction, towards the design of interactions that draw upon nuances of expressive gesture.
Article
Full-text available
The Gesture Recognition Toolkit is a cross-platform open-source C++ library designed to make real-time machine learning and gesture recognition more accessible for non-specialists. Emphasis is placed on ease of use, with a consistent, minimalist design that promotes accessibility while supporting flexibility and customization for advanced users. The toolkit features a broad range of classification and regression algorithms and has extensive support for building real-time systems. This includes algorithms for signal processing, feature extraction and automatic gesture spotting.
Conference Paper
Full-text available
We present a set of probabilistic models that support the design of movement and sound relationships in interactive sonic systems. We focus on a mapping-by-demonstration approach in which the relationships between motion and sound are defined by a machine learning model that learns from a set of user examples. We describe four probabilistic models with complementary characteristics in terms of multimodality and temporality. We illustrate the practical use of each of the four models with a prototype application for sound control built using our Max implementation.
Conference Paper
Full-text available
We present a study that explores the affordance evoked by sound and sound-gesture mappings. In order to do this, we make use of a sensor system with minimal form factor in a user study that minimizes cultural association. The present study focuses on understanding how participants describe sounds and gestures produced while playing designed sonic interaction mappings. This approach seeks to move from object-centric affordance towards investigating embodied gestural sonic affordances.
Book
Digital media handles music as encoded physical energy, but humans consider music in terms of beliefs, intentions, interpretations, experiences, evaluations, and significations. In this book, drawing on work in computer science, psychology, brain science, and musicology, Marc Leman proposes an embodied cognition approach to music research that will help bridge this gap. Assuming that the body plays a central role in all musical activities, and basing his approach on a hypothesis about the relationship between musical experience (mind) and sound energy (matter), Leman proposes that the human body is a biologically designed mediator that transfers physical energy to a mental level--engaging experiences, values, and intentions--and, reversing the process, transfers mental representation into material form. He suggests that this idea of the body as mediator offers a promising framework for thinking about music mediation technology. Leman argues that, under certain conditions, the natural mediator (the body) can be extended with artificial technology-based mediators. He explores the necessary conditions and analyzes ways in which they can be studied. Leman outlines his theory of embodied music cognition, introducing a model that describes the relationship between a human subject and its environment, analyzing the coupling of action and perception, and exploring different degrees of the body's engagement with music. He then examines possible applications in two core areas: interaction with music instruments and music search and retrieval in a database or digital library. The embodied music cognition approach, Leman argues, can help us develop tools that integrate artistic expression and contemporary technology.