Conference PaperPDF Available

Designing Gestures for Continuous Sonic Interaction

Authors:

Abstract and Figures

We present a system that allows users to try different ways to train neural networks and temporal modelling to associate gestures with time-varying sound. We created a software framework for this and evaluated it in a workshop-based study. We build upon research in sound tracing and mapping-by-demonstration to ask participants to design gestures for performing time-varying sounds using a multimodal, inertial measurement (IMU) and muscle sensing (EMG) device. We presented the user with two classical techniques from the literature, Static Position regression and Hidden Markov based temporal modelling, and propose a new technique for capturing gesture anchor points on the fly as training data for neural network based regression , called Windowed Regression. Our results show trade-offs between accurate, predictable reproduction of source sounds and exploration of the gesture-sound space. Several users were attracted to our windowed regression technique. This paper will be of interest to musicians engaged in going from sound design to gesture design and offers a workflow for interactive machine learning.
Content may be subject to copyright.
Designing Gestures for Continuous Sonic Interaction
Atau Tanaka, Balandino Di Donato, and
Michael Zbyszy´
nski
Embodied Audiovisual Interaction Group
Goldsmiths, University of London
SE14 6NW, London, UK
[a.tanaka, b.didonato,
m.zbyszynski]@gold.ac.uk
Geert Roks
Music and Technology Department
HKU University of the Arts
3582 VB, Utrecht, Netherlands
geertrocks@gmail.com
ABSTRACT
We present a system that allows users to try different ways
to train neural networks and temporal modelling to asso-
ciate gestures with time-varying sound. We created a soft-
ware framework for this and evaluated it in a workshop-
based study. We build upon research in sound tracing
and mapping-by-demonstration to ask participants to de-
sign gestures for performing time-varying sounds using a
multimodal, inertial measurement (IMU) and muscle sens-
ing (EMG) device. We presented the user with two classical
techniques from the literature, Static Position regression
and Hidden Markov based temporal modelling, and pro-
pose a new technique for capturing gesture anchor points
on the fly as training data for neural network based regres-
sion, called Windowed Regression. Our results show trade-
offs between accurate, predictable reproduction of source
sounds and exploration of the gesture-sound space. Several
users were attracted to our windowed regression technique.
This paper will be of interest to musicians engaged in going
from sound design to gesture design and offers a workflow
for interactive machine learning.
Author Keywords
Sonic Interaction Design, Interactive Machine Learning,
Gestural Interaction
CCS Concepts
Human-centered computing Empirical studies
in interaction design; Applied computing Sound
and music computing;
1. INTRODUCTION
Designing gestures for the articulation of dynamic sound
synthesis is a key part of the preparation of a performance
with a DMI. Traditionally this takes place through a care-
ful and manual process of mapping. Strategies for mapping,
including “one-to-many” and “many-to-one” [15] are funda-
mental techniques in NIME. The field of embodied music
cognition looks at the relationship between corporeal action
and music [17]. The notion of sonic affordances draws upon
the notion of affordance from environmental psychology [13]
to look at how a sound may invite action [1].
Licensed under a Creative Commons Attribution
4.0 International License (CC BY 4.0). Copyright
remains with the author(s).
NIME’19, June 3-6, 2019, Federal University of Rio Grande do Sul,
Porto Alegre, Brazil.
Sound tracing is an exercise where a sound is given as
a stimulus to study evoked gestural response [3]. Sound
tracing has been used as a starting point for techniques of
“mapping-by-demonstration” [12]. While these studies look
at the articulation of gesture in response to sounds, they
focus on evoked gesture. In the field of sonic interaction de-
sign, embodied interaction has been used to design sounds.
This includes techniques applying interactive technologies
to traditions of Foley, or by vocalisation [7] and invoke the
body in the design of sounds.
The synthesis of time-varying sounds and the exploration
of timbral spaces is a practice at the heart of computer mu-
sic research. Wessel’s seminal work in the field defines tim-
bre space in a Cartesian plane [22]. Momeni has proposed
interactive techniques for exploring timbre spaces [18].
Neural networks can be trained for regression tasks by
providing examples of inputs associated with desired out-
puts. In systems for interactive machine learning, like Wek-
inator [9], this is implemented by associating positions in 3D
space to synthesised sound output. Once a model is trained,
the user performs by moving between (and beyond) the ex-
ample positions to create dynamic sound by gestures. While
performance is dynamic, the training is based on poses as-
sociated with sound synthesis parameters that are fixed for
each input example. Here we call this approach “static re-
gression.”
Time-varying gestures can be modelled by probabilistic
approaches, such as Hidden Markov Models. In perfor-
mance, live input is compared to transition states of the
model, allowing the algorithm to track where in the exam-
ple gesture the input is. This approach is commonly referred
to as temporal modelling.
We present a system for designing gestures to perform
time-varying synthesised sound. It extends the notion of
mapping-by-demonstration in a practical setting by en-
abling users to capture gesture while listening to sound, and
then to train different machine learning models. It asso-
ciates the authoring of gesture to interactive sound synthe-
sis and in so doing, explores the connection between sound
design and gesture design. The technique uses commonly
available tools for musical performance and machine learn-
ing and assumes no specialist knowledge of machine learn-
ing. It will be useful for artists wishing to create gestures
for interactive music performances in which gestural input
articulates dynamic synthesised sound where the associa-
tion of gesture and sound is not made by direct mapping,
but mediated by machine learning.
We propose an automated technique for training a neu-
ral network with a windowed set of anchor points captured
on the fly from a dynamic gesture made in response to a
sound tracing stimulus. We call this technique Windowed
Regression and evaluate it alongside static regression and
temporal modelling to gain insight into its usefulness in a
180
gesture design task.
This paper is organised as follows. In the next section,
we survey related work in the area of machine learning of
musical gesture. In Section 3, we present the architecture of
our system, its techniques of sound design, machine learning
and the proposed workflow. Section 4 presents a workshop-
based evaluation. This is followed by a discussion to gather
insight from user experiences.
2. RELATED WORK
Fiebrink established an interactive machine learning (IML)
workflow for musicians carrying out classification and re-
gression tasks with gestural input driving sound synthesis
output where users are able to edit, delete, and add to train-
ing datasets interactively [9]. In a typical workflow with
Wekinator, a regression task would be trained by static pos-
tures. Scurto [10] proposes a method of extracting examples
from dynamic performances in response to sonic stimuli.
Caramiaux [3] uses Canonical Correlation Analysis to
study evoked gestures in response to sound stimuli and ex-
plores the different movement-sound relationships evoked
by “causal” and“non-causal” sounds [5]. In the latter, users
trace the sound’s frequency/amplitude morphology.
Nymoen [19] conducted a large scale sound tracing study
relating gesture features (position, velocity, acceleration) to
sound features such as loudness, brightness and pitch, and
found a direct relationship between spectral centroid and
vertical motion. When the movement of pitch was opposite
to the motion of the spectral centroid, participants were
more likely to move their hands following the pitch. When
listening to noisy sounds, participants performed gestures
that were characterised by a higher acceleration.
Fran¸coise [11] studied different probabilistic models in
mapping-by-demonstration. He uses two kinds of mod-
elling, Gaussian Mixture Models (GMM), and Hierarchi-
cal Hidden Markov Models (HHMM) and uses each in two
different ways: 1.) to model gesture itself (single mode),
and 2.) to model gesture along with the associated sound
(multimodal). GMMs provide a probabilistic classification
of gesture or regression based on a gesture-sound relation-
ship, while HMM-based approaches create temporal mod-
els either of the gesture by itself or of the gesture-sound
association. We adopt his HHMM approach as one of the
algorithms used in our proposed system.
There are an increasing number of machine learning soft-
ware packages for interactive music applications [8] [14] [2]
[20] [23]. While these tools expose machine learning tech-
nologies to artists, they still require configuration and in-
tegration into a music composition or performance system.
One part of our proposed system is a scriptable interface
where the user can assign gesture features to feed Wek-
inator, and select synthesis parameters to be controlled
by Wekinator’s output. We provide a generic Wekinator
project that runs in the background that is controlled by
our system.
3. THE SYSTEM
We developed our system using Cycling’74 Max, Fiebrink’s
Wekinator for neural network regression, and the HHMM
object from IRCAM’s MuBu library for temporal modelling.
3.1 Architecture
Our system is modular, comprised of three (3) blocks:
1. A scriptable sensor input and gesture feature extrac-
tion module
2. A scriptable synthesiser controller with breakpoint en-
velopes to dynamically send selected parameters to
the machine learning module
3. A machine learning training module to capture gesture
training sets and remotely control Wekinator
3.1.1 Sensor input & feature extraction
For this study, we capture gesture using a Thalmic Labs
Myo, using its electromyogram (EMG) muscle sensing and
inertial measurement unit (IMU) gross movement and ori-
entation sensing. To extract orientation from the IMU, we
capture Euler Angles (x, y, z) of the forearm. We calculate
the first order differences (xd, yd, zd) of these angles, which
are correlated with direction and speed of displacement, and
augment our regression feature vector with historical data.
We detect gesture power [4] by tracking muscle exertion,
following the amplitude envelope of four (of the Myo’s 8)
EMG channels with a Bayesian filter [21].
The sendrcv scripting system we propose allows the user
to select any number of features to be sent to Wekinator
as inputs. In this way, the proposed system is not specific
to the Myo and can be used with other sensors and input
feature extraction algorithms.
3.1.2 Synthesizer playback
We used a general purpose software synthesizer, SCP by
Manuel Poletti. This synthesizer is controlled by our break-
point envelope-based playback system. We chose to design
sounds that transition between four fixed anchor points
(start, two intermediate points, and end) that represent
fixed synthesis parameters. The envelope interpolates be-
tween these fixed points. The temporal evolution of sound
is captured as different states in the breakpoint editor whose
envelopes run during playback, feeding both synthesizer and
Wekinator. Any of the parameters can be assigned to break-
point envelopes to be controlled during playback.
The sounds are customisable. For the workshop, we cre-
ated two sounds with granular synthesis and one sound us-
ing a looping sample synthesizer. These sound trajectories
are reproduced during the gesture design and model train-
ing phases of our workflow (section 3.2). In performance
a model maps sensor data to synthesis parameters, allow-
ing users to reproduce the designed sounds or explore sonic
space around the existing sounds.
3.1.3 Wekinator communications
We developed a scripting system, sendrcv, in Max that al-
lows modularity and high-level use of the system. Sendrcv
is a configurable scaling and mapping abstraction that sets
up assignable sends and receives between Wekinator, the
gesture features that feed it, and the synthesis parameters
it controls. On input, it allows the user to select gesture
features to be recorded by Wekinator. On output, each
instantiation makes a bridge between a parameter in the
synthesizer and the model output.
Sendrcv is invoked with network ports as arguments, al-
lowing multiple sensor inputs and synthesizers to be used in
parallel with a corresponding number of Wekinator projects.
It is instantiated with a unique name so messages can be
addressed specifying the gesture feature or synthesizer pa-
rameter that it feeds or is controls. It is bidirectional, allow-
ing the use of a synthesizer’s user interface or the Wekinator
sliders to author sounds. The relevant range of a synthesizer
parameter can be defined in the script and is normalised to
a floating point value in the range, 0.01.0. This allows
a Wekinator pro ject to be agnostic to synthesizer specifics.
Other scripting features include throttling the data rate us-
181
ing speedlim, and a ramp destination time for Max’s line
object. A typical setup script is:
; 6448weki01 sendrcv mysend;
6448weki01 arg myarg;
6448weki01 min 0;
6448weki01 max 127;
6448weki01 speedlim 10;
6448weki01 time 10;
The sound and gesture design workflow are described be-
low in the section 3.3.
3.2 Machine Learning Approaches
Four different approaches to machine learning (ML) are
used in the system. We provide three different ways to
train neural networks for regression, each using the same al-
gorithm and topology, but varying in the way training data
are captured. A fourth method uses HHMMs for tempo-
ral modelling, which we chose because it can track progress
inside of a gesture.
3.2.1 Static Regression
In the first approach, after designing the sound-gesture in-
teraction through the sound tracing exercise, users segment
their gestural performance into four discrete poses, or an-
chor points. These points coincide with breakpoints in
the synthesis parameters (section 3.1.2). Training data are
recorded by pairing sensor data from static poses with fixed
synthesis parameters. These data are used to train a re-
gression model, so in performance participants can explore
a continuous mapping between the defined training points.
We refer to this technique as Static Regression.
3.2.2 Temporal Modelling
In the second approach, we train temporal models, specifi-
cally Hierarchical Hidden Markov Models implemented with
MuBu [11]. HHMMs are used to automatically segment a
gesture into 10 equal-sized states, each represented by a
Gaussian Mixture Model. In performance, the output of an
HHMM is used to step along the synthesis parameter time-
line. Here, we refer to this technique as Temporal Modelling.
3.2.3 Whole Regression
In a third approach, we train a neural network using input
and output data generated during the whole duration of the
sound. We call this algorithm Whole Regression.
3.2.4 Windowed Regression
Finally, we propose our method: training a neural network
with gestural data and synthesis parameters from four tem-
poral windows centred around the four fixed anchor points
in the sound. Anchor points are defined as points in time
where there is a breakpoint in the functions that gener-
ate synthesis parameters over time (red circles in Figure 1).
This includes the beginning and end of the sound, as well
as two equally spaced intermediate points. Training data
are recorded during windows that are and centred around
the anchor points and have a size of 1/6 of the whole du-
ration of given sound (grey areas in Figure 1). We call this
Windowed Regression.
3.3 Workflow
The workflow is divided into four top level activities: Sound
design, Gesture design, Machine training and Performance.
While we present them here in order, they and the steps
within them can be carried out iteratively and interactively.
Synth
parameters/
Labels (1, 2, …)
Gestural/Input
Data (1, 2, …)
1 2 3 4 Time
Anchor Points
Figure 1: Windowed Regression. The red circles
represent the four anchor points, and the grey zones
show the window of recorded data around each an-
chor point.
3.3.1 Sound design
In the sound design phase of our workflow, users use their
preferred synthesizer to author sounds. They select salient
synthesis parameters that will be modulated in the temporal
evolution of the sound. These parameters are scripted in
sendrcv. A sound trajectory is then composed of four anchor
points. The user then records these anchor points using the
Envelope window of our system. They create a variant on
their sound, select the breakpoint to which they would like
to assign it (0 3), and click Set (Fig. 2).
Figure 2: The Envelope window showing progress
bar, sound duration, anchor point selection, set but-
ton above, and several envelopes below.
In this way, short (<10 second) sounds can be created
with dynamic parameter envelopes that are suitable for
sound tracing.
3.3.2 Gesture design
The gesture design part of the system (Fig. 3) enables the
user to choose between the different ML approaches men-
tioned above (Section 3.2). The user selects a sound to
preview in the left part of the panel. In the current ver-
sion, there are three (3) authored sounds that can be pre-
viewed, each with four (4) anchor points. The Play button
below the progress bar changes name contextually to play
the time-varying sound trajectory or one of the selected
anchor points. In this way the user can conceive their ges-
ture by sound tracing, practice executing it while listening
to the sound, and find salient anchor points in the gesture
that correspond to anchor points in the sound.
3.3.3 Model training
Once the gestures are designed, the user can train their
choice of ML algorithms. Figure 4 shows the logical se-
quence. First, the user decides whether to work with an-
chor points in a static regression approach or using dynamic
gesture in one of three time-based approaches. In the lat-
ter case, they choose from whole or windowed regression or
temporal modelling. This part is seen in the middle pane
of the interface in Fig. 3. Once the algorithm is chosen,
the user proceeds with training using the right panel. The
Record button records examples. If a dynamic algorithm
is chosen, this will play the selected sound, and the user
182
Figure 3: The machine learning training panel, with
selection of sounds (with possible selection of anchor
point for static regression)(Left), Selection of ML
algorithm (Centre), and Record, Play, Train, and
Clear Dataset buttons (Right).
records a training set by sound tracing, in the same way
they designed the gesture. If the user has chosen Static Re-
gression, they select an anchor point on the left, hold the
pose to associated with the anchor point, and then click
the Record button. This is repeated for each of the anchor
points. At any point, the user has the possibility to Clear
their recording (the Cbutton) to re-take their gesture or
their posture. The Data Set Size field shows the number of
samples recorded. If they are happy with their recording,
the user then trains a model by clicking the Tbutton.
Run algorithm
Record
gesture
Train ML algorithm
Choose sound
Record postures
Whole Gesture or
Windowed Regression or
HHMM?
Dynamic or Static?
Record
gesture
Record
gesture
Whole
Gesture
Windowed
Regression
Temporal
(HHMM)
Static
Regression
Dynamic Static
Figure 4: The machine training decision tree, where
the user selects static regression, one of two types
of dynamic regression, or temporal modelling.
4. EVALUATION
We organised a half-day workshop where we presented the
software and asked participants to explore each approach
to machine learning. We collected qualitative data in the
form of video capturing participants’ experience using our
proposed system. Data were analysed by adopting Open
and Axial Coding Methods [6].
4.1 Participants
The workshop was not meant to be a tutorial on ML tech-
niques nor a primer on sonic interaction design. We, there-
fore, recruited participants who were creative practitioners
in music, dance, or computational art, who had some prior
exposure to topics of embodied interaction and ML. We
recruited five (5) participants (3 female, 2 male). Three
were Computational Arts Masters students with interest in
dance technology, one was a recent undergraduate Creative
Computing graduate, and one was a PhD student in live
audiovisual performance.
Figure 5: A workshop participant demonstrating
her gesture.
4.2 Procedure
We provided the hardware and software system on lab com-
puters. We also provided three (3) sounds that had been
prepared for the study:
A A Theremin-like whistling sound with a frequency tra-
jectory ending in rapid vibrato
B A rhythmic sound of repeating bells where speed and
pitch were modulated
C Scrubbing of a pop song where granular synthesis al-
lowed time stretching
By providing the sounds, the workshop focused on the Ges-
ture Design segment of the workflow described above.
We focused on Sound A, the frequency trajectory of the
whistling tone. Participants listened to the sound, design-
ing their gesture by sound tracing. They then tried Whole
Regression. In the second task, participants were asked to
think about breaking their gesture down into anchor points
to train the system by Static Regression. Task three con-
sisted of trying Windowed Regression and Temporal Mod-
elling. We finished with a free exploration segment where
the participants tried the other two sounds with algorithms
of their choosing.
4.3 Results
Four of five participants designed a gesture for Sound A
that was consistent with theory from sound tracing; they
followed the amplitude/frequency morphology of the sound
with sweeping arm gestures and muscle tension. One par-
ticipant designed her gesture with a drawing on paper (Fig.
6). Participants tried to represent the wobbly vibrato at the
end of the sound in different ways: by wiggling their fingers,
flapping their hands, or making a fist. P1 commented on
Whole Regression where interaction with the sound“became
embodied, it was giving me, and I was giving it.”
The participants responded differently to decomposing
their gesture into anchor points for Static Regression. For
P1 this meant that she“could be more precise.” P2 identified
what she called, “natural” points along her paper sketch as
anchors. These included key points like the turn of a line,
but also the middle of a smooth curve (Fig. 6). P3 felt
that this technique had a less fluid response, like triggering
different “samples”. P4 found it difficult to decompose her
183
smooth gesture into constituent anchors: “It was difficult to
have the four anchor points... Sure the sound was divided
up in different pitches but..”. P5 felt that “the connection
between the sound and the movement was not as close [as
using Whole Regression].” P1 took this as a creative op-
portunity, “I had the possibility to reinvent the transitions.”
BA
Figure 6: Gesture design by drawing. P2 in Task 1
(Left), then Task 2 with anchor points (Right).
With Temporal modelling, P1 seemed to track the orien-
tation of her arm more than her hand gestures. P3 found
it to be “too discrete” and P4 sound it “super choppy.” P5
remarked, “you could hear the transitions, it was less fluid
than number one [Whole regression]. It was steppy.”
Three participants (P1, P3, P4) had positive reactions
to our Windowed Regression technique. P1 used it with
Sound B (a rhythmic bell) in a gesture consisting of waving
her hand out and twisting her wrist while moving her arm
from frontwards to upwards. After trying and clearing the
recording four times, she perfected her gesture by paying
attention to shoulder position and finger tension. P3 and
P4 chose Windowed Regression with Sound C (a scrubbed
and filtered sample of a pop song). P3 “performed” with
it in a playful manner: “What I was trying to do was...
to separate out the bits.” P4 played with the “acceleration
of the gesture... because of the sound [song], that’s more
a continuous sound and movement, so I worked more with
the acceleration.” P1 and 3 felt that this technique enabled
them to reproduce the sound accurately but at the same
time also to explore new sonic possibilities.
In the free exploration segment of the workshop, four out
of five participants (P2, P3, P4 and P5) presented their
explorations with Sound B (rhythmic bells). P5 trained a
Static Regression model with different spatial positions of
the arm. P3 did similarly and attempted to add striking
gestures to follow the rhythmic accelerando. P2 associated
speed of movement to bell triggering using Temporal Mod-
elling. She tried with the arm in a fixed position and again
by changing orientation, and felt that the latter worked bet-
ter for her. P2 showed how she used the bell sound with
Whole Regression. She performed a zig-zag like movement,
and explored the quiet moments she could attain through
stillness, similar to the work of Jensenius et al. [16]
Participants were interested in going beyond reproducing
the sound trajectory they had traced, exploring the expres-
sivity of a given technique and responding to variations of
gesture within and outside the designed gesture. Sound B
(rhythmic bell) was the most difficult sample to reproduce
faithfully but gave more expressivity, P5 said “it gave the
best interaction... the most surprising results.”
5. DISCUSSION AND CONCLUSIONS
We have presented a system for designing gesture and imple-
menting four related machine learning techniques. We pre-
sented those techniques in a workshop without giving their
name or technical details on how each algorithm worked.
The only indication about how static modelling differed
from the dynamic techniques was that participants were
asked to train the system using gesture anchor points. In
this sense, this study was not a comparison of different mod-
elling techniques. In the release version of our system1, we
expose the names of the algorithms in the UI, making a
direct comparison possible.
The workflow afforded by our system enables the user,
without specialist knowledge of ML and without directly
configuring and operating ML algorithms, to enter into a
musically productive gesture design activity following the
IML paradigm. Our system is aimed at musicians and
artists who might imagine incorporating embodied interac-
tion and machine learning into a performance. The work-
shop participants represented such a user group: they were
comfortable with digital technologies, but did not have spe-
cific technical knowledge of feature extraction or machine
learning. However, they were articulate in describing what
they experienced and insightful in discerning the different
kinds of gesture-sound interaction each algorithm afforded.
The intuitive way in which our users explored the differ-
ent algorithms means they were able to train models that
did not perform as expected. Without visibility into the
data and how an algorithm was processing it, it is difficult
to know how to alter one’s approach when training a new
model. While sometimes unpredictable performance was a
positive effect, it was more commonly viewed as an error.
Three users (P3, P4, P5) felt that Static Regression did
not result in smooth interaction. This may be due to large
amounts of training data and a possible overfitting effect.
We took this into consideration in a design iteration of the
system. Based on this, we added an auto-stop feature in the
static gesture recording so that it stops after 200 samples.
Participants on the whole confirmed findings of sound
tracing studies. They followed the amplitude/frequency
morphology of a sound when it was non-causal [5]. When
they sought to trace a more casual type of sound such as the
bell hits, they tried to make striking gestures. Such gestures
would be missed by a regression algorithm. Meanwhile, a
temporal model would have difficulty tracking the repeti-
tive looping nature of such a gesture. While in the output
of the neural network, modulation of the sample loop-end
point caused an accelerando in the bell rhythm, a striking
rhythm on input was not modelled.
Meanwhile having multiple input modalities (EMG and
IMU) gave the users multiple dimensions on which to trace
sound morphology. With a single modality, like motion cap-
ture in Cartesian space, it can be unclear whether a gesture
like raising the arms is tracing rising frequency or amplitude
or both. By using muscle tension and orientation indepen-
dently, we saw that our users used the IMU to follow pitch
contour, and muscle tension to follow intensity of sound – be
they in amplitude or effects like the nervous vibrato at the
end of the whistling Theremin-like tone. This is consistent
with Nymoen’s observation on the change in sound tracing
strategies as users encounter noisier sounds [19]. While Ny-
moen sees increased acceleration, here the EMG modality
allows an effort dimension in sound tracing that does not
have to follow pitch or spectral centroid.
While the workshop focused on the gesture design work-
flow, we imagine users will be interested in designing sounds
along with performance gestures, and training models ac-
cordingly. We hope our method of designing sounds with
trajectories is effective. However, authoring sounds using
only four anchor points may be frustrating for some. If the
number of anchor points is too few, our system could be
expanded to accommodate more. However, in the current
version, anchor points are synchronous. It is possible that
1https://gitlab.doc.gold.ac.uk/biomusic/
continuous-gesture-sound-interaction
184
sound designers would not want parameters to have break-
points at the same points in time. Future development will
involve integrating our system into full musical performance
environments, incorporating multiple sounds and gestures,
providing an interface for saving and loading models, and
accounting for performance issues such as fatigue.
In demonstrations of machine learning for artists, tuto-
rials often focus on the rapid prototyping advantages of
the IML paradigm. In a desire to get artists up and run-
ning with regression and modelling techniques, examples
are recorded quickly and trained on random variations of
synthesizer sounds. The focus is on speed and ease of use.
Scurto found that the serendipity this causes can bring a
certain creative satisfaction [10]. However, we can imagine
that once comfortable with the record-train-perform-iterate
IML loop, that composers and performers will want to work
with specific sounds or choreographies of movement. It is
here that sound design and gesture design meet. Our sys-
tem provides a sound and gesture design front end to IML
that connects the two via sound tracing.
Participants in our workshop were concerned about the
fluidity of response of the ML algorithms. They discussed
the choice of algorithms as a trade-off between faithfully
reproducing the traced sound and giving them a space of
exploration to produce new, unexpected ways to articulate
the sounds. In this way, they began to examine the ges-
ture/sound affordances of the different approaches to re-
gression and temporal modelling our system offered. We
might say that this enabled them to exploit IML for a ges-
tural exploration of Wessel’s timbre space.
This paper presented a system that enabled sound and
gesture design to use techniques of sound tracing and IML in
authoring continuous embodied sonic interaction. It intro-
duced established techniques of static regression and tem-
poral modelling and proposed a hybrid approach, called
Windowed Regression, to track time-varying sound and as-
sociated gesture to automatically train a neural network
with salient examples. Workshop participants responded
favourably to Windowed Regression, finding it fluid and ex-
pressive. They were successful in using our system in an it-
erative workflow to design gestures in response to dynamic,
time-varying sound synthesis. We hope that this system and
associated techniques will be of interest to artists preparing
performances with time-based media and machine learning.
6. ACKNOWLEDGEMENT
We acknowledge our funding body H2020-EU.1.1. - EX-
CELLENT SCIENCE - European Research Council (ERC)
- ERC-2017-Proof of Concept (PoC) - Project name:
BioMusic - Project ID: 789825.
7. REFERENCES
[1] A. Altavilla, B. Caramiaux, and A. Tanaka. Towards
gestural sonic affordances. In Proc. NIME, Daejeon,
Korea, 2013.
[2] J. Bullock and A. Momeni. ml.lib: Robust,
cross-platform, open-source machine learning for max
and pure data. In Proc. NIME, pages 265–270, Baton
Rouge, Louisiana, USA, 2015.
[3] B. Caramiaux, F. Bevilacqua, and N. Schnell.
Towards a gesture-sound cross-modal analysis. In
Gesture in Embodied Communication and
Human-Computer Interaction, pages 158–170, Berlin,
Heidelberg, 2010.
[4] B. Caramiaux, M. Donnarumma, and A. Tanaka.
Understanding gesture expressivity through muscle
sensing. ACM Transactions on Computer-Human
Interaction (TOCHI), 21(6):31, 2015.
[5] B. Caramiaux, P. Susini, T. Bianco, et al. Gestural
embodiment of environmental sounds : an
experimental study. In Proc. NIME, pages 144–148,
Oslo, Norway, 2011.
[6] J. M. Corbin and A. L. Strauss. Basics of Qualitative
Research: Techniques and Procedures for Developing
Grounded Theory. SAGE, Fourth edition, 2015.
[7] S. Delle Monache and D. Rocchesso. To embody or
not to embody: A sound design dilemma. In Machine
Sounds, Sound Machines. XXII Colloquium of Music
Informatics, Venice, Italy, 2018.
[8] R. Fiebrink and P. R. Cook. The Wekinator: a system
for real-time, interactive machine learning in music.
In Proc. ISMIR, Utrecht, Netherlands, 2010.
[9] R. Fiebrink, P. R. Cook, and D. Trueman. Human
model evaluation in interactive supervised learning. In
Proc. CHI, pages 147–156, Vancouver, BC, Canada,
2011.
[10] R. Fiebrink and H. Scurto. Grab-and-play mapping:
Creative machine learning approaches for musical
inclusion and exploration. In Proc. ICMC, pages
12–16, 2016.
[11] J. Fran¸coise, N. Schnell, R. Borghesi, and
F. Bevilacqua. Probabilistic models for designing
motion and sound relationships. In Proc. NIME,
pages 287–292, London, UK, 2014.
[12] J. Fran¸coise. Motion-sound mapping by
demonstration. PhD thesis, UPMC, 2015.
[13] J. Gibson. Theory of affordances. In The ecological
approach to visual perception. Lawrence Erlbaum
Associates, 1986.
[14] N. Gillian and J. A. Paradiso. The Gesture
Recognition Toolkit. The Journal of Machine
Learning Research, 15(1):3483–3487, 2014.
[15] A. Hunt and M. M. Wanderley. Mapping performer
parameters to synthesis engines. Organised Sound,
7(2):97–108, 2002.
[16] A. R. Jensenius, V. E. Gonzalez Sanchez,
A. Zelechowska, and K. A. V. Bjerkestrand. Exploring
the myo controller for sonic microinteraction. In Proc.
NIME, pages 442–445, Copenhagen, Denmark, 2017.
[17] M. Leman. Embodied music cognition and mediation
technology. MIT Press, 2008.
[18] A. Momeni and D. Wessel. Characterizing and
controlling musical material intuitively with geometric
models. In Proc. NIME, pages 54–62, Montreal,
Canada, 2003.
[19] K. Nymoen, J. Torresen, R. I. Godøy, and A. R.
Jensenius. A statistical approach to analyzing sound
tracings. In Speech, Sound and Music Processing:
Embracing Research in India, pages 120–145, Berlin,
Heidelberg, 2012.
[20] A. Parkinson, M. Zbyszy´nski, and F. Bernardo.
Demonstrating interactive machine learning tools for
rapid prototyping of gestural instruments in the
browser. In Proc. Web Audio Conference, London,
UK, 2017.
[21] T. D. Sanger. Bayesian filtering of myoelectric signals.
Journal of neurophysiology, 97(2):1839–1845, 2007.
[22] D. L. Wessel. Timbre space as a musical control
structure. Computer Music Journal, pages 45–52,
1979.
[23] M. Zbyszy´nski, M. Grierson, M. Yee-King, et al.
Rapid prototyping of new instruments with codecircle.
In Proc. NIME, Copenhagen, Denmark, 2017.
185
... We refer to this approach as static regression (Tanaka et al. 2019). Within the project, we extended this paradigm and proposed an automated technique for training a neural network with a windowed set of anchor points captured on the fly from a dynamic gesture made in response to a soundtracing auditory stimulus; we called this technique "windowed regression." ...
... Incorporating the mapping and sound synthesis contributes to making an integrated system that can then be adopted by musicians as a complete system without requiring them to build or invent their own. The software components of the EAVI-EMG system have been developed through a series of user-centered design actions (Tanaka et al. 2019;Zbyszyński et al. 2021). The integrated system allows musicians with no prior experience with biomedical technologies to get started using the EMG in musical applications. ...
Article
Full-text available
This article presents a custom system combining hardware and sortware that sense physiological signals of the performer's body resulting from muscle contraction and translates them to computer-synthesized sound. Our goal was to build upon the history of research in the field to develop a complete, integrated system that could be used by nonspecialist musicians. We describe the Embodied AudioVisual Interaction Electromyogram, an end-to-end system, spanning wearable sensing on the musician's body, custom microcontroller-based biosignal acquisition hardware, machine learning– based gesture-to-sound mapping middleware, and software-based granular synthesis sound output. A novel hardware design digitizes the electromyogram signals from the muscle with minimal analog preprocessing and treats it in an audio signal-processing chain as a class-compliant audio and wireless MIDI interface. The mapping layer implements an interactive machine learning workflow in a reinforcement learning configuration and can map gesture features to auditory metadata in a multidimensional information space. The system adapts existing machine learning and synthesis modules adapted to work with the hardware, resulting in an integrated, end-to-end system. We explore its potential as a digital musical instrument through a series of public presentations and concert performance by a range of musical practitioners.
... Firstly, they have been used to build mappings, which consist of learning to associate input data to output data. For example, it associates gestures , or a set of control parameters from an ad-hoc interface, to sound synthesis parameters [62,76,84]. These mapping methods have historically been based on Shallow models, such as probabilistic methods based on Gaussian Mixture Models [22]. ...
... It refers to system that must achieve objective tasks. In this first category, we group together articles which use the following terminologies: "system" [87,59,28,56,76,18,31,33,6,7,70,90,24,35,14,82,4,23,34], "method " [96,83,19], "technique" [88], "algorithm" [48,38], "instrument" [51,74,75,32,60,54,73,26,2], ou "model " [93,61,47,44]. Around 77% of the articles use these different terminologies. ...
Conference Paper
Full-text available
For several decades NIME community has always been appropriating machine learning (ML) to apply for various tasks such as gesture-sound mapping or sound synthesis for digital musical instruments. Recently, the use of ML methods seems to have increased and the objectives have diversified. Despite its increasing use, few contributions have studied what constitutes the culture of learning technologies for this specific practice. This paper presents an analysis of 69 contributions selected from a systematic review of the NIME conference over the last 10 years. This paper aims at analysing the practices involving ML in terms of the techniques and the task used and the ways to interact this technology. It thus contributes to a deeper understanding of the specific goals and motivation in using ML for musical expression. This study allows us to propose new perspectives in the practice of these techniques.
... The pilot study was conducted in a workshop-based scenario to teach nonspecialists gesture design without needing technical knowledge of machine learning [31]. Participants were able to explore new timbres using the four approaches described in section 3.1 and one of three pre-designed sounds. ...
... Interestingly, participants thought that the HMM temporal modeling was too "choppy" for exploration outside the designed gesture. Static regression was found to be a precise way to recreate the designed gesture/sound relationship, while whole regression was found by participants to be "embodied" [31]. ...
Chapter
Full-text available
This chapter explores three systems for mapping embodied gesture, acquired with electromyography and motion sensing, to sound synthesis. A pilot study using granular synthesis is presented, followed by studies employing corpus-based concatenative synthesis, where small sound units are organized by derived timbral features. We use interactive machine learning in a mapping-by-demonstration paradigm to create regression models that map high-dimensional gestural data to timbral data without dimensionality reduction in three distinct workflows. First, by directly associating individual sound units and static poses (anchor points) in static regression. Second, in whole regression a sound tracing method leverages our intuitive associations between time-varying sound and embodied movement. Third, we extend interactive machine learning through the use of artificial agents and reinforcement learning in an assisted interactive machine learning workflow. We discuss the benefits of organizing the sound corpus using self-organizing maps to address corpus sparseness, and the potential of regression-based mapping at different points in a musical workflow: gesture design, sound design, and mapping design. These systems support expressive performance by creating gesture-timbre spaces that maximize sonic diversity while maintaining coherence, enabling reliable reproduction of target sounds as well as improvisatory exploration of a sonic corpus. They have been made available to the research community, and have been used by the authors in concert performance.
... Lately, it has also been found to be an appropriate medium for sensing musical gesture. [39,31]. The Surface EMG (sEMG) is a technique that allows non-invasive detection of muscle activity, without the use of needles (typically used in EMG medical applications) but instead using passive electrodes placed on the skin. ...
Conference Paper
Full-text available
This paper explores a method to innovate the conventional interaction with a guitar pedalboard. By analyzing muscular contractions tracked via surface Electromyography (sEMG) wearable sensors, we aimed to investigate how to dynamically track guitarists' sonic intentions to automatically control the guitar sound. Two Recurrent Neural Networks based on Bidirectional Long-Short Term Memory were developed to analyze sEMG signals in real-time. The system was designed as a digital musical instrument that calibrates itself to each user during an initial training process. During training musicians provide their gestural vocabulary, associating each gesture to a corresponding pedalboard preset. The selection of the most effective features, in synergy with the best set of muscles, was conducted to optimize the learning rate of the system. The system was assessed with a user study encompassing seven expert guitar players. Results showed that, on average, participants appreciated the concept underlying the system and deemed it to be able to foster their creativity.
... During the workshop we explained that such an approach can be transferred to other textiles as well as other sound parameters, and that the software interface we built would allow such interactions to be quickly set up during rehearsals. We called each key-position-to-sound-parameters pair an 'anchor point' (Tanaka et al. 2019). ...
Article
Full-text available
Interwoven Sound Spaces is an interdisciplinary project which brought together telematic music performance, interactive textiles, interaction design, and artistic research. A team of researchers collaborated with two professional contemporary music ensembles based in Berlin, Germany, and Piteå, Sweden, and four composers, with the aim of creating a telematic distributed concert taking place simultaneously in two concert halls and online. Central to the project was the development of interactive textiles capable of sensing the musicians' movements while playing acoustic instruments, and generating data the composers used in their works. Musicians, instruments, textiles, sounds, halls, and data formed a network of entities and agencies that was reconfigured for each piece, showing how networked music practice enables distinctive musicking techniques. We describe each part of the project and report on a research interview conducted with one of the composers for the purpose of analysing the creative approaches she adopted for composing her piece.
... Another study [26] investigated user-designed gestures in a workshop setting, using two ML model types (static and dynamic) and a Myo GI with IMU and EMG data to realise designed gestures. In the work, workshop participants were asked to design a gesture and choose one of four ML models to represent those for music generation; one static model and three dynamic models. ...
Conference Paper
Full-text available
Since 2015, commercial gestural interfaces have widened accessibility for researchers and artists to use novel Electromyographic (EMG) biometric data. EMG data measures musclar amplitude and allows us to enhance Human-Computer Interaction (HCI) through providing natural gestural interaction with digital media. Virtual Reality (VR) is an immersive technology capable of simulating the real world and abstractions of it. However, current commercial VR technology is not equipped to process and use biometric information. Using biometrics within VR allows for better gestural detailing and use of complex custom gestures, such as those found within instrumental music performance, compared to using optical sensors for gesture recognition in current commercial VR equipment. However, EMG data is complex and machine learning must be used to employ it. This study uses a Myo armband to classify four custom gestures in Wekinator and observe their prediction accuracies and representations (including or omitting signal onset) to compose music within VR. Results show that specific regression and classification models, according to gesture representation type, are the most accurate when classifying four music gestures for advanced music HCI in VR. We apply and record our results, showing that EMG biometrics are promising for future interactive music composition systems in VR.
... In the performance video released by the author 8 classification, Markov models, neural networks, and reinforcement learning. In [6], the user wears a muscle sensing device and gesture information is dynamically mapped to synthesizer parameters by machine learning to enable performance. In this system, the gestures themselves are directly related to the performance, so many physical movements exist. ...
... Works in augmented reality (AR) for musical interaction explore the use of mobile devices ( [13]), multisensory applications of AR ( [14]), the augmentation of gestural instruments with projectors ( [15]) or the audience experience in an augmented live music performance ( [16]). There are many works investigating the sonification of body movement using artificial intelligence ( [17][18] [19]) and gestural control for music ([20][21] [22][23] [24]), but to our knowledge, none explore user-adaptive hand pose recognition based on interactive machine learning (IML). Compared to traditional machine learning (ML), IML exposes the "data collection -training -execution" loop to the users of a system [25]. ...
... Another study [26] investigated user-designed gestures in a workshop setting, using two ML model types (static and dynamic) and a Myo GI with IMU and EMG data to realise designed gestures. In the work, workshop participants were asked to design a gesture and choose one of four ML models to represent those for music generation; one static model and three dynamic models. ...
Chapter
Full-text available
Since 2015, commercial gestural interfaces have widened accessibility for researchers and artists to use novel Electromyographic (EMG) biometric data. EMG data measures musclar amplitude and allows us to enhance Human-Computer Interaction (HCI) through providing natural gestural interaction with digital media. Virtual Reality (VR) is an immersive technology capable of simulating the real world and abstractions of it. However, current commercial VR technology is not equipped to process and use biometric information. Using biometrics within VR allows for better gestural detailing and use of complex custom gestures, such as those found within instrumental music performance, compared to using optical sensors for gesture recognition in current commercial VR equipment. However, EMG data is complex and machine learning must be used to employ it. This study uses a Myo armband to classify four custom gestures in Wekinator and observe their prediction accuracies and representations (including or omitting signal onset) to compose music within VR. Results show that specific regression and classification models, according to gesture representation type, are the most accurate when classifying four music gestures for advanced music HCI in VR. We apply and record our results, showing that EMG biometrics are promising for future interactive music composition systems in VR. KeywordsEMGInteractive musicMachine learningMusic compositionMyoBiometricsWekinatorVRVirtual reality
Conference Paper
Full-text available
Designing with sound is about constructing appropriate sound representations of a concept, from the early ideas to the final product. A survey research on embodied sound sketching is presented, and the problems of early representation in sound design are discussed by analysing the questionnaire results of three workshops on vocal sketching .
Thesis
Full-text available
Designing the relationship between motion and sound is essential to the creation of interactive systems. This thesis proposes an approach to the design of the mapping between motion and sound called Mapping-by-Demonstration. Mapping-by-Demonstration is a framework for crafting sonic interactions from demonstrations of embodied associations between motion and sound. It draws upon existing literature emphasizing the importance of bodily experience in sound perception and cognition. It uses an interactive machine learning approach to build the mapping iteratively from user demonstrations. Drawing upon related work in the fields of animation, speech processing and robotics, we propose to fully exploit the generative nature of probabilistic models, from continuous gesture recognition to continuous sound parameter generation. We studied several probabilistic models under the light of continuous interaction. We examined both instantaneous (Gaussian Mixture Model) and temporal models (Hidden Markov Model) for recognition, regression and parameter generation. We adopted an Interactive Machine Learning perspective with a focus on learning sequence models from few examples, and continuously performing recognition and mapping. The models either focus on movement, or integrate a joint representation of motion and sound. In movement models, the system learns the association between the input movement and an output modality that might be gesture labels or movement characteristics. In motion-sound models, we model motion and sound jointly, and the learned mapping directly generates sound parameters from input movements. We explored a set of applications and experiments relating to real-world problems in movement practice, sonic interaction design, and music. We proposed two approaches to movement analysis based on Hidden Markov Model and Hidden Markov Regression, respectively. We showed, through a use-case in Tai Chi performance, how the models help characterizing movement sequences across trials and performers. We presented two generic systems for movement sonification. The first system allows users to craft hand gesture control strategies for the exploration of sound textures, based on Gaussian Mixture Regression. The second system exploits the temporal modeling of Hidden Markov Regression for associating vocalizations to continuous gestures. Both systems gave birth to interactive installations that we presented to a wide public, and we started investigating their interest to support gesture learning.
Article
Full-text available
Expressivity is a visceral capacity of the human body. To understand what makes a gesture expressive, we need to consider not only its spatial placement and orientation, but also its dynamics and the mechanisms enacting them. We start by defining gesture and gesture expressivity, and then present fundamental aspects of muscle activity and ways to capture information through electromyography (EMG) and mechanomyog-raphy (MMG). We present pilot studies that inspect the ability of users to control spatial and temporal variations of 2D shapes and that use muscle sensing to assess expressive information in gesture execution beyond space and time. This leads us to the design of a study that explores the notion of gesture power in terms of control and sensing. Results give insights to interaction designers to go beyond simplistic gestural interaction, towards the design of interactions that draw upon nuances of expressive gesture.
Article
Full-text available
The Gesture Recognition Toolkit is a cross-platform open-source C++ library designed to make real-time machine learning and gesture recognition more accessible for non-specialists. Emphasis is placed on ease of use, with a consistent, minimalist design that promotes accessibility while supporting flexibility and customization for advanced users. The toolkit features a broad range of classification and regression algorithms and has extensive support for building real-time systems. This includes algorithms for signal processing, feature extraction and automatic gesture spotting.
Conference Paper
Full-text available
We present a set of probabilistic models that support the design of movement and sound relationships in interactive sonic systems. We focus on a mapping-by-demonstration approach in which the relationships between motion and sound are defined by a machine learning model that learns from a set of user examples. We describe four probabilistic models with complementary characteristics in terms of multimodality and temporality. We illustrate the practical use of each of the four models with a prototype application for sound control built using our Max implementation.
Conference Paper
Full-text available
We present a study that explores the affordance evoked by sound and sound-gesture mappings. In order to do this, we make use of a sensor system with minimal form factor in a user study that minimizes cultural association. The present study focuses on understanding how participants describe sounds and gestures produced while playing designed sonic interaction mappings. This approach seeks to move from object-centric affordance towards investigating embodied gestural sonic affordances.
Book
Digital media handles music as encoded physical energy, but humans consider music in terms of beliefs, intentions, interpretations, experiences, evaluations, and significations. In this book, drawing on work in computer science, psychology, brain science, and musicology, Marc Leman proposes an embodied cognition approach to music research that will help bridge this gap. Assuming that the body plays a central role in all musical activities, and basing his approach on a hypothesis about the relationship between musical experience (mind) and sound energy (matter), Leman proposes that the human body is a biologically designed mediator that transfers physical energy to a mental level--engaging experiences, values, and intentions--and, reversing the process, transfers mental representation into material form. He suggests that this idea of the body as mediator offers a promising framework for thinking about music mediation technology. Leman argues that, under certain conditions, the natural mediator (the body) can be extended with artificial technology-based mediators. He explores the necessary conditions and analyzes ways in which they can be studied. Leman outlines his theory of embodied music cognition, introducing a model that describes the relationship between a human subject and its environment, analyzing the coupling of action and perception, and exploring different degrees of the body's engagement with music. He then examines possible applications in two core areas: interaction with music instruments and music search and retrieval in a database or digital library. The embodied music cognition approach, Leman argues, can help us develop tools that integrate artistic expression and contemporary technology.