Conference PaperPDF Available

A Self-Organizing Gesture Map for a Voice-Controlled Instrument Interface

Authors:
A Self-Organizing Gesture Map for a Voice-Controlled
Instrument Interface
Stefano Fasciani1,2 Lonce Wyse2,3
1Graduate School for Integrative Sciences & Engineering
2Arts and Creativity Laboratory, Interactive and Digital Media Institute
3Department of Communications and New Media
National University of Singapore
{stefano17, lonce.wyse}@nus.edu.sg
ABSTRACT
Mapping gestures to digital musical instrument parameters is
not trivial when the dimensionality of the sensor-captured data
is high and the model relating the gesture to sensor data is
unknown. In these cases, a front-end processing system for
extracting gestural information embedded in the sensor data is
essential. In this paper we propose an unsupervised offline
method that learns how to reduce and map the gestural data to a
generic instrument parameter control space. We make an
unconventional use of the Self-Organizing Maps to obtain only
a geometrical transformation of the gestural data, while
dimensionality reduction is handled separately. We introduce a
novel training procedure to overcome two main Self-
Organizing Maps limitations which otherwise corrupt the
interface usability. As evaluation, we apply this method to our
existing Voice-Controlled Interface for musical instruments,
obtaining sensible usability improvements.
Keywords
Self-Organizing Maps, Gestural Controller, Multi Dimensional
Control, Unsupervised Gesture Mapping, Voice Control.
1. INTRODUCTION
In Digital Musical Instruments (DMI) design, the Gestural
Controller (GC) [1] plays an essential role: it converts the input
gestural data into intermediate signals that are mapped to sound
synthesis or processing parameters. The GC, together with the
mapping, defines the relationship between the performer’s
gesture and DMI sonic response. Its design is considered a
specialized branch of HCI where the simultaneous and
continuous control of multiple parameters, the instantaneous
response, and the necessity of user practice are key aspects [2]
[3]. The algorithm to interpret the gestural data, performed by
the GC, depends on the nature of the sensors employed in the
interface and also on designer choices. Hunt, Wanderley, and
Kirk [4] classify systems for gestural acquisition into three
categories: direct, indirect and physiological. For the first
category, each sensor captures a single feature. Correlations,
dependencies, redundancy and constraints across the different
sensor data can be derived and handled directly by the physical
characteristic of the performer and the sensors. Hence knowing
how the performer’s gesture is represented in the gestural data
domain allows the implementation of explicit strategies within
the GC. For the other two categories, finding the relationship
between a gesture and the captured gestural data may be
challenging, therefore generative mechanisms, such as learning
algorithms, are often used for the model estimation [5].
In this paper we propose a method to obtain a GC through
unsupervised learning on a set of gesture examples. We assume
that the gestural data is high dimensional, continuous, and
contains potential correlations, cross-dependencies, and it is not
uniformly distributed. The GC we propose here has output
dimensionality generally lower than the input, its output signals
are continuous and have no cross-constraints across the
individual dimensions. Considering the gestural examples
provided as a sequence of instantaneous postures snapshots, the
GC relates its unique outputs combinations to unique postures.
A performer gesture would therefore produce a continuous
modification of the DMI parameters, changing the properties of
the sound synthesis or processing algorithm. The non-linear
transformation performed in the GC outputs also if the input
gesture differs from the provided examples, but respecting the
lower dimensional spatial bounds, topology, and distribution
found in the example data set.
This method is particularly suited for “alternate controllers”
[4] with indirect and physiological gestural data acquisition,
such as those with a large set of features extracted from an
audio or video signal, or from a network of sensors, but it can
be applied to high dimensional direct acquisition as well where
the implementation of explicit strategies can be challenging.
From a broad range of potential application scenarios, in this
paper we present and discuss the integration of this GC
technique with our Voice-Controlled Interface (VCI) [6]. The
VCI is an alternate controller for DMIs with gestural
acquisition lying between the categories of indirect and
physiological which estimates pseudo-physiological
characteristics of the vocal tract through the analysis of the
vocal audio signal. The VCI is meant to provide performers
with vocal control over a multidimensional instrument real-
valued parameters space. Therefore the GC we present in this
work does not perform temporal gesture recognition, but it
samples instantaneous postures from a gesture and maps these
in an intermediate space with a bijective correspondence with
DMI parameters. The key aspect of this work is the utilization
of a multidimensional extension of the Kohonen Self-
Organizing Maps (SOM) [7] lattice to achieve a geometrical
transformation of the gestural data space into the mapping
space. We perform the gestural data dimensionality reduction
using non-linear techniques before the SOM training to
overcome some limitations and shortcoming of the SOM
lattice, which are evident when using the transformation
mentioned above for GCs. Moreover, we introduce and
motivate some variation in the SOM training algorithm as well.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
NIME’13, May 27-30, 2013, KAIST, Daejeon, Korea.
Copyright remains with the author(s).
2. RELATED WORK
Artificial Neural Networks (ANN) have been used in the design
of new interfaces for musical purposes for more than two
decades. From the pioneering work of Lee and Wessel [8] to
Fiebrink’s more recent Wekinator [9], history shows how these
generative algorithms have been successfully used to learn
gesture maps for DMIs, particularly for time-continuous real-
valued parameters. Typically an ANN is trained with
supervised techniques. On the other hand, the Kohonen SOM
ANN is trained with an unsupervised technique. SOMs are
commonly used to produce a two-dimensional discrete
representation, called a map or lattice, of the input space.
SOMs, contrary to traditional ANNs, tend to preserve the
topological properties of the input space. Therefore, SOMs
have been mainly used for classification, visualization,
organization or retrieval of audio, exploiting the dimensionality
reduction and topological organization capabilities. Ness and
Tzanetakis [10] present a comprehensive survey of SOM
applications in music and instrument interface related projects.
Some instrument interfaces using SOMs are limited to a one-to-
one linear mapping of a two or three-dimensional sensor to an
equal dimensional trained map. Then the music or audio chunks
previously used to train the map are retrieved and reproduced in
real time, as in Odowichuk [10], which represents one of the
few musical examples using an SOM output lattice with
dimensionality larger than two.
Stowell [11] describes an attempt to use the SOM for
remapping vocal timbre representing the gestural data to sound
synthesis. He explains the relatively poor performance of this
solution with the intrinsic shortcomings and limitations of the
SOM. The selection of appropriate training settings, output
lattice resolution, and dimensionality may vary drastically case
by case. There may also be differences in map orientation for
different training over the same data set, and errors in topology
preservation such as output lattice twisting or folding. The
latter two are harmless side effect in classification tasks but are
lethal when the output lattice must represent a continuous and
non-linear discretization of the input data set, as we wish to
obtain here. The SOM performs a dimensionality reduction of
the input data and at the same time it tries to maintain its
topology. As described in [12], this dualistic role of the SOM is
one of the sources of topology corruption. If the embedded
dimensionality of the input data is higher than the output lattice
dimensionality, topology corruption is highly probable.
Moreover there is a tradeoff between continuity and resolution
of the map. The SOM output, represented by the lattice node
position, is intrinsically discrete. Therefore when we mention
map continuity within this context we mean that vectors very
close in the input manifold are mapped either to the same or
adjacent nodes. These issues, their effect on a SOM-based GC,
and proposed solutions are covered in the next section.
3. SELF-ORGANIZING GESTURES
In this section we describe the procedure to train the SOM-
based GC and different operative modalities which may fit
different DMI interface requirements. As mentioned above, our
goal is to drive time-continuous and real-valued DMI
parameters. The set of gestures used for the training represents
a temporal sequence of instantaneous postures, defining an
arbitrary shape in an N-dimensional space with arbitrary and
generally non-uniform density. On the output side, the GC
generates data in an M-dimensional space, therefore it applies a
bijective transformation f : Rn Rm. It is desirable that the
transformation f maintains the local and global topological
structure of the training gestures and spreads them evenly over
the output dimensions. The training data subject to the
transformation f must uniformly fill in a hypercube with
dimensionality M. An empty or low-density sub-volume in the
output hypercube results in a GC constraint that makes it
impossible to obtain certain combinations of the M output
parameters with any gesture similar to the training ones.
Independently of the strategy used to map the GC output to the
DMI (one-to-one, divergent or convergent), the transformation f
described above implicitly requires that M is equal to or smaller
than N. In high-dimensional gestural acquisition we often find
an embedded dimensionality N’ smaller than N, therefore if we
want non-redundant GC outputs, M has to be equal or smaller
than N’.
In principle, the SOM seems to be a suitable solution in
learning the transformation f from the gestural examples. The
GC can be implemented interpolating the discrete output map
of the trained ANN, which has N input neurons and an M
dimensional output lattice. However, due to the shortcomings
in topology preservations described in the previous section, the
GC can be affected by the following problems detrimental to
musical instrument design:
inverted GC response to the gestural input
belonging to different subsections of the input
space due to SOM lattice twisting;
discontinuous GC response due to excessive SOM
folding, (resulting in the proximity of two or more
edges in the output lattice), or due to SOM curled
edges;
minor inconsistency of the GC behavior due to
local topology distortion of individual output nodes
relative to its neighbors.
Our approach aims to avoid or minimize topology corruptions
by introducing a prior dimensionality reduction step, proper
data preprocessing, and some expedients in the training
algorithm. We free the SOM from the dimensionality reduction
task, while using it to find a non-linear geometrical
transformation between the two iso-dimensional spaces, and to
compresses/expands the dynamic of the input. A similar
approach for a different application domain is taken in the
Isometric SOM [13], where a 3D hand posture estimation, is
achieved using the SOM with a prior dimensionality reduction
stage performed using the ISOMAP method.
3.1 Learning Process
The preparation of a proper training data set G is essential for
the success of the learning algorithm. The larger the data set,
the better the final system will likely respond. However,
consistency and compliance of the data set are fundamental.
Most sensor output data are sampled at regular intervals.
Maintaining a posture over time during the recording of G
generates inappropriate high-density cluster in the N-
dimensional space, due to the sampling of identical or very
similar postures. This can bias and corrupt our training process
because the data density is not representative of the gesture
only. Strategies to avoid the presence of identical postural data
may vary with the characteristics of the gestural data
acquisition system, or alternatively, a more general step of post
processing over the acquired data G can be used. For our VCI
we adopt a gate operated by the spectral flux value to filter out
undesired postures while collecting the gestural data vectors in
G. The spectral flux threshold value is automatically measured
from a set of postures, used as well for noisy features rejection.
3.1.1 Gestural Data Pre Processing
A single dimension here represents a sensor signal for direct
gestural acquisition, or a single feature for indirect and
physiological gestural acquisition. The N-dimensional gestural
training data G is centered at the origin removing the mean
gmean. Then we estimate N’, the embedded dimensionality in the
N-dimensional gestural training data. This step is essential
because after the dimensionality reduction we keep at most N’
dimensions, possibly fewer depending on user choice. For this
estimation we round the results of the correlation dimension
method [15] to the closest integer. For the dimensionality
reduction there are several possibilities and experiments (e.g.
[16]) that demonstrate a specific technique outperforms others
only if certain specific characteristics are present in the data,
while for real data scenarios often the most advanced
techniques gives just a small improvement over basic
techniques such as the Principal Component Analysis. Since we
cannot make any assumption about the gestural data, choosing
a Nonlinear Dimensionality Reduction (NLDR) technique
would handle complex cases and non-linear manifolds. We
chose a convex NLDR technique because it produced
consistent results over different experimental iterations. We
tested several NLDR techniques on real gestural data sets, and
those that produced the best results were ISOMAP [17]
followed by LLE [18]. The results were measured in terms of
uniformity of data distribution across single dimensions, as well
as the overall GC user experience. However, the tests were
performed integrating the Self Organizing Gestures method
with our VCI covered in the next section. It is certainly possible
that for different gestural data acquisition systems the optimal
NLDR might differ from those used here. After the NLDR, the
data on each dimension are normalized to the range [-1;1].
3.1.2 SOM Training
As mentioned above, the SOM output lattice dimensionality M
will be at most equal to N’. However the output dimensionality
of the GC M is user-configurable. The NLDR technique ranks
the output dimensions, hence we simply discard the lowest
ranked (N’- M) components to achieve the further reduction.
An excessive number of output nodes, often referred to as
output lattice resolution, is one of the causes for topology
distortion, especially of the local type. We derive the resolution
r from M and from the number of entries g in the gestural
training data G using a nonlinear relation as in (1). A high-
resolution implies a more complex model to be estimated, and
the requirement of more training data is a direct consequence.
Similarly, the number of SOM training iterations tMAX is related
to the model complexity as in (2).
r=round(1.5logM(g))
(1)
tMAX =Mgr
(2)
The SOM output lattice represents an M-dimensional
hypercube with rM output nodes Oj distributed uniformly
(forming a grid), each associated to an M-dimensional weight
vector wj. Before initializing the weight vector wj and training
the SOM, we search the gestural training data for 2M points ak
representing the extrema of the performer’s gestures that best
encompass the gestural space. The data is centered on the axis
origin, hence we perform the search with the following steps:
1. for each quadrant, set an equal-value boundary on
each axis, and increase it progressively reducing the
quadrant extension until only one point remains, and
compute the sum of the distances of the 2M ak points
from the origin and between themselves;
2. repeat the previous step performing a sequence of
fine angular full data rotation steps around the origin;
3. pick the rotation angle αopt and the related ak that
maximizes the sum of the distances.
The weights wj are initialized distributing them uniformly
through the hypercube inscribed into the dataset, while the
weights related to the Oj located at the 2M vertices are
initialized at the position of the ak. The SOM is trained for tMAX
iterations picking a random point xrand from the dataset and
updating all the wj, as in equations (3) and (4)
(3)
Θ(t,j,z)=exp OzOj
22
σ
(t)2
( )
(4)
where µ(t) is the linearly decreasing learning rate, Θ(t,j,z) is the
neighborhood function described in (4), z is the index of the
output node Oz with the related weight closest to xrand. In (4)
the nominator of the exponential is the squared Euclidean
distance of the output nodes in the output M-dimensional grid,
σ(t) is the linearly decreasing neighborhood parameter
representing the attraction between the output nodes. We apply
the following modification to the standard training algorithm:
the Oj of the 2M vertices are updated more slowly
than the rest of the points, using the half of the
learning rate µ(t);
the random point selection is biased in favor of the ak,
forcing the weights update using the 2M ak every time
a 10% of tMAX has elapsed, and using half of the
attraction σ(t) for the Oj of the 2M vertices.
At the end of the training process each output node Oj is
associated with a number Cj counting how many entries in the
training gestural data are associated with it (for which it is the
nearest node). Additionally we embed information about the
temporal unfolding of G in the lattice output nodes. This is
used in one of the operative modes described in the next
subsection. Since we are running a large number of training
iterations, we choose a relatively small value for the initial
learning rate µ(t0). To avoid local topology distortions, we set
the final value of the attraction σ(tMAX) to a relatively large
value, and a µ(tMAX) at least one order of magnitude smaller
than the initial one. In particular, values producing good results
for our vocal gestural data cases are: µ=[0.5,0.01] and
σ=[1.5,0.5]. Other gestural data acquisition systems may
require different parameter values to obtain a well-trained
system, but the selection strategy is still valid.
The idea behind the modification of the conventional training
is to avoid the topology distortions mentioned above, and at the
same time obtain a better overlap between the weights of the
SOM output nodes and the gestural data. In particular, we
focused in pulling the lattice from its vertices, stretching it
toward the gestural extrema, achieving at the same time lattice
edges enclosing the M-dimensional gestural data subspace with
higher fidelity. Moreover, the SOM training data is also used to
find the M-dimensional convex hull that encloses all the vectors
in G. The convex hull defines the valid region for new
incoming vectors during the real-time operation, so that the GC
does not produce an output if the vector is outside the region. In
Figure 1 we present two examples of SOM training on gestural
data, with M=2 and M=3, showing the training data (after
preprocessing), the weights of the SOM nodes at initialization
and after training, node mass Cj and node neighborhood based
on gestural data temporal unfolding. The weights of the SOM
nodes are represented in the normalized M-dimensional training
data space.
The preprocessing and training processes presented are
completely unsupervised. In case the user wants to explicitly
define the 2M gestural extrema to be associated with the SOM
lattice vertices, we use a supervised dimensionality reduction
technique in the preprocessing stage. This requires an
additional training data set E that contains several instances of
labeled gestural extrema. We use a multiclass Linear
Discriminant Analysis (LDA) for the N to M dimensionality
reduction, which maximizes the between-class variance while
minimizing the within-class variance. The LDA transformation
is learned from the labeled data in E, and it replaces ISOMAP.
The LDA dimensionality reduction is then applied to G and to
the 2M class centroids in E, representing the gestural extrema.
The rest of the preprocessing and SOM training procedure is
unchanged. With LDA, when projecting the gestural data in to
the lower dimensional space, the extrema are located
somewhere along the boundaries of the enclosing convex hull,
insuring SOM topological preservation when associating the
extrema with the vertices of the output lattice.
3.2 GC Operational Modes
The SOM-based GC is implemented applying the bijective
transformation f, composed by a series of step in which we
apply the processing learned from G. For a new gestural data
vector d, we obtain the output vector p as follows:
1. subtract the mean gmean from d;
2. apply the NDLR and truncate the vector to the top M
dimensions if M is smaller than N’;
3. normalize each dimension to the range [-1;1];
4. rotate the vector by the angle αopt;
5. verify if the vector is within the valid convex hull;
6. find the closest wj to the obtained d* and output the
related Oj normalized grid indexes vector p.
In Figure 2 we summarize the learning process and the GC
processing steps. The operations in the square boxes within the
learning process, learn and apply a transformation from G.
Those in the rounded boxes within the GC simply apply the
transformation based on the information learned.
The GC output p contains the indexes of the output node,
which is its position in the M-dimensional lattice. Usually the
indexes are integers numbers, but we normalize them in the
range [0,1]. The resolution r in (1) directly defines the
resolution of the M dimensions of p, which are based upon a
user defined mapping strategy and used to drive DMI
parameters. If r is small, we use the Inverse Distance
Weighting (IDW) interpolation between the 2M closest wj to
provide virtually infinite output resolution. The SOM output
lattice can be used in more complex manners to implement
variations of the GC which may suit different instrument
interface philosophies.
To obtain a more smooth response d-to-p, it is possible to
force the output to follow a connecting path on the lattice
Moore neighborhood. The search for the wj closest to d*(t) is
performed only on the 3M neighbors of the d*(t-1) closest Oj,
including itself. If d(t-1) and d(t) are far apart due to a quick
performer gesture, the GC responds more slowly. This can also
help to compensate for potential performer error in actuating
the interface, or noise in the gestural acquisition system. If
d(t+1) is again close to d(t-1), such big variations in the
gestural data space will produce only a small perturbation in the
related temporal sequence of p.
Figure 1. 2D and 3D lattice examples of an SOM trained with the proposed algorithm with different gestural data; a) gestural
training data; b) weight initialization; c) trained lattice output weights; d) trained lattice output weights with mass mapped to
node diameter; e) trained lattice with output node neighborhood based on gestural data temporal unfolding.
Figure 2. Learning Process and Gestural Controller functional schemes.
The SOM algorithm is supposed to organize the output node
weights also according to the training data density. More dense
areas will have more output nodes and vice versa, which ideally
means equal Cj across all the output nodes. However in real
cases some minor difference may still be present. We can
consider the Cj as a mass value associated with Oj and replace
the Euclidean distance with a gravity force in the
neighborhood search as in (5)
Fj=Gconst Cjwid*2
( )
(5)
where Gconst represents a user-defined gravitational constant,
and we can omit the mass of d* because it will just scale
equally across all the Fj. With this approach, independently of
the neighborhood search space definition, the winner Oj is the
one giving the strongest gravitational attraction. The GC
response perturbed by the mass of the output nodes is non-
linear, presenting hills and valleys where a different
amount of gestural energy is required to change position.
So far, we have used the SOM-based GC using the global
gestural information learned from G. Therefore the system will
also respond to gestures different from those used for training,
but in terms of their multidimensional shape and density in G.
At the end of the learning process we found the set of possible
next nodes for each of the lattice output nodes, analyzing
forward and backward the sequence of entries in G. We can use
the possible next nodes associated to each Oj as a neighborhood
search subspace. In this way, we obtain a different operational
modality where the GC responds only to gestures consistent
with those used in training. This modality introduces strong
constraints between the GC input and output, and it is not used
in our VCI system.
4. VCI INTEGRATION & EVALUATION
In this section we discuss the Self-Organizing Gestures user
perspective. We briefly describe the integration with our VCI
system, and we present some numerical results. The proposed
method is completely unsupervised and allows the user to
obtain the GC transformation f by providing only some gestural
examples. Because the method is unsupervised, the implication
is that users have to learn the resulting gestural organization in
relation to the GC output. We provide two computationally
inexpensive possibilities for allowing the user to modify the
system response in real-time. The first is the range [0,1]
inversion for every single GC output to [1,0] and the second is
the application of a global scaling value to the wj, to expand or
shrink the output lattice in relation to the GC input space, while
maintaining the same topology.
We have implemented the learning process of Section 3.1
using a set of MATLAB functions with a simple interface. The
gestural training data is read from a file, and imported into a
single matrix. During the learning process it provides some
intermediate text information and it shows graphically the SOM
learning process plotting and updating the wj grid over the G
data after the dimensionality reduction. Plotting is supported for
M less than 4. The learning function returns a single structure
including all the data necessary for the various operational
modes described in 3.2. For the NLDR we use the
Dimensionality Reduction Toolbox1. The real-time GC
operative modes are implemented in a separate MATLAB
function, which exchanges data though the Open Sound Control
(OSC) protocol, receiving the gestural data d and sending out
p. The real-time GC is able to change its behavior in runtime in
1http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimens
ionality_Reduction.html
response to specific OSC messages. It is possible to change the
operative modality, invert the output ranges, and modify the
global scale factor for the wj.
The integration with our VCI MAX/Msp system [6] is
established through the OSC communication for the real time
part, while the gestural training data is collected into a matrix in
MAX/Msp and exported to a text file using FTM [19]. In the
VCI prototype we implemented a larger vocal feature
computation and a system for noisy features rejection based on
vocal postures as described in [20]. A mixture of MFCC and
PLP features represent the system’s gestural data. Even though
we tested the different operative mode described in 3.2, in the
evaluation experiments we work using a search space limited to
the Moore neighborhood of the previous output node, which is
the modality that better suits voice driven GC.
The evaluation data presented here considers only the GC
performances. We probe the measurement data before our DMI
mapping system [21], which is integrated in the VCI. We ran
two categories of tests using 10 different vocal gestural training
data G. In the first category we verified the performances of the
learning process. For each G we trained the two equivalent
SOMs 20 times, one using our proposed method and the other
one using the standard SOM method. We compared the two by
measuring the frequency of global topology (twisting, folding,
curling) and local topology distortions. These results are
presented in Table 1, reporting the average over the 10 cases.
In the second category of test we gave the user 3 minutes to
learn and fine-tune the SOM-based GC, and a second GC
obtained with our previous VCI approach having equal
dimensionality and output resolution The user was then asked
to perform two tasks. The first was to cover the highest number
of GC output combinations possible (each lattice output node
represents one combination) within 1 minute; the other was to
maintain the output at 2M+1 key positions (vertices and center)
for 5 seconds. We then measured the output stability in terms of
standard deviation. The results are presented in Table 2,
showing the average over the 10 cases. In these tests we
refrained from mapping the GC output to any DMI in order to
focus attention on the output signal itself that will be used for
mapping. The presence of sonic feedback can ease some of the
tasks thereby biasing the measurements. Performance might
also have been influenced by the nonlinearities in the DMI
parameters-to-sound response. Thus the only feedback provided
in the user testing were the M on-screen continuous sliders and
an M-dimensional grid. For these experiments, the G
dimensionality N varies case by case between 7 and 28, while
the N’ is usually either 2 or 3, typical for vocal gestural data.
Table 1. Topology distortion performances comparison
between standard and proposed learning process.
Test
Standard
Training SOM
Modified
Training SOM
% Global Topology
Distortion
63.5%
2.5%
% Local Topology
Distortion
33%
6%
Table 2. Use task comparison between the VCI GC and the
Self-Organizing Gestures (SOG) based VCI GC.
Task
VCI GC
SOG-VCI
GC
% Covered Output
Combination
74.3%
96.5%
Key Positions Output Stability
(standard deviation)
0.14
0.06
Table 1 shows the improvements in preserving the topology
of the gestural training data. For the local distortion a negative
mark is given for any number of output nodes not preserving
the topology. In the 6% of cases showing local distortion
obtained with our learning process there is usually only one
node affected by a small topology distortion, resulting in only a
slight impact in the GC response. For the standard learning
process, when a local distortion is detected it usually affects a
large number of output nodes. Moreover, we observed that
global and local distortions are usually not simultaneous,
therefore distortion-free topology cases for the standard
learning process are below 10% for the standard SOM training.
Table 2 shows better results for both tasks than our previous
approach. Thanks to the SOM-based geometrical
transformation we obtain a multidimensional GC where the
number of reachable output combinations is close to 100%,
while in our previous approach, constraints across different
dimension resulted in forbidden regions. We observed that for
the cases with M=2, we always obtain 100% for the first task.
Moreover, we used other existing methods to spread the
gestural training data coherently and uniformly throughout the
M-dimensional hypercube, and we used different ANNs to
learn the transformation function. In the 10 cases considered
here, the ANN is unable to learn the transformation function in
more than the half of the cases. The ANN learning
performances was measured with the mean squared error on the
gestural training data.
5. CONCLUSIONS AND FUTURE WORK
We presented an unsupervised method to obtain a Gestural
Controller from a set of gesture examples, which converts an
arbitrary distributed N-dimensional manifold into an M-
dimensional uniformly distributed hypercube, representing the
GC output. Other than the performances presented in the tables
above, this method presents high consistency across different
instances of the learning process, producing extremely similar
GCs. We developed this method to improve our VCI, but it is
generic enough to be employed for any gestural data acquisition
system. The various operational modes described in Section 3.2
will be integrated in MAX/Msp to remove the MATLAB
dependency for the real-time component. Moreover current
work is also user-design oriented to study and evaluate aspects
beyond the numerical measurements of the complete VCI
integrated system. The visualizations of the training and real-
time data ease the learning curve for the user, but it still
remains an issue when working with M larger than 3.
6. REFERENCES
[1] J. B. Rovan, M. M. Wanderley, S. Dubnov, and P.
Depalle, “Instrumental Gestural Mapping Strategies as
Expressivity Determinants in Computer Music
Performance,” in Proceedings of Kansei - The
Technology of Emotions Workshop, Genoa, Italy, 1997.
[2] A. Hunt and R. Kirk, “Mapping strategies for musical
performance,” Trends in Gestural Control of Music, M.
Wanderley and M. Battier, Eds. Paris, France: Institut
de Recherche et Coordination Acoustique Musique
Centre Pompidou, vol. 21, pp. 231258, 2000.
[3] N. Orio, N. Schnell, and M. M. Wanderley, “Input
devices for musical expression: borrowing tools from
HCI,” in Proceedings of the 2001 conference on New
Interfaces for Musical Expression, Seattle, USA, 2001,
pp. 14.
[4] M. M. Wanderley and P. Depalle, “Gestural Control of
Sound Synthesis,” Proceedings of the IEEE, vol. 92, no.
4, pp. 632644, 2004.
[5] A. Hunt, M. Wanderley, and R. Kirk, “Towards a model
for instrumental mapping in expert musical interaction,”
in Proceedings of the 2000 International Computer
Music Conference, Berlin, Germany, 2000, pp. 209212.
[6] S. Fasciani and L. Wyse, “A Voice Interface for Sound
Generators: adaptive and automatic mapping of gestures
to sound,” in Proceedings of the 2012 conference on
New Interfaces for Musical Expression, Ann Arbor, MI,
USA, 2012.
[7] T. Kohonen, “Self-organized formation of topologically
correct feature maps,” Biological Cybernetics, vol. 43,
no. 1, pp. 5969, Jan. 1982.
[8] M. Lee and D. Wessel, “Connectionist models for real-
time control of synthesis and compositional algorithms,”
in Proceedings of the 1992 International Computer
Music Conference, San Jose, CA, 1992.
[9] R. A. Fiebrink, “Real-time Human Interaction with
Supervised Learning Algorithms for Music Composition
and Performance,” Ph.D. Thesis, Princeton, 2011.
[10] G. T. Gabrielle Odowichuk, “Browsing Music and
Sound Using Gestures in a Self-Organized 3D Space,” in
Proceedings of the 2012 International Computer Music
Conference, Ljubljana, Slovenia, 2012.
[11] D. Stowell, “Making music through real-time voice
timbre analysis: machine learning and timbral control,”
Ph.D. Thesis, Queen Mary University of London, 2010.
[12] K. Kiviluoto, “Topology preservation in self-organizing
maps,” in IEEE International Conference on Neural
Networks, Piscataway, NJ, USA, 1996, vol. 1, pp. 294
299.
[13] H. Guan, R. S. Feris, and M. Turk, “The isometric self-
organizing map for 3D hand pose estimation,” in 7th
International Conference on Automatic Face and
Gesture Recognition, Southampton, UK, 2006, pp. 263
268.
[14] P. Grassberger and I. Procaccia, “Measuring the
strangeness of strange attractors,” Physica D: Nonlinear
Phenomena, vol. 9, no. 12, pp. 189208, 1983.
[15] L. van der Maaten, E. Postma, and H. van den Herik,
“Dimensionality Reduction: A Comparative Review,”
Tilburg University Technical Report, 2009.
[16] J. B. Tenenbaum, V. Silva, and J. C. Langford, “A
global geometric framework for nonlinear
dimensionality reduction,” Science, vol. 290, no. 5500,
p. 2319, 2000.
[17] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality
Reduction by Locally Linear Embedding,” Science, vol.
290, no. 5500, pp. 23232326, 2000.
[18] N. Schnell, R. Borghesi, D. Schwarz, F. Bevilacqua, and
R. Muller, “FTM Complex Data Structure for Max,”
in Proceedings of 2005 International Computer Music
Conference, Barcelona, Spain, 2005.
[19] S. Fasciani, “Voice Features For Control: A Vocalist
Dependent Method For Noise Measurement And
Independent Signals Computation,” in Proceedings of
the 15th International Conference on Digital Audio
Effects, York, UK, 2012.
[20] S. Fasciani and L. Wyse, “Adapting general purpose
interfaces to synthesis engines using unsupervised
dimensionality reduction techniques and inverse
mapping from features to parameters,” in Proceedings of
the 2012 International Computer Music Conference,
Ljubljana, Slovenia, 2012.
... In this article we describe and evaluate the integration of a user-driven generative mapping framework based on several techniques we introduced in (Fasciani 2012;Fasciani and Wyse 2013;Fasciani 2016). The method is independent of the specific synthesis method and it measures the perceptual timbre response of any deterministic sound synthesizer, providing low-dimensional and perceptually-based interaction independent of the type and number of synthesis parameter controlled. ...
... There are two major components comprising the integrated system: the vocal gestural controller (Fasciani and Wyse 2013), and the synthesis timbre space mapping (Fasciani 2016), as illustrated in Figure 1. The first is built upon robust and noise-free control signals extracted from the voice (Fasciani 2012), representative of control intention expressed by sub-verbal vocal gestures. ...
... Moreover, the SOM requires choosing parameters such as the number of nodes, number of training iterations, and learning and attraction rates, which can generate distortions if not carefully chosen. In our training procedure, detailed in (Fasciani and Wyse 2013) and briefly illustrated in Figure 2, these parameters are adapted to the size of training data. The modified training algorithm and the prior stage of dimensionality reduction pulls the vertices of the lattice towards the gestural extrema detected in the training data, avoiding folding or twisting, providing a distortion-free lattice adapted to the local distribution in the training vocal data. ...
Article
Full-text available
In this article we describe a user-driven adaptive method to control the sonic response of digital musical instruments using information extracted from the timbre of the human voice. The mapping between heterogeneous attributes of the input and output timbres is determined from data collected through machine-listening techniques and then processed by unsupervised machine-learning algorithms. This approach is based on a minimum-loss mapping that hides any synthesizer-specific parameters and that maps the vocal interaction directly to perceptual characteristics of the generated sound. The mapping adapts to the dynamics detected in the voice and maximizes the timbral space covered by the sound synthesizer. The strategies for mapping vocal control to perceptual timbral features and for automating the customization of vocal interfaces for different users and synthesizers, in general, are evaluated through a variety of qualitative and quantitative methods.
... In this article we describe and evaluate the integration of a user-driven generative mapping framework based on several techniques we introduced in (Fasciani 2012;Fasciani and Wyse 2013;Fasciani 2016). The method is independent of the specific synthesis method and it measures the perceptual timbre response of any deterministic sound synthesizer, providing low-dimensional and perceptually-based interaction independent of the type and number of synthesis parameter controlled. ...
... There are two major components comprising the integrated system: the vocal gestural controller (Fasciani and Wyse 2013), and the synthesis timbre space mapping (Fasciani 2016), as illustrated in Figure 1. The first is built upon robust and noise-free control signals extracted from the voice (Fasciani 2012), representative of control intention expressed by sub-verbal vocal gestures. ...
... Moreover, the SOM requires choosing parameters such as the number of nodes, number of training iterations, and learning and attraction rates, which can generate distortions if not carefully chosen. In our training procedure, detailed in (Fasciani and Wyse 2013) and briefly illustrated in Figure 2, these parameters are adapted to the size of training data. The modified training algorithm and the prior stage of dimensionality reduction pulls the vertices of the lattice towards the gestural extrema detected in the training data, avoiding folding or twisting, providing a distortion-free lattice adapted to the local distribution in the training vocal data. ...
Article
Full-text available
In this article we describe a user-driven adaptive method to control the sonic response of digital musical instruments using information extracted from the timbre of the human voice. The mapping between heterogeneous attributes of the input and output timbres is determined from data collected through machine-listening techniques and then processed by unsupervised machine-learning algorithms. This approach is based on a minimum-loss mapping that hides any synthesizer-specific parameters and that maps the vocal interaction directly to perceptual characteristics of the generated sound. The mapping adapts to the dynamics detected in the voice and maximizes the timbral space covered by the sound synthesizer. The strategies for mapping vocal control to perceptual timbral features and for automating the customization of vocal interfaces for different users and synthesizers, in general, are evaluated through a variety of qualitative and quantitative methods.
... The non trivial mapping of gestures to sounds provides a real challenge if the dimensionality of the sensor-captured data is high. Fasciani et al. [6] applies an artificial neural network approach to simplyfy the mapping and increase the usability of their voice-controlled interfaces Several projects studied the synthesis of sound with mouth shapes or with gestures. At NIME 2003 Lyons et al. presented a vision-based mouth interface that used facial action to control musical sound [11]. ...
Conference Paper
Full-text available
We present a system that allows users to experience singing without singing using gesture-based interaction techniques. We designed a set of body-related interaction and multi-modal feedback techniques and developed a singing voice synthesizer system that is controlled by the user's mouth shapes and arm gestures. Based on the adaption of a number of digital media-related techniques such as face and body tracking, 3D rendering, singing voice synthesis and physical computing, we developed a media installation that allows users to perform an aria without real singing and provide the look and feel from a 20th century performance of an opera singer. We evaluated this system preliminarily with users. Keywords Gesture based musical interfaces, 3D character performance, singing voice synthesis, interactive media installation.
... Discovering representations that succinctly account for the types of variation and structure present in a dataset can facilitate more accurate machine learning on subsequent tasks (e.g., recognition, mapping, following), when these representations are used as features for the data. For example, Fasciani and Wyse [2013] apply self-organising maps to human gesture examples in order to establish a gesture feature representation to use in mapping within a new digital musical instrument. Fried and Fiebrink [2013] demonstrate how unsupervised deep feature learning can be applied to two domains -e.g., gesture and audio, or music and image -in order to subsequently build mappings between them. ...
Article
Full-text available
Machine learning is the capacity of a computational system to learn structures from datasets in order to make predictions on newly seen data. Such an approach offers a significant advantage in music scenarios in which musicians can teach the system to learn an idiosyncratic style, or can break the rules to explore the system's capacity in unexpected ways. In this chapter we draw on music, machine learning, and human-computer interaction to elucidate an understanding of machine learning algorithms as creative tools for music and the sonic arts. We motivate a new understanding of learning algorithms as human-computer interfaces. We show that, like other interfaces, learning algorithms can be characterised by the ways their affordances intersect with goals of human users. We also argue that the nature of interaction between users and algorithms impacts the usability and usefulness of those algorithms in profound ways. This human-centred view of machine learning motivates our concluding discussion of what it means to employ machine learning as a creative tool.
Conference Paper
Full-text available
Sound generators and synthesis engines expose a large set of parameters, allowing run-time timbre morphing and exploration of sonic space. However, control over these high-dimensional interfaces is constrained by the physical limitations of performers. In this paper we propose the exploitation of vocal gesture as an extension or alternative to traditional physical controllers. The approach uses dynamic aspects of vocal sound to control variations in the timbre of the synthesized sound. The mapping from vocal to synthesis parameters is automatically adapted to information extracted from vocal examples as well as to the relationship between parameters and timbre within the synthesizer. The mapping strategy aims to maximize the breadth of the explorable perceptual sonic space over a set of the synthesizer's real-valued parameters, indirectly driven by the voice-controlled interface.
Conference Paper
Full-text available
Information about the human spoken and singing voice is conveyed through the articulations of the individual's vocal folds and vocal tract. The signal receiver, either human or machine, works at different levels of abstraction to extract and interpret only the relevant context specific information needed. Traditionally in the field of human machine interaction, the human voice is used to drive and control events that are discrete in terms of time and value. We propose to use the voice as a source of real-valued and time-continuous control signals that can be employed to interact with any multidimensional human-controllable device in real-time. The isolation of noise sources and the independence of the control dimensions play a central role. Their dependency on individual voice represents an additional challenge. In this paper we introduce a method to compute case specific independent signals from the vocal sound, together with an individual study of features computation and selection for noise rejection.
Article
Full-text available
In recent years, a variety of nonlinear dimensionality reduction techniques have been proposed that aim to address the limitations of traditional techniques such as PCA. The paper presents a review and systematic comparison of these techniques. The performances of the nonlinear techniques are investigated on artificial and natural tasks. The results of the experiments reveal that nonlinear techniques perform well on selected artificial tasks, but do not outperform the traditional PCA on real-world tasks. The paper explains these results by identifying weaknesses of current nonlinear techniques, and suggests how the performance of nonlinear dimensionality reduction techniques may be improved.
Article
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.
Article
We study the correlation exponent v introduced recently as a characteristic measure of strange attractors which allows one to distinguish between deterministic chaos and random noise. The exponent v is closely related to the fractal dimension and the information dimension, but its computation is considerably easier. Its usefulness in characterizing experimental data which stem from very high dimensional systems is stressed. Algorithms for extracting v from the time series of a single variable are proposed. The relations between the various measures of strange attractors and between them and the Lyapunov exponents are discussed. It is shown that the conjecture of Kaplan and Yorke for the dimension gives an upper bound for v. Various examples of finite and infinite dimensional systems are treated, both numerically and analytically.