Conference PaperPDF Available

Voice features for control: A vocalist dependent method for noise measurement and independent signals computation

Authors:

Abstract and Figures

Information about the human spoken and singing voice is conveyed through the articulations of the individual's vocal folds and vocal tract. The signal receiver, either human or machine, works at different levels of abstraction to extract and interpret only the relevant context specific information needed. Traditionally in the field of human machine interaction, the human voice is used to drive and control events that are discrete in terms of time and value. We propose to use the voice as a source of real-valued and time-continuous control signals that can be employed to interact with any multidimensional human-controllable device in real-time. The isolation of noise sources and the independence of the control dimensions play a central role. Their dependency on individual voice represents an additional challenge. In this paper we introduce a method to compute case specific independent signals from the vocal sound, together with an individual study of features computation and selection for noise rejection.
Content may be subject to copyright.
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-1
VOICE FEATURES FOR CONTROL: A VOCALIST DEPENDENT METHOD FOR NOISE
MEASUREMENT AND INDEPENDENT SIGNALS COMPUTATION
Stefano Fasciani
Graduate School for Integrative Sciences & Engineering
Arts and Creativity Laboratory, Interactive and Digital Media Institute
National University of Singapore
stefano17@nus.edu.sg
ABSTRACT
Information about the human spoken and singing voice is con-
veyed through the articulations of the individual’s vocal folds
and vocal tract. The signal receiver, either human or machine,
works at different levels of abstraction to extract and interpret
only the relevant context specific information needed. Tradition-
ally in the field of human machine interaction, the human voice
is used to drive and control events that are discrete in terms of
time and value. We propose to use the voice as a source of real-
valued and time-continuous control signals that can be employed
to interact with any multidimensional human-controllable device
in real-time. The isolation of noise sources and the independence
of the control dimensions play a central role. Their dependency
on individual voice represents an additional challenge. In this
paper we introduce a method to compute case specific independ-
ent signals from the vocal sound, together with an individual
study of features computation and selection for noise rejection.
1. INTRODUCTION
The human voice is an extremely flexible sound generation
mechanism and we use it primarily to transfer different catego-
ries of information through acoustic communications. The human
brain interprets and understands it transparently, while for ma-
chines the task is challenging. The processing of the vocal signal,
in most human machine application domains, begins with a stage
of feature computations to obtain a compact representation of the
audio signal, and is followed by the application of one or more
statistical models to decode information. Other than well estab-
lished and commonly used applications such as speech recogni-
tion, speaker identification, voice detection, music information
retrieval (querying by voice) and voice transformation, there are
more recent applications domains in which the interaction is es-
tablished at sub-verbal level.
The resulting interaction, when working directly with low
level features of the voice, is more direct and immediate [1]. The
“vocal joystick”, presented in [2] and studied in [3], computes
energy, pitch and vowel quality (vowel recognition) to provide a
2-D pointer navigation system, underlying the importance of in-
dependence in vocal features selection. Extensions of this work
are “VoiceDraw” [4] and the “VoiceBot” [5], where a similar
technique is applied to screen drawing and to the manipulation of
a robotic arm with 5 degree of freedom. Motor-impaired subjects
see these voice controlled interfaces as an accessible system,
while others see these as a hands free extension to traditional
controllers. However the works mentioned above, even if work-
ing at sub-verbal level, still present limitations due to the pres-
ence of classifiers driving discrete events.
In [6] and [7] a richer set of low level vocal features, includ-
ing the MFCC, are used to build a more complex voice con-
trolled interface for a wah-wah pedal and for an audio mosaicing
synthesizer respectively. The Gesture Follower1 and the Wekina-
tor [8] are two machine learning based system for mapping ge-
neric human gesture to real-valued continuous parameters of any
controllable device. The latter two are not designed to control
any particular class of devices, but they can implement any na-
ture of sub-verbal interface, computing low-level vocal features
as a source of gestural input, but noise issues and independence
of the control dimension remain largely unaddressed.
In this paper we present a generic method to analyze the low-
level features of the voice with the aim of improving the robust-
ness and the control capabilities of any vocal control system that
does not make use of classification techniques. The method can
be applied indiscriminately on spoken voice, singing voice and
pure sub-verbal sounds. We developed this generic technique to
improve the control capabilities of the sub-verbal vocal interface
for digital musical instruments that we proposed in [9].
The characteristics of the vocal folds and the vocal tract pre-
sent high variability among different speakers, as do speaking or
singing style and different individual native languages. The pro-
posed method minimizes the noise and maximizes the independ-
ence of the computed control signals over specific performances
of individual vocalists, rather than attempt to provide vocalist-
independent study results, which sacrifice optimization for gen-
eralization. In Section 2 we present the method for noise features
rejection and independence measurements starting from the com-
putation of a large feature set. Experimental result, on different
vocalists in different environmental conditions, are described and
in Section 3. Conclusions and usability issues are discussed in
Section 4.
2. NOISE AND INDEPENDENCE MEASUREMENT
Within this work we adapt some terminology from the non-
verbal communication literature, defining: “vocal posture” as the
action of uttering sound with invariant characteristics over time,
and “vocal gesture” as the action of uttering sound with charac-
teristics varying over time. We ground the study of noise and in-
dependence for the low level features computed from the vocal
audio signal on two generic requirements for the sub-verbal inter-
face:
The control signals computed over a vocal posture
must have constant values (static behaviour).
The control signals computed over a vocal gesture
must not be redundant (dynamic behaviour).
1 http://ftm.ircam.fr/index.php/Gesture_Follower
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-2
The absence of voice can be considered as a special case of vocal
posture. The inhibition of the control system when no voice is
present at the input can be considered as a further requirement.
There are techniques described in the literature which can be ex-
ploited for this purpose that are capable of detecting voice even
in challenging conditions such as the concurrent presence of
voice and music [10] [11].
The main issue related to vocal postures is the presence of
noise. Even though a vocalist has the perception of uttering in-
variant sound over time, such as sustained vowels, some low-
level features may still have a significant variance. Features af-
fected by this noise are different across individuals. Therefore we
define noisy features as those with a statistical dispersion above a
certain threshold when computed over a set of vocal postures
from a specific vocalist. An arbitrary number of control signals
computed from a vocal gesture (selection or transformations of
the low-level features) should present statistical independence, or
at least low correlation. If not, the multiple control dimension,
presumably mapped over different device parameters, would
vary similarly to one another, providing a trivial control system.
The vocal folds and the vocal tract can be approximated with a
source-filter system. These two systems are independent and
within each system there are independent sub-components as
well: energy and pitch in the vocal folds, and the first two for-
mants frequencies in the vocal tract representing the vowel space,
just to mention a few. But given a specific vocal gesture, this in-
dependence assumption may not be valid anymore. In our ap-
proach, we do not use any prior knowledge about the independ-
ence of the vocal features, but we perform a posterior study,
based on a vocalist-specific gesture (speech, singing or sub-
verbal sounds) using a method to find independence from the
non-noisy features retained in the system. In [12] Stowell and
Plumber present a study on the degradation and independence of
voice timbre features subjected to acoustic degradations, provid-
ing a sort of feature ranking for general purpose usage. They
provide vocalist-independent results, while defining three differ-
ent vocal categories: singing, speaking and beatboxing. As men-
tioned above, in our approach the study is based on a specific
individual voice and a specific performance, without introducing
any categorization. We extend the study on a larger feature set
and we explore several computing parameters combination.
Moreover the key concept of feature robustness is substantially
different from previous work to address a specific feature compu-
tation purpose, which is the generation of robust, independent,
time and value continuous control signals.
2.1. Parametric Low Level Features Computation
Since we assume no prior knowledge about the vocalist’s voice
characteristics and the kind of vocal gesture used for control pur-
poses, we initially compute a large set of features, including all
those commonly used in speech processing applications. This
feature set may present high redundancy, but we leave the feature
selection to a following stage. The features computed are:
Energy;
Pitch;
Linear Predictive Coding coefficients (LPC);
Mel Frequency Cepstrum Coefficients (MFCC);
Perceptual Linear Predictive coefficients (PLP) [13];
RelAtive SpecTrAl Perceptual Linear Predictive coef-
ficients (RASTA-PLP) [14];
Delta coefficients;
Delta-delta (acceleration) coefficients.
The computation and the post processing of the low level fea-
tures are implemented in MATLAB. Pitch, LPC and MFCC are
computed using the Voicebox2 package; PLP and RASTA-PLP
are computed using the Rastamat3 package. Within the feature
computation process there are several parameters to choose. The-
se affect aspects of the eventual real-time voice controlled inter-
face, such as the latency, the time resolution and the computa-
tional cost. At the same time they may affect the noise and inde-
pendence of the control signals computed from the voice. Instead
of choosing fixed parameters we perform a systematic study,
testing different combinations of features computation parame-
ters and picking the one resulting in the better performances. It
has been previously shown [15] that even in a different applica-
tion domain such as speaker independent speech recognition, the
optimal performances are obtained for different computation pa-
rameters once the nature of the feature vector is fixed. In this
study, two quality indicators measure the performances: one is
computed within the noise detection and rejection phase, the oth-
er one after the independence analysis.
The parameters we expose to variation within the features
computation are:
Window size;
Window overlap;
Pre-emphasis;
Order of the various features vectors.
For the window size, we explore the range from 128 to 2046,
considering only the values power of two. For the window over-
lap the tested values are 25%, 50% and 75%. Three are the val-
ues for the pre-emphasis: high (0.97=H), mid (0.485=M) and ze-
ro (0.0=Z). With order we intend the number of LPC, MFCC,
PLP and RASTA-PLP coefficients computed, corresponding to
the number of spectral sub-bands for the cepstral coefficients.
We vary the order in the range 8 to 16 with a step of 2. We dis-
card the first LPC coefficient because it is constant and the first
MFCC and PLP coefficients because they are redundant with the
energy. Therefore the number of computed features depends by
the order and is equal to:
dim(f)=(3((order 4)+3))
(1)
where f represents the feature vector computed for each window
of the vocal audio signal. The feature computation described
above leads to 225 different combinations to optimize over. Wid-
er parameter ranges with a finer step help to find a solution closer
to the absolute optimum. Since the optimal parameters depend on
the vocalist and the specific gesture, in this paper we aim to pre-
sent a methodology rather than derive parameters that work
across general cases.
The audio sampling rate is a flexible parameter in the system,
however for the experiment described in this paper it has been
fixed to 16KHz. Such a low audio sampling rate, common in
speech processing application, is a trade-off between low compu-
tational cost and loss of information at higher frequencies. In a
real sub-verbal interface implementation, a low computational
complexity of the processing chain is desirable, because the con-
trol signals are generated from the vocal signal in real-time. Even
if most of the energy is concentrated below the 8KHz Nyquist
frequency, singing and speech may have frequency components
up to 20KHz. Historical and physical reasons for neglecting the
band above 8KHz in speech processing applications are dis-
2 http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
3 http://labrosa.ee.columbia.edu/matlab/rastamat/
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-3
cussed in [16], where the author investigates its audibility and
perceptual significance.
2.2. Robust Features Selection Based On Noise Measurement
We mark a feature as noisy, and thus we discard it, if its statisti-
cal dispersion, computed over perceptually vocal posture, is
above a certain threshold. As a measurement of statistical disper-
sion we use the Relative Mean Difference (RMD) (2), because it
is scale invariant.
RMD =xi
i=1
n
(n1)
( )
1
xixj
j=1
n
i=1
n
(2)
In (2) n represents the number of samples, which in our case cor-
responds to the number of analysis windows over a single vocal
posture, while x represents the feature under question. The first
summation in (2) comes from the arithmetic mean and taking its
absolute value produces only positive RMD values, facilitating
the subsequent operations. Given a database of voice recordings,
each containing different vocal postures by an individual vocal-
ist, we compute the low level features. For each feature we com-
pute the RMD over a single recording, and we compute the aver-
age of the RMDs to measure the statistical dispersion over the
whole database. Features with the average RMD over the thresh-
old are marked as noisy and thus discarded.
We ran this study for all the possible combinations of the fea-
ture computation parameters. As a Robustness Quality Measure
(RQM) of every combination we choose the inverse of the aver-
age RMD of all the non-rejected features, normalized by the
number of non-noisy features (3),
RQM =E[RMD]feat.
( )
1
(3)
where E[RMD] represents the average of the RMD of the robust
features, and |feat.| represents the number of robust features. In
(3) we promote cases in which the number of non-rejected fea-
tures is higher because it may potentially increase the quantity of
independent information in the subsequent study. In (3) we com-
pute the inverse to obtain a RQM value growing with the overall
quality. This method addresses the individual inability to utter
sustained invariant sounds, called vocal postures in this work,
even when having the subjective perception of doing so.
2.3. Independent Control Signals Computation
The study of the independence is performed over an individual
and specific vocal gesture consisting of speech, singing or pure
sub-verbal sounds. The independence measurement retrieved
from the training examples can be then applied to implement live
vocal control, where the vocalist does not necessarily have to
stick to the same training gesture in terms of temporal unfolding.
We compute the low level features over the training recordings,
then we keep only the ones marked as robust from the previous
study and we apply the Independent Component Analysis (ICA)
method for the independence signals computation. We repeat
these measurements for five parameter combinations correspond-
ing to the local maxima of the robustness quality measurement
for the five different order ranges.
ICA is a statistical technique assuming a nongaussian distri-
bution of the sources and their statistical independence [17]. In
this case we assume that the P robust features fi are a linear com-
bination of J random variables sj, statistically independent and
nongaussian, as in (4). In (5) the linear combination is expressed
in matrix notation.
(4)
f = As
(5)
s = Wf
(6)
In (5) A represents the mixing matrix while in (6) W is the un-
mixing matrix. The ICA algorithm iterates until convergence on
a matrix W that gives the maximally nongaussian sources. The
ICA requires a number of observations f at least equal to the
number of sources s. This condition always holds in our study
because we consider a maximum number of independent compo-
nent at least 1 order or magnitude smaller than the number of fea-
ture computation windows. We use the FastICA4 [18] package
for the MATLAB computation of the independent component
analysis because of its efficiency.
2.4. Test and Evaluation Method
A global evaluation method, as well as a metric to compare
among the different feature computation parameters combina-
tions, is based on three different testing conditions, from which
we compute a triplet of quality parameters Q1, Q2 and Q3. We
extract the independent control signals performing the feature
computation, robust feature selection and the ICA unmixing on:
Vocal postures to measure how constant the control
signals are (Q1);
Vocal gestures similar to the one used for the W esti-
mation, to measure control signal independency (Q2);;
Vocal gestures with a different temporal unfolding
from the one used for the W estimation, to measure
control signal independency (Q3);
The vocal recordings used for the testing differ from the ones
used for the noise measurement and ICA. In the first test the in-
dependent signals are computed from a new set of vocal postures
and we measure the RMD for each sj, taking their sum as in (7).
Q1=RMD(sK(t))
k=1
J
(7)
In the second and third tests we compute the sj from a different
instance of the same vocal gesture used to estimate W, and from
others gestures which differ in some aspects. Other than chang-
ing the temporal unfolding we tested also other gesture variation
such as singing performance presenting different lyrics over the
same score, or vice versa. In both tests we measure the independ-
ence of the obtained sj. As independence measure we use the dis-
tance correlation, defined in (8), (9) and (10), which is a measure
of the statistical dependence between two random vectors which
do not necessarily share the same dimensionality [19]. A null dis-
tance correlation implies independence, while a null correlation
implies independence only if the random variables are Gaussians,
which is not our case. The distance correlation depends on the
distance covariance, equivalent to the Brownian covariance [20].
dCor(X,Y)=dCov(X,Y) dVar( X)dVar(Y)
(8)
dVar2(X) :=E[ X"
X2]+E2[X"
X]+...
... 2E[ X"
XX""
X]
(9)
dCov2(X,Y) :=cov X"
X,Y"
Y
( )
+...
... 2 cov X"
X,Y""
Y
( )
(10)
4 http://research.ics.tkk.fi/ica/fastica/
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-4
In (9), (10) and (11), ||.|| represents the Euclidean norm, while
(X,Y), (X’,Y’) and (X’,Y’’) are independent and identical distrib-
uted random variables. We compute the distance correlation
dCor(sj,Sj) between every independent signal sj and all the re-
maining, denoted by Sj.
Q2,3 =dCor(sK(t), SK(t))
k=1
J
)
(11)
From the way we defined the three different quality parameters
in (7) and (11), better performance is obtained for small values of
the Q1,2,3. Hence we define a global quality parameter ISQM (In-
dependent Signals Quality Measurement) in proportion to per-
formances, as the inverse of the sum of Q1,2,3, as in (12).
ISQM =(Q
1+Q2+Q3)1
(12)
3. EXPERIMENTAL RESULTS
In this section we present experimental results based on inde-
pendent signals computed on different performances of different
vocalists. Since the noise and independence study is based on
individual vocalists, we do not aim to provide generic results val-
id across vocalists, but we identify only recurrent trends. We use
three different vocalists in our experiments to highlight the capa-
bility of this method to adapt itself to the individual voice charac-
teristics, and to highlight differences across optimal settings
when changing the subject. The three vocalists, two adult males
(Voc.1, Voc.3) and one adult female (Voc.2), differ in their na-
tive language, which may influence the speaking or singing style
as well. None of them is a professional singer or speaker. We
used a MOTU UltraLite recording interface, and selected a dif-
ferent microphone for each vocalist (Crown CM311-A, Shure
SM58 and Rode NT55) to increase variation between speakers.
The recordings were performed in silent conditions; performance
degradations due to noisy recording environment are presented
later.
3.1. Robustness Analysis
The size of the vocal postures recordings database we use for
each vocalist is different (22, 14 and 17 recordings). Each record-
ing has a length of about 3 seconds. The vocal postures are cho-
sen by the vocalist themselves and they may differ across the in-
dividual dataset. For the robust feature selection we use the
whole database for each vocalist except 2 recordings, randomly
chosen, which we use later on for testing purposes. As a thresh-
old value for the feature rejection we choose 0.5, but it can be
changed to be compliant with specific application requirements.
Figure 1 shows the average RMD and the standard deviation for
each feature for the worst and best cases across the 225 computa-
tion parameters combination cases, computed on the vocalist 1
database. Each segment in Figure 1 (features in blue, delta in red,
and delta-delta in green) contains features in the following order:
energy, pitch, LPC, MFCC, PLP, RASTA-PLP. The difference in
the x-axis is due to the different order between the worst and best
RMD case, which generates feature vectors of different size. As
expected, we observe across speakers that the delta and delta-
delta differential features are very noisy and therefor not useful
for this purpose. In Figure 1 their RMD is not visible when they
exceed the value of 10. Energy, pitch, low-order MFCC, PLP and
RASTA-PLP are usually more robust than LPC. RASTA-PLP is
the most robust features set. In Figure 2 we show the RQM
across the 225 cases, computed over 3 vocalist’s dataset.
Figure 1: Worst case (a) and best case (b) features aver-
age RMD with standard deviation across the vocalist 1
dataset.
Figure 2: RQM for three different vocalist dataset over
225 features computation parameters combinations.
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-5
In the computation we iterate over window overlap, window size,
pre-emphasis, and features order parameters. The periodicity of
the RQM in Figure 2 is due to the specific computation loop
nesting. We observe that in general the local maxima for each
order range are not shared across vocalists. This, as expected,
demonstrates that an individual vocalist approach is necessary to
find the optimal features computation parameters. Since RQM
promotes cases with a higher number of robust features, we can
observe a rising trend due to the increase in the order of the fea-
tures. In general with higher order the probability of having a
higher number of features with the RMD below the threshold is
higher, hence we expected and obtained higher RQMs. As we
will discuss and present later, the absolute maximum of the RMQ
may not coincide with the maximum of QT. Details of the RQM
local maxima for each order are reported in Table 1. In Table 2
the robust features, always for the order’s local maxima, are cat-
egorized into their class and differential, and as mentioned, it is
possible to observe the delta and delta-delta differential features
are not robust. Energy and pitch are never rejected, and usually
the majority of the robust features belong to the RASTA-PLP
group, followed by the MFCC and PLP. However the specific
composition of the robust feature vector is different across vocal-
ists, again supporting the individual vocalist approach.
3.2. Independent Control Signal Analysis
We compute the ICA over three instances of the same vocal ges-
ture, choosing a number of independent components equal to 4.
Each instance has a length of about 10 seconds. As described in
2.4, after computing the unmixing matrix W, we run test in three
different conditions, computing quality parameter for each case.
Since the computational load of the distance correlation is ex-
tremely high, we ran these tests using only 20 different parame-
ters configurations for the features computation, coming from the
top four RQM over the five different orders. In Table 3 we report
the ISQM results, as well as the Q1,2,3, obtained for different
speech, singing and pure sub-verbal gestures. For the best ISMQ
in Table 3, we always obtain a low variance of the independent
control signals over vocal postures, while the independence of
the sj, measured by the distance correlation, is typically below
0.3. In general we observed that we obtain better performances
with a lower feature computation order and with large windows.
However the best parameter configuration is vocalist and gesture
dependant. In Figure 3 we show the spectrogram versus the inde-
pendent control signals over a speech gesture for vocalist 1 (a),
with the corresponding stable control signals for vocal posture
(c), and the best case of singing gesture for vocalist 2 (b).
Table 1: Robustness Quality Measurement local maxima for each order, with relative feature computation parameters (w=window size,
s=window step, p=pre-emphasis) and number of robust features (r.f.), over three vocalists postures dataset. It can be observed the rising
trend of the RQM and r.f. with the feature computation order.
Order
Voc.1 - RQM
Voc.2 - RQM
Voc.3 - RQM
8
127.6 (20 r.f.; w=1024; s=25%; p=M)
124.6 (25 r.f.; w=2048; s=75%; p=M)
101.2 (19 r.f.; w=256; s=25%; p=H)
10
130.3 (26 r.f.; w=2048; s=75%; p=Z)
125.6 (27 r.f.; w=256; s=25%; p=H)
101.0 (20 r.f.; w=256; s=25%; p=H)
12
147.2 (31 r.f.; w=2048; s=75%; p=Z)
134.1 (31 r.f.; w=256; s=25; p=H)
106.3 (21 r.f.; w=256; s=25%; p=H)
14
150.1 (32 r.f.; w=2048; s=75%; p=M)
147.1 (34 r.f.; w=2048; s=75%; p=M)
109.3 (21 r.f.; w=256; s=25%; p=H)
16
151.6 (33 r.f.; w=2048; s=75%; p=M)
158.4 (36 r.f.; w=2048; s=75%; p=M)
109.7 (22 r.f.; w=256; s=25%; p=H)
Table 2: Percentage of robust featured across features class (capital case columns, where EN.=Energy, PTC.=Pitch,
R.PLP=RASTA-PLP) and differential class (lower case columns, where d.=delta, d.d=delta-delta ) for local maxima for each
order over three vocalists dataset. RASTA-PLP has higher percentage among the feature classes followed by MFCC and PLP.
Energy and Pitch are never rejected. Delta and delta-delta are always marked as noisy in the vocal postures.
Order
EN.
PTC.
LPC
MFCC
PLP
R.PLP
feat.
d.
d.d.
8
5.0
5.0
10.0
15.0
25.0
45.0
100.0
0.0
0.0
10
3.8
3.8
11.5
23.0
26.9
30.7
100.0
0.0
0.0
Voc.1
12
3.2
3.2
9.6
22.5
32.2
29.0
100.0
0.0
0.0
%
14
3.1
3.1
6.2
25
31.2
31.2
100.0
0.0
0.0
16
3.0
3.0
6.0
27.2
30.3
30.3
100.0
0.0
0.0
8
4.0
4.0
16.0
24.0
20.0
32.0
100.0
0.0
0.0
10
3.7
3.7
7.4
29.6
22.2
33.3
100.0
0.0
0.0
Voc.2
12
3.2
3.2
6.4
29.0
25.8
32.2
100.0
0.0
0.0
%
14
2.9
2.9
11.7
23.5
32.3
26.4
100.0
0.0
0.0
16
2.7
2.7
11.1
25.0
30.5
27.7
100.0
0.0
0.0
8
5.2
5.2
5.2
21.0
21.0
42.1
100.0
0.0
0.0
10
5.0
5.0
5.0
20.0
15.0
50.0
100.0
0.0
0.0
Voc.3
12
4.7
4.7
4.7
19.0
14.2
52.3
100.0
0.0
0.0
%
14
4.7
4.7
4.7
19.0
14.2
52.3
100.0
0.0
0.0
16
4.5
4.5
4.5
18.1
18.1
50.0
100.0
0.0
0.0
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-6
Table 3: Three independent control signal quality parameters over different vocal gestures (speech, singing and pure sub-verbal), in-
cluding noisy environment conditions, of different vocalists. The best two ISQM for each gesture are presented with the relative parame-
ters combinations parameters (o=order, w=window size, s=window step, p=pre-emphasis). One can clearly see how the best parameter
combination is vocalist and vocal gesture dependant. In general, lower orders and larger windows give the best ISQM. Q2 and Q3 values
show consistency in most of the test cases, while Q1 values are usually low.
Voc.1 Speech
Voc.1 Sub-verbal
Q1
Q2
Q3
ISQM
Params.
Q1
Q2
Q3
ISQM
Params.
0.29
0.81
0.85
0.5
o=10; w=1024;
s=75%; p=M
0.52
0.80
0.85
0.45
o=10; w=1024;
s=75%; p=M
0.65
0.95
0.78
0.41
o=14; w=2048;
s=75%; p=Z
0.60
0.87
0.78
0.44
o=8; w=1024;
s=25%; p=M
Voc.2 Sing
Voc.2 Sub-verbal
Q1
Q2
Q3
ISQM
Params.
Q1
Q2
Q3
ISQM
Params.
0.37
0.89
1.15
0.41
o=8; w=2048;
s=50%; p=M
0.64
0.79
0.92
0.42
o=12; w=128;
s=25%; p=M
0.32
1.02
1.23
0.38
o=14; w=2048;
s=75%; p=M
0.65
1.21
1.56
0.29
o=16; w=2048;
s=75%; p=M
Voc.3 Speech
Voc.3 Sub-verbal
Q1
Q2
Q3
ISQM
Params.
Q1
Q2
Q3
ISQM
Params.
0.31
0.71
0.69
0.58
o=12; w=1024;
s=75%; p=M
0.37
0.67
0.64
0.45
o=12; w=1024;
s=75%; p=M
0.47
0.98
0.88
0.42
o=14; w=1024;
s=50%; p=M
0.53
0.89
0.94
0.41
o=8; w=1024;
s=50%; p=H
Voc.1 Speech Noisy
Voc.2 Sing Noisy
Q1
Q2
Q3
ISQM
Params.
Q1
Q2
Q3
ISQM
Params.
0.18
0.94
0.88
0.49
o=10; w=512;
s=25%; p=H
0.73
0.97
1.10
0.35
o=10; w=2048;
s=75%; p=Z
0.39
0.86
0.85
0.47
o=8; w=2048;
s=25%; p=H
0.98
0.93
1.08
0.33
o=10; w=2048;
s=50%; p=M
Figure 3: Three details of spectrogram and independent control signals, obtained by the ICA unmixing matrix, computed on a
short interval of two vocal gestures (a) (b) and one vocal posture (c).
Table 4: Robustness Quality Measurement local maxima for each order, with relative feature computation parameters
(w=window size, s=window step, p=pre-emphasis) and number of robust features (r.f.), over vocalists 1 postures dataset com-
paring silent and noisy recording conditions. It is possible to observe how the number of robust features and the RQM values are
lower, while the feature computation parameters are similar except differences in the step size.
Order
Voc.1 - RQM - Silent Env.
Voc.1 - RQM - Noisy Env.
8
127.6 (20 r.f.; w=1024; s=25%; p=M)
94.7 (20 r.f.; w=2048; s=50%; p=H)
10
130.3 (26 r.f.; w=2048; s=75%; p=Z)
101.6 (21 r.f.; w=2048; s=50%; p=H)
12
147.2 (31 r.f.; w=2048; s=75%; p=Z)
109.8 (24 r.f.; w=2048; s=75%; p=H)
14
150.1 (32 r.f.; w=2048; s=75%; p=M)
109.3 (28 r.f.; w=2048; s=75%; p=M)
16
151.6 (33 r.f.; w=2048; s=75%; p=M)
110.2 (29 r.f.; w=2048; s=75%; p=M)
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-7
The method, tested with a different number of independent com-
ponents and with a threshold value different by 0.5, consistently
shows the capability to identify the vocalist and gesture depend-
ent configuration to compute the independent control signals
with highest independence and lowest noise. In particular, with a
less severe threshold, such as 1, the number of robust features
increases by about 50%. The threshold determines the trade-off
between the value of Q1 and the values of Q2 and Q3. Moreover
we observed that in general small orders produces higher Q1,
even if the performances on the Q2 and Q3 are slightly better for
higher orders.
3.3. Noisy Environments Performances
The proposed method presents only a slight performance de-
crease when the database is recorded in noisy environments. We
duplicated the database performing the same vocal gesture and
posture recordings in different environmental conditions. Eight
loudspeakers simultaneously reproducing music and the micro-
phone signal were surrounding the vocalist. To increase the ran-
domness, the background music was rapidly crossfading across 4
different genre songs every 5 seconds. In Figure 4 we compare
the RQM, for vocalist 1, over the 225 different cases, in silent
and noisy recording conditions. The RQM decrease in absolute
value is evident, while the rising trend as well as the local maxi-
ma and minima are similar. In Table 4 we present a comparison
of RQM local maxima details, for each order, for the two differ-
ent recording conditions. Since we did not add artificial noise to
the database, but we performed new recordings in a noisy envi-
ronment, the consistency of the results for the two environments
supports the validity of this approach.
Figure 4: Robustness Quality Measurement for vocalist 1
silent and noisy dataset over 225 features computation
parameters combinations.
In the performances measurements over the independent signals,
the Q1 values show a decrease due to a lower RQM nominal val-
ue, as presented in the bottom part of Table 3. Therefore noisy
environments lead to performance degradation as expected, but
the method shows its capability to reject external sources of noise
without an excessive penalization of the overall performances.
4. CONCLUSIONS AND FUTURE WORK
We presented a method based on individual vocalists, to compute
a set of time and value continuous control signals with high inde-
pendence and low noise, for particular performance datasets. We
run a blind search of the computation parameters to minimizing
four quality parameters. Experimental results over different vo-
calist and performances presented coherent results. Additional
experiments with different values of the features rejection
thresholds and number of independent computed signals also led
to result consistency. In general we showed how feature robust-
ness is dependent for the individual voices, and that the best
computation parameters must vary according to the vocalist. The
computation of independent signals strongly depends on the spe-
cific vocal performance and must be tuned accordingly. Moreo-
ver we found and highlighted recurrent pattern in the RQM and
ISQM measurements across different features and feature com-
putation parameters. Additional work on alternative techniques to
generate independent signals may further improve this method.
The capability of computing noise-free and independent sig-
nals from the voice cannot be taken as providing evidence of
their “human-controllability”. This important HCI issue is still
open and we will investigate it in the future. Moreover, since this
method can be considered unsupervised, it is necessary to pro-
vide the user with information about how the independent signals
are extracted from the voice in order to support user control of
the system. To address these two issues we developed a real-time
application for the independent signal computation, but it needs
to cooperate with a system that provides feedback to the user in
acoustic and visual forms.
5. ACKNOWLEDGMENTS
This work was supported by a scholarship from the NUS Gradu-
ate School for Integrative Sciences & Engineering.
6. REFERENCES
[1] T. Igarashi and J. F. Hughes, “Voice as sound: using non-
verbal voice input for interactive control,” in Proceedings
of the 14th annual ACM symposium on User interface
software and technology, 2001, pp. 155156.
[2] J. A. Bilmes, X. Li, J. Malkin, K. Kilanski, R. Wright, K.
Kirchhoff, A. Subramanya, S. Harada, J. A, P. Dowden,
and H. Chizeck, “The Vocal Joystick: A Voice-Based
Human-Computer Interface for Individuals with Motor
Impairments,” in Human Language Technology Conf. and
Conf. on Empirical Methods in Natural Language Pro-
cessing, Vancouver, Canada, 2005, p. 9951002.
[3] S. Harada, J. O. Wobbrock, J. Malkin, J. A. Bilmes, and
J. A. Landay, “Longitudinal study of people learning to
use continuous voice-based cursor control,” in Proceed-
ings of the 27th international conference on Human fac-
tors in computing systems, 2009, pp. 347356.
[4] S. Harada, J. O. Wobbrock, and J. A. Landay,
“Voicedraw: a hands-free voice-driven drawing applica-
tion for people with motor impairments,” in Proceedings
of the 9th international ACM SIGACCESS conference on
Computers and accessibility, 2007, pp. 2734.
[5] B. House, J. Malkin, and J. Bilmes, “The VoiceBot: a
voice controlled robot arm,” in Proceedings of the 27th in-
ternational conference on Human factors in computing
systems, 2009, pp. 183192.
Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK, September 17-21, 2012
DAFX-8
[6] A. Loscos and T. Aussenac, “The wahwactor: a voice
controlled wah-wah pedal,” in Proceedings of the 2005
conference on New interfaces for musical expression,
2005, pp. 172175.
[7] J. Janer and M. De Boer, “Extending voice-driven syn-
thesis to audio mosaicing,” in 5th Sound and Music Com-
puting Conference, Berlin, 2008, vol. 4.
[8] R. A. Fiebrink, “Real-time Human Interaction with Su-
pervised Learning Algorithms for Music Composition and
Performance,” Ph.D. Thesis, Princeton, 2011.
[9] S. Fasciani and W. Lonce, “A Voice Interface for Sound
Generators: adaptive and automatic mapping of gestures
to sound,” in Proceedings of the 2012 conference on New
interfaces for musical expression, 2012.
[10] M. Rocamora and P. Herrera, “Comparing audio de-
scriptors for singing voice detection in music audio files,”
in Brazilian Symposium on Computer Music, 11th. San
Pablo, Brazil, 2007, vol. 26, p. 27.
[11] H. Lukashevich, M. Gruhne, and C. Dittmar, “Effective
singing voice detection in popular music using arma filter-
ing,” in Workshop on Digital Audio Effects (DAFx’07),
2007.
[12] D. Stowell and M. D. Plumbley, “Robustness and inde-
pendence of voice timbre features under live performance
acoustic degradations,” in Proc. of the 11th Int. Confer-
ence on Digital Audio Effects, 2008.
[13] H. Hermansky, “Perceptual linear predictive (PLP) analy-
sis of speech,” J. Acoust. Soc. Am., vol. 87, no. 4, pp.
17381752, Apr. 1990.
[14] H. Hermansky and N. Morgan, “RASTA processing of
speech,” IEEE Transactions on Speech and Audio Pro-
cessing, vol. 2, no. 4, pp. 578589, Oct. 1994.
[15] C. Ssnderson and K. K. Paliwal, “Effect of different sam-
pling rates and feature vector sizes on speech recognition
performance,” in TENCON’97. IEEE Region 10 Annual
Conference. Speech and Image Technologies for Compu-
ting and Telecommunications., Proceedings of IEEE,
1997, vol. 1, pp. 161164.
[16] B. B. Monson, “High-Frequency energy in singing ans
speech,” Ph.D. Thesis, University of Arizona, 2011.
[17] P. Comon, “Independent component analysis, a new con-
cept?,” Signal Process., vol. 36, no. 3, pp. 287314, Apr.
1994.
[18] A. Hyvarinen, “Fast and robust fixed-point algorithms for
independent component analysis,” IEEE Transactions on
Neural Networks, vol. 10, no. 3, pp. 626634, May 1999.
[19] G. J. Székely, M. L. Rizzo, and N. K. Bakirov, “Measur-
ing and testing dependence by correlation of distances,”
The Annals of Statistics, vol. 35, no. 6, pp. 27692794,
Dec. 2007.
[20] A. Gretton, K. Fukumizu, and B. K. Sriperumbudur,
“Discussion of: Brownian distance covariance,” The An-
nals of Applied Statistics, vol. 3, no. 4, pp. 12851294,
Dec. 2009.
... In this article we describe and evaluate the integration of a user-driven generative mapping framework based on several techniques we introduced in (Fasciani 2012;Fasciani and Wyse 2013;Fasciani 2016). The method is independent of the specific synthesis method and it measures the perceptual timbre response of any deterministic sound synthesizer, providing low-dimensional and perceptually-based interaction independent of the type and number of synthesis parameter controlled. ...
... There are two major components comprising the integrated system: the vocal gestural controller (Fasciani and Wyse 2013), and the synthesis timbre space mapping (Fasciani 2016), as illustrated in Figure 1. The first is built upon robust and noise-free control signals extracted from the voice (Fasciani 2012), representative of control intention expressed by sub-verbal vocal gestures. The second component maps the control signals computed from the voice onto the synthesis timbre space, from which we can retrieve the parameters to control the synthesizer itself. ...
... Thus, the optimal set of low-level features is case-specific and must be adaptive to both the user and specific control style. For mathematical details see (Fasciani 2012). ...
Article
Full-text available
In this article we describe a user-driven adaptive method to control the sonic response of digital musical instruments using information extracted from the timbre of the human voice. The mapping between heterogeneous attributes of the input and output timbres is determined from data collected through machine-listening techniques and then processed by unsupervised machine-learning algorithms. This approach is based on a minimum-loss mapping that hides any synthesizer-specific parameters and that maps the vocal interaction directly to perceptual characteristics of the generated sound. The mapping adapts to the dynamics detected in the voice and maximizes the timbral space covered by the sound synthesizer. The strategies for mapping vocal control to perceptual timbral features and for automating the customization of vocal interfaces for different users and synthesizers, in general, are evaluated through a variety of qualitative and quantitative methods.
... In this article we describe and evaluate the integration of a user-driven generative mapping framework based on several techniques we introduced in (Fasciani 2012;Fasciani and Wyse 2013;Fasciani 2016). The method is independent of the specific synthesis method and it measures the perceptual timbre response of any deterministic sound synthesizer, providing low-dimensional and perceptually-based interaction independent of the type and number of synthesis parameter controlled. ...
... There are two major components comprising the integrated system: the vocal gestural controller (Fasciani and Wyse 2013), and the synthesis timbre space mapping (Fasciani 2016), as illustrated in Figure 1. The first is built upon robust and noise-free control signals extracted from the voice (Fasciani 2012), representative of control intention expressed by sub-verbal vocal gestures. The second component maps the control signals computed from the voice onto the synthesis timbre space, from which we can retrieve the parameters to control the synthesizer itself. ...
... Thus, the optimal set of low-level features is case-specific and must be adaptive to both the user and specific control style. For mathematical details see (Fasciani 2012). ...
Article
Full-text available
In this article we describe a user-driven adaptive method to control the sonic response of digital musical instruments using information extracted from the timbre of the human voice. The mapping between heterogeneous attributes of the input and output timbres is determined from data collected through machine-listening techniques and then processed by unsupervised machine-learning algorithms. This approach is based on a minimum-loss mapping that hides any synthesizer-specific parameters and that maps the vocal interaction directly to perceptual characteristics of the generated sound. The mapping adapts to the dynamics detected in the voice and maximizes the timbral space covered by the sound synthesizer. The strategies for mapping vocal control to perceptual timbral features and for automating the customization of vocal interfaces for different users and synthesizers, in general, are evaluated through a variety of qualitative and quantitative methods.
... Others have used wavelet windowing to extract the features for manufacturing voice-controlled robots [5][6][7][8][9]. Also features such as LPC (Linear Predictive Coding) and MFCC (Mel Frequency Cepstral Coefficient) have been used [10][11][12][13][14][15]. Other classifiers such as Support Vector Machine (SVM) and K Nearest Neighbor (KNN) can be applied to the feature extracted from the audio signal. ...
Conference Paper
In this paper, a robot has been designed and implemented to move in four main directions by voice commands. The robot consists of a HC-5 Bluetooth receiver module, an Atmega8a microcontroller, a L293d motor driver and two regulators. 2 seconds voice were recorded from 5 different subjects and preprocessed by a Butterworth bandpass filter. A Frequency feature was extracted from the operator's voice and it has been classified by KNN and ANN classifiers with different structures. The performance of ANN was better in comparison to the KNN classifier. Also when the network trained with one person and test with the same person the accuracy was higher. The best result was for ANN classifier with 12 neurons in the hidden layer when the input data were windowed and it was 90.7±3.1% accuracy. finally, the outputs of the classifier send to the robot via Bluetooth module and the robot is moved in the desired direction for 3 seconds.
... We compute a large set of heterogeneous low-level features including Linear Prediction Coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP) coefficients (Hermansky 1990). We search for the optimal features computation configuration across about 400 different cases, presenting different window size, window overlap, sampling rate and LPC-MFCC-PLP orders by discarding features that are noisy over the postures, and using the robust ones to compute measure quality ratings related to gesture accentuation and posture stability (Fasciani 2012). The computation configuration that maximizes the overall quality rating is selected and programmed to utilize the vocal gestures provided by the user to compute the high-dimensional gestural input space "V" using only those features marked as robust in the previous stage. ...
... It is possible to change the operative modality, invert the output ranges, and modify the global scale factor for the w j . The integration with our VCI MAX/Msp system [6] is established through the OSC communication for the real time part, while the gestural training data is collected into a matrix in MAX/Msp and exported to a text file using FTM [19]. In the VCI prototype we implemented a larger vocal feature computation and a system for noisy features rejection based on vocal postures as described in [20]. ...
Conference Paper
Full-text available
Sound generators and synthesis engines expose a large set of parameters, allowing run-time timbre morphing and exploration of sonic space. However, control over these high-dimensional interfaces is constrained by the physical limitations of performers. In this paper we propose the exploitation of vocal gesture as an extension or alternative to traditional physical controllers. The approach uses dynamic aspects of vocal sound to control variations in the timbre of the synthesized sound. The mapping from vocal to synthesis parameters is automatically adapted to information extracted from vocal examples as well as to the relationship between parameters and timbre within the synthesizer. The mapping strategy aims to maximize the breadth of the explorable perceptual sonic space over a set of the synthesizer's real-valued parameters, indirectly driven by the voice-controlled interface.
Article
Full-text available
Locating singing voice segments is essential for convenient index- ing, browsing and retrieval large music archives and catalogues. Furthermore, it is beneficial for automatic music transcription and annotations. TheapproachdescribedinthispaperusesMel-Frequency Cepstral Coefficients in conjunction with Gaussian Mixture Mod- els for discriminating two classes of data (instrumental music and singing voice with music background). Due to imperfect classifi- cationbehavior, thecategorizationwithoutadditionalpost-processing tends to alternate within a very short time span, whereas singing voice tends to be continuous for several frames. Thus, various tests have been performed to identify a suitable decision function and corresponding smoothing methods. Results are reported by comparing the performance of straightforward likelihood based classifications vs. postprocessing with an autoregressive moving average filtering method.
Conference Paper
Full-text available
Given the relevance of the singing voice in popular western music, a system able to reliable identify those portions of a music audio file containing vocals would be very useful. In this work, we explore already used descriptors to perform this task and compare the perfor- mance of a statistical classifier using each kind of them, concluding that MFCC are the most appropriate. As an outcome of our study, an effective statistical classification system with a reduced set of descriptors for singing voice detection in music audio files is presented. The per- formance of the system is validated using independent datasets of popular music for training, validation and testing, reaching a classification performance of 78.5% on the testing set.
Conference Paper
Full-text available
We present VoiceDraw, a voice-driven drawing application for people with motor impairments that provides a way to generate free-form drawings without needing manual interaction. VoiceDraw was designed and built to investigate the potential of the human voice as a modality to bring fluid, continuous direct manipulation interaction to users who lack the use of their hands. VoiceDraw also allows us to study the issues surrounding the design of a user interface optimized for non-speech voice-based interaction. We describe the features of the VoiceDraw application, our design process, including our user-centered design sessions with a "voice painter," and offer lessons learned that could inform future voice-based design efforts. In particular, we offer insights for mapping human voice to continuous control.
Conference Paper
We describe the use of non-verbal features in voice for direct control of interactive applications. Traditional speech recognition interfaces are based on an indirect, conversational model. First the user gives a direction and then the system performs certain operation. Our goal is to achieve more direct, immediate interaction like using a button or joystick by using lower-level features of voice such as pitch and volume. We are developing several prototype interaction techniques based on this idea, such as "control by continuous voice", "rate-based parameter control by pitch," and "discrete parameter control by tonguing." We have implemented several prototype systems, and they suggest that voice-as-sound techniques can enhance traditional voice recognition approach.
Article
While human speech and the human voice generate acoustical energy up to (and beyond) 20 kHz, the energy above approximately 5 kHz has been largely neglected. Evidence is accruing that this high-frequency energy contains perceptual information relevant to speech and voice, including percepts of quality, localization, and intelligibility. The present research was an initial step in the long-range goal of characterizing high-frequency energy in singing voice and speech, with particular regard for its perceptual role and its potential for modification during voice and speech production. In this study, a database of high-fidelity recordings of talkers was created and used for a broad acoustical analysis and general characterization of high-frequency energy, as well as specific characterization of phoneme category, voice and speech intensity level, and mode of production (speech versus singing) by high-frequency energy content. Directionality of radiation of high-frequency energy from the mouth was also examined. The recordings were used for perceptual experiments wherein listeners were asked to discriminate between speech and voice samples that differed only in high-frequency energy content. Listeners were also subjected to gender discrimination tasks, mode-of-production discrimination tasks, and transcription tasks with samples of speech and singing that contained only high-frequency content. The combination of these experiments has revealed that (1) human listeners are able to detect very subtle level changes in high-frequency energy, and (2) human listeners are able to extract significant perceptual information from high-frequency energy.
Conference Paper
Live performance situations can lead to degradations in the vocal signal from a typical microphone, such as ambient noise or echoes due to feedback. We investigate the robustness of continuous-valued timbre features measured on vocal signals (speech, singing, beatboxing) under simulated degradations. We also consider non-parametric dependencies between features, using information the-oretic measures and a feature-selection algorithm. We discuss how robustness and independence issues reflect on the choice of acous-tic features for use in constructing a continuous-valued vocal tim-bre space. While some measures (notably spectral crest factors) emerge as good candidates for such a task, others are poor, and some features such as ZCR exhibit an interaction with the type of voice signal being analysed.
Article
The independent component analysis (ICA) of a random vector consists of searching for a linear transformation that minimizes the statistical dependence between its components. In order to define suitable search criteria, the expansion of mutual information is utilized as a function of cumulants of increasing orders. An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time. The concept of ICA may actually be seen as an extension of the principal component analysis (PCA), which can only impose independence up to the second order and, consequently, defines directions that are orthogonal. Potential applications of ICA include data analysis and compression, Bayesian detection, localization of sources, and blind identification and deconvolution.