Conference PaperPDF Available

Novel Metric Learning for Non-parallel Voice Conversion

Authors:

Abstract and Figures

Obtaining aligned spectral pairs in case of non-parallel data for stand-alone Voice Conversion (VC) technique is a challenging research problem. Unsupervised alignment algorithm, namely, an Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) iteratively tries to align the spectral features by minimizing the Euclidean distance metric between the intermediate converted and the target spectral feature vectors. However, the Euclidean distance may not correlate well with the perceptual distance between the two (sound or visual) patterns in a given feature space. In this paper, we propose to learn distance metric using Large Margin Nearest Neighbor (LMNN) technique that gives a minimum distance for the same phoneme uttered by the different speakers and more distance for the different set of phonemes. This learned metric is then used for finding the NN pairs in the INCA. Furthermore, we propose to use this learned metric only for the first iteration in the INCA, since the intermediate converted features (which are not the actual acoustic features) may not behave well w.r.t. the learned metric. We obtained on an average 7.93 % relative improvement in Phonetic Accuracy (PA). This is reflected positively in subjective and objective evaluations.
Content may be subject to copyright.
NOVEL METRIC LEARNING FOR NON-PARALLEL VOICE CONVERSION
Nirmesh J. Shah and Hemant A. Patil
Speech Research Lab,
Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar
Email: {nirmesh88 shah and hemant patil}@daiict.ac.in
ABSTRACT
Obtaining aligned spectral pairs in case of non-parallel data
for stand-alone Voice Conversion (VC) technique is a chal-
lenging research problem. Unsupervised alignment algo-
rithm, namely, an Iterative combination of a Nearest Neigh-
bor search step and a Conversion step Alignment (INCA)
iteratively tries to align the spectral features by minimizing
the Euclidean distance metric between the intermediate con-
verted and the target spectral feature vectors. However, the
Euclidean distance may not correlate well with the perceptual
distance between the two (sound or visual) patterns in a given
feature space. In this paper, we propose to learn distance
metric using Large Margin Nearest Neighbor (LMNN) tech-
nique that gives a minimum distance for the same phoneme
uttered by the different speakers and more distance for the
different set of phonemes. This learned metric is then used
for finding the NN pairs in the INCA. Furthermore, we pro-
pose to use this learned metric only for the first iteration in
the INCA, since the intermediate converted features (which
are not the actual acoustic features) may not behave well
w.r.t. the learned metric. We obtained on an average 7.93
% relative improvement in Phonetic Accuracy (PA). This is
reflected positively in subjective and objective evaluations.
Index TermsVC, INCA, Metric Learning, LMNN.
1. INTRODUCTION
Non-parallel Voice Conversion (VC) has been a focus of
research for last one decade. Alignment is one of the key
research issues in non-parallel VC [1]. Though adaptation
and generation model-based techniques avoid the alignment
step in non-parallel VC [2–8], aligned spectral feature pairs
are still required to apply standalone VC techniques for non-
parallel data [1]. Among available alignment approaches
for the non-parallel VC, the state-of-the-art unsupervised al-
gorithm is the Iterative combination of a Nearest Neighbor
search step and a Conversion step Alignment (INCA), which
iteratively computes the mapping function that uses the Near-
est Neighbor (NN) aligned feature pairs [9, 10].
We acknowledge authorities of DA-IICT Gandhinagar and MeitY Govt.
of India for their kind support.
The key issue in the INCA is that it tries to minimize the
Euclidean metric among acoustic features for the time align-
ment. However, the same phoneme uttered by the two speak-
ers may not have the minimum Euclidean distance [11–14].
Recently, in the area of keyword search, the possibilities of
exploring distance metric learning is proposed to use instead
of standard distance metric in the DTW [15, 16]. In this
paper, we propose to learn the metric that gives a minimum
distance for the same phoneme and maximum distance for
the different phonemes uttered by the two different speakers.
In particular, we propose to use this learned metric instead of
the Euclidean for finding the NN pairs in the INCA. In this
paper, we globally learn the metric on the TIMIT database
due to the availability of phone annotations.
Furthermore, the INCA is sensitive to the initializa-
tion due to an alternating minimization nature of the algo-
rithm [10, 17]. Except for iteration 1 in the INCA, NN is
obtained between intermediate converted and the target spec-
tral feature vectors. These feature vectors may not behave like
the actual spectral features. Hence, we propose to use this
learned metric only for the initial iteration, where the spectral
features are derived from both the source and target speakers.
Among various available metric learning techniques [18],
we used the state-of-the-art Large Margin Nearest Neigh-
bor (LMNN) technique [19, 20]. These aligned feature pairs
that are obtained using the Euclidean metric, and the learned
metric, are further used to develop various VC systems.
2. METRIC LEARNING FOR ALIGNMENT TASK
2.1. Motivation for Metric Learning in VC
INCA consists of three steps, namely, initialization, NN
search and transformation step [9]. These steps are repeated
until the convergence. In the literature, lower Phonetic Ac-
curacy (PA) is reported after the alignment step [17, 21]. To
further investigate the possible reasons behind this low PA,
we apply t-stochastic neighbor embedding (t-SNE) visualiza-
tion technique to the acoustic space of the source and target
speakers [22]. We have taken one of the available speaker-
pairs (namely, BDL-RMS (M-M)) from the CMU-ARCTIC
database [23]. The acoustic space for a vowel, stop, nasal
and fricative is shown in Fig. 1. We can clearly see that the
3722978-1-5386-4658-8/18/$31.00 ©2019 IEEE ICASSP 2019
Fig. 1: Acoustic features space visualization in 2-D using t-SNE for different speech sound classes, such as (a) vowel, (b) stop,
(c) nasal, and (d) fricative.
same phoneme uttered by the two speakers does not lie in
the neighborhood in Euclidean space, rather they are spread
across the 2-D acoustic space. This is primarily due to the
difference in vocal tract system (i.e., size and shape) and
excitation source (difference in size of the glottis, vocal fold
mass, tension in the vocal folds and hence, the manner in
which glottis opens or closes, i.e., the glottal activity) across
the speakers (Chapter 3, pp. 59) [24]. This motivated the au-
thors to define new metric that represents the acoustic feature
space. In this paper, we propose to use the learned metric for
finding the NN pairs in the iteration II of the INCA [9].
2.2. Metric Learning
The metric learning is concerned with the learning of a
distance function w.r.t. a particular task. Metric learning
has been shown to be extremely useful when used along
with the NN methods [18]. The metric learning tech-
niques can be broadly classified into linear (which uses
Mahalanobis distance) vs. nonlinear approaches [18]. Let
X= [x1, x2, ..., xn]be the matrix of all the data points. The
mapping d:X×XRis called a metric if it satisfies
following four conditions [25]:
1. d(xi, xj)0(non-negativity),
2. d(xi, xj)=0xi=xj(identity of indiscernible),
3. d(xi, xj) = d(xj, xi)(symmetry),
4. d(xi, xj)d(xi, xr) + d(xr, xj), where xi, x,xr
X(triangle inequality).
If condition (2) is dropped then the mapping is called as
pseudo metric [25]. In particular, distance metric is de-
fined through inner product space. For example, d2(x, y) =
||hxy, x yi|| =xTy. Hence, in general a distance metric
is defined as:
dA(x, y)=(xy)TA(xy).(1)
If A= Σ1then distance is called as the Mahalanobis dis-
tance [26]. Here, Σis covariance matrix of the data. In most
of the cases, true covariance is unknown and hence, the sam-
ple covariance is used. Here, A must be positive-semidefinite
(PSD) (i.e., A0, where notation is used to indicate pos-
itive semidefinite) to satisfy the metric definition. Further-
more, if A is PSD then it can be factorized as A=GTGthat
leads to dA(x, y) = ||GxGy||2
2(where ||· || is the L2norm).
Hence, the idea behind learning the metric can be considered
as the learning of global linear transformation. The idea be-
hind learning Mahalanobis metric was proposed by Xing et.
al. in [27]. Here, the desired metric should give minimum
squared distance for the pairs (xi, xj)∈ S (where S is a set
of similar pairs) with constraint P(xi,xj)DdA(xi, xj)1,
where D is set of dissimilar pairs. Here, the objective function
is given by [27]:
arg min
AX
(xi,xj)∈S
||xixj||2
A,(2)
subject to X
(xi,xj)∈D
||xixj||2
A1, A 0.(3)
The above mentioned objective function is linear and both the
constraints are convex and hence, the convex optimization al-
gorithms can be applied to get global optimal solution. In
particular, the gradient descent and the idea of iterative pro-
jections can be used to solve the above mentioned convex op-
timization problem [27]. Weinberger et. al. proposed the
LMNN technique that uses the relative distance constraints,
which is one of the most popular and state-of-the-art metric
learning technique in the literature [19, 20]. The main aim
of LMNN technique is that the given feature should have the
same label as its neighbors, while the features that are having
different labels should be distant apart from the given feature.
The key idea behind the LMNN is illustrated in Figure 2.
3723
Fig. 2: Schematic representation of LMNN technique (a) be-
fore and (b) after applying the LMNN technique. Adapted
from [19].
Here, the target neighbors refer to the features that have
similar label and impostor is also the neighbor feature vec-
tor. However, it is having different label. The goal of LMNN
technique is to minimize the number of impostors via relative
distance constraint. The objective function is given by [19] :
arg min
A0X
(i,j)∈S
dA(xi, xj)
+λX
(i,j,k)∈R
[1 + dA(xi, xj)dA(xi, xk)],
(4)
where Ris the set of all triplets (i, j, k)such that xiand xj
are the target neighbors and xkis the impostor.
In this paper, we have used TIMIT (American English)
database for estimating the learned metric as the manual
phone-annotations are available, which is obtained from the
highly trained human annotators [28]. We used the full phone
set label. We randomly selected a small subset of the database
to learn the metric using LMNN technique. We extracted 25-
D Mel Cepstral Coefficients (MCC) per frame (with 25 ms
frame duration, and 5ms frameshift). In this paper, we glob-
ally learn the metric for the spectral features and use this
learned metric for calculating NN feature pairs. We consid-
ered three possible approaches to use the learned metric in the
INCA. Schematic representations of the various approaches
are given in Figure 3. Proposed system A uses the learned
metric in each iteration of the baseline INCA as shown in
Figure 3 (a) and (b). On the other hand, the proposed system
C uses the learned metric only at the iteration I in the INCA
as shown in Figure 3 (c). Proposed system B first applies the
global transformation that is learned via metric learning to
the spectral features obtained from both the speakers and then
the baseline INCA is applied to the transformed features.
3. EXPERIMENTAL RESULTS
We have used CMU-ARCTIC database due to the availability
of the phone annotations that is obtained using speaker-
dependent hidden Markov model (HMM) trained over 1132
utterances [23]. Here, we converted the reference phone-
annotations to the frame-level labeling. In this paper, 40
non-parallel utterances from each speaker-pair have been
used to develop VC system using the aligned features ob-
tained via the baseline and proposed techniques. The state-
of-the-art methods, namely, Joint Density Gaussian Mix-
ture Model (JDGMM)-based VC has been selected among
the available various VC techniques [29]. The JDGMM-
based method is selected since it uses conditional expecta-
tion, which is the best minimum mean square error (MMSE)
estimator [30]. Hence, it leads to the minimum error be-
tween converted and the target spectral features. 25D MCC
and 1-D F0per frame (with 25 ms frame duration, and 5
ms frameshift) have been extracted using AHOCODER. The
number of mixture components has been varied, for exam-
ple, m=8,16, 32, 64, and 128. The system having optimum
Mel Cepstral Distortion (MCD), is selected for the subjective
evaluation. Here, Mean-Variance (MV) transformation has
been used for F0conversion [31].
(a) Baseline System
(b) Proposed System A
(c) Proposed System C
Fig. 3: Schematic representation of (a) baseline, (b) proposed
system A, and (c) proposed system C. Proposed system B is
not shown here, since it applies the baseline technique to the
transformed features obtained via the LM, and hence, similar
to (a). EUCL: Euclidean metric, LM: Learned metric.
3.1. Analysis of Phonetic Accuracies
In the context of VC, if the aligned pair contains features from
the same phoneme then it is considered as hit and if not then
false. From this, Phonetic Accuracy (PA) is defined as [13]:
PA (in %) =Total no. of Hits
Total no. of Frames ×100,(5)
3724
Total no. of Frames =Total no. of Hits +Total no. of Falses.
Fig. 4: PA of different initialization techniques for non-
parallel VC systems.
Fig. 4 shows the accuracy of alignment obtained using
three proposed techniques. It is observed that there is on an
average 0.71 % relative reduction and 0.07 % relative im-
provement (in the PA) w.r.t. the baseline using the proposed
method A and B, respectively. It is possibly due to fact that
this metric is globally learned for entire TIMIT. The broad
phonetic classes, such as vowel, stops, fricatives, nasal, etc.
will behave very much differently in acoustic space due to
the different manner of articulation required to produce these
sounds [24]. Since the metric is learned for the true acoustic
features and not for the intermediate converted acoustic fea-
tures, we propose technique C, which is performing consis-
tently better (with on an average 7.93 % relative improvement
in PA) than the INCA.
3.2. Subjective and Objective Evaluations
Mean Opinion Score (MOS) and ABX tests have been per-
formed for measuring speech quality, and speaker similarity
(SS) of the converted voices, respectively. Both the subjective
tests are taken from the 16 subjects (5females and 11 males
with no known hearing impairments and with the age varia-
tions between 18 to 29 years) from the total of 288 samples.
The result of 5-point (1 (very bad) to 5 (very good)) MOS test
are shown in Table 1 along with the 95 % confidence inter-
vals. It can be seen from Table 1 that the proposed system C
is more preferred than the baseline in terms of speech quality
(i.e., naturalness) for the VC (except in the case of F-M).
Table 1: MOS analysis for the naturalness of converted
voices. Number in the bracket indicates a margin of error
corresponding to the 95 % confidence intervals for VC sys-
tems
M-M M-F F-M F-F
Baseline 3.06
(0.27)
2.41
(0.29)
2.66
(0.28)
3.5
(0.26)
Proposed System C 3.31
(0.29)
2.81
(0.22)
2.53
(0.21)
3.5
(0.25)
In ABX test for SS, the listeners were asked to select from
the randomly played Aand Bsamples (generated with the
baseline and the proposed system C) based on the SS with
reference to the actual target speaker’s speech signal X. Eight
samples for ABX test were taken from both the approaches.
All the subjects have given equal preference to both the sys-
tems. This result indicates that accurate alignment may not
lead to the better converted voice in terms of SS. However, it
will lead to the better speech quality of converted voice.
Table 2: MCD analysis. Number in bracket indicates the mar-
gin of error corresponding to the 95 % confidence intervals
M-M M-F F-M F-F
Baseline 6.53
(0.34)
6.95
(1)
8.02
(1.29)
6.06
(0.93)
Proposed System C 6.41
(0.09)
6.76
(0.26)
7.85
(0.34)
6.02
(0.24)
The traditional Mel Cepstral Distortion (MCD) is used
here [31]. It can be seen from Table 2 that our proposed sys-
tem C is performing better than the baseline in all the cases.
Table 3 presents the analysis of Pearson Correlation Coeffi-
cient (PCC) of PA and MCD with the MOS and the SS. It is
clear from the Table 3 that the PCC between PA and MOS
is 0.96, i.e., PA is having more correlation with the MOS.
This clearly indicates that better alignment will lead to better
speech quality. On the other hand, PCC between PA and SS is
less compared to the PCC between PA and MOS. For the case
of MCD, PCC ideally should be -1 since lesser value of MCD
means that the system is performing better than the given sys-
tems (that are having higher values of MCD). It is clearly seen
that the traditional MCD is not correlating well with the MOS
and the SS. Less correlation between MCD and the subjective
scores have also been reported in the VC literature [32].
Table 3: PCC of % PA and MCD with the subjective score
PCC MOS SS
PA 0.96 0.37
MCD -0.3 0.10
4. SUMMARY AND CONCLUSIONS
In this study, we proposed to exploit metric learning technique
for finding NN in the INCA than state-of-the-art Euclidean
distance. Furthermore, we also proposed to use our learned
metric only for the initial iteration of INCA since the metric
is learned for the actual acoustic features. Therefore, during
other iterations in the INCA, intermediate converted features
may not represent the true acoustic features. We compare our
proposed system C with the baseline INCA and found that
our proposed system performs better (in terms of PA) than
the baseline INCA. Moreover, subjective as well as objective
evaluations also confirm that the proposed system C performs
better w.r.t. the baseline system. In particular, improvement
(in terms of PA) obtained due to system C is clearly reflected
in the MOS scores with the PCC of 0.96. In the future, we
plan to apply local learning of the metric in order to capture
local metric for each broad phonetic classes.
3725
5. REFERENCES
[1] Seyed Hamidreza Mohammadi and Alexander Kain, “An
overview of voice conversion systems,” Speech Communica-
tion, vol. 8, no. 4, pp. 65–82, 2017.
[2] Tomi Kinnunen et al., “Non-parallel voice conversion using i-
vector PLDA: Towards unifying speaker verification and trans-
formation,” in International Conference on Acoustics, Speech
and Signal Processing (ICASSP), New Orleans, USA, 2017,
pp. 5535–5539.
[3] Toru Nakashika, “Cab: An energy-based speaker clustering
model for rapid adaptation in non-parallel voice conversion,” in
INTERSPEECH, Stockholm, Sweden, 2017, pp. 3369–3373.
[4] Chin-Cheng Hsu et al., “Voice conversion from unaligned cor-
pora using variational autoencoding Wasserstein generative ad-
versarial networks, in INTERSPEECH, Stockholm, Sweden,
2017, pp. 3364–3368.
[5] Fuming Fang et al., “High quality nonparallel voice conversion
based on cycle-consistent adversarial network, in ICASSP,
Calgary, Canada, 2018, pp. 5279–5283.
[6] Nirmesh J. Shah, Maulik C Madhavi, and Hemant A. Patil,
“Unsupervised vocal tract length warped posterior features for
non-parallel voice conversion, in INTERSPEECH, Hyder-
abad, India, 2018, pp. 1968–1972.
[7] Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke
Takamichi, “Non-parallel voice conversion using variational
autoencoders conditioned by phonetic posteriorgrams and d-
vectors,” in ICASSP, Calgary, Canada, 2018, pp. 5274–5278.
[8] Nirmesh J Shah, Sreereaj R., Neil Shah, and Hemant A. Patil,
“Novel inter mixture weighted GMM posteriorgram for DNN
and GAN-based voice conversion, in Proceedings of Asia-
Pacific Signal and Information Processing Association (AP-
SIPA) Annual Summit and Conference, Hawaii, USA, 2018,
IEEE, pp. 1776–1781.
[9] D. Erro et al., “INCA algorithm for training voice conver-
sion systems from nonparallel corpora,” IEEE Trans. on Audio,
Speech and Lang. Process., vol. 18, no. 5, pp. 944–953, 2010.
[10] Nirmesh J. Shah and Hemant A. Patil, “On the convergence of
INCA algorithm,” in APSIPA ASC, Kuala Lumpur, Malaysia,
2017, IEEE, pp. 559–562.
[11] Jui-Ting Huang et al., “Kernel metric learning for phonetic
classification,” in Workshop on Automatic Speech Recognition
& Understanding (ASRU), Merano, Italy, 2009, pp. 141–145.
[12] Xiong Xiao et al., “Distance metric learning for kernel density-
based acoustic model under limited training data conditions,”
in APSIPA ASC, Hong Kong, 2015, pp. 54–58.
[13] Nirmesh J. Shah and Hemant A. Patil, Analysis of features and
metrics for alignment in text-dependent voice conversion, B.
Uma Shankar et. al. (Eds), Lecture Notes in Computer Science
(LNCS), Springer, PReMI, vol. 10597, pp. 299–307, 2017.
[14] Julian Martin Fernandez and Bart Farell, “Is perceptual space
inherently non-Euclidean?,” Journal of Mathematical Psychol-
ogy, vol. 53, no. 2, pp. 86–91, 2009.
[15] Batuhan G ¨
undo˘
gdu and Murat Sarac¸lar, “Distance metric
learning for posteriorgram based keyword search, in ICASSP,
New Orleans, USA, 2017, pp. 5660–5664.
[16] Batuhan G ¨
undo˘
gdu et al., “Joint learning of distance met-
ric and query model for posteriorgram based keyword search,
IEEE Journal of Selected Topics in Signal Processing, vol. 11,
no. 8, pp. 1318–1328, 2017.
[17] Nirmesh J. Shah and Hemant A. Patil, “Effectiveness of dy-
namic features in INCA and temporal context-INCA,” in IN-
TERSPEECH, Hyderabad, India, 2018, pp. 711–715.
[18] Brian Kulis et al., “Metric learning: A survey,” Foundations
and Trends® in Machine Learning, vol. 5, pp. 287–364, 2013.
[19] Kilian Q Weinberger and Lawrence K. Saul, “Distance met-
ric learning for large margin nearest neighbor classification,
Journal of Machine Learning Research, vol. 10, no. Feb, pp.
207–244, 2009.
[20] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul, “Dis-
tance metric learning for large margin nearest neighbor classifi-
cation,” in Advances in Neural Information Processing Systems
(NIPS), Vancouver, Canada, 2005, pp. 1473–1480.
[21] H. Benisty, D. Malah, and K. Crammer, “Non-parallel voice
conversion using joint optimization of alignment by temporal
context and spectral distortion,” in ICASSP, Florence, Italy,
2014, pp. 7909–7913.
[22] Laurens van der Maaten and Geoffrey Hinton, “Visualizing
data using t-SNE,” Journal of Machine Learning Research,
vol. 9, no. Nov, pp. 2579–2605, 2008.
[23] John Kominek and Alan W Black, “The CMU-ARCTIC
speech databases,” in 5th ISCA Workshop on Speech Synthesis,
Pittsburgh, USA, 2004, pp. 223–224.
[24] Thomas F Quatieri, Discrete-Time Speech Signal Processing:
Principles and Practice. 1st Edition,, Pearson Education India,
2006.
[25] E. Kreyszig, Introductory Functional Analysis with Applica-
tions, vol. 81, Wiley New York, 1st edition, 1989.
[26] PC Mahalanobis, “Mahalanobis distance,” in Proceedings Na-
tional Institute of Science of India, 1936, vol. 49, pp. 234–256.
[27] Eric P Xing et al., “Distance metric learning with applica-
tion to clustering with side-information,” in NIPS, Vancouver,
Canada, 2002, vol. 15, p. 12.
[28] John S Garofolo, “DARPA TIMIT acoustic-phonetic speech
database,” National Institute of Standards and Technology
(NIST),USA, vol. 15, pp. 29–50, 1988.
[29] A. Kain and M. W. Macon, “Spectral voice conversion for text-
to-speech synthesis,” in ICASSP, Seattle, WA, USA, 1998, pp.
285–288.
[30] Steven M. Kay, “Fundamentals of Statistical Signal Process-
ing, Volume I: Estimation Theory,” Upper Saddle River, New
Jersey: Prentic Hall, 1998.
[31] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based
on maximum-likelihood estimation of spectral parameter tra-
jectory,IEEE Trans. on Audio, Speech and Lang. Process.,
vol. 15, no. 8, pp. 2222–2235, 2007.
[32] Avni Rajpal, Nirmesh J. Shah, Mohammadi Zaki, and He-
mant A. Patil, “Quality assessment of voice converted speech
using articulatory features,” in ICASSP, New Orleans, USA,
2017, pp. 5515–5519.
3726
... Obtaining the aligned spectral features in non-parallel VC is more challenging due to the fact that both the source and target speakers have spoken different utterances. Among various available alignment approaches, the most popular alignment techniques are based on Nearest Neighbor (NN), for example, the state-ofthe-art Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) [7,8] and its variants [9][10][11]. However, lower % Phonetic Accuracy (PA) has been reported after the NN-based alignment techniques [7]. ...
... If both the estimated and the ground truth frame-level label are found to be the same, it is considered as hit and if not then false. From this, % Phonetic Accuracy (PA) is defined as [11,31]: ...
Conference Paper
Full-text available
Nearest Neighbor (NN)-based alignment techniques are pop- ular in non-parallel Voice Conversion (VC). The performance of NN-based alignment improves with the information about phone boundary. However, estimating the exact phone bound- ary is a challenging task. If text corresponding to the utterance is available, the Hidden Markov Model (HMM) can be used to identify the phone boundaries. However, it requires a large amount of training data that is difficult to collect in realistic VC scenarios. Hence, we propose to exploit a Spectral Transition Measure (STM)-based alignment technique that does not re- quire apriori training data. The idea behind STM is that neurons in the auditory or visual cortex respond strongly to the transi- tional stimuli compared to the steady-state stimuli. The phone boundaries estimated using the STM algorithm are then applied to the NN technique to obtain the aligned spectral features of the source and target speakers. Proposed STM+NN alignment technique is giving on an average 13.67% relative improvement in phonetic accuracy (PA) compared to the NN-based alignment technique. The improvement in %PA after alignment has pos- itively reflected in the better performance in terms of speech quality and speaker similarity (in particular, a relative improve- ment of 13.63% and 13.26% , respectively) of the converted voice.
Thesis
Full-text available
Understanding how a particular speaker is producing speech, and mimicking one‘s voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker without changing the linguistic content. Each standalone VC system building consists of two stages, namely, training and testing. First, speaker-dependent features are extracted from both speakers‘ training data. These features are first time aligned and corresponding pairs are obtained. Then a mapping function is learned among these aligned feature-pairs. Once the training step is done, during the testing stage, features are extracted from the source speaker‘s held out data. These features are converted using the mapping function. The converted features are then passed through the vocoder that will produce a converted voice. Hence, there are primarily three components of the stand-alone VC system building, namely, the alignment step, the mapping function, and the speech analysis/synthesis framework. Major contributions of this thesis are towards identifying the limitations of existing techniques, improving it, and developing new approaches for the mapping, and alignment stages of the VC. In particular, a novel Amplitude Scaling (AS) method is proposed for frequency warping (FW)-based VC, which linearly transfers the amplitude of the frequency-warped spectrum using the knowledge of a Gaussian Mixture Model (GMM)-based converted spectrum without adding any spurious peaks. To overcome the issue of overfitting in Deep Neural Network (DNN)-based VC, the idea of pre-training is popular. However, this pre-training is time-consuming, and requires a separate network to learn the parameters of the network. Hence, whether this additional pre-training step could be avoided by using recent advances in deep learning is investigated in this thesis. The ability of Generative Adversarial Network (GAN) in estimating probability density function (pdf ) for generating the realistic samples corresponding to the given source speaker‘s utterance resulted in a significant performance improvement in the area of VC. The key limitation of the vanilla GAN-based system is in generating the samples that ma y not correspond to the given source speaker‘s utterance. To address this issue, Minimum Mean Squared Error (MMSE) regularized GAN (i.e., MMSE-GAN) is proposed in this thesis. Obtaining corresponding feature pairs in the context of both parallel as well as non-parallel VC is a challenging task. In this thesis, the strengths and limitations of the different existing alignment strategies are identified, and new alignment strategies are proposed for both parallel and non-parallel VC task. Wrongly aligned pairs will affect the learning of the mapping function, which in turn will deteriorate the quality of the converted voices. In order to remove such wrongly aligned pairs from the training data, outlier removal-based pre-processing technique is proposed for the parallel VC. In the case of non-parallel VC, theoretical convergence proof is developed for the popular alignment technique, namely, Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA). In addition, the use of dynamic features along with static features to calculate the Nearest Neighbor (NN) aligned pairs in the existing INCA, and Temporal context (TC) INCA is also proposed. Furthermore, a novel distance metric is learned for the NN-based search strategies, as Euclidean distance may not correlate well with the perceptual distance. Moreover, computationally simple Spectral Transition Measure (STM)-based phone alignment technique that does not require any apriori training data is also proposed for the non-parallel VC. Both the parallel and the non-parallel alignment techniques will generate one-to-many and many-to-one feature pairs. These one-to-many and many-to-one pairs will affect the learning of the mapping function and result in the muffling and oversmoothing effect in VC. Hence, unsupervised Vocal Tract Length Normalization (VTLN) posteriorgram, and novel inter mixture weighted GMM Posteriorgram as a speaker-independent representation in the two-stage mapping network is proposed in order to avoid the alignment step from the VC framework. In this thesis, an attempt has also been made to use the acoustic-to-articulatory inversion (AAI) technique for the quality assessment of the voice converted speech. Lastly, the proposed MMSE-GAN architecture is extended in the form of Discover GAN (i.e., MMSE DiscoGAN) for the cross-domain VC applications (w.r.t. attributes of the speech production mechanism), namely, Non-Audible Murmur (NAM)-to-WHiSPer (NAM2WHSP) speech conversion, and WHiSPer-to-SPeeCH (WHSP2SPCH) conversion. Finally, thesis summarizes overall work presented, limitations of various approaches along with future research directions.
Article
Alignment is a key step before learning a mapping function between a source and a target speaker’s spectral features in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters of the mapping function from their true value and hence, negatively affect the learning of mapping function during the training phase of the VC task. To the best of the authors’ knowledge, the effect of outliers (and hence, their removal) on quality of the converted voice has not been much explored in the VC literature. Recent research has shown the effectiveness of the outlier removal as a pre-processing step in the VC. In this paper, we extend this study with a detailed theory and analysis. The proposed method uses a score distance that is estimated using Robust Principal Component Analysis (ROBPCA) to detect the outliers. In particular, the outliers are determined using a fixed cut-off on the score distances, based on the degrees of freedom in a chi-squared distribution, which is speaker-pair independent. The fixed cut-off is due to the assumption that the score distances follow the normal (i.e., Gaussian) distribution. However, this is a weak statistical assumption even in the cases where quite many samples are available. Hence, in this paper, we propose to explore speaker-pair dependent cut-offs to detect the outliers. In addition, we have presented our results on two state-of-the-art databases, namely, CMU-ARCTIC and Voice Conversion Challenge (VCC) 2016 by developing various state-of-the-art methods in the VC. In particular, we have presented the effectiveness of the outlier removal on Gaussian Mixture Model (GMM), Artificial Neural Network (ANN), and Deep Neural Network (DNN)-based VC techniques. Furthermore, we have presented subjective and objective evaluations using a 95% confidence interval for the statistical significance of the tests. We obtained an average 0.56% relative reduction in Mel Cepstral Distortion (MCD) with the proposed outlier removal approach as a pre-processing step. In particular, with the proposed speaker-pair dependent cut-off, we have observed relative improvement of 12.24% and 30.51% in the speech quality, and 39.7% and 4.27% absolute improvement in the speaker similarity for the CMU-ARCTIC and the VCC 2016, respectively.
Conference Paper
Full-text available
Voice Conversion (VC) requires an alignment of the spectral features before learning the mapping function, due to the speaking rate variations across the source and target speakers. To address this issue, the idea of training two parallel networks with the use of speaker-independent representation was proposed. In this paper, we explore the unsupervised Gaussian Mixture Model (GMM) posteriorgram as a speaker-independent representation. However, in the GMM posteriorgram, the same phonetic information gets spread across more than one component due to the speaking style variations across the speakers. In particular, this spread is limited to a group of neighboring components for a given phone. We propose to share the posterior probability of each component with the limited number of neighboring components that are sorted based on the KullbackLeibler (KL) divergence. We propose to employ a Deep Neural Network (DNN) and a Generative Adversarial Network (GAN)- based framework to measure the effectiveness of the proposed Inter Mixture Weighted GMM (IMW GMM) posteriorgram on the Voice Conversion Challenge (VCC) 2016 database. The relative improvement of 13.73 %, and 5.25 % is obtained with the proposed IMW GMM posteriorgram w.r.t. the GMM posteriorgram for the speech quality and the speaker similarity of the converted voices, respectively.
Conference Paper
Full-text available
In the non-parallel Voice Conversion (VC) with the Iterative combination of Nearest Neighbor search step and Conversion step Alignment (INCA) algorithm, the occurrence of one-tomany and many-to-one pairs in the training data will deteriorate the performance of the stand-alone VC system. The work on handling these pairs during the training is less explored. In this paper, we establish the relationship via intermediate speaker-independent posteriorgram representation, instead of directly mapping the source spectrum to the target spectrum. To that effect, a Deep Neural Network (DNN) is used to map the source spectrum to posteriorgram representation and another DNN is used to map this posteriorgram representation to the target speaker’s spectrum. In this paper, we propose to use unsupervised Vocal Tract Length Normalization (VTLN)-based warped Gaussian posteriorgram features as the speaker independent representations. We performed experiments on a small subset of publicly available Voice Conversion Challenge (VCC) 2016 database. We obtain the lower Mel Cepstral Distortion (MCD) values with the proposed approach compared to the baseline as well as the supervised phonetic posteriorgram feature-based speaker-independent representations. Furthermore, subjective evaluation gave relative improvement of 13.3 % with the proposed approach in terms of Speaker Similarity (SS).
Conference Paper
Full-text available
Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that both the speakers may have uttered different utterances from same or the different languages. Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) and its variant Temporal Context (TC)-INCA are popular unsupervised alignment algorithms. The INCA and TC-INCA iteratively learn the mapping function after getting the Nearest Neighbor (NN) aligned pairs from the intermediate converted and the target spectral features. In this paper, we propose to use dynamic features along with static features to calculate the NN aligned pairs in both the INCA and TC-INCA algorithms (since the dynamic features are known to play a key role to differentiate major phonetic categories). We obtained on an average relative improvement of 13.75 % and 5.39 % with our proposed Dynamic INCA and Dynamic TC-INCA, respectively. This improvement is also positively reflected in the quality of converted voices.
Article
Full-text available
Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consistent adversarial network (CycleGAN) for nonparallel data-based VC training. A CycleGAN is a generative adversarial network (GAN) originally developed for unpaired image-to-image translation. A subjective evaluation of inter-gender conversion demonstrated that the proposed method significantly outperformed a method based on the Merlin open source neural network speech synthesis system (a parallel VC system adapted for our setup) and a GAN-based parallel VC system. This is the first research to show that the performance of a nonparallel VC method can exceed that of state-of-the-art parallel VC methods.
Conference Paper
Full-text available
Development of text-independent Voice Conversion (VC) has gained more research interest for last one decade. Alignment of the source and target speakers’ spectral features before learning the mapping function is the challenging step for the development of the text-independent VC as both the speakers have uttered different utterances from the same or different languages. State-of-the-art alignment technique is an Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) algorithm that iteratively learns the mapping function after getting the nearest neighbor aligned feature pairs from intermediate converted spectral features and target spectral features. To the best of authors’ knowledge, this algorithm was shown to converge empirically, however, its theoretical proof has not been discussed in detail in the VC literature. In this paper, we have presented that the INCA algorithm will converge monotonically to a local minimum in mean square error (MSE) sense. In addition, we also present the reason of convergence in MSE sense in the context of VC task.
Article
Full-text available
In this paper, we propose a novel approach to keyword search (KWS) in low-resource languages, which provides an alternative method for retrieving the terms of interest, especially for the out of vocabulary (OOV) ones. Our system incorporates the techniques of query-by-example retrieval tasks into KWS and conducts the search by means of the subsequence dynamic time warping (sDTW) algorithm. For this, text queries are modeled as sequences of feature vectors and used as templates in the search. A Siamese neural network-based model is trained to learn a frame-level distance metric to be used in sDTW and the proper query model frame representations for this learned distance. Experiments conducted on IARPA Babel Program's Turkish, Pashto and Zulu datasets demonstrate the effectiveness of our approach. In each of the languages, the proposed system outperforms the large vocabulary continuous speech recognition (LVCSR)-based baseline for OOV terms. Furthermore, the fusion of the proposed system with the baseline system provides an average relative actual term weighted value (ATWV) improvement of 13.9% on all terms and, more significantly, the fusion yields an average relative ATWV improvement of 154.5% on OOV terms. We show that this new method can be used as an alternative to conventional LVCSR-based KWS systems, or in combination with them, to achieve the goal of closing the gap between OOV and in-vocabulary (IV) retrieval performances.
Conference Paper
Voice Conversion (VC) is a technique that convert the perceived speaker identity from a source speaker to a target speaker. Given a source and target speakers’ parallel training speech database in the text-dependent VC, first task is to align source and target speakers’ spectral features at frame-level before learning the mapping function. The accuracy of alignment will affect the learning of mapping function and hence, the voice quality of converted voice in VC. The impact of alignment is not much explored in the VC literature. Most of the alignment techniques try to align the acoustical features (namely, spectral features, such as Mel Cepstral Coefficients (MCC)). However, spectral features represents both speaker as well as speech-specific information. In this paper, we have done analysis on the use of different speaker-independent features (namely, unsupervised posterior features, such as, Gaussian Mixture Model (GMM)-based and Maximum A Posteriori (MAP) adapted from Universal Background Model (UBM), i.e., GMM-UBM-based posterior features) for the alignment task. In addition, we propose to use different metrics, such as, symmetric Kullback-Leibler (KL) and cosine distances instead of Euclidean distance for the alignment. Our analysis-based on % Phone Accuracy (PA) is correlating with subjective scores of the developed VC systems with 0.98 Pearson correlation coefficient.