PreprintPDF Available

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge
Content may be subject to copyright.
A Comparison of Discrete Latent Variable Models
for Speech Representation Learning
Henry Zhou5∗ Alexei Baevski4Michael Auli4
5University of Toronto 4Facebook AI Research
Abstract
Neural latent variable models enable the discovery of interesting structure in speech
audio data. This paper presents a comparison of two different approaches which
are broadly based on predicting future time-steps or auto-encoding the input signal.
Our study compares the representations learned by vq-vae and vq-wav2vec in terms
of sub-word unit discovery and phoneme recognition performance. Results show
that future time-step prediction with vq-wav2vec achieves better performance. The
best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme
discrimination challenge.
1 Introduction
Speech contains structure that algorithms processing this data reason about to make good predic-
tions. In the classical supervised learning paradigm both representation learning and making good
predictions based on these representations are intertwined. A major limitation of this approach is
that it requires large amounts of labeled data. Because of this, there has been much recent interest in
algorithms which learn good representations or latent structure of speech without supervision.
Latent variable models have been heavily studied for speech applications, including voice conversion
[
1
], speaker identification and verification [
2
]. Recently, discrete representations learning through
quantization [
3
,
4
] have been used on inherently continuous data such as images and speech. If the
latent structure is modeled as a discrete set of units, then it can be evaluated in terms of semantic
meaning such as in the ZeroSpeech challenge [5].
There are various methods for learning quantized latent features and in this study, we focus on two
popular approaches: quantized latent features can be learnt through autoencoding, which reconstructs
the original signal, either the raw waveform or spectrogram features [
4
,
6
]. Another approach learns
latent features through predicting representations of future time-steps [7, 8, 9, 10].
In this study, we are interested in the quality of the discrete representations learnt by these two
methods. In particular, we perform pre-training with either vq-vae or vq-wav2vec and evaluate the
resulting representations in terms of phoneme recognition. This enables us to evaluate whether the
discrete representations are helpful in distinguishing phonemes. Next, we evaluate the discrete latents
on the ZeroSpeech ABX task to evaluate if the quantized features are able to discover phonetic
units. Our results show that representations learned by vq-wav2vec outperform vq-vae on both tasks.
Context prediction is therefore a better task to learn discrete latent speech representations compared
to autoencoding.
Work done while at Facebook during a Facebook AI Residency.
Preprint. Under review.
arXiv:2010.14230v1 [eess.AS] 24 Oct 2020
2 Approaches
In this section, we present the two different methods and their architectures for unsupervised speech
pre-training. The first approach is trained by reconstructing the input through a latent bottleneck
between an encoder network and a decoder network [
6
] where the latent features serve as discrete
representation. The second approach learns through predicting the representations of future time-
steps, which tasks the network to correctly identify true future time-steps from distractors, both of
which are represented by discrete latent variables [8].
2.1 Autoencoding with vq-vae
vq-vae [
6
] learns vector quantized representations by reconstructing the input. The audio data is
first encoded as a sequence of dense representations
ze
which is then vector quantized through
online k-means clustering (see §2.3) to obtain discrete vectors
zq
. Finally, an autoregressive decoder
reconstructs the original waveform conditioned on past audio samples, the latent representation
zq
, and optionally the speaker identity [
6
] (see Figure 1). To make the reconstruction feasible, the
waveform is quantized to 256 levels through a mu-law transform [
11
]. The loss function for training
vq-vae is
Lvqvae =logp(x|zq(x)) + Lk-means (1)
The first term is the reconstruction loss of the audio in which we minimize a negative log-likelihood.
The second term denotes the loss for training the quantizer (§2.3).
ˆy
<latexit sha1_base64="2kJ7ODLKzVpLVXEZD72WsNkAdB0=">AAAB7nicbZDLSsNAFIZPvNZ6q7pUJFgEVyWpC10W3bhswV6gDWUynbRDJ5MwcyKE0KUP4MaFIm59hD6HO5/Bl3B6WWjrDwMf/38Oc87xY8E1Os6XtbK6tr6xmdvKb+/s7u0XDg4bOkoUZXUaiUi1fKKZ4JLVkaNgrVgxEvqCNf3h7SRvPjCleSTvMY2ZF5K+5AGnBI3V7AwIZumoWyg6JWcqexncORQrJ+Pa9+PpuNotfHZ6EU1CJpEKonXbdWL0MqKQU8FG+U6iWUzokPRZ26AkIdNeNh13ZJ8bp2cHkTJPoj11f3dkJNQ6DX1TGRIc6MVsYv6XtRMMrr2MyzhBJunsoyARNkb2ZHe7xxWjKFIDhCpuZrXpgChC0Vwob47gLq68DI1yyb0slWtusXIDM+XgGM7gAly4ggrcQRXqQGEIT/ACr1ZsPVtv1vusdMWa9xzBH1kfP6Bek3Y=</latexit>
zq
<latexit sha1_base64="GCBOOX0uXFoDvF922v61J81OLNg=">AAAB6nicbZC7TsMwFIZPyq20XAKMLBYFialKygBjBQtjEfQitVHluE5r1XGC7VQqUR+BhQGEWHkGXoA3YONBYMa9DNDyS5Y+/f858jnHjzlT2nE+rczS8srqWnY9l9/Y3Nq2d3ZrKkokoVUS8Ug2fKwoZ4JWNdOcNmJJcehzWvf7F+O8PqBSsUjc6GFMvRB3BQsYwdpY13ft27ZdcIrORGgR3BkUyodfb++D/HelbX+0OhFJQio04VippuvE2kux1IxwOsq1EkVjTPq4S5sGBQ6p8tLJqCN0ZJwOCiJpntBo4v7uSHGo1DD0TWWIdU/NZ2Pzv6yZ6ODMS5mIE00FmX4UJBzpCI33Rh0mKdF8aAATycysiPSwxESb6+TMEdz5lRehViq6J8XSlVson8NUWdiHAzgGF06hDJdQgSoQ6MI9PMKTxa0H69l6mZZmrFnPHvyR9foDMyqSLA==</latexit>
ze
<latexit sha1_base64="E2x7FLzoSV3I+yGoAjAz5M/C58o=">AAAB6nicbZC7SgNBFIbPxluMt6hgY7MYBKuwGwstQ2wsEzQXSJYwOzmbDJmdXWZmhbjkEWwsFLG19S18Ajsbn8XJpdDEHwY+/v8c5pzjx5wp7ThfVmZldW19I7uZ29re2d3L7x80VJRIinUa8Ui2fKKQM4F1zTTHViyRhD7Hpj+8muTNO5SKReJWj2L0QtIXLGCUaGPd3Hexmy84RWcqexncORTKR7Vv9l75qHbzn51eRJMQhaacKNV2nVh7KZGaUY7jXCdRGBM6JH1sGxQkROWl01HH9qlxenYQSfOEtqfu746UhEqNQt9UhkQP1GI2Mf/L2okOLr2UiTjRKOjsoyDhto7syd52j0mkmo8MECqZmdWmAyIJ1eY6OXMEd3HlZWiUiu55sVRzC+UKzJSFYziBM3DhAspwDVWoA4U+PMATPFvcerRerNdZacaa9xzCH1lvPz3SkXg=</latexit>
q
<latexit sha1_base64="V0rQmMELEF+WGK0Pp1lKElOIlg4=">AAAB6HicbZC7SgNBFIbPxluMt6ilIoNBsAq7sdAyaGOZgLlAsoTZydlkzOzFmVkhLCmtbCwUsfUp8hx2PoMv4eRSaPSHgY//P4c553ix4Erb9qeVWVpeWV3Lruc2Nre2d/K7e3UVJZJhjUUikk2PKhQ8xJrmWmAzlkgDT2DDG1xN8sY9SsWj8EYPY3QD2gu5zxnVxqredfIFu2hPRf6CM4dC+XBc/Xo4Glc6+Y92N2JJgKFmgirVcuxYuymVmjOBo1w7URhTNqA9bBkMaYDKTaeDjsiJcbrEj6R5oSZT92dHSgOlhoFnKgOq+2oxm5j/Za1E+xduysM40Riy2Ud+IoiOyGRr0uUSmRZDA5RJbmYlrE8lZdrcJmeO4Cyu/BfqpaJzVixVnUL5EmbKwgEcwyk4cA5luIYK1IABwiM8w4t1az1Zr9bbrDRjzXv24Zes92/KOpCh</latexit>
x
<latexit sha1_base64="ze+WcU7V23dwaMcDXxVk0US2TPs=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWGgjBm0sEzAXSEKYnZxNxsxemJkVw5InsLFQxFYfxt5GfBsnl0ITfxj4+P9zmHOOFwuutON8W5ml5ZXVtey6vbG5tb2T292rqSiRDKssEpFseFSh4CFWNdcCG7FEGngC697gapzX71AqHoU3ehhjO6C9kPucUW2syn0nl3cKzkRkEdwZ5C8+7PP4/csud3KfrW7EkgBDzQRVquk6sW6nVGrOBI7sVqIwpmxAe9g0GNIAVTudDDoiR8bpEj+S5oWaTNzfHSkNlBoGnqkMqO6r+Wxs/pc1E+2ftVMexonGkE0/8hNBdETGW5Mul8i0GBqgTHIzK2F9KinT5ja2OYI7v/Ii1IoF96RQrLj50iVMlYUDOIRjcOEUSnANZagCA4QHeIJn69Z6tF6s12lpxpr17MMfWW8/R22QPg==</latexit>
L
<latexit sha1_base64="YUJr3I+1tsZpOV6yXsgDLw4xtdE=">AAAB8nicbVA9SwNBEN2LX/H8ilraHAbBKtzFQhsxaGNhEcF8wOUIe5u9ZMne7rE7J4QjP8PGQpG0/g97G/HfuJek0MQHA4/3Zpg3EyacaXDdb6uwsrq2vlHctLe2d3b3SvsHTS1TRWiDSC5VO8SaciZoAxhw2k4UxXHIaSsc3uR+65EqzaR4gFFCgxj3BYsYwWAkvxNjGBDMs7txt1R2K+4UzjLx5qR89WFfJpMvu94tfXZ6kqQxFUA41tr33ASCDCtghNOx3Uk1TTAZ4j71DRU4pjrIppHHzolRek4klSkBzlT9PZHhWOtRHJrOPKJe9HLxP89PIboIMiaSFKggs0VRyh2QTn6/02OKEuAjQzBRzGR1yAArTMB8yTZP8BZPXibNasU7q1TvvXLtGs1QREfoGJ0iD52jGrpFddRABEn0hF7QqwXWs/VmTWatBWs+c4j+wHr/AeIqlKQ=</latexit>
Figure 1: The vq-vae model is composed of an encoder which extracts dense features
ze
from the
audio, the quantizer
q
maps the dense features to discrete features
zq
and the decoder
p(xt|x1...t1, zq)
reconstructs the original audio.
2.2 Context-prediction with vq-wav2vec
The vq-wav2vec model [
12
] learns the quantized vectors through predicting the representations of
future timesteps. It consists of two networks applied to the raw audio signal (Figure 2). An encoder
network extracts the dense features (
ze
) from the audio signal and a quantization module maps it
to discrete features
zq
. A context network then combines sequences of discrete features to obtain
contextualized representations (
ci
). The model is trained to distinguish a future dense sample
zqi+k
drawn from a set of distractors pnthat are within Ktime steps (2).
Lwav2vec =
Tk
X
i=1
(logσ(zei+k>hk(ci))
+λE
˜zpn
[σ(˜z>hk(ci))])
(2)
T
is the sequence length,
σ(x) = 1/(1 + exp (x))
,
σ(zei+k>hk(ci))
computes the probability of
zei+kis the true sample.
2
L1
zq
<latexit sha1_base64="GCBOOX0uXFoDvF922v61J81OLNg=">AAAB6nicbZC7TsMwFIZPyq20XAKMLBYFialKygBjBQtjEfQitVHluE5r1XGC7VQqUR+BhQGEWHkGXoA3YONBYMa9DNDyS5Y+/f858jnHjzlT2nE+rczS8srqWnY9l9/Y3Nq2d3ZrKkokoVUS8Ug2fKwoZ4JWNdOcNmJJcehzWvf7F+O8PqBSsUjc6GFMvRB3BQsYwdpY13ft27ZdcIrORGgR3BkUyodfb++D/HelbX+0OhFJQio04VippuvE2kux1IxwOsq1EkVjTPq4S5sGBQ6p8tLJqCN0ZJwOCiJpntBo4v7uSHGo1DD0TWWIdU/NZ2Pzv6yZ6ODMS5mIE00FmX4UJBzpCI33Rh0mKdF8aAATycysiPSwxESb6+TMEdz5lRehViq6J8XSlVson8NUWdiHAzgGF06hDJdQgSoQ6MI9PMKTxa0H69l6mZZmrFnPHvyR9foDMyqSLA==</latexit>
ze
<latexit sha1_base64="E2x7FLzoSV3I+yGoAjAz5M/C58o=">AAAB6nicbZC7SgNBFIbPxluMt6hgY7MYBKuwGwstQ2wsEzQXSJYwOzmbDJmdXWZmhbjkEWwsFLG19S18Ajsbn8XJpdDEHwY+/v8c5pzjx5wp7ThfVmZldW19I7uZ29re2d3L7x80VJRIinUa8Ui2fKKQM4F1zTTHViyRhD7Hpj+8muTNO5SKReJWj2L0QtIXLGCUaGPd3Hexmy84RWcqexncORTKR7Vv9l75qHbzn51eRJMQhaacKNV2nVh7KZGaUY7jXCdRGBM6JH1sGxQkROWl01HH9qlxenYQSfOEtqfu746UhEqNQt9UhkQP1GI2Mf/L2okOLr2UiTjRKOjsoyDhto7syd52j0mkmo8MECqZmdWmAyIJ1eY6OXMEd3HlZWiUiu55sVRzC+UKzJSFYziBM3DhAspwDVWoA4U+PMATPFvcerRerNdZacaa9xzCH1lvPz3SkXg=</latexit>
q
<latexit sha1_base64="V0rQmMELEF+WGK0Pp1lKElOIlg4=">AAAB6HicbZC7SgNBFIbPxluMt6ilIoNBsAq7sdAyaGOZgLlAsoTZydlkzOzFmVkhLCmtbCwUsfUp8hx2PoMv4eRSaPSHgY//P4c553ix4Erb9qeVWVpeWV3Lruc2Nre2d/K7e3UVJZJhjUUikk2PKhQ8xJrmWmAzlkgDT2DDG1xN8sY9SsWj8EYPY3QD2gu5zxnVxqredfIFu2hPRf6CM4dC+XBc/Xo4Glc6+Y92N2JJgKFmgirVcuxYuymVmjOBo1w7URhTNqA9bBkMaYDKTaeDjsiJcbrEj6R5oSZT92dHSgOlhoFnKgOq+2oxm5j/Za1E+xduysM40Riy2Ud+IoiOyGRr0uUSmRZDA5RJbmYlrE8lZdrcJmeO4Cyu/BfqpaJzVixVnUL5EmbKwgEcwyk4cA5luIYK1IABwiM8w4t1az1Zr9bbrDRjzXv24Zes92/KOpCh</latexit>
x
<latexit sha1_base64="ze+WcU7V23dwaMcDXxVk0US2TPs=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWGgjBm0sEzAXSEKYnZxNxsxemJkVw5InsLFQxFYfxt5GfBsnl0ITfxj4+P9zmHOOFwuutON8W5ml5ZXVtey6vbG5tb2T292rqSiRDKssEpFseFSh4CFWNdcCG7FEGngC697gapzX71AqHoU3ehhjO6C9kPucUW2syn0nl3cKzkRkEdwZ5C8+7PP4/csud3KfrW7EkgBDzQRVquk6sW6nVGrOBI7sVqIwpmxAe9g0GNIAVTudDDoiR8bpEj+S5oWaTNzfHSkNlBoGnqkMqO6r+Wxs/pc1E+2ftVMexonGkE0/8hNBdETGW5Mul8i0GBqgTHIzK2F9KinT5ja2OYI7v/Ii1IoF96RQrLj50iVMlYUDOIRjcOEUSnANZagCA4QHeIJn69Z6tF6s12lpxpr17MMfWW8/R22QPg==</latexit>
c
<latexit sha1_base64="QRC48XdhQhiZ8+pk6BFSDSx54GM=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIXdWGgjBm0sEzAXSJYwOzmbjJmdXWZmhbDkCWwsFLHVh7G3Ed/GyaXQ6A8DH/9/DnPOCRLOlHbdLyu3tLyyupZftzc2t7Z3Crt7DRWnkmKdxjyWrYAo5ExgXTPNsZVIJFHAsRkMryZ58w6lYrG40aME/Yj0BQsZJdpYNdotFN2SO5XzF7w5FC/e7fPk7dOudgsfnV5M0wiFppwo1fbcRPsZkZpRjmO7kypMCB2SPrYNChKh8rPpoGPnyDg9J4yleUI7U/dnR0YipUZRYCojogdqMZuY/2XtVIdnfsZEkmoUdPZRmHJHx85ka6fHJFLNRwYIlczM6tABkYRqcxvbHMFbXPkvNMol76RUrnnFyiXMlIcDOIRj8OAUKnANVagDBYR7eIQn69Z6sJ6tl1lpzpr37MMvWa/fJ5mQKQ==</latexit>
Figure 2: vq-wav2vec uses an encoder and a quantizer to compute dense and discrete features
ze
and
zq
, respectively. An aggregator combines past discrete features into
c
. The loss is based on predicting
future discrete latent speech representations based on the context features.
2.3 Vector quantization
The above methods use either k-means [
4
] or Gumbel-Softmax [
13
] to quantize high dimensional
dense vectors. A vector quantization layer maps a dense representation
ze
to a discrete representation
zqfrom a codebook {eiRD, i = [1...K]}with Kentries.
2.3.1 K-means
K-means chooses the vector in the codebook which has the smallest Euclidean distance with respect
to the input vector. During training the loss is augmented by the following
Lk-means = (ze[zq])2+β([ze]zq)2(3)
where
[x]x
,
d
dx [x]0
is the stop-gradient operator. During training, the gradient propagates
through the dense vector
ze
, while the selected vector
zq
from the codebook is updated by the
Euclidean distance with respect to ze.
2.3.2 Gumbel-Softmax
The Gumbel-Softmax [
13
] version of vq-wav2vec hard-selects a codeword
zqRD
based on a linear
projection lRKand is fully differentiable. The probability for selecting the j-th codeword is
pj=exp(lj+vj)
PK
k=1 exp(lk+vk)(4)
in which
v=log(log(u))
where
uU(0,1)
and
τ
is the Gumbel softmax temperature. During
a forward pass, the hard index
argmaxjpj
is selected, and during the backward pass, the true gradient
with respect to the logits is used.
2.3.3 Codebook
To avoid the problem of mode collapse of discrete latent models [
14
], we use several codebooks or
groups [
15
]. A codebook is parametrized by the following parameters:
D
the dimension of the vector,
Gthe number of groups in a codebook, Kthe number of codewords within a group.
The codebook
{zi
qRD/G|i[1..M ]}
, where
M=KG
can represent
KG
distinct vectors of
dimension
D
, and has a total number of
K·D
parameters. We follow [
12
] and share codewords across
groups. In this way, the effective codebook can be represented as a matrix of shape K×(D/G).
In a forward pass during inference, a dense feature
zeRd
is first organized into
G
number of groups
into the matrix form
ze0RG×(d/G)
. Each row
j
of
ze0
will then be converted into an integer index
3
ij[1..K]
through either nearest distance (vq-vae or vq-wav2vec Kmeans)
ij= argmin
kK||ze0
iek||2
or largest value (vq-wav2vec Gumbel) ij= argmax
k
ze0
k(see Figure 3).
During training with the Gumbel-Softmax, the true probability is being used for backward
propagation.
ze
<latexit sha1_base64="HkwVjqGL5E/U9iOYisydQQjYhaI=">AAAB6nicbZDLSgMxFIYz9VbrrepGcBMsQldlpi50Z8GNy4r2Au1QMumZNjSTDElGqEMfwY0LRdy68Dl8BHc+gU8hmF4W2vpD4OP/zyHnnCDmTBvX/XQyS8srq2vZ9dzG5tb2Tn53r65loijUqORSNQOigTMBNcMMh2asgEQBh0YwuBjnjVtQmklxY4Yx+BHpCRYySoy1ru860MkX3JI7EV4EbwaF8+93+XXwFlU7+Y92V9IkAmEoJ1q3PDc2fkqUYZTDKNdONMSEDkgPWhYFiUD76WTUET62TheHUtknDJ64vztSEmk9jAJbGRHT1/PZ2PwvayUmPPNTJuLEgKDTj8KEYyPxeG/cZQqo4UMLhCpmZ8W0TxShxl4nZ4/gza+8CPVyyTspla+8QqWIpsqiQ3SEishDp6iCLlEV1RBFPXSPHtGTw50H59l5mZZmnFnPPvoj5/UHZwGSPg==</latexit>
ze
<latexit sha1_base64="HkwVjqGL5E/U9iOYisydQQjYhaI=">AAAB6nicbZDLSgMxFIYz9VbrrepGcBMsQldlpi50Z8GNy4r2Au1QMumZNjSTDElGqEMfwY0LRdy68Dl8BHc+gU8hmF4W2vpD4OP/zyHnnCDmTBvX/XQyS8srq2vZ9dzG5tb2Tn53r65loijUqORSNQOigTMBNcMMh2asgEQBh0YwuBjnjVtQmklxY4Yx+BHpCRYySoy1ru860MkX3JI7EV4EbwaF8+93+XXwFlU7+Y92V9IkAmEoJ1q3PDc2fkqUYZTDKNdONMSEDkgPWhYFiUD76WTUET62TheHUtknDJ64vztSEmk9jAJbGRHT1/PZ2PwvayUmPPNTJuLEgKDTj8KEYyPxeG/cZQqo4UMLhCpmZ8W0TxShxl4nZ4/gza+8CPVyyTspla+8QqWIpsqiQ3SEishDp6iCLlEV1RBFPXSPHtGTw50H59l5mZZmnFnPPvoj5/UHZwGSPg==</latexit>
arg min
i||zeei||
<latexit sha1_base64="L5PIxa7isJmmDWf8+5mo8N8XL50=">AAACD3icbVC7TgJBFJ3FF+ALtbSZSDQ0kl0stCSxscQoj4Qlm9nhAhNmZzczsyaw8Ac2/oqNhcbYGjsLE//G4VEoeJKbnDnn3sy9x484U9q2v63Uyura+kY6k93c2t7Zze3t11QYSwpVGvJQNnyigDMBVc00h0YkgQQ+h7rfv5z49TuQioXiVg8iaAWkK1iHUaKN5OVO3Fi0jQ86YePEJbLrBkyM8WiEhx7gUwweMw8vl7eL9hR4mThzki+nh18fN5VMxct9uu2QxgEITTlRqunYkW4lRGpGOYyzbqwgIrRPutA0VJAAVCuZ3jPGx0Zp404oTQmNp+rviYQESg0C33QGRPfUojcR//Oase5ctBImoliDoLOPOjHHOsSTcHCbSaCaDwwhVDKzK6Y9IgnVJqKsCcFZPHmZ1EpF56xYunby5QKaIY0O0REqIAedozK6QhVURRTdo0f0jF6sB+vJerXeZq0paz5zgP7Aev8BLx2fmg==</latexit>
arg max
i
pi
<latexit sha1_base64="WtRNOUeRyk08pRDGVDyTmaKb2IQ=">AAACBHicbVDLSsNAFJ1YH7W+oi67GSyFrkpSF7osuHFZwT6gCWEymbRDJ5MwMxFLCCKI/+DSlRsXiujSj3Dn3zh9LLT1wIXDOfdy7z1+wqhUlvVtrBRW19Y3ipulre2d3T1z/6Aj41Rg0sYxi0XPR5IwyklbUcVILxEERT4jXX90NvG7V0RIGvNLNU6IG6EBpyHFSGnJM8tOygPtE5XRPHOQGDgRus4hTDzqmRWrbk0Bl4k9J5VmoXrz8XD32PLMLyeIcRoRrjBDUvZtK1FuhoSimJG85KSSJAiP0ID0NeUoItLNpk/ksKqVAIax0MUVnKq/JzIUSTmOfN0ZITWUi95E/M/rpyo8dTPKk1QRjmeLwpRBFcNJIjCggmDFxpogLKi+FeIhEggrnUtJh2AvvrxMOo26fVxvXNiVZg3MUARlcARqwAYnoAnOQQu0AQa34Am8gFfj3ng23oz3WeuKMZ85BH9gfP4A2dybyg==</latexit>
zq
<latexit sha1_base64="Z2RyjPklSQ9GAF5SBkoTbOUog1k=">AAAB6nicbZC7TgMxEEVnwyMh4RGgpLGIkFJFu6GAMoKGMgjykMIq8jrexIrtXWwvUljlD6ChACFavoiO36CmwHkUkHAlS0f3zsgzE8ScaeO6n05mZXVtPZvbyBc2t7Z3irt7TR0litAGiXik2gHWlDNJG4YZTtuxolgEnLaC4fkkb91RpVkkr80opr7AfclCRrCx1tV997ZbLLkVdyq0DN4cSrXC11m28PBd7xY/bnoRSQSVhnCsdcdzY+OnWBlGOB3nbxJNY0yGuE87FiUWVPvpdNQxOrJOD4WRsk8aNHV/d6RYaD0Sga0U2Az0YjYx/8s6iQlP/ZTJODFUktlHYcKRidBkb9RjihLDRxYwUczOisgAK0yMvU7eHsFbXHkZmtWKd1ypXnqlWhlmysEBHEIZPDiBGlxAHRpAoA+P8AwvDneenFfnbVaaceY9+/BHzvsPkUyQ4g==</latexit>
zq
<latexit sha1_base64="Z2RyjPklSQ9GAF5SBkoTbOUog1k=">AAAB6nicbZC7TgMxEEVnwyMh4RGgpLGIkFJFu6GAMoKGMgjykMIq8jrexIrtXWwvUljlD6ChACFavoiO36CmwHkUkHAlS0f3zsgzE8ScaeO6n05mZXVtPZvbyBc2t7Z3irt7TR0litAGiXik2gHWlDNJG4YZTtuxolgEnLaC4fkkb91RpVkkr80opr7AfclCRrCx1tV997ZbLLkVdyq0DN4cSrXC11m28PBd7xY/bnoRSQSVhnCsdcdzY+OnWBlGOB3nbxJNY0yGuE87FiUWVPvpdNQxOrJOD4WRsk8aNHV/d6RYaD0Sga0U2Az0YjYx/8s6iQlP/ZTJODFUktlHYcKRidBkb9RjihLDRxYwUczOisgAK0yMvU7eHsFbXHkZmtWKd1ypXnqlWhlmysEBHEIZPDiBGlxAHRpAoA+P8AwvDneenFfnbVaaceY9+/BHzvsPkUyQ4g==</latexit>
e1
<latexit sha1_base64="GR0QSO1dEEl/5msOEHn6x4uRqsw=">AAAB6nicbZC7SgNBFIZn4y3GW9RGsBkMQqqwGwvtDNhYRjQXiEuYnZxNhsxlmZkVQsgj2FgoYmvhc/gIdj6BTyE4uRSa+MPAx/+fw5xzooQzY33/08ssLa+srmXXcxubW9s7+d29ulGpplCjiivdjIgBziTULLMcmokGIiIOjah/Mc4bd6ANU/LGDhIIBelKFjNKrLOuoR208wW/5E+EFyGYQeH8+119HbyJajv/cdtRNBUgLeXEmFbgJzYcEm0Z5TDK3aYGEkL7pAsth5IIMOFwMuoIHzung2Ol3ZMWT9zfHUMijBmIyFUKYntmPhub/2Wt1MZn4ZDJJLUg6fSjOOXYKjzeG3eYBmr5wAGhmrlZMe0RTah118m5IwTzKy9CvVwKTkrlq6BQKaKpsugQHaEiCtApqqBLVEU1RFEX3aNH9ORx78F79l6mpRlv1rOP/sh7/QH4JJH1</latexit>
e2
<latexit sha1_base64="rKwiUAzD0M62/NFE+7fZxuyQw98=">AAAB6nicbZC7SgNBFIZn4y3GW9RGsBkMQqqwGwvtDNhYRjQXiEuYnZwkQ+ayzMwKYckj2FgoYmvhc/gIdj6BTyE4uRSa+MPAx/+fw5xzopgzY33/08ssLa+srmXXcxubW9s7+d29ulGJplCjiivdjIgBziTULLMcmrEGIiIOjWhwMc4bd6ANU/LGDmMIBelJ1mWUWGddQ7vczhf8kj8RXoRgBoXz73f1dfAmqu38x21H0USAtJQTY1qBH9swJdoyymGUu00MxIQOSA9aDiURYMJ0MuoIHzung7tKuyctnri/O1IijBmKyFUKYvtmPhub/2WtxHbPwpTJOLEg6fSjbsKxVXi8N+4wDdTyoQNCNXOzYtonmlDrrpNzRwjmV16EerkUnJTKV0GhUkRTZdEhOkJFFKBTVEGXqIpqiKIeukeP6Mnj3oP37L1MSzPerGcf/ZH3+gP5qJH2</latexit>
e1
<latexit sha1_base64="GR0QSO1dEEl/5msOEHn6x4uRqsw=">AAAB6nicbZC7SgNBFIZn4y3GW9RGsBkMQqqwGwvtDNhYRjQXiEuYnZxNhsxlmZkVQsgj2FgoYmvhc/gIdj6BTyE4uRSa+MPAx/+fw5xzooQzY33/08ssLa+srmXXcxubW9s7+d29ulGpplCjiivdjIgBziTULLMcmokGIiIOjah/Mc4bd6ANU/LGDhIIBelKFjNKrLOuoR208wW/5E+EFyGYQeH8+119HbyJajv/cdtRNBUgLeXEmFbgJzYcEm0Z5TDK3aYGEkL7pAsth5IIMOFwMuoIHzung2Ol3ZMWT9zfHUMijBmIyFUKYntmPhub/2Wt1MZn4ZDJJLUg6fSjOOXYKjzeG3eYBmr5wAGhmrlZMe0RTah118m5IwTzKy9CvVwKTkrlq6BQKaKpsugQHaEiCtApqqBLVEU1RFEX3aNH9ORx78F79l6mpRlv1rOP/sh7/QH4JJH1</latexit>
e2
<latexit sha1_base64="rKwiUAzD0M62/NFE+7fZxuyQw98=">AAAB6nicbZC7SgNBFIZn4y3GW9RGsBkMQqqwGwvtDNhYRjQXiEuYnZwkQ+ayzMwKYckj2FgoYmvhc/gIdj6BTyE4uRSa+MPAx/+fw5xzopgzY33/08ssLa+srmXXcxubW9s7+d29ulGJplCjiivdjIgBziTULLMcmrEGIiIOjWhwMc4bd6ANU/LGDmMIBelJ1mWUWGddQ7vczhf8kj8RXoRgBoXz73f1dfAmqu38x21H0USAtJQTY1qBH9swJdoyymGUu00MxIQOSA9aDiURYMJ0MuoIHzung7tKuyctnri/O1IijBmKyFUKYvtmPhub/2WtxHbPwpTJOLEg6fSjbsKxVXi8N+4wDdTyoQNCNXOzYtonmlDrrpNzRwjmV16EerkUnJTKV0GhUkRTZdEhOkJFFKBTVEGXqIpqiKIeukeP6Mnj3oP37L1MSzPerGcf/ZH3+gP5qJH2</latexit>
eK
<latexit sha1_base64="B8Jeqdure2/B8mO8KVkdzoQdsMY=">AAAB6nicbZC7SgNBFIbPxluMt6iNYLMYhFRhNxbaGbARbCKaCyRLmJ3MJkPmsszMCmHJI9hYKGJr4XP4CHY+gU8hOLkUmvjDwMf/n8Occ8KYUW0879PJLC2vrK5l13Mbm1vbO/ndvbqWicKkhiWTqhkiTRgVpGaoYaQZK4J4yEgjHFyM88YdUZpKcWuGMQk46gkaUYyMtW5I56qTL3glbyJ3EfwZFM6/3+XXwRuvdvIf7a7ECSfCYIa0bvlebIIUKUMxI6NcO9EkRniAeqRlUSBOdJBORh25x9bpupFU9gnjTtzfHSniWg95aCs5Mn09n43N/7JWYqKzIKUiTgwRePpRlDDXSHe8t9ulimDDhhYQVtTO6uI+Uggbe52cPYI/v/Ii1Msl/6RUvvYLlSJMlYVDOIIi+HAKFbiEKtQAQw/u4RGeHOY8OM/Oy7Q048x69uGPnNcfH5uSDw==</latexit>
eK
<latexit sha1_base64="B8Jeqdure2/B8mO8KVkdzoQdsMY=">AAAB6nicbZC7SgNBFIbPxluMt6iNYLMYhFRhNxbaGbARbCKaCyRLmJ3MJkPmsszMCmHJI9hYKGJr4XP4CHY+gU8hOLkUmvjDwMf/n8Occ8KYUW0879PJLC2vrK5l13Mbm1vbO/ndvbqWicKkhiWTqhkiTRgVpGaoYaQZK4J4yEgjHFyM88YdUZpKcWuGMQk46gkaUYyMtW5I56qTL3glbyJ3EfwZFM6/3+XXwRuvdvIf7a7ECSfCYIa0bvlebIIUKUMxI6NcO9EkRniAeqRlUSBOdJBORh25x9bpupFU9gnjTtzfHSniWg95aCs5Mn09n43N/7JWYqKzIKUiTgwRePpRlDDXSHe8t9ulimDDhhYQVtTO6uI+Uggbe52cPYI/v/Ii1Msl/6RUvvYLlSJMlYVDOIIi+HAKFbiEKtQAQw/u4RGeHOY8OM/Oy7Q048x69uGPnNcfH5uSDw==</latexit>
Figure 3: Two different methods for vector quantization. Left: K-Means selects a discretized latent
variable based on the Euclidean distance to the dense representation
ze
in the vector space. Both vq-
vae and the vq-wav2vec k-means version use this method. Right: Gumbel-Softmax based approach.
zeis projected onto a vector and the latent with the largest logit is output.
2.3.4 Codebook usage penalty
Similar to [
16
,
17
], we apply a penalty to encourage good usage of the codebook. The softmax
distribution
pg
is encouraged to be uniform across
K
codewords within the codebook. The entropy
loss aims to maximize the entropy of the average probability of selection across group
g∈ {1...G}
and codewords k∈ {1...K}denoted ¯pg,k within a training batch.
Ldiversity =1
GK
G
XH(¯pg) = 1
GK
G,K
X
g=1,k=1
¯pg ,klog¯pg,k (5)
3 Experimental Setup
3.1 Datasets
Models are pre-trained on Librispeech [
18
], except otherwise mentioned. For evaluation we consider
TIMIT [
19
], a 5-hour dataset of phoneme and speaker labels, and ZeroSpeech2019 [
5
] which contains
20-hours of audio without alignment, text or labels for Unit Discovery dataset. For TIMIT, we apply
the standard evaluation protocol and collapse detailed phoneme labels to 39 classes.
3.2 Encoder Architecture
All approaches use the fairseq implementation [
20
] of the wav2vec encoder for raw audio feature
extraction [
7
]. The encoder contains 8 layers with 512 channels each, with kernel sizes (10, 8, 4, 4, 4,
1, 1, 1) and strides (5, 4, 2, 2, 2, 1, 1, 1), yielding a total receptive field of about 30 ms.
3.3 Training details
For vq-vae, we follow the same training scheme as in [
6
] to train the vq-vae model but use the encoder
architecture of vq-wav2vec. We train for 300k updates using a cosine learning rate schedule [
21
] with
an initial learning rate of
2×104
which is annealed to
1×106
. The batch size is 64 with each
audio sample having length 32ms [
6
]. We experiment with inputting the speaker id to the decoder in
order to learn speaker-independent representations.
4
Table 1: Comparison of vq-wav2vec with different quantizers as well as vq-vae (with k-means
quantier) with and without speaker ID input to the decoder. We show phoneme error rate on the
TIMIT test set, the error rate for the ZeroSpeech ABX evaluation (ZS) when training on the ZS
training set as well as when training on Librispeech (ZS(LS)).
TIMIT ZS(LS) ZS
vq-cpc [25] - - 13.4
vq-wav2vec (Gumbel) 16.54 14.12 15.37
vq-wav2vec (k-means) 17.64 12.72 13.22
vq-vae (w/ speaker) 19.99 18.61 18.73
vq-vae (w/o speaker) 19.34 18.45 19.29
For vq-wav2vec, we also train for 300k updates, warmup the learning rate from
1×107
to
1×104
over 500 updates, and then anneal to
1×105
with a cosine schedule. We use a smaller batch size of
20, since we use larger segments of 150k frames (about 9.3 seconds). Unless otherwise mentioned
we use K= 320,G= 2 which has been found to be effective in previous work [8].
To evaluate the quality of the latent reprsenetation for phoneme recognition, we use a wav2letter
acoustic model [
22
], trained for 1000 epochs on 8 GPUs for TIMIT. The acoustic model is a
convolutional neural network fed with raw audio and is optimized with an auto segmentation criterion.
4 Results
4.1 Comparison of methods
We train vq-vae and vq-wav2vec either on all audio data of Librispeech (960h) or the training data of
ZeroSpeech 2019 (20h; [
5
]). In our setup, the
zq
of both models have the same receptive field. For
vq-vae we use the k-means quantization scheme of [
4
] with a single codebook of 4096 entries and a
latent frequency of 50hz (§3.2) which we found to work best on the validation set. For vq-wav2vec
we consider both Gumbel (
G= 8
,
K= 8
) and k-means (
G= 2
,
K= 320
) with latent frequency of
50hz.
We evaluate the discrete representations output by the quantization modules of each model in several
setups: For TIMIT phoneme recognition we extract discrete features from the quantizer of vq-vae
and vq-wav2vec and feed them into the acoustic model. For the ABX task of ZeroSpeech 2019 we
consider models trained on all of Librispeech (ZS(LS)) or the much smaller training data provided
by the task (ZS). The task evaluates whether the representations learned capture acoustic units. We
average all representations for a given sequence and then use a cosine distance to measure similarity
between representations.
Table 1 shows that learning latent discrete representations of speech with context prediction (vq-
wav2vec) performs generally better than autoencoding (vq-vae). Reasoning about temporal infor-
mation appears to result in better representations than reconstructing the raw audio. Self-supervised
work in natural language processing also heavily relies on objectives that require predicting future
information or missing parts of the sequence [23, 24].
For vq-vae, we do not see a large effect of using speaker ID. We also compare to the recently
introduced vq-cpc [
25
]. Similar to vq-wav2vec, vq-cpc combines vq-vae and cpc [
26
]. However,
different to vq-wav2vec, they input log-mel filterbanks instead of raw speech, they also use a much
larger receptive field and use an RNN decoder instead of a WaveNet decoder. Finally, they perform
model selection based on performance on the ZeroSpeech test set while as we selected models based
on TIMIT validation performance. vq-wav2vec (k-means) performs slightly better than their result
and achieves a 13.22 error rate on the ABX task.
5
Table 2: ABX performanace on ZS(LS), cf. Table 1, for a different number of entries in each
codebook (K) and the number of codebooks or groups (G).
Codebook (K×G)ZS(LS)
vq-wav2vec
(Gumbel)
4 x 8 14.2
8 x 8 14.12
320 x 2 14.18
512 x 1 14.77
vq-wav2vec
(k-means)
4 x 8 15.06
8 x 8 13.71
320 x 2 12.72
512 x 1 18.41
vq-vae
w/ speaker
4 x 8 21.97
8 x 8 24.93
512 x 1 18.45
320 x 2 19.59
Figure 4: Visualization of the co-occurrence between discrete latent speech representations and
phonemes. We plot the conditional probability
P(phoneme|zqt)
on TIMIT train data. The y-axis
shows the collapsed 39 classes of phonemes and the x-axis is over the different discrete latents.
4.2 Codebook architecture of the quantizer
In Table 2 we show that both models are sensitive to the choice of codebook architecture. In general,
vq-wav2vec performs better when using multiple groups (
G
) while as vq-vae performs best when
using a single codebook (G= 1).
4.3 Analysis
Next, we compute the co-occurence between human annotated phonemes and the discrete latent
features produced by a vq-wav2vec model pretrained on the ZeroSpeech 20-hour data. We extract
the discrete features
zq
on the TIMIT training data without any finetuning. The TIMIT training data
contains 3696 utterances of an average length 13.6 sec, equivalent to 563k discrete latents.
Figure 4 shows that many latents specialize in specific phonemes. A large number of latents
correspond to phonemes containing vowels, e.g., aa, ae, ah. Similarly, there are many discrete latents
which co-occur with silence (bcl) which is a frequent label in the TIMIT data.
6
5 Conclusions
We presented a comparison of two prevalent methods for learning discrete latent speech represen-
tations in terms of classical TIMIT phoneme recognition as well as the more recent ZeroSpeech
ABX phoneme discrimination task. Results show that predicting future time-steps is a more effective
training task to learn these representations. Future work includes exploring other objectives that
require models to learn even richer representations.
References
[1]
T. Nakashika, T. Takiguchi, and Y. Ariki. Voice conversion using speaker-dependent conditional
restricted boltzmann machine. EURASIP Journal on Audio, Speech, and Music Processing,
2015, 12 2015. doi: 10.1186/s13636-014-0044-3.
[2]
C.-I Lai. Contrastive predictive coding based feature for automatic speaker verification. CoRR,
abs/1904.01575, 2019.
[3]
W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learning discrete representations
via information maximizing self-augmented training, 2017.
[4]
A. van den Oord, O. Vinyals, and k. kavukcuoglu. Neural discrete representation learning. In
Proc. of NeurIPS, 2017.
[5]
S. Dieleman, A. van den Oord, and K. Simonyan. The challenge of realistic music generation:
modelling raw audio at scale. CoRR, abs/1806.10474, 2018. URL
http://arxiv.org/abs/
1806.10474.
[6]
J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord. Unsupervised speech representation
learning using wavenet autoencoders. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 27
(12):2041–2053, December 2019. ISSN 2329-9290. doi: 10.1109/TASLP.2019.2938863.
[7]
S. Schneider, A. Baevski, R. Collobert, and M. Auli. wav2vec: Unsupervised pre-training for
speech recognition. In Proc. of Interspeech, 2019.
[8]
A. Baevski, S. Schneider, and M. Auli. vq-wav2vec: Self-supervised learning of discrete speech
representations. In Proc. of ICLR, 2020.
[9]
A. H. Liu, T. Tu, H. Lee, and L. Lee. Towards unsupervised speech recognition and synthesis
with quantized speech representation learning. arXiv, 2019.
[10]
D. Harwath, W. Hsu, and J. Glass. Learning hierarchical discrete linguistic units from visually-
grounded speech. In Proc. of ICLR, 2020.
[11]
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,
A. W. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR,
abs/1609.03499, 2016.
[12]
A. Baevski, S. Schneider, and M. Auli. vq-wav2vec: Self-supervised learning of discrete speech
representations. In Proc. of ICLR, 2020.
[13] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax, 2016.
[14]
L. Kaiser, A. Roy, A. Vaswani, N. Parmar, S. Bengio, J. Uszkoreit, and N. Shazeer. Fast
decoding in sequence models using discrete latent variables. CoRR, abs/1803.03382, 2018.
[15]
J. Hervé, D. Matthijs, and S. Cordelia. Product quantization for nearest neighbor search. IEEE
Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011.
[16]
S. Dieleman, A. van den Oord, and K. Simonyan. The challenge of realistic music generation:
modelling raw audio at scale. In Proc. of NeurIPS, 2018.
[17]
A. Baevski, H. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised
learning of speech representations. In Proc. of NeurIPS, 2020.
[18]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on
public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 5206–5210, 2015.
[19]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. Darpa
timit acoustic phonetic continuous speech corpus cdrom, 1993.
7
[20]
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq:
A fast, extensible toolkit for sequence modeling. In Proc. of NAACL System Demonstrations,
2019.
[21]
I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with restarts. CoRR,
abs/1608.03983, 2016.
[22]
V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchinsky, and R. Collobert.
Wav2letter++: A fast open-source speech recognition system. In Proc. of ICASSP, 2019.
[23]
J. Devlin, M. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. In Proc. of NAACL, 2019.
[24]
A. Baevski, S. Edunov, Y. Liu, L. Zettlemoyer, and M. Auli. Cloze-driven pretraining of
self-attention networks. In Proc. of EMNLP, 2019.
[25]
B. van Niekerk, L. Nortje, and H. Kamper. Vector-quantized neural networks for acoustic unit
discovery in the zerospeech 2020 challenge. arXiv, 2020.
[26]
A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive
coding. arXiv, abs/1807.03748, 2018.
8
ResearchGate has not been able to resolve any citations for this publication.
Article
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
Article
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Article
Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on CIFAR-10 and CIFAR-100 datasets where we demonstrate new state-of-the-art results below 4backslash% and 19backslash%, respectively. Our source code is available at https://github.com/loshchil/SGDR.
Conference Paper
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
Conference Paper
Learning discrete representations of data is a central machine learning task because of the compactness of the representations and ease of interpretation. The task includes clustering and hash learning as special cases. Deep neural networks are promising to be used because they can model the non-linearity of data and scale to large datasets. However, their model complexity is huge, and therefore, we need to carefully regularize the networks in order to learn useful representations that exhibit intended invariance for applications of interest. To this end, we propose a method called Information Maximizing Self Augmented Training (IMSAT). In IMSAT, we use data augmentation to impose the invariance on discrete representations. More specifically, we encourage the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion. At the same time, we maximize the information-theoretic dependency between data and their mapped representations of data. Extensive experiments on benchmark datasets show that IMSAT produces state-of-the-art results for both clustering and unsupervised hash learning.