Conference PaperPDF Available

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks

Authors:
  • NNAISENSE SA

Abstract and Figures

Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Content may be subject to copyright.
Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks
Alex Graves1alex@idsia.ch
Santiago Fern´andez1santiago@idsia.ch
Faustino Gomez1tino@idsia.ch
urgen Schmidhuber1,2juergen@idsia.ch
1Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland
2Technische Universit¨at M¨unchen (TUM), Boltzmannstr. 3, 85748 Garching, Munich, Germany
Abstract
Many real-world sequence learning tasks re-
quire the prediction of sequences of labels
from noisy, unsegmented input data. In
speech recognition, for example, an acoustic
signal is transcribed into words or sub-word
units. Recurrent neural networks (RNNs) are
powerful sequence learners that would seem
well suited to such tasks. However, because
they require pre-segmented training data,
and post-processing to transform their out-
puts into label sequences, their applicability
has so far been limited. This paper presents a
novel method for training RNNs to label un-
segmented sequences directly, thereby solv-
ing both problems. An experiment on the
TIMIT speech corpus demonstrates its ad-
vantages over both a baseline HMM and a
hybrid HMM-RNN.
1. Introduction
Labelling unsegmented sequence data is a ubiquitous
problem in real-world sequence learning. It is partic-
ularly common in perceptual tasks (e.g. handwriting
recognition, speech recognition, gesture recognition)
where noisy, real-valued input streams are annotated
with strings of discrete labels, such as letters or words.
Currently, graphical models such as hidden Markov
Models (HMMs; Rabiner, 1989), conditional random
fields (CRFs; Lafferty et al., 2001) and their vari-
ants, are the predominant framework for sequence la-
Appearing in Proceedings of the 23rd International Con-
ference on Machine Learning, Pittsburgh, PA, 2006. Copy-
right 2006 by the author(s)/owner(s).
belling. While these approaches have proved success-
ful for many problems, they have several drawbacks:
(1) they usually require a significant amount of task
specific knowledge, e.g. to design the state models for
HMMs, or choose the input features for CRFs; (2)
they require explicit (and often questionable) depen-
dency assumptions to make inference tractable, e.g.
the assumption that observations are independent for
HMMs; (3) for standard HMMs, training is generative,
even though sequence labelling is discriminative.
Recurrent neural networks (RNNs), on the other hand,
require no prior knowledge of the data, beyond the
choice of input and output representation. They can
be trained discriminatively, and their internal state
provides a powerful, general mechanism for modelling
time series. In addition, they tend to be robust to
temporal and spatial noise.
So far, however, it has not been possible to apply
RNNs directly to sequence labelling. The problem is
that the standard neural network objective functions
are defined separately for each point in the training se-
quence; in other words, RNNs can only be trained to
make a series of independent label classifications. This
means that the training data must be pre-segmented,
and that the network outputs must be post-processed
to give the final label sequence.
At present, the most effective use of RNNs for se-
quence labelling is to combine them with HMMs in the
so-called hybrid approach (Bourlard & Morgan, 1994;
Bengio., 1999). Hybrid systems use HMMs to model
the long-range sequential structure of the data, and
neural nets to provide localised classifications. The
HMM component is able to automatically segment
the sequence during training, and to transform the
network classifications into label sequences. However,
as well as inheriting the aforementioned drawbacks of
Connectionist Temporal Classification
HMMs, hybrid systems do not exploit the full poten-
tial of RNNs for sequence modelling.
This paper presents a novel method for labelling se-
quence data with RNNs that removes the need for pre-
segmented training data and post-processed outputs,
and models all aspects of the sequence within a single
network architecture. The basic idea is to interpret
the network outputs as a probability distribution over
all possible label sequences, conditioned on a given in-
put sequence. Given this distribution, an objective
function can be derived that directly maximises the
probabilities of the correct labellings. Since the objec-
tive function is differentiable, the network can then be
trained with standard backpropagation through time
(Werbos, 1990).
In what follows, we refer to the task of labelling un-
segmented data sequences as temporal classification
(Kadous, 2002), and to our use of RNNs for this pur-
pose as connectionist temporal classification (CTC).
By contrast, we refer to the independent labelling of
each time-step, or frame, of the input sequence as
framewise classification.
The next section provides the mathematical formalism
for temporal classification, and defines the error mea-
sure used in this paper. Section 3 describes the output
representation that allows RNNs to be used as tempo-
ral classifiers. Section 4 explains how CTC networks
can be trained. Section 5 compares CTC to hybrid and
HMM systems on the TIMIT speech corpus. Section 6
discusses some key differences between CTC and other
temporal classifiers, giving directions for future work,
and the paper concludes with section 7.
2. Temporal Classification
Let Sbe a set of training examples drawn from a fixed
distribution DX ×Z . The input space X= (Rm)is
the set of all sequences of mdimensional real val-
ued vectors. The target space Z=Lis the set
of all sequences over the (finite) alphabet Lof la-
bels. In general, we refer to elements of Las label
sequences or labellings. Each example in Sconsists
of a pair of sequences (x,z). The target sequence
z= (z1, z2, ..., zU) is at most as long as the input
sequence x= (x1, x2, ..., xT), i.e. UT. Since the
input and target sequences are not generally the same
length, there is no a priori way of aligning them.
The aim is to use Sto train a temporal classifier
h:X 7→ Z to classify previously unseen input se-
quences in a way that minimises some task specific
error measure.
2.1. Label Error Rate
In this paper, we are interested in the following error
measure: given a test set S0⊂ DX ×Z disjoint from S,
define the label error rate (LER) of a temporal clas-
sifier has the mean normalised edit distance between
its classifications and the targets on S0, i.e.
LER(h, S 0) = 1
|S0|X
(x,z)S0
ED(h(x),z)
|z|(1)
where ED(p,q) is the edit distance between two se-
quences pand q— i.e. the minimum number of inser-
tions, substitutions and deletions required to change p
into q.
This is a natural measure for tasks (such as speech or
handwriting recognition) where the aim is to minimise
the rate of transcription mistakes.
3. Connectionist Temporal Classification
This section describes the output representation that
allows a recurrent neural network to be used for CTC.
The crucial step is to transform the network outputs
into a conditional probability distribution over label
sequences. The network can then be used a classifier
by selecting the most probable labelling for a given
input sequence.
3.1. From Network Outputs to Labellings
A CTC network has a softmax output layer (Bridle,
1990) with one more unit than there are labels in L.
The activations of the first |L|units are interpreted as
the probabilities of observing the corresponding labels
at particular times. The activation of the extra unit
is the probability of observing a ‘blank’, or no label.
Together, these outputs define the probabilities of all
possible ways of aligning all possible label sequences
with the input sequence. The total probability of any
one label sequence can then be found by summing the
probabilities of its different alignments.
More formally, for an input sequence xof length T,
define a recurrent neural network with minputs, n
outputs and weight vector was a continuous map Nw:
(Rm)T7→ (Rn)T. Let y=Nw(x) be the sequence of
network outputs, and denote by yt
kthe activation of
output unit kat time t. Then yt
kis interpreted as the
probability of observing label kat time t, which defines
a distribution over the set L0Tof length Tsequences
over the alphabet L0=L∪ {blank}:
p(π|x) =
T
Y
t=1
yt
πt,πL0T.(2)
Connectionist Temporal Classification
0
label probability
""" """
1
0
1
n
dcl
d ix v
Framewise
the sound of
Waveform
CTC
dh ax s aw
Figure 1. Framewise and CTC networks classifying a speech signal. The shaded lines are the output activations,
corresponding to the probabilities of observing phonemes at particular times. The CTC network predicts only the
sequence of phonemes (typically as a series of spikes, separated by ‘blanks’, or null predictions), while the framewise
network attempts to align them with the manual segmentation (vertical lines). The framewise network receives an error
for misaligning the segment boundaries, even if it predicts the correct phoneme (e.g. ‘dh’). When one phoneme always
occurs beside another (e.g. the closure ‘dcl’ with the stop ‘d’), CTC tends to predict them together in a double spike.
The choice of labelling can be read directly from the CTC outputs (follow the spikes), whereas the predictions of the
framewise network must be post-processed before use.
From now on, we refer to the elements of L0Tas paths,
and denote them π.
Implicit in (2) is the assumption that the network out-
puts at different times are conditionally independent,
given the internal state of the network. This is ensured
by requiring that no feedback connections exist from
the output layer to itself or the network.
The next step is to define a many-to-one map B:
L0T7→ LT, where LTis the set of possible labellings
(i.e. the set of sequences of length less than or equal to
Tover the original label alphabet L). We do this by
simply removing all blanks and repeated labels from
the paths (e.g. B(aab) = B(aa − −abb) = aab).
Intuitively, this corresponds to outputting a new label
when the network switches from predicting no label
to predicting a label, or from predicting one label to
another (c.f. the CTC outputs in figure 1). Finally, we
use Bto define the conditional probability of a given
labelling lLTas the sum of the probabilities of all
the paths corresponding to it:
p(l|x) = X
π∈B1(l)
p(π|x).(3)
3.2. Constructing the Classifier
Given the above formulation, the output of the classi-
fier should be the most probable labelling for the input
sequence:
h(x) = arg max
lLTp(l|x).
Using the terminology of HMMs, we refer to the task of
finding this labelling as decoding. Unfortunately, we do
not know of a general, tractable decoding algorithm for
our system. However the following two approximate
methods give good results in practice.
The first method (best path decoding) is based on the
assumption that the most probable path will corre-
spond to the most probable labelling:
h(x)≈ B(π) (4)
where π= arg max
πNtp(π|x).
Best path decoding is trivial to compute, since πis
just the concatenation of the most active outputs at
every time-step. However it is not guaranteed to find
the most probable labelling.
The second method (prefix search decoding) relies on
the fact that, by modifying the forward-backward al-
gorithm of section 4.1, we can efficiently calculate the
probabilities of successive extensions of labelling pre-
fixes (figure 2).
Given enough time, prefix search decoding always finds
the most probable labelling. However, the maximum
number of prefixes it must expand grows exponentially
with the input sequence length. If the output distri-
bution is sufficiently peaked around the mode, it will
nonetheless finish in reasonable time. For the exper-
iment in this paper though, a further heuristic was
required to make its application feasible.
Observing that the outputs of a trained CTC network
Connectionist Temporal Classification
Figure 2. Prefix search decoding on the label alpha-
bet X,Y. Each node either ends (‘e’) or extends the prefix
at its parent node. The number above an extending node
is the total probability of all labellings beginning with that
prefix. The number above an end node is the probability of
the single labelling ending at its parent. At every iteration
the extensions of the most probable remaining prefix are
explored. Search ends when a single labelling (here ‘XY’)
is more probable than any remaining prefix.
tend to form a series of spikes separated by strongly
predicted blanks (figure 1), we divide the output se-
quence into sections that are very likely to begin and
end with a blank. We do this by choosing boundary
points where the probability of observing a blank label
is above a certain threshold. We then calculate the
most probable labelling for each section individually
and concatenate these to get the final classification.
In practice, prefix search works well with this heuristic,
and generally outperforms best path decoding. How-
ever it does fail in some cases, e.g. if the same label is
predicted weakly on both sides of a section boundary.
4. Training the Network
So far we have described an output representation that
allows RNNs to be used for CTC. We now derive an
objective function for training CTC networks with gra-
dient descent.
The objective function is derived from the principle of
maximum likelihood. That is, minimising it maximises
the log likelihoods of the target labellings. Note that
this is the same principle underlying the standard neu-
ral network objective functions (Bishop, 1995). Given
the objective function, and its derivatives with re-
spect to the network outputs, the weight gradients can
be calculated with standard backpropagation through
time. The network can then be trained with any of
the gradient-based optimisation algorithms currently
in use for neural networks (LeCun et al., 1998; Schrau-
dolph, 2002).
We begin with an algorithm required for the maximum
likelihood function.
4.1. The CTC Forward-Backward Algorithm
We require an efficient way of calculating the condi-
tional probabilities p(l|x) of individual labellings. At
first sight (3) suggests this will be problematic: the
sum is over all paths corresponding to a given labelling,
and in general there are very many of these.
Fortunately the problem can be solved with a dy-
namic programming algorithm, similar to the forward-
backward algorithm for HMMs (Rabiner, 1989). The
key idea is that the sum over paths corresponding to
a labelling can be broken down into an iterative sum
over paths corresponding to prefixes of that labelling.
The iterations can then be efficiently computed with
recursive forward and backward variables.
For some sequence qof length r, denote by q1:pand
qrp:rits first and last psymbols respectively. Then
for a labelling l, define the forward variable αt(s) to
be the total probability of l1:sat time t. i.e.
αt(s)def
=X
πNT:
B(π1:t)=l1:s
t
Y
t0=1
yt0
πt0.(5)
As we will see, αt(s) can be calculated recursively from
αt1(s) and αt1(s1).
To allow for blanks in the output paths, we consider
a modified label sequence l0, with blanks added to the
beginning and the end and inserted between every pair
of labels. The length of l0is therefore 2|l|+ 1. In cal-
culating the probabilities of prefixes of l0we allow all
transitions between blank and non-blank labels, and
also those between any pair of distinct non-blank la-
bels. We allow all prefixes to start with either a blank
(b) or the first symbol in l(l1).
This gives us the following rules for initialisation
α1(1) = y1
b
α1(2) = y1
l1
α1(s) = 0,s > 2
and recursion
αt(s) = (¯αt(s)yt
l0
sif l0
s=bor l0
s2=l0
s
¯αt(s) + αt1(s2)yt
l0
sotherwise
(6)
where
¯αt(s)def
=αt1(s) + αt1(s1).(7)
Connectionist Temporal Classification
Figure 3. illustration of the forward backward algo-
rithm applied to the labelling ‘CAT’. Black circles
represent labels, and white circles represent blanks. Arrows
signify allowed transitions. Forward variables are updated
in the direction of the arrows, and backward variables are
updated against them.
Note that αt(s)=0s < |l0| − 2(Tt)1, because
these variables correspond to states for which there are
not enough time-steps left to complete the sequence
(the unconnected circles in the top right of figure 3).
Also αt(s) = 0 s < 1.
The probability of lis then the sum of the total prob-
abilities of l0with and without the final blank at time
T.
p(l|x) = αT(|l0|) + αT(|l0| − 1).(8)
Similarly, the backward variables βt(s) are defined as
the total probability of ls:|l|at time t.
βt(s)def
=X
πNT:
B(πt:T)=ls:|l|
T
Y
t0=t
yt0
πt0(9)
βT(|l0|) = yT
b
βT(|l0| − 1) = yT
l|l|
βT(s) = 0,s < |l0| − 1
βt(s) = (¯
βt(s)yt
l0
sif l0
s=bor l0
s+2 =l0
s
¯
βt(s) + βt+1(s+ 2)yt
l0
sotherwise
(10)
where
¯
βt(s)def
=βt+1(s) + βt+1 (s+ 1).(11)
βt(s)=0s > 2t(the unconnected circles in the bot-
tom left of figure 3) and s > |l0|.
In practice, the above recursions will soon lead to un-
derflows on any digital computer. One way of avoid-
ing this is to rescale the forward and backward vari-
ables (Rabiner, 1989). If we define
Ct
def
=X
s
αt(s),ˆαt(s)def
=αt(s)
Ct
and substitute αfor ˆαon the RHS of (6) and (7), the
forward variables will remain in computational range.
Similarly, for the backward variables we define
Dt
def
=X
s
βt(s),ˆ
βt(s)def
=βt(s)
Dt
and substitute βfor ˆ
βon the RHS of (10) and (11).
To evaluate the maximum likelihood error, we need
the natural logs of the target labelling probabilities.
With the rescaled variables these have a particularly
simple form:
ln(p(l|x)) =
T
X
t=1
ln(Ct)
4.2. Maximum Likelihood Training
The aim of maximum likelihood training is to simul-
taneously maximise the log probabilities of all the cor-
rect classifications in the training set. In our case, this
means minimising the following objective function:
OML(S, Nw) = X
(x,z)S
lnp(z|x)(12)
To train the network with gradient descent, we need to
differentiate (12) with respect to the network outputs.
Since the training examples are independent we can
consider them separately:
∂OML ({(x,z)},Nw)
∂yt
k
=∂ln(p(z|x))
∂yt
k
(13)
We now show how the algorithm of section 4.1 can be
used to calculate (13).
The key point is that, for a labelling l, the product of
the forward and backward variables at a given sand
tis the probability of all the paths corresponding to l
that go through the symbol sat time t. More precisely,
from (5) and (9) we have:
αt(s)βt(s) = X
π∈B1(l):
πt=ls
yt
ls
T
Y
t=1
yt
πt.
Rearranging and substituting in from (2) gives
αt(s)βt(s)
yt
ls
=X
π∈B1(l):
πt=ls
p(π|x).
Connectionist Temporal Classification
From (3) we can see that this is the portion of the total
probability p(l|x) due to those paths going through ls
at time t. We can therefore sum over all sand tto
get:
p(l|x) =
T
X
t=1
|l|
X
s=1
αt(s)βt(s)
yt
ls
.(14)
Because the network outputs are conditionally inde-
pendent (section 3.1), we need only consider the paths
going through label kat time tto get the partial
derivatives of p(l|x) with respect to yt
k. Noting that
the same label may be repeated several times in a sin-
gle labelling l, we define the set of positions where
label koccurs as lab(l, k) = {s:ls=k}, which may
be empty. We then differentiate (14) to get:
∂p(l|x)
∂yt
k
=1
yt
k
2X
slab(l,k)
αt(s)βt(s).(15)
Observing that
∂ln(p(l|x))
∂yt
k
=1
p(l|x)
∂p(l|x)
∂yt
k
we can set l=zand substitute (8) and (15) into (13)
to differentiate the objective function.
Finally, to backpropagate the gradient through the
softmax layer, we need the objective function deriva-
tives with respect to the unnormalised outputs ut
k.
If the rescaling of section 4.1 is used, we have:
∂OML ({(x,z)},Nw)
∂ut
k
=yt
kQt
yt
kX
slab(z,k)
ˆαt(s)ˆ
βt(s)
(16)
where
Qt
def
=Dt
T
Y
t0=t+1
Dt0
Ct0
.
Eqn (16) is the ‘error signal’ received by the network
during training (figure 4).
5. Experiments
We compared the performance of CTC with that of
both an HMM and an HMM-RNN hybrid on a real-
world temporal classification problem: phonetic la-
belling on the TIMIT speech corpus. More precisely,
the task was to annotate the utterances in the TIMIT
test set with the phoneme sequences that gave the low-
est possible label error rate (as defined in section 2.1).
To make the comparison fair, the CTC and hybrid
networks used the same RNN architecture: bidirec-
tional Long Short-Term Memory (BLSTM; Graves &
(b)
error
output
(c)
(a)
Figure 4. Evolution of the CTC Error Signal During
Training. The left column shows the output activations
for the same sequence at various stages of training (the
dashed line is the ‘blank’ unit); the right column shows
the corresponding error signals. Errors above the horizon-
tal axis act to increase the corresponding output activation
and those below act to decrease it. (a) Initially the network
has small random weights, and the error is determined by
the target sequence only. (b) The network begins to make
predictions and the error localises around them. (c) The
network strongly predicts the correct labelling and the er-
ror virtually disappears.
Schmidhuber, 2005). BLSTM combines the ability
of Long Short-Term Memory (LSTM; Hochreiter &
Schmidhuber, 1997) to bridge long time lags with
the access of bidirectional RNNs (BRNNs; Schuster
& Paliwal, 1997) to past and future context. We
stress that any other architecture could have been
used instead. We chose BLSTM because our exper-
iments with standard BRNNs and unidirectional net-
works gave worse results on the same task.
5.1. Data
TIMIT contain recordings of prompted English speech,
accompanied by manually segmented phonetic tran-
scripts. It has a lexicon of 61 distinct phonemes, and
comes divided into training and test sets containing
4620 and 1680 utterances respectively. 5% (184) of
the training utterances were chosen at random and
used as a validation set for early stopping in the hy-
brid and CTC experiments. The audio data was pre-
processed into 10 ms frames, overlapped by 5 ms, us-
ing 12 Mel-Frequency Cepstrum Coefficients (MFCCs)
from 26 filter-bank channels. The log-energy was also
included, along with the first derivatives of all coeffi-
cients, giving a vector of 26 coefficients per frame in
total. The coefficients were individually normalised to
have mean 0 and standard deviation 1 over the train-
ing set.
Connectionist Temporal Classification
5.2. Experimental Setup
The CTC network used an extended BLSTM archi-
tecture with peepholes and forget gates (Gers et al.,
2002), 100 blocks in each of the forward and backward
hidden layers, hyperbolic tangent for the input and
output cell activation functions and a logistic sigmoid
in the range [0,1] for the gates.
The hidden layers were fully connected to themselves
and the output layer, and fully connected from the
input layer. The input layer was size 26, the soft-
max output layer size 62 (61 phoneme categories plus
the blank label), and the total number of weights was
114,662.
Training was carried out with back propagation
through time and online gradient descent (weight up-
dates after every training example), using a learning
rate of 104and a momentum of 0.9. Network activa-
tions were reset to 0 at the start of each example. For
prefix search decoding (section 3.2) the blank proba-
bility threshold was set at 99.99%. The weights were
initialised with a flat random distribution in the range
[0.1,0.1]. During training, Gaussian noise was added
to the inputs with a standard deviation of 0.6 to im-
prove generalisation.
The baseline HMM and hybrid systems were imple-
mented as in (Graves et al., 2005). Briefly, baseline
HMMs with context independent and context depen-
dent three-states left-to-right models were trained and
tested using the HTK Toolkit1. Observation probabil-
ities were modelled by a mixture of Gaussians. Both
the number of Gaussians and the insertion penalty
were chosen to obtain the best performance on the
task. Neither linguistic information nor probabilities
of partial phone sequences were included in the system.
There were more than 900,000 parameters in total.
The hybrid system comprised an HMM and a BLSTM
network, and was trained using Viterbi-based forced-
alignment (Robinson, 1994). Initial estimation of tran-
sition and prior probabilities of the one-state 61 mod-
els was carried out using the correct transcription for
the training set. Network output probabilities were
divided by prior probabilities to obtain likelihoods for
the HMM. The insertion penalty was chosen to obtain
the best performance on the task.
The BLSTM architecture and parameters were identi-
cal to those used for CTC, with the following excep-
tions: (1) the learning rate for the hybrid network was
105; (2) the injected noise had standard deviation
0.5; (3) the output layer had 61 units instead of 62
1http://htk.eng.cam.ac.uk/
Table 1. Label Error Rate (LER) on TIMIT. CTC
and hybrid results are means over 5 runs, ±standard error.
All differences were significant (p < 0.01), except between
weighted error BLSTM/HMM and CTC (best path).
System LER
Context-independent HMM 38.85 %
Context-dependent HMM 35.21 %
BLSTM/HMM 33.84 ±0.06 %
Weighted error BLSTM/HMM 31.57 ±0.06 %
CTC (best path) 31.47 ±0.21 %
CTC (prefix search) 30.51 ±0.19 %
(no blank label). The noise and learning rate were set
for the two systems independently, following a rough
search in parameter space. The hybrid network had
a total of 114,461 weights, to which the HMM added
183 further parameters. For the weighted error exper-
iment, the error signal was scaled to give equal weight
to long and short phonemes (Robinson, 1991).
5.3. Experimental Results
The results in table 1 show that, with prefix search
decoding, CTC outperformed both a baseline HMM
recogniser and an HMM-RNN hybrid with the same
RNN architecture. They also show that prefix search
gave a small improvement over best path decoding.
Note that the best hybrid results were achieved with a
weighted error signal. Such heuristics are unnecessary
for CTC, as its objective function depends only on
the sequence of labels, and not on their duration or
segmentation.
Input noise had a greater impact on generalisation for
CTC than the hybrid system, and a higher level of
noise was found to be optimal for CTC.
6. Discussion and Future Work
A key difference between CTC and other temporal
classifiers is that CTC does not explicitly segment its
input sequences. This has several benefits, such as re-
moving the need to locate inherently ambiguous label
boundaries (e.g. in speech or handwriting), and allow-
ing label predictions to be grouped together if it proves
useful (e.g. if several labels commonly occur together).
In any case, determining the segmentation is a waste of
modelling effort if only the label sequence is required.
For tasks where segmentation is required (e.g. protein
secondary structure prediction), it would seem prob-
lematic to use CTC. However, as can be seen from fig-
ure 1, CTC naturally tends to align each label predic-
Connectionist Temporal Classification
tion with the corresponding part of the sequence. This
should make it suitable for tasks like keyword spotting,
where approximate segmentation is sufficient.
Another distinctive feature of CTC is that it does
not explicitly model inter-label dependencies. This is
in contrast to graphical models, where the labels are
typically assumed to form a kth order Markov chain.
Nonetheless, CTC implicitly models inter-label depen-
dencies, e.g. by predicting labels that commonly occur
together as a double spike (see figure 1).
One very general way of dealing with structured data
would be a hierarchy of temporal classifiers, where the
labellings at one level (e.g. letters) become inputs for
the labellings at the next (e.g. words). Preliminary
experiments with hierarchical CTC have been encour-
aging, and we intend to pursue this direction further.
Good generalisation is always difficult with maximum
likelihood training, but appears to be particularly so
for CTC. In the future, we will continue to explore
methods to reduce overfitting, such as weight decay,
boosting and margin maximisation.
7. Conclusions
We have introduced a novel, general method for tem-
poral classification with RNNs. Our method fits nat-
urally into the existing framework of neural network
classifiers, and is derived from the same probabilis-
tic principles. It obviates the need for pre-segmented
data, and allows the network to be trained directly for
sequence labelling. Moreover, without requiring any
task-specific knowledge, it has outperformed both an
HMM and an HMM-RNN hybrid on a real-world tem-
poral classification problem.
Acknowledgements
We thank Marcus Hutter for useful mathematical dis-
cussions. This research was funded by SNF grants
200021-111968/1 and 200020-107534/1.
References
Bengio., Y. (1999). Markovian models for sequential
data. Neural Computing Surveys,2, 129–162.
Bishop, C. (1995). Neural Networks for Pattern Recog-
nition, chapter 6. Oxford University Press, Inc.
Bourlard, H., & Morgan, N. (1994). Connnectionist
speech recognition: A hybrid approach. Kluwer Aca-
demic Publishers.
Bridle, J. (1990). Probabilistic interpretation of feed-
forward classification network outputs, with re-
lationships to statistical pattern recognition. In
F. Soulie and J.Herault (Eds.), Neurocomputing: Al-
gorithms, architectures and applications, 227–236.
Springer-Verlag.
Gers, F., Schraudolph, N., & Schmidhuber, J. (2002).
Learning precise timing with LSTM recurrent net-
works. Journal of Machine Learning Research,3,
115–143.
Graves, A., Fern´andez, S., & Schmidhuber, J.
(2005). Bidirectional LSTM networks for improved
phoneme classification and recognition. Proceedings
of the 2005 International Conference on Artificial
Neural Networks. Warsaw, Poland.
Graves, A., & Schmidhuber, J. (2005). Framewise
phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Net-
works,18, 602–610.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-
Term Memory. Neural Computation,9, 1735–1780.
Kadous, M. W. (2002). Temporal classification: Ex-
tending the classification paradigm to multivariate
time series. Doctoral dissertation, School of Com-
puter Science & Engineering, University of New
South Wales.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Con-
ditional random fields: Probabilistic models for seg-
menting and labeling sequence data. Proc. 18th In-
ternational Conf. on Machine Learning (pp. 282–
289). Morgan Kaufmann, San Francisco, CA.
LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998).
Efficient backprop. Neural Networks: Tricks of the
trade. Springer.
Rabiner, L. R. (1989). A tutorial on hidden markov
models and selected applications in speech recogni-
tion. Proc. IEEE (pp. 257–286). IEEE.
Robinson, A. J. (1991). Several improvements
to a recurrent error propagation network phone
recognition system (Technical Report CUED/F-
INFENG/TR82). University of Cambridge.
Robinson, A. J. (1994). An application of recurrent
nets to phone probability estimation. IEEE Trans-
actions on Neural Networks,5, 298–305.
Schraudolph, N. N. (2002). Fast Curvature Matrix-
Vector Products for Second-Order Gradient De-
scent. Neural Comp.,14, 1723–1738.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional
recurrent neural networks. IEEE Transactions on
Signal Processing,45, 2673–2681.
Werbos, P. (1990). Backpropagation through time:
What it does and how to do it. Proceedings of the
IEEE,78, 1550 – 1560.
... CTC Recently, CTC (Graves et al. 2006) is receiving increasing attention in NAT for its superior performance and the flexibility of variable length prediction (Libovický and Helcl 2018; Saharia et al. 2020). CTC-based NAT does not require a length predictor but generates an overlong sequence containing repetitions and blank tokens, which will be removed by a collapse function Γ −1 in the postprocessing to recover a normal sentence. ...
... Ghazvininejad et al. (2020); Du, Tu, and Jiang (2021) improved the cross-entropy loss with better alignments. Recently, the CTC loss (Graves et al. 2006) is receiving increasing attention in NAT (Libovický and Helcl 2018; Saharia et al. 2020;Gu and Kong 2021;, which is further enhanced with a directed acyclic graph to explicitly model the probability of transistion paths (Huang et al. 2022b;Shao, Ma, and Feng 2022). ...
Preprint
Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence, so the reference sentence may be inappropriate for the training when the NAT output is closer to other translations. In response to this problem, we introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output. As we train NAT based on the rephraser output rather than the reference sentence, the rephraser output should fit well with the NAT output and not deviate too far from the reference, which can be quantified as reward functions and optimized by reinforcement learning. Experiments on major WMT benchmarks and NAT baselines show that our approach consistently improves the translation quality of NAT. Specifically, our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.
... The recent success of convolutional neural networks (CNNs) and recurrent neural net-works brings huge progress for CSLR. The widely-used CTC loss (Graves et al. 2006) enables end-to-end training for recent methods by aligning target glosses with inputs. ...
Preprint
Full-text available
Hand and face play an important role in expressing sign language. Their features are usually especially leveraged to improve system performance. However, to effectively extract visual representations and capture trajectories for hands and face, previous methods always come at high computations with increased training complexity. They usually employ extra heavy pose-estimation networks to locate human body keypoints or rely on additional pre-extracted heatmaps for supervision. To relieve this problem, we propose a self-emphasizing network (SEN) to emphasize informative spatial regions in a self-motivated way, with few extra computations and without additional expensive supervision. Specifically, SEN first employs a lightweight subnetwork to incorporate local spatial-temporal features to identify informative regions, and then dynamically augment original features via attention maps. It's also observed that not all frames contribute equally to recognition. We present a temporal self-emphasizing module to adaptively emphasize those discriminative frames and suppress redundant ones. A comprehensive comparison with previous methods equipped with hand and face features demonstrates the superiority of our method, even though they always require huge computations and rely on expensive extra supervision. Remarkably, with few extra computations, SEN achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. Visualizations verify the effects of SEN on emphasizing informative spatial and temporal features. Code is available at https://github.com/hulianyuyy/SEN_CSLR
Chapter
Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits of diffusion models can also be realized for speech recognition. To this end, we propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features. Specifically, we propose TransFusion: a transcribing diffusion model which iteratively denoises a random character sequence into coherent text corresponding to the transcript of a conditioning utterance. We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark. To the best of our knowledge, we are the first to apply denoising diffusion to speech recognition. We also propose new techniques for effectively sampling and decoding multinomial diffusion models. These are required because traditional methods of sampling from acoustic models are not possible with our new discrete diffusion approach.KeywordsDenoising diffusionSpeech recognitionDiffusion decoding
Conference Paper
Full-text available
In this paper, we carry out two experiments on the TIMIT speech corpus with bidirectional and unidirectional Long Short Term Memory (LSTM) networks. In the first experiment (framewise phoneme classification) we find that bidirectional LSTMoutperforms both unidirectional LSTMand conventional Recurrent Neural Networks (RNNs). In the second (phoneme recognition) we find that a hybrid BLSTM-HMM system improves on an equivalent traditional HMM system, as well as unidirectional LSTM-HMM.
Article
Full-text available
Abstract The temporal distance between events conveys information essential for numerous sequential tasks such as motor control and rhythm detection. While Hidden Markov Models tend to ignore this information, recurrent neural networks (RNNs) can in principle learn to make use of it. We focus on Long Short-Term Memory (LSTM) because it has been shown to outperform other RNNs on tasks involving long time lags. We find that LSTM augmented by “peephole connections” from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes spaced either 50 or 49 time steps apart without the help of any short training exemplars. Without external resets or teacher forcing, our LSTM variant also learns to generate stable streams of precisely timed spikes and other highly nonlinear periodic patterns. This makes LSTM a promising approach for tasks that require the accurate measurement,or generation of time intervals. Keywords: Recurrent Neural Networks, Long Short-Term Memory, Timing.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
We propose a generic method for iteratively approximating various second-order gradient steps - Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient - in linear time per iteration, using special curvature matrix-vector products that can be computed in O(n). Two recent acceleration techniques for on-line learning, matrix momentum and stochastic meta-descent (SMD), implement this approach. Since both were originally derived by very different routes, this offers fresh insight into their operation, resulting in further improvements to SMD.
Article
This paper describes a speaker-independent phoneme and word recognition system based on a recurrent error propagation network (REPN) trained on the TIMIT database.The REPN is a fully recurrent error propagation network trained by the propagation of the gradient signal backwards in time. A variation of the stochastic gradient descent procedure is used which updates the weights by an adaptive step size in the direction given by the sign of the gradient.Phonetic context is stored internal to the network and the outputs are estimates of the probability that a given frame is part of a segment labelled with a context-independent phonetic symbol.During recognition, a dynamic programming match is made to find the most probable string of symbols. The one pass algorithm is used for phoneme and word recognition.The phoneme recognition rate for all 61 TIMIT symbols is 70·0% correct (63·5% accuracy including insertion errors) and on a reduced 39-symbol set the recognition rate is 76·5% correct (69·8%). This compares favourably with the results of other methods, such as HMMs, on the same database [K. F. Lee & H. W. Hon 1989. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648; S. E. Levinson, M. Y. Liberman, A. Ljolje & L. G. Miller 1989. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Glasgow, pp. 441–444].Analysis of the phoneme recognition results shows that information available from bigram and durational constraints is adequately handled within the network allowing for efficient parsing of the network output. For comparison, there is less computation involved in the resulting scheme than in a one-state-per-phoneme HMM system. This is demonstrated by applying the recognizer to the DARPA 1000-word resource management task. Parsing the network output to the word level with no grammar and no pruning can be carried out in faster than real time on a SUN workstation.
Chapter
We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity. The two modifications together result in quite simple arithmetic, and hardware implementation is not difficult either. The use of radial units (squared distance instead of dot product) immediately before the softmax output stage produces a network which computes posterior distributions over class labels based on an assumption of Gaussian within-class distributions. However the training, which uses cross-class information, can result in better performance at class discrimination than the usual within-class training method, unless the within-class distribution assumptions are actually correct.