Conference PaperPDF Available

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks

Authors:
  • NNAISENSE SA

Abstract and Figures

Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Content may be subject to copyright.
Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks
Alex Graves1alex@idsia.ch
Santiago Fern´andez1santiago@idsia.ch
Faustino Gomez1tino@idsia.ch
urgen Schmidhuber1,2juergen@idsia.ch
1Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland
2Technische Universit¨at M¨unchen (TUM), Boltzmannstr. 3, 85748 Garching, Munich, Germany
Abstract
Many real-world sequence learning tasks re-
quire the prediction of sequences of labels
from noisy, unsegmented input data. In
speech recognition, for example, an acoustic
signal is transcribed into words or sub-word
units. Recurrent neural networks (RNNs) are
powerful sequence learners that would seem
well suited to such tasks. However, because
they require pre-segmented training data,
and post-processing to transform their out-
puts into label sequences, their applicability
has so far been limited. This paper presents a
novel method for training RNNs to label un-
segmented sequences directly, thereby solv-
ing both problems. An experiment on the
TIMIT speech corpus demonstrates its ad-
vantages over both a baseline HMM and a
hybrid HMM-RNN.
1. Introduction
Labelling unsegmented sequence data is a ubiquitous
problem in real-world sequence learning. It is partic-
ularly common in perceptual tasks (e.g. handwriting
recognition, speech recognition, gesture recognition)
where noisy, real-valued input streams are annotated
with strings of discrete labels, such as letters or words.
Currently, graphical models such as hidden Markov
Models (HMMs; Rabiner, 1989), conditional random
fields (CRFs; Lafferty et al., 2001) and their vari-
ants, are the predominant framework for sequence la-
Appearing in Proceedings of the 23rd International Con-
ference on Machine Learning, Pittsburgh, PA, 2006. Copy-
right 2006 by the author(s)/owner(s).
belling. While these approaches have proved success-
ful for many problems, they have several drawbacks:
(1) they usually require a significant amount of task
specific knowledge, e.g. to design the state models for
HMMs, or choose the input features for CRFs; (2)
they require explicit (and often questionable) depen-
dency assumptions to make inference tractable, e.g.
the assumption that observations are independent for
HMMs; (3) for standard HMMs, training is generative,
even though sequence labelling is discriminative.
Recurrent neural networks (RNNs), on the other hand,
require no prior knowledge of the data, beyond the
choice of input and output representation. They can
be trained discriminatively, and their internal state
provides a powerful, general mechanism for modelling
time series. In addition, they tend to be robust to
temporal and spatial noise.
So far, however, it has not been possible to apply
RNNs directly to sequence labelling. The problem is
that the standard neural network objective functions
are defined separately for each point in the training se-
quence; in other words, RNNs can only be trained to
make a series of independent label classifications. This
means that the training data must be pre-segmented,
and that the network outputs must be post-processed
to give the final label sequence.
At present, the most effective use of RNNs for se-
quence labelling is to combine them with HMMs in the
so-called hybrid approach (Bourlard & Morgan, 1994;
Bengio., 1999). Hybrid systems use HMMs to model
the long-range sequential structure of the data, and
neural nets to provide localised classifications. The
HMM component is able to automatically segment
the sequence during training, and to transform the
network classifications into label sequences. However,
as well as inheriting the aforementioned drawbacks of
Connectionist Temporal Classification
HMMs, hybrid systems do not exploit the full poten-
tial of RNNs for sequence modelling.
This paper presents a novel method for labelling se-
quence data with RNNs that removes the need for pre-
segmented training data and post-processed outputs,
and models all aspects of the sequence within a single
network architecture. The basic idea is to interpret
the network outputs as a probability distribution over
all possible label sequences, conditioned on a given in-
put sequence. Given this distribution, an objective
function can be derived that directly maximises the
probabilities of the correct labellings. Since the objec-
tive function is differentiable, the network can then be
trained with standard backpropagation through time
(Werbos, 1990).
In what follows, we refer to the task of labelling un-
segmented data sequences as temporal classification
(Kadous, 2002), and to our use of RNNs for this pur-
pose as connectionist temporal classification (CTC).
By contrast, we refer to the independent labelling of
each time-step, or frame, of the input sequence as
framewise classification.
The next section provides the mathematical formalism
for temporal classification, and defines the error mea-
sure used in this paper. Section 3 describes the output
representation that allows RNNs to be used as tempo-
ral classifiers. Section 4 explains how CTC networks
can be trained. Section 5 compares CTC to hybrid and
HMM systems on the TIMIT speech corpus. Section 6
discusses some key differences between CTC and other
temporal classifiers, giving directions for future work,
and the paper concludes with section 7.
2. Temporal Classification
Let Sbe a set of training examples drawn from a fixed
distribution DX ×Z . The input space X= (Rm)is
the set of all sequences of mdimensional real val-
ued vectors. The target space Z=Lis the set
of all sequences over the (finite) alphabet Lof la-
bels. In general, we refer to elements of Las label
sequences or labellings. Each example in Sconsists
of a pair of sequences (x,z). The target sequence
z= (z1, z2, ..., zU) is at most as long as the input
sequence x= (x1, x2, ..., xT), i.e. UT. Since the
input and target sequences are not generally the same
length, there is no a priori way of aligning them.
The aim is to use Sto train a temporal classifier
h:X 7→ Z to classify previously unseen input se-
quences in a way that minimises some task specific
error measure.
2.1. Label Error Rate
In this paper, we are interested in the following error
measure: given a test set S0⊂ DX ×Z disjoint from S,
define the label error rate (LER) of a temporal clas-
sifier has the mean normalised edit distance between
its classifications and the targets on S0, i.e.
LER(h, S 0) = 1
|S0|X
(x,z)S0
ED(h(x),z)
|z|(1)
where ED(p,q) is the edit distance between two se-
quences pand q— i.e. the minimum number of inser-
tions, substitutions and deletions required to change p
into q.
This is a natural measure for tasks (such as speech or
handwriting recognition) where the aim is to minimise
the rate of transcription mistakes.
3. Connectionist Temporal Classification
This section describes the output representation that
allows a recurrent neural network to be used for CTC.
The crucial step is to transform the network outputs
into a conditional probability distribution over label
sequences. The network can then be used a classifier
by selecting the most probable labelling for a given
input sequence.
3.1. From Network Outputs to Labellings
A CTC network has a softmax output layer (Bridle,
1990) with one more unit than there are labels in L.
The activations of the first |L|units are interpreted as
the probabilities of observing the corresponding labels
at particular times. The activation of the extra unit
is the probability of observing a ‘blank’, or no label.
Together, these outputs define the probabilities of all
possible ways of aligning all possible label sequences
with the input sequence. The total probability of any
one label sequence can then be found by summing the
probabilities of its different alignments.
More formally, for an input sequence xof length T,
define a recurrent neural network with minputs, n
outputs and weight vector was a continuous map Nw:
(Rm)T7→ (Rn)T. Let y=Nw(x) be the sequence of
network outputs, and denote by yt
kthe activation of
output unit kat time t. Then yt
kis interpreted as the
probability of observing label kat time t, which defines
a distribution over the set L0Tof length Tsequences
over the alphabet L0=L∪ {blank}:
p(π|x) =
T
Y
t=1
yt
πt,πL0T.(2)
Connectionist Temporal Classification
0
label probability
""" """
1
0
1
n
dcl
d ix v
Framewise
the sound of
Waveform
CTC
dh ax s aw
Figure 1. Framewise and CTC networks classifying a speech signal. The shaded lines are the output activations,
corresponding to the probabilities of observing phonemes at particular times. The CTC network predicts only the
sequence of phonemes (typically as a series of spikes, separated by ‘blanks’, or null predictions), while the framewise
network attempts to align them with the manual segmentation (vertical lines). The framewise network receives an error
for misaligning the segment boundaries, even if it predicts the correct phoneme (e.g. ‘dh’). When one phoneme always
occurs beside another (e.g. the closure ‘dcl’ with the stop ‘d’), CTC tends to predict them together in a double spike.
The choice of labelling can be read directly from the CTC outputs (follow the spikes), whereas the predictions of the
framewise network must be post-processed before use.
From now on, we refer to the elements of L0Tas paths,
and denote them π.
Implicit in (2) is the assumption that the network out-
puts at different times are conditionally independent,
given the internal state of the network. This is ensured
by requiring that no feedback connections exist from
the output layer to itself or the network.
The next step is to define a many-to-one map B:
L0T7→ LT, where LTis the set of possible labellings
(i.e. the set of sequences of length less than or equal to
Tover the original label alphabet L). We do this by
simply removing all blanks and repeated labels from
the paths (e.g. B(aab) = B(aa − −abb) = aab).
Intuitively, this corresponds to outputting a new label
when the network switches from predicting no label
to predicting a label, or from predicting one label to
another (c.f. the CTC outputs in figure 1). Finally, we
use Bto define the conditional probability of a given
labelling lLTas the sum of the probabilities of all
the paths corresponding to it:
p(l|x) = X
π∈B1(l)
p(π|x).(3)
3.2. Constructing the Classifier
Given the above formulation, the output of the classi-
fier should be the most probable labelling for the input
sequence:
h(x) = arg max
lLTp(l|x).
Using the terminology of HMMs, we refer to the task of
finding this labelling as decoding. Unfortunately, we do
not know of a general, tractable decoding algorithm for
our system. However the following two approximate
methods give good results in practice.
The first method (best path decoding) is based on the
assumption that the most probable path will corre-
spond to the most probable labelling:
h(x)≈ B(π) (4)
where π= arg max
πNtp(π|x).
Best path decoding is trivial to compute, since πis
just the concatenation of the most active outputs at
every time-step. However it is not guaranteed to find
the most probable labelling.
The second method (prefix search decoding) relies on
the fact that, by modifying the forward-backward al-
gorithm of section 4.1, we can efficiently calculate the
probabilities of successive extensions of labelling pre-
fixes (figure 2).
Given enough time, prefix search decoding always finds
the most probable labelling. However, the maximum
number of prefixes it must expand grows exponentially
with the input sequence length. If the output distri-
bution is sufficiently peaked around the mode, it will
nonetheless finish in reasonable time. For the exper-
iment in this paper though, a further heuristic was
required to make its application feasible.
Observing that the outputs of a trained CTC network
Connectionist Temporal Classification
Figure 2. Prefix search decoding on the label alpha-
bet X,Y. Each node either ends (‘e’) or extends the prefix
at its parent node. The number above an extending node
is the total probability of all labellings beginning with that
prefix. The number above an end node is the probability of
the single labelling ending at its parent. At every iteration
the extensions of the most probable remaining prefix are
explored. Search ends when a single labelling (here ‘XY’)
is more probable than any remaining prefix.
tend to form a series of spikes separated by strongly
predicted blanks (figure 1), we divide the output se-
quence into sections that are very likely to begin and
end with a blank. We do this by choosing boundary
points where the probability of observing a blank label
is above a certain threshold. We then calculate the
most probable labelling for each section individually
and concatenate these to get the final classification.
In practice, prefix search works well with this heuristic,
and generally outperforms best path decoding. How-
ever it does fail in some cases, e.g. if the same label is
predicted weakly on both sides of a section boundary.
4. Training the Network
So far we have described an output representation that
allows RNNs to be used for CTC. We now derive an
objective function for training CTC networks with gra-
dient descent.
The objective function is derived from the principle of
maximum likelihood. That is, minimising it maximises
the log likelihoods of the target labellings. Note that
this is the same principle underlying the standard neu-
ral network objective functions (Bishop, 1995). Given
the objective function, and its derivatives with re-
spect to the network outputs, the weight gradients can
be calculated with standard backpropagation through
time. The network can then be trained with any of
the gradient-based optimisation algorithms currently
in use for neural networks (LeCun et al., 1998; Schrau-
dolph, 2002).
We begin with an algorithm required for the maximum
likelihood function.
4.1. The CTC Forward-Backward Algorithm
We require an efficient way of calculating the condi-
tional probabilities p(l|x) of individual labellings. At
first sight (3) suggests this will be problematic: the
sum is over all paths corresponding to a given labelling,
and in general there are very many of these.
Fortunately the problem can be solved with a dy-
namic programming algorithm, similar to the forward-
backward algorithm for HMMs (Rabiner, 1989). The
key idea is that the sum over paths corresponding to
a labelling can be broken down into an iterative sum
over paths corresponding to prefixes of that labelling.
The iterations can then be efficiently computed with
recursive forward and backward variables.
For some sequence qof length r, denote by q1:pand
qrp:rits first and last psymbols respectively. Then
for a labelling l, define the forward variable αt(s) to
be the total probability of l1:sat time t. i.e.
αt(s)def
=X
πNT:
B(π1:t)=l1:s
t
Y
t0=1
yt0
πt0.(5)
As we will see, αt(s) can be calculated recursively from
αt1(s) and αt1(s1).
To allow for blanks in the output paths, we consider
a modified label sequence l0, with blanks added to the
beginning and the end and inserted between every pair
of labels. The length of l0is therefore 2|l|+ 1. In cal-
culating the probabilities of prefixes of l0we allow all
transitions between blank and non-blank labels, and
also those between any pair of distinct non-blank la-
bels. We allow all prefixes to start with either a blank
(b) or the first symbol in l(l1).
This gives us the following rules for initialisation
α1(1) = y1
b
α1(2) = y1
l1
α1(s) = 0,s > 2
and recursion
αt(s) = (¯αt(s)yt
l0
sif l0
s=bor l0
s2=l0
s
¯αt(s) + αt1(s2)yt
l0
sotherwise
(6)
where
¯αt(s)def
=αt1(s) + αt1(s1).(7)
Connectionist Temporal Classification
Figure 3. illustration of the forward backward algo-
rithm applied to the labelling ‘CAT’. Black circles
represent labels, and white circles represent blanks. Arrows
signify allowed transitions. Forward variables are updated
in the direction of the arrows, and backward variables are
updated against them.
Note that αt(s)=0s < |l0| − 2(Tt)1, because
these variables correspond to states for which there are
not enough time-steps left to complete the sequence
(the unconnected circles in the top right of figure 3).
Also αt(s) = 0 s < 1.
The probability of lis then the sum of the total prob-
abilities of l0with and without the final blank at time
T.
p(l|x) = αT(|l0|) + αT(|l0| − 1).(8)
Similarly, the backward variables βt(s) are defined as
the total probability of ls:|l|at time t.
βt(s)def
=X
πNT:
B(πt:T)=ls:|l|
T
Y
t0=t
yt0
πt0(9)
βT(|l0|) = yT
b
βT(|l0| − 1) = yT
l|l|
βT(s) = 0,s < |l0| − 1
βt(s) = (¯
βt(s)yt
l0
sif l0
s=bor l0
s+2 =l0
s
¯
βt(s) + βt+1(s+ 2)yt
l0
sotherwise
(10)
where
¯
βt(s)def
=βt+1(s) + βt+1 (s+ 1).(11)
βt(s)=0s > 2t(the unconnected circles in the bot-
tom left of figure 3) and s > |l0|.
In practice, the above recursions will soon lead to un-
derflows on any digital computer. One way of avoid-
ing this is to rescale the forward and backward vari-
ables (Rabiner, 1989). If we define
Ct
def
=X
s
αt(s),ˆαt(s)def
=αt(s)
Ct
and substitute αfor ˆαon the RHS of (6) and (7), the
forward variables will remain in computational range.
Similarly, for the backward variables we define
Dt
def
=X
s
βt(s),ˆ
βt(s)def
=βt(s)
Dt
and substitute βfor ˆ
βon the RHS of (10) and (11).
To evaluate the maximum likelihood error, we need
the natural logs of the target labelling probabilities.
With the rescaled variables these have a particularly
simple form:
ln(p(l|x)) =
T
X
t=1
ln(Ct)
4.2. Maximum Likelihood Training
The aim of maximum likelihood training is to simul-
taneously maximise the log probabilities of all the cor-
rect classifications in the training set. In our case, this
means minimising the following objective function:
OML(S, Nw) = X
(x,z)S
lnp(z|x)(12)
To train the network with gradient descent, we need to
differentiate (12) with respect to the network outputs.
Since the training examples are independent we can
consider them separately:
∂OML ({(x,z)},Nw)
∂yt
k
=∂ln(p(z|x))
∂yt
k
(13)
We now show how the algorithm of section 4.1 can be
used to calculate (13).
The key point is that, for a labelling l, the product of
the forward and backward variables at a given sand
tis the probability of all the paths corresponding to l
that go through the symbol sat time t. More precisely,
from (5) and (9) we have:
αt(s)βt(s) = X
π∈B1(l):
πt=ls
yt
ls
T
Y
t=1
yt
πt.
Rearranging and substituting in from (2) gives
αt(s)βt(s)
yt
ls
=X
π∈B1(l):
πt=ls
p(π|x).
Connectionist Temporal Classification
From (3) we can see that this is the portion of the total
probability p(l|x) due to those paths going through ls
at time t. We can therefore sum over all sand tto
get:
p(l|x) =
T
X
t=1
|l|
X
s=1
αt(s)βt(s)
yt
ls
.(14)
Because the network outputs are conditionally inde-
pendent (section 3.1), we need only consider the paths
going through label kat time tto get the partial
derivatives of p(l|x) with respect to yt
k. Noting that
the same label may be repeated several times in a sin-
gle labelling l, we define the set of positions where
label koccurs as lab(l, k) = {s:ls=k}, which may
be empty. We then differentiate (14) to get:
∂p(l|x)
∂yt
k
=1
yt
k
2X
slab(l,k)
αt(s)βt(s).(15)
Observing that
∂ln(p(l|x))
∂yt
k
=1
p(l|x)
∂p(l|x)
∂yt
k
we can set l=zand substitute (8) and (15) into (13)
to differentiate the objective function.
Finally, to backpropagate the gradient through the
softmax layer, we need the objective function deriva-
tives with respect to the unnormalised outputs ut
k.
If the rescaling of section 4.1 is used, we have:
∂OML ({(x,z)},Nw)
∂ut
k
=yt
kQt
yt
kX
slab(z,k)
ˆαt(s)ˆ
βt(s)
(16)
where
Qt
def
=Dt
T
Y
t0=t+1
Dt0
Ct0
.
Eqn (16) is the ‘error signal’ received by the network
during training (figure 4).
5. Experiments
We compared the performance of CTC with that of
both an HMM and an HMM-RNN hybrid on a real-
world temporal classification problem: phonetic la-
belling on the TIMIT speech corpus. More precisely,
the task was to annotate the utterances in the TIMIT
test set with the phoneme sequences that gave the low-
est possible label error rate (as defined in section 2.1).
To make the comparison fair, the CTC and hybrid
networks used the same RNN architecture: bidirec-
tional Long Short-Term Memory (BLSTM; Graves &
(b)
error
output
(c)
(a)
Figure 4. Evolution of the CTC Error Signal During
Training. The left column shows the output activations
for the same sequence at various stages of training (the
dashed line is the ‘blank’ unit); the right column shows
the corresponding error signals. Errors above the horizon-
tal axis act to increase the corresponding output activation
and those below act to decrease it. (a) Initially the network
has small random weights, and the error is determined by
the target sequence only. (b) The network begins to make
predictions and the error localises around them. (c) The
network strongly predicts the correct labelling and the er-
ror virtually disappears.
Schmidhuber, 2005). BLSTM combines the ability
of Long Short-Term Memory (LSTM; Hochreiter &
Schmidhuber, 1997) to bridge long time lags with
the access of bidirectional RNNs (BRNNs; Schuster
& Paliwal, 1997) to past and future context. We
stress that any other architecture could have been
used instead. We chose BLSTM because our exper-
iments with standard BRNNs and unidirectional net-
works gave worse results on the same task.
5.1. Data
TIMIT contain recordings of prompted English speech,
accompanied by manually segmented phonetic tran-
scripts. It has a lexicon of 61 distinct phonemes, and
comes divided into training and test sets containing
4620 and 1680 utterances respectively. 5% (184) of
the training utterances were chosen at random and
used as a validation set for early stopping in the hy-
brid and CTC experiments. The audio data was pre-
processed into 10 ms frames, overlapped by 5 ms, us-
ing 12 Mel-Frequency Cepstrum Coefficients (MFCCs)
from 26 filter-bank channels. The log-energy was also
included, along with the first derivatives of all coeffi-
cients, giving a vector of 26 coefficients per frame in
total. The coefficients were individually normalised to
have mean 0 and standard deviation 1 over the train-
ing set.
Connectionist Temporal Classification
5.2. Experimental Setup
The CTC network used an extended BLSTM archi-
tecture with peepholes and forget gates (Gers et al.,
2002), 100 blocks in each of the forward and backward
hidden layers, hyperbolic tangent for the input and
output cell activation functions and a logistic sigmoid
in the range [0,1] for the gates.
The hidden layers were fully connected to themselves
and the output layer, and fully connected from the
input layer. The input layer was size 26, the soft-
max output layer size 62 (61 phoneme categories plus
the blank label), and the total number of weights was
114,662.
Training was carried out with back propagation
through time and online gradient descent (weight up-
dates after every training example), using a learning
rate of 104and a momentum of 0.9. Network activa-
tions were reset to 0 at the start of each example. For
prefix search decoding (section 3.2) the blank proba-
bility threshold was set at 99.99%. The weights were
initialised with a flat random distribution in the range
[0.1,0.1]. During training, Gaussian noise was added
to the inputs with a standard deviation of 0.6 to im-
prove generalisation.
The baseline HMM and hybrid systems were imple-
mented as in (Graves et al., 2005). Briefly, baseline
HMMs with context independent and context depen-
dent three-states left-to-right models were trained and
tested using the HTK Toolkit1. Observation probabil-
ities were modelled by a mixture of Gaussians. Both
the number of Gaussians and the insertion penalty
were chosen to obtain the best performance on the
task. Neither linguistic information nor probabilities
of partial phone sequences were included in the system.
There were more than 900,000 parameters in total.
The hybrid system comprised an HMM and a BLSTM
network, and was trained using Viterbi-based forced-
alignment (Robinson, 1994). Initial estimation of tran-
sition and prior probabilities of the one-state 61 mod-
els was carried out using the correct transcription for
the training set. Network output probabilities were
divided by prior probabilities to obtain likelihoods for
the HMM. The insertion penalty was chosen to obtain
the best performance on the task.
The BLSTM architecture and parameters were identi-
cal to those used for CTC, with the following excep-
tions: (1) the learning rate for the hybrid network was
105; (2) the injected noise had standard deviation
0.5; (3) the output layer had 61 units instead of 62
1http://htk.eng.cam.ac.uk/
Table 1. Label Error Rate (LER) on TIMIT. CTC
and hybrid results are means over 5 runs, ±standard error.
All differences were significant (p < 0.01), except between
weighted error BLSTM/HMM and CTC (best path).
System LER
Context-independent HMM 38.85 %
Context-dependent HMM 35.21 %
BLSTM/HMM 33.84 ±0.06 %
Weighted error BLSTM/HMM 31.57 ±0.06 %
CTC (best path) 31.47 ±0.21 %
CTC (prefix search) 30.51 ±0.19 %
(no blank label). The noise and learning rate were set
for the two systems independently, following a rough
search in parameter space. The hybrid network had
a total of 114,461 weights, to which the HMM added
183 further parameters. For the weighted error exper-
iment, the error signal was scaled to give equal weight
to long and short phonemes (Robinson, 1991).
5.3. Experimental Results
The results in table 1 show that, with prefix search
decoding, CTC outperformed both a baseline HMM
recogniser and an HMM-RNN hybrid with the same
RNN architecture. They also show that prefix search
gave a small improvement over best path decoding.
Note that the best hybrid results were achieved with a
weighted error signal. Such heuristics are unnecessary
for CTC, as its objective function depends only on
the sequence of labels, and not on their duration or
segmentation.
Input noise had a greater impact on generalisation for
CTC than the hybrid system, and a higher level of
noise was found to be optimal for CTC.
6. Discussion and Future Work
A key difference between CTC and other temporal
classifiers is that CTC does not explicitly segment its
input sequences. This has several benefits, such as re-
moving the need to locate inherently ambiguous label
boundaries (e.g. in speech or handwriting), and allow-
ing label predictions to be grouped together if it proves
useful (e.g. if several labels commonly occur together).
In any case, determining the segmentation is a waste of
modelling effort if only the label sequence is required.
For tasks where segmentation is required (e.g. protein
secondary structure prediction), it would seem prob-
lematic to use CTC. However, as can be seen from fig-
ure 1, CTC naturally tends to align each label predic-
Connectionist Temporal Classification
tion with the corresponding part of the sequence. This
should make it suitable for tasks like keyword spotting,
where approximate segmentation is sufficient.
Another distinctive feature of CTC is that it does
not explicitly model inter-label dependencies. This is
in contrast to graphical models, where the labels are
typically assumed to form a kth order Markov chain.
Nonetheless, CTC implicitly models inter-label depen-
dencies, e.g. by predicting labels that commonly occur
together as a double spike (see figure 1).
One very general way of dealing with structured data
would be a hierarchy of temporal classifiers, where the
labellings at one level (e.g. letters) become inputs for
the labellings at the next (e.g. words). Preliminary
experiments with hierarchical CTC have been encour-
aging, and we intend to pursue this direction further.
Good generalisation is always difficult with maximum
likelihood training, but appears to be particularly so
for CTC. In the future, we will continue to explore
methods to reduce overfitting, such as weight decay,
boosting and margin maximisation.
7. Conclusions
We have introduced a novel, general method for tem-
poral classification with RNNs. Our method fits nat-
urally into the existing framework of neural network
classifiers, and is derived from the same probabilis-
tic principles. It obviates the need for pre-segmented
data, and allows the network to be trained directly for
sequence labelling. Moreover, without requiring any
task-specific knowledge, it has outperformed both an
HMM and an HMM-RNN hybrid on a real-world tem-
poral classification problem.
Acknowledgements
We thank Marcus Hutter for useful mathematical dis-
cussions. This research was funded by SNF grants
200021-111968/1 and 200020-107534/1.
References
Bengio., Y. (1999). Markovian models for sequential
data. Neural Computing Surveys,2, 129–162.
Bishop, C. (1995). Neural Networks for Pattern Recog-
nition, chapter 6. Oxford University Press, Inc.
Bourlard, H., & Morgan, N. (1994). Connnectionist
speech recognition: A hybrid approach. Kluwer Aca-
demic Publishers.
Bridle, J. (1990). Probabilistic interpretation of feed-
forward classification network outputs, with re-
lationships to statistical pattern recognition. In
F. Soulie and J.Herault (Eds.), Neurocomputing: Al-
gorithms, architectures and applications, 227–236.
Springer-Verlag.
Gers, F., Schraudolph, N., & Schmidhuber, J. (2002).
Learning precise timing with LSTM recurrent net-
works. Journal of Machine Learning Research,3,
115–143.
Graves, A., Fern´andez, S., & Schmidhuber, J.
(2005). Bidirectional LSTM networks for improved
phoneme classification and recognition. Proceedings
of the 2005 International Conference on Artificial
Neural Networks. Warsaw, Poland.
Graves, A., & Schmidhuber, J. (2005). Framewise
phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Net-
works,18, 602–610.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-
Term Memory. Neural Computation,9, 1735–1780.
Kadous, M. W. (2002). Temporal classification: Ex-
tending the classification paradigm to multivariate
time series. Doctoral dissertation, School of Com-
puter Science & Engineering, University of New
South Wales.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Con-
ditional random fields: Probabilistic models for seg-
menting and labeling sequence data. Proc. 18th In-
ternational Conf. on Machine Learning (pp. 282–
289). Morgan Kaufmann, San Francisco, CA.
LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998).
Efficient backprop. Neural Networks: Tricks of the
trade. Springer.
Rabiner, L. R. (1989). A tutorial on hidden markov
models and selected applications in speech recogni-
tion. Proc. IEEE (pp. 257–286). IEEE.
Robinson, A. J. (1991). Several improvements
to a recurrent error propagation network phone
recognition system (Technical Report CUED/F-
INFENG/TR82). University of Cambridge.
Robinson, A. J. (1994). An application of recurrent
nets to phone probability estimation. IEEE Trans-
actions on Neural Networks,5, 298–305.
Schraudolph, N. N. (2002). Fast Curvature Matrix-
Vector Products for Second-Order Gradient De-
scent. Neural Comp.,14, 1723–1738.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional
recurrent neural networks. IEEE Transactions on
Signal Processing,45, 2673–2681.
Werbos, P. (1990). Backpropagation through time:
What it does and how to do it. Proceedings of the
IEEE,78, 1550 – 1560.
... For years, the field of HTR relied on HMM [32]- [35]. Further advances favored the use of RNNs due to their capacity to model sequential data [36]- [39]. In particular, the use of bidirectional Long Short-Term Memory (LSTM) networks with Connectionist Temporal Classification (CTC) [36] dominated the state of the art in HTR for decades [40]- [43]. ...
... Further advances favored the use of RNNs due to their capacity to model sequential data [36]- [39]. In particular, the use of bidirectional Long Short-Term Memory (LSTM) networks with Connectionist Temporal Classification (CTC) [36] dominated the state of the art in HTR for decades [40]- [43]. Despite this prevalence of CTC, attention-based encoder-decoder approaches [44] gained popularity for their competitive results [45]- [48]. ...
... The transition probabilities can be used to infer the actual output in a greedy form by taking the most likely character at each time step t or by using beam search (see Section III-C). Formally, for a given pair of (x, y) CTC [36] computes the conditional probability ...
Preprint
Full-text available
Handwritten Text Recognition (HTR) has become an essential field within pattern recognition and machine learning, with applications spanning historical document preservation to modern data entry and accessibility solutions. The complexity of HTR lies in the high variability of handwriting, which makes it challenging to develop robust recognition systems. This survey examines the evolution of HTR models, tracing their progression from early heuristic-based approaches to contemporary state-of-the-art neural models, which leverage deep learning techniques. The scope of the field has also expanded, with models initially capable of recognizing only word-level content progressing to recent end-to-end document-level approaches. Our paper categorizes existing work into two primary levels of recognition: (1) \emph{up to line-level}, encompassing word and line recognition, and (2) \emph{beyond line-level}, addressing paragraph- and document-level challenges. We provide a unified framework that examines research methodologies, recent advances in benchmarking, key datasets in the field, and a discussion of the results reported in the literature. Finally, we identify pressing research challenges and outline promising future directions, aiming to equip researchers and practitioners with a roadmap for advancing the field.
... In our case, we trained the model to take audio word samples as input and output a sequence of phonemes. Specifically, a predefined list of phonemes was provided (these are the phonemes the model can output), along with an 'unknown' token for cases where the model cannot identify the phoneme, and a 'blank' token [53] for cases where the model detects no phoneme being pronounced. ...
... To assess whether the outputted phoneme sequence is correct, we next compare it with the "ground truth" phoneme list, which was manually scored by the hearing healthcare professionals. Since the alignment between the input audio samples and the output phonemes is unknown, we used a connectionist temporal classification (CTC) cost function [51,53]. This comparison is made at the phoneme level. ...
Article
Full-text available
In addition to pure-tone audiometry tests and electrophysiological tests, a comprehensive hearing evaluation includes assessing a subject’s ability to understand speech in quiet and in noise. In fact, speech audiometry tests are commonly used in clinical practice; however, they are time-consuming as they require manual scoring by a hearing professional. To address this issue, we developed an automated speech recognition (ASR) system for scoring subject responses at the phonetic level. The ASR was built using a deep neural network and trained with pre-recorded French speech materials: Lafon’s cochlear lists and Dodelé logatoms. Next, we tested the performance and reliability of the ASR in clinical settings with both normal-hearing and hearing-impaired listeners. Our findings indicate that the ASR’s performance is statistically similar to manual scoring by expert hearing professionals, both in quiet and in noisy conditions. Moreover, the test–retest reliability of the automated scoring closely matches that of manual scoring. Together, our results validate the use of this deep neural network in both clinical and research contexts for conducting speech audiometry tests in quiet and in noise.
... Easy OCR is a highly efficient OCR library that supports over 80 languages, developed using Python and PyTorch. It employs a Convolutional Recurrent Neural Network (CRNN) [45] for text recognition, consisting of three main components: ResNet [46] for feature extraction, LSTM [47] for sequence labeling, and Connectionist Temporal Classification (CTC) [48] decoding. The workflow of Easy OCR is illustrated in Figure 4. ...
Article
Full-text available
Robust autonomous navigation systems rely on mapping, locomotion, path planning, and localization factors. Localization, one of the most essential factors of navigation, is a crucial requirement for a mobile robot because it needs the capability to localize itself in the environment. Global Positioning Systems (GPSs) are commonly used for outdoor mobile robot localization tasks. However, various environmental circumstances, such as high-rise buildings and trees, affect GPS signal quality, which leads to reduced precision or complete signal blockage. This study proposes a visual-based localization system for outdoor mobile robots in crowded urban environments. The proposed system comprises three steps. The first step is to detect the text in urban areas using the “Efficient and Accurate Scene Text Detector (EAST)” algorithm. Then, EasyOCR was applied to the detected text for the recognition phase to extract text from images that were obtained from EAST. The results from text detection and recognition algorithms were enhanced by applying post-processing and word similarity algorithms. In the second step, once the text detection and recognition process is completed, the recognized word (label/tag) is sent to the Places API in order to return the recognized word’s coordinates that are passed within the specified radius. Parallely, points of interest (POI) data are collected for a defined area by a certain radius while the robot has an accurate internet connection. The proposed system was tested in three distinct urban areas by creating five scenarios under different lighting conditions, such as morning and evening, using the outdoor delivery robot utilized in this study. In the case studies, it has been shown that the proposed system provides a low error of around 4 m for localization tasks. Compared to existing works, the proposed system consistently outperforms all other approaches using just one sensor. The results indicate the efficacy of the proposed system for localization tasks in environments where GPS signals are limited or completely blocked.
... Greedy decoding was used during evaluation without the application of a language model. The Connectionist Temporal Classification (CTC) loss [63] was employed for model training. ...
Article
Full-text available
Automatic speech recognition (ASR) systems often struggle to recognize speech from individuals with dysarthria, a speech disorder with neuromuscular causes, with accuracy declining further for unseen speakers and content. Achieving robustness for such situations requires ASR systems to address speaker-independent and vocabulary-mismatched scenarios, minimizing user adaptation effort. This study focuses on comprehensive training strategies and methods to tackle these challenges, leveraging the transformer-based Wav2Vec2.0 model. Unlike prior research, which often focuses on limited datasets, we systematically explore training data selection strategies across diverse source types (languages, canonical vs. dysarthric, and generic vs. in-domain) in a speaker-independent setting. For the under-explored vocabulary-mismatched scenarios, we evaluate conventional methods, identify their limitations, and propose a solution that uses phonological features as intermediate representations for phone recognition to address these gaps. Experimental results demonstrate that this approach enhances recognition across dysarthric datasets in both speaker-independent and vocabulary-mismatched settings. By integrating advanced transfer learning techniques with the innovative use of phonological features, this study addresses key challenges for dysarthric speech recognition, setting a new benchmark for robustness and adaptability in the field.
... We also introduce a special label called "Blank" in the CTC Loss function. The "Blank" is utilized to represent repeated characters or gaps between two characters in the output sequence 26 . Building upon this, we no longer process the entire sequence as a single unit. ...
Article
Full-text available
Named Entity Recognition for crop diseases and pests (NER-CDP) is significant in agricultural information extraction and offers vital data support for subsequent knowledge services and retrieval. However, existing NER-CDP methods rely heavily on plain text or external features such as radicals and font types and have limited effect on improving word segmentation. In this paper, we propose a multimodal named entity recognition model (CDP-MCNER) based on cross-modal attention to solve the issue of the performance degradation of the NER model caused by potential word segmentation errors. We introduce audio modality information into the field of NER-CDP for the first time and use the pauses in audio sentences to assist Chinese word segmentation. The CDP-MCNER model adopts cross-modal attention as the main architecture to fully integrate the textual and acoustic modalities. Then some data augmentation techniques, such as introducing disturbances in the text encoder, and frequency domain enhancement in the acoustic encoder are used to enhance the diversity of multimodal inputs. To improve the accuracy of the prediction label, the Masked CTC (Connectionist Temporal Classification) Loss is used to further align the multimodal semantic representation. In the experiment studies, we compare with classical text-only models, lexicon-enhanced models, and multimodal models, our model achieves the optimal precision, recall, and F1 score of 91.32%, 93.05%, and 92.18%, respectively. Furthermore, the optimal F1 scores of our method are 81.05% and 79.23% based on the public domain datasets, CNERTA and Ai-SHELL. The experimental results show the effectiveness and generalization of the CDP-MCNER model in the task of NER-CDP.
... Furthermore, due to factors like high temperatures and oxidation, billet numbers often exhibit distortions, peeling, or damage. Most mainstream scene text recognition models employ the Connectionist Temporal Classification (CTC) [6]algorithm for post-processing. However, we observed that CTC has limitations when dealing with damaged characters, as it tends to produce omissions during recognition. ...
Preprint
During the steel billet production process, it is essential to recognize machine-printed or manually written billet numbers on moving billets in real-time. To address the issue of low recognition accuracy for existing scene text recognition methods, caused by factors such as image distortions and distribution differences between training and test data, we propose a billet number recognition method that integrates test-time adaptation with prior knowledge. First, we introduce a test-time adaptation method into a model that uses the DB network for text detection and the SVTR network for text recognition. By minimizing the model's entropy during the testing phase, the model can adapt to the distribution of test data without the need for supervised fine-tuning. Second, we leverage the billet number encoding rules as prior knowledge to assess the validity of each recognition result. Invalid results, which do not comply with the encoding rules, are replaced. Finally, we introduce a validation mechanism into the CTC algorithm using prior knowledge to address its limitations in recognizing damaged characters. Experimental results on real datasets, including both machine-printed billet numbers and handwritten billet numbers, show significant improvements in evaluation metrics, validating the effectiveness of the proposed method.
Article
Nanopore sequencing, a third-generation sequencing technique, enables direct RNA sequencing, real-time analysis, and long-read length. Nanopore sequencers measure electrical current changes as nucleotides pass through nanopores; a basecaller identifies base sequences according to the raw current measurements. However, accurate basecalling remains challenging due to molecular variations and sequencing noise. Here, we introduce SqueezeCall, a novel Squeezeformer-based model for accurate nanopore basecalling. SqueezeCall uses convolution layers to down-sample raw signals and model local dependencies. A Squeezeformer network captures the global context, and a connectionist temporal classification (CTC) decoder with beam search generates DNA sequences. Experimental results demonstrated SqueezeCall’s ability to resist noise, improving basecalling accuracy. We trained SqueezeCall combining three types of loss, and found that all three loss types contribute to basecalling accuracy. Experiments across multiple species demonstrated the potential of a Squeezeformer-based model to improve basecalling accuracy and its superiority over recurrent neural network-based models and Transformer-based models.
Article
The intricate spatial‐temporal dynamics and variability of sign language gestures pose significant challenges for Continuous Sign Language Recognition (CSLR) systems. Existing models often fall short in accurately capturing these complexities, leading to performance issues and frequent misalignments. To address these shortcomings, we introduce a new approach that leverages Denoising Diffusion Models (DDMs) to improve feature representation in the visual‐sequential module of CSLR systems. Originally intended for generative tasks, DDMs have shown strong potential in representation learning through a denoising process akin to Denoising Autoencoders. Our method incorporates a denoising diffusion transformer into the CSLR framework to refine spatial‐temporal features, capitalizing on the ability of diffusion models to enhance representation quality. By conditionally denoising visual feature sequences, our approach increases the discriminative capability of the system. Additionally, we introduce an additional classifier, trained with Connectionist Temporal Classification (CTC) loss, to provide complementary supervision and further boost performance. Extensive experiments demonstrate that our method significantly improves CSLR accuracy by effectively capturing the subtle details of continuous sign language gestures and overcoming the representation limitations of current models.
Conference Paper
Full-text available
In this paper, we carry out two experiments on the TIMIT speech corpus with bidirectional and unidirectional Long Short Term Memory (LSTM) networks. In the first experiment (framewise phoneme classification) we find that bidirectional LSTMoutperforms both unidirectional LSTMand conventional Recurrent Neural Networks (RNNs). In the second (phoneme recognition) we find that a hybrid BLSTM-HMM system improves on an equivalent traditional HMM system, as well as unidirectional LSTM-HMM.
Article
Full-text available
Abstract The temporal distance between events conveys information essential for numerous sequential tasks such as motor control and rhythm detection. While Hidden Markov Models tend to ignore this information, recurrent neural networks (RNNs) can in principle learn to make use of it. We focus on Long Short-Term Memory (LSTM) because it has been shown to outperform other RNNs on tasks involving long time lags. We find that LSTM augmented by “peephole connections” from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes spaced either 50 or 49 time steps apart without the help of any short training exemplars. Without external resets or teacher forcing, our LSTM variant also learns to generate stable streams of precisely timed spikes and other highly nonlinear periodic patterns. This makes LSTM a promising approach for tasks that require the accurate measurement,or generation of time intervals. Keywords: Recurrent Neural Networks, Long Short-Term Memory, Timing.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
We propose a generic method for iteratively approximating various second-order gradient steps—-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient—-in linear time per iteration, using special curvature matrix-vector products that can be computed in O(n). Two recent acceleration techniques for on-line learning, matrix momentum and stochastic meta-descent (SMD), implement this approach. Since both were originally derived by very different routes, this offers fresh insight into their operation, resulting in further improvements to SMD.
Article
This paper describes a speaker-independent phoneme and word recognition system based on a recurrent error propagation network (REPN) trained on the TIMIT database.The REPN is a fully recurrent error propagation network trained by the propagation of the gradient signal backwards in time. A variation of the stochastic gradient descent procedure is used which updates the weights by an adaptive step size in the direction given by the sign of the gradient.Phonetic context is stored internal to the network and the outputs are estimates of the probability that a given frame is part of a segment labelled with a context-independent phonetic symbol.During recognition, a dynamic programming match is made to find the most probable string of symbols. The one pass algorithm is used for phoneme and word recognition.The phoneme recognition rate for all 61 TIMIT symbols is 70·0% correct (63·5% accuracy including insertion errors) and on a reduced 39-symbol set the recognition rate is 76·5% correct (69·8%). This compares favourably with the results of other methods, such as HMMs, on the same database [K. F. Lee & H. W. Hon 1989. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648; S. E. Levinson, M. Y. Liberman, A. Ljolje & L. G. Miller 1989. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Glasgow, pp. 441–444].Analysis of the phoneme recognition results shows that information available from bigram and durational constraints is adequately handled within the network allowing for efficient parsing of the network output. For comparison, there is less computation involved in the resulting scheme than in a one-state-per-phoneme HMM system. This is demonstrated by applying the recognizer to the DARPA 1000-word resource management task. Parsing the network output to the word level with no grammar and no pruning can be carried out in faster than real time on a SUN workstation.
Chapter
We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity. The two modifications together result in quite simple arithmetic, and hardware implementation is not difficult either. The use of radial units (squared distance instead of dot product) immediately before the softmax output stage produces a network which computes posterior distributions over class labels based on an assumption of Gaussian within-class distributions. However the training, which uses cross-class information, can result in better performance at class discrimination than the usual within-class training method, unless the within-class distribution assumptions are actually correct.
Book
This book provides the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition. After introducing the basic concepts of pattern recognition, the book describes techniques for modelling probability density functions, and discusses the properties and relative merits of the multi-layer perceptron and radial basis function network models. It also motivates the use of various forms of error functions, and reviews the principal algorithms for error function minimization. As well as providing a detailed discussion of learning and generalization in neural networks, the book also covers the important topics of data processing, feature extraction, and prior knowledge. The book concludes with an extensive treatment of Bayesian techniques and their applications to neural networks.