ArticlePDF Available

Hybrid Modeling, HMM/NN Architectures, and Protein Applications

Authors:

Abstract

We describe a hybrid modeling approach where the parameters of a mode are calculated and modulated by another model, typically a neural network (NN), to avoid both overfitting and underfitting. We develop the approach for the case of Hidden Markov Models (HMMs), by deriving a class of hybrid HMM/NN architectures. These architectures can be trained with unified algorithms that blend HMM dynamic programming with NN backpropagation. In the case of complex data, mixtures of HMMs or modulated HMMs must be used. NNs can then be applied both to the parameters of each single HMM, and to the switching or modulatation of the models, as a function of input or context. Hybrid HMM/NN architectures provide a flexible NN parameterization for the control of model structure and complexity. At the same time, they can capture distributions that, in practice, are inaccessible to single HMMs. The HMM/NN hybrid approach is tested, in its simplest form, by constructing a model of the immunoglobulin protein family. A hybrid model is trained, and a multiple alignment derived, with less than a fourth of the number of parameters used with previous single HMMs.
Communicated
by
Terrence Sejnowski
Hybrid Modeling, HMM/NN Architectures, and Protein
Applications
Pierre Baldi
Division
of
Biology,
Cnlifornin
lizstitiite
of
Tecl~tzoloxy,
Pnsndenn,
CA 91
125
USA
Yves Chauvin
Net-ID,
Inc.,
Snrr
Fmncisco,
CA
94107
USA
We describe a hybrid modeling approach where the parameters of a
model are calculated and modulated by another model, typically
a
neu-
ral network (NN), to avoid both overfitting and underfitting. We de-
velop the approach for the case of Hidden Markov Models (HMMs),
by deriving a class
of
hybrid HMM/NN architectures. These architec-
tures can be trained with unified algorithms that blend HMM dynamic
programming with NN backpropagation. In the case
of
complex data,
mixtures of HMMs
or
modulated HMMs must be used. NNs can then
be applied both to the parameters of each single HMM, and to the
switching or modulation
of
the models, as a function
of
input or con-
text. Hybrid HMM/NN architectures provide a flexible NN parameter-
ization for the control
of
model structure and complexity. At the same
time, they can capture distributions that, in practice, are inaccessible to
single HMMs. The HMM/NN hybrid approach is tested, in its simplest
form, by constructing a model of the immunoglobulin protein family.
A hybrid model is trained, and a multiple alignment derived, with less
than a fourth of the number of parameters used with previous single
HMMs.
1
Introduction: Hybrid Modeling
One fundamental step in scientific reasoning is the inference of param-
eterized probabilistic models to account for a given data set
D.
If we
identify
a
model
M(B),
in
a
given class, with its parameter vector
0,
then
the goal is to approximate the distribution P(QlD), and often to find its
mode maxHP(BID). Problems, however, arise whenever there is
a
mis-
match between the complexity
of
the model and the data.
Too
complex
models result in overfitting; too simple models result in underfitting.
The hybrid modeling approach attempts to finess both problems.
When the model is too complex, it
is
reparameterized as a function of
Neicrd
Corrrplifattoii
8,
1541-1565
(1996)
@
19Y6
Massachusetts Institute
of
Technology
1542
Pierre
Balcli
and
Kw
Chauviii
a simpler parameter vector
w,
so
that
H
=
f(i[)\.’
When the data are
too
complex, short of resorting to a different model class, the only solution
is
to model the data with several
M(H)s,
with
H
varying discretely or
continuously across different regions of data space. Thus the parame-
ters must be modulated as a function
of
input, or context, in the form
/i
-~
f”).
In the general case, both may be desirabicl,
so
that
H
-fiici.I).
This
approach is hybrid, in the stmse that the functionf can belong
to
a different model class. Since neural networks
(NN)
have well-known
universal approximation properties, a natural approach
is
to
compute
f
lvith
an
NN,
but other representations are possible. Phis approach is also
hierarchical because model reparameterizations can easily be nested
at
sr\wal levels. Here, for simplicity,
MY
confine ourselves to
a
single level
of reFarameteriraticin.
For concreteness, we focus on a particular class
of
probabilistic mod-
cds,
namely Hidden Markov Models (HMMs), and their application in
molecular biology.
To
ol’ercome the limitations of simple HMMs, we
propose to use hybrid HMM/NN architectures’ that combine
the
ex-
pressive power of artificial
NNs
with the sequential time series aspect
of
HMMs.
It
is, of course, not the first time HMMs and NNs are combined.
Hybrid architectures have been used both in speech and cursive hand-
writing recognition (Bourlard and Morgan
1994;
Cho and Kim
1995).
In
man!- of these applications, however,
NNs
are used
as
front end pro-
cc’ssors to extract features, such as strokes, characters, and phonemes.
HMMs are then used in higher processing stages for word and language
modeling.’ The HMM
and
NN components are often trained separately,
although there are some exceptions (Bengio
rt
171.
1995).
A
different type
of hybrid architecture
is
also described in Cho and Kim
(1995),
where the
NK
component is used to classify the pattern of likelihoods produced by
several HMMs. Here, in contrast, the HMM and NN components are
inseparable. This yields, among other things, unified training algorithms
where the HMM dynamic programming and the
NN
backpropagation
blend together.
In
what follows, we first brieflv review HMMs, how they can be used
to model protein families, and their limitations. In Section
3,
we develop
HMM/NN hybrid architectures for single models,
to
address the problem
of
parameter complexity and control or olwfitting. Simulation results are
presented in Section
4
for a simple HMM/NN hybrid architecture used
‘Classical Bayesian hierarchical modeling relies on the description
ot
a
parameterized
prior
P,,
IW),
where
(I
are
the hyperparanieter>. This
is
relateci to the present situation
f!
:-
f
(i(xt,
provided
a
prior
P(;oi
is
defined
on
the
neu’
parameters.
’HMM/NiY
architectures v,wc tirst described
‘it
a
NIPS44
workshop
(Vail,
CO)
and
at thc lntcrnational Symposium
011
Fifth Generation ConipLiter Systems (Tokyo,
Japan),
in December
1994.
Preliminary
versions
ivere
published
in the
Proceedings
of
the
Srmiposium,
and
in the
I’roceedings
of
the
ISM695
Conference.
’In
the nicrlcndar biology applications
to
be
considercd,
NNs
could conceivably
he
uwd
to interprct the analog output of
t-arious
sequencing machines, hut this
is
definitely
not
tht, focus here.
Hybrid Modeling
1543
to model
a
particular protein family (immunoglobulins). In Section
5,
we
discuss
HMM/NN
hybrid architectures for multiple models, to address
the problem of long-range dependencies or underfitting.
2
HMMs
of
Protein
Families
Many problems in computational molecular biology can be cast in terms
of statistical pattern recognition and formal languages (Searls 1992). The
increasing abundance of sequence data creates a favorable situation for
machine learning approaches, where grammars are learned from the data.
In particular,
HMMs
are equivalent to stochastic regular grammars and
have been extensively used to model protein families and
DNA
coding
regions (Baldi
et
al.
1994a,b; Krogh
ef
al.
1994a; Baldi and Chauvin 1994a;
Krogh
et
al.
1994b).
Proteins consist of polymer chains of amino acids. There are
20
im-
portant amino acids,
so
that proteins can be viewed
as
strings of letters,
over
a
20-letter alphabet. Protein sequences with
a
common ancestor
share functional and structural properties, and can be grouped into fam-
ilies. Aligning sequences in
a
family is important, for instance to detect
highly conserved regions, or motifs, with particular significance. Multi-
ple alignment
of
highly divergent families where,
as
a result of evolu-
tionary insertions and deletions, pairs of sequences often share less than
20"/0 amino acids, is a highly nontrivial task (Meyers 1994).
A
first-order discrete
HMM
can be viewed
as
a
stochastic genera-
tive model defined by
a
set of states
S,
an alphabet
A
of
M
symbols, a
probability transition matrix
T
=
(f,,),
and a probability emission matrix
E
=
(e,x).
The system randomly evolves from state to state, while emit-
ting symbols from the alphabet. When the system
is
in
a
given state
i,
it
has a probability
t,,
of moving to state
j,
and
a
probability
eJX
of emitting
symbol
X.
As
in the application
of
HMMs
to speech recognition, a family
of proteins can be seen
as
a
set of different utterances
of
the
same
word,
generated by
a
common underlying
HMM.
One of the standard
HMM
architectures for protein applications (Krogh
et
al.
1994a), is the left-right
architecture depicted in Figure
1.
The alphabet has
M
=
20 symbols,
one
for
each amino acid
(M
=
4
for
DNA
or
RNA
models, one symbol
per nucleotide). In addition to the start and end state, there are three
classes of states: the main states, the delete states, and the insert states
with
S
=
{start.
ml..
. . .
inN.
i,.
. .
.
.
iN+l.
dl..
. .
.
dN.
end}.
N
is the length of
the model, typically equal to the average length
of
the sequences in the
family. The main and insert states always emit an amino-acid symbol,
whereas the delete states are mute. The linear sequence of state transi-
tions
start
i
tT11
+
rn2...
i
tnN
i
end
is
the backbone
of
the model.
For each main state, corresponding insert and delete states are needed to
model insertions and deletions. The self-loop
on
the insert states allows
for multiple insertions at a given site.
1544
Pierre Baldi and Yves Cliauvin
Figure
1:
Example
of
HMM
architecture used
in
protein modeling.
S
is
the
start state,
E
the end state.
d,,
ni,,
and
i,
denote delete,
main,
and insert states,
respectively.
2.1
Learning
Algorithms.
Given
a
sample of
K
training sequences
01.
. . . .
OR,
the parameters of an
HMM
can be iteratively modified, in an
unsupervised way, to optimize the data
fit
according to some measure,
usually based
on
the likelihood
of
the data. Since the sequences
can
be
considered
as
independent, the overall likelihood is equal to the product
of
the individual likelihoods. Two target functions, commonly used for
training, are the negative log-likelihood:
k
I(
Q
=
-
c
Qk
=
-
1
InP(0k)
(2.1)
k-
1
k=l
and the negative log-likelihood based on the optimal paths:
h
h
Q=
-x&=
-~lnP[.rr(O~.)]
(2.2)
k=
1
k=
1
where
~(0)
is the most likely
HMM
production path for sequence
0.
~(0)
can be computed efficiently by dynamic programming (Viterbi al-
gorithm). Depending on the situation, the Viterbi path approach can be
considered
as
a
fast approximation to the full maximum likelihood, or
as an algorithm in its own right. This can be the case in protein model-
ing where,
as
described below, the optimal paths play an important role.
Hybrid Modeling
1545
When priors on the parameters are included, one can also add regular-
izer terms to the objective functions for maximum a posteriori
(MAP)
estimation.
Different algorithms are available for
HMM
training, including the
Baum-Welch or expectation-maximization
(EM)
algorithm, and different
forms of gradient descent and other generalized
EM
(GEM)
(Dempster
et
al.
1977; Rabiner 1989; Baldi and Chauvin 1994a) algorithms. In the
Baum-Welch algorithm, the parameters are updated according to
(2.3)
where
m,
=
Cvm,~
(respectively
n,
=
C,M,,)
and
mix
(respectively
iz,,)
are the normalized4 expected emission (respectively transition) counts,
induced by the data, that can be calculated using the forward-backward
dynamic programming procedure (Kabiner 1989), or the Viterbi paths in
Viterbi learning.
As
for gradient descent, and other
GEM
algorithms,
a
useful reparameterization (Baldi and Chauvin 1994b), in terms of nor-
malized exponentials consists of
(2.4)
with
w;,
and
ZJ,~
as the new variables. This reparameterization has two
advantages:
(1)
modification of the
70s
and
us
automatically preserves
normalization constraints on emission and transition distributions; and
(2)
transition and emission probabilities can never reach the absorbing
value
0.
The on-line gradient descent equations on the negative log-
likelihood are then
where
11
is the learning rate. The variables
n,,,
n,,
mix,
nz,
are again the
expected counts derived by the forward-backward procedure, for each
single sequence if the algorithm is to be used on-line. Similarly, in Viterbi
learning, at each step along
a
Viterbi path, and for any state
i
on the path,
the parameters of the model are updated according to
Ti,
=
1
(respectively
E,x
=
1)
if the
i
+
j
transition (respectively emission
of
X
from
i)
is used, and
0
otherwise. The new parameters are therefore
updated incrementally, using the discrepancy between the frequencies
induced by the training
data
and the probability parameters of the model.
?Unlike in Baldi and Chauvin
(1994b),
throughout this paper we use the
more
clas-
sical notation
of
(Rabiner
1989)
where
the counts,
for
a
given sequence, automatIcallv
incorporate
a
normalization by the probability
P(0)
of
the sequence itself.
1516
Pierre
Baldi
and
Yves
Chauviii
Regardless of the training method, once an
HMM
has been success-
fully trained on
a
family of sequences,
it
can be used in
a
number of
tasks. For instance, for any given sequence, one can compute its most
likely path,
as
well
as
its likelihood.
A
multiple alignment results im-
mediately from aligning all the optimal paths. The likelihoods can be
used for discrimination tests and data base searches (Krogh
et
al.
1994a;
Baldi and Chauvin
19941).
In the case of proteins,
HMMs
have been suc-
cessfully applied to several families such
as
globins, immunoglobulins,
kinases, and G-protein-coupled receptors. In most cases,
HMMs
have
performed well on all tasks yielding, for instance, multiple alignments
that are comparable to those derived by human experts.
2.2
Limitations
of
HMMs.
In spite of their success in various appli-
cations,
HMMs
can suffer from two weaknesses. First, they often have
a
large number of unstructured parameters. In the case
of
protein models,
the architecture of Figure
1
has
a
total of approximately
49N
parameters
(40N
emission parameters and
9N
transition parameters). For
a
typical
protein family,
N
is
of the order of
a
few hundreds, resulting imme-
diately in models with over
10,000
tree parameters. This can lead to
orwfitting when only
a
few sequences are available,5 not an uncommon
situation in early stages of genome projects. Second, first-order
HMMs
are limited with respect to dependencies between hidden states, found in
most interesting problems. Proteins, for instance, fold into complex
3D
shapes, essential to their function. Subtle long-range correlations in their
polypeptide chains may exist that are not accessible to
a
single
HMM.
For instance, assume that whenever
X
is found at position
i,
it is gener-
ally followed by
Y
at position
j;
and whenever
X'
is found at position
i,
It
tends to
be
followed by
Y'
at
j.
A
single
HMM
has typically twofisrd
emission vectors associated with the
i
and
j
positions. Therefore
it
cannot
capture such correlations. Related problems are
also
the nonstationarity
of
complex time series,
as
well
as
the variability often encountered in
"speaker-independent" recognition problems. Only
a
small fraction of
distributions over the space of possible sequences, essentially the facto-
rial distributions, can be represented by
a
reasonably constrained
HMM."
3
HMM/NN Hybrid Architectures: Single Model Case
3.1
Basic Idea.
In
a
general
HMM,
an emission or transition vector
H
is
a
function of the state
i
only:
H
=f(i).
The first basic idea is to have
~ ~~
51t
should be noted, however, that a typical sequence provides
on
the order
of
2N
constraints, and
25
sequences or
so
provide a number of examples in the same range
as
the
number
of
HMM
parameters.
'Any distribution can be represented by a single
cxyorirritial
six
HMM,
with a start
state connected
to
different sequences
of
deterministic states, one for each possible
alphabet sequence, with a transikm probability equal
to
the probability
of
the
s~q~wi~cc
i
tse
I
f.
Hybrid Modeling
1547
a
NN on top of the HMM, for the computation of the HMM parame-
ters, that is for the computation of the functionf. NNs are universal
approximators, and, therefore, can represent any
f.
More importantly
perhaps, NN representations enable the flexible introduction of many
possible constraints. For simplicity, we discuss emission parameters only,
but the approach extends immediately to transition parameters
as
well.
In the reparameterization of
2.4,
we can consider that each one of the
HMM emission vectors
is
calculated by
a
small NN, with one input set
to one (bias), no hidden layers, and
20
softmax output units (Fig.
2a).
The connections between the input and the outputs are the
ZJ,~.
This can
be generalized immediately by having arbitrarily complex NNs, for the
computation of the HMM parameters. The NNs associated with different
states can also be linked with one or several common hidden layers, the
overall architecture being dictated by the problem at hand. In the case
of a discrete alphabet however, such
as
for proteins, the emission of
each state is
a
multinomial distribution, and, therefore, the output of the
corresponding network should consist of
M
softmax units.
As
a
simple example, consider the hybrid HMM/NN architecture of
Figure 2b consisting of the following:
1.
Input layer: one unit for each state
i.
At each time, all units are set
to
0,
except one which is set to
1.
If unit
i
is
set to
1,
the network
computes
E,X,
the emission distribution of state
i.
2.
Hidden layer:
H
hidden units indexed by
h,
each with transfer
functionfi, (logistic by default) with bias
bl,
(H
<
M).
3.
Output layer:
M
softmax units or weighted exponentials, indexed
by
X,
with bias
bx.
4.
Connections:
(t
=
(01,~)
connects input position
i
to hidden unit
11.
!I
=
(.jx,l)
connects hidden unit
h
to output unit
X.
For input
i,
the activity in the 11th unit in the hidden layer is given by
fil(0ill
+
be)
(3.1)
The corresponding activity in the output layer is
For hybrid HMM/NN architectures,
a
number of points are worth notic-
ing:
0
The HMM states can be partitioned into different groups, with
dif-
ferent networks for different groups. In protein applications, for
instance, one can use different NNs for insert states and for main
states, or for different groups of states along the protein sequence
corresponding, for instance, to different regions (hydrophobic, hy-
drophilic, alpha-helical, etc.).
1548
Pierre
Baldi
and
Yves
Chauvin
Output emission distributions
Input: HMM states
Fig.
2a
Output
emission distribution
m
L
input: HMM states
Fig.
2b
Figure
2:
(a)
Schematic representation of siniple
HMM
/NN
hybrid architecture
used in Baldi
Pt
171.
(1994h).
Each
HMM
state has its
own
NN.
Here,
the
NNs
are
extremely simple, with no hidden la!*er, and an output layer
of
softmax units
computing the state emission,
or
transition, parameters. Only
output
emissions
are represented
for
simplicit!: (b) Schematic representation
of
an
HMM/NN
xchitecture
Lvhere
the
NNs
associated with different states (or different groups
of
states) are connected via me
or
several hidden layers.
Hybrid
Modeling
1549
0
HMM parameter reduction can easily be achieved using small hid-
den layers with
H
hidden units, and
H
small compared to
N
or
M.
In the example of Figure 2b, with
H
hidden units and consid-
ering only main states, the number of parameters is
H(N
+
M)
in
the HMM/NN architecture, versus
NM
in the corresponding sim-
ple HMM. For protein models, this yields roughly
HN
parameters
for the HMM/NN architecture, versus 20N for the simple HMM.
H
=
M
is equivalent to
2.4.
0
The number of parameters can be adaptively adjusted to variable
training set sizes, merely by changing the number of hidden units.
This is useful in environments with large variations in data base
sizes, as in current molecular biology applications.
0
The entire bag of connectionist tricks can be brought to bear on these
architectures, such as radial basis functions, multiple hidden layers,
sparse connectivity, weight sharing, gaussian priors, and hyperpa-
rameters. Several initializations and structures can be implemented
in a flexible way. For instance, by allocating different numbers of
hidden units to different subsets of emissions or transitions, it is
easy to favor certain classes of paths in the models, when needed.
In the HMM of Figure
1,
for instance, one must introduce a bias
favoring main states over insert states, prior to any learning. It
is
easy also to tie different regions of a protein that may have sim-
ilar properties by weight sharing, and other types of long-range
correlations. By setting the output bias to the proper values, the
model can be initialized to the average composition
of
the training
sequences, or any other useful distribution.
0
Classical prior information in the form of substitution matrices,
for instance, is easily incorporated. Substitution matrices (Altschul
1991)
can be computed from data bases, and essentially produce
a
background probability matrix
P
=
(pxy),
where
pxy
is the proba-
bility that
X
be changed into
Y
over a certain evolutionary time.
P
can be implemented as a linear transformation in the emission NN.
0
HMMs with continuous emission distributions are also easy to in-
corporate in the HMM/NN framework. The output emission dis-
tributions can be represented, for instance, in the form of samples,
moments, and/or mixture coefficients. In the classical mixture
of
gaussians case, means, covariances, and mixture coefficients can be
computed by the NN. Likewise, additional HMM parameters, such
as exponential parameters to model the duration of stay in any
given state, can be calculated by a NN.
With hybrid HMM/NN architectures, in general the M step
of
the
EM algorithm, cannot be carried analytically. One can still use, however,
some form of gradient descent using the chain rule, by computing the
derivatives
of
the target likelihood functions
2.1
or
2.2
with respect to
the HMM parameters, and then the derivatives of the HMM parameters
1550
Pierre Baldi
and
Yves Chauvin
with respect to the
NN
parameters. For completeness,
a
derivation
of
the learning equations for the HMM/NN architecture described above
is given in the Appendix. In the resulting learning equations
(A.3
and
A.7),
the HMM dynamic programming and the NN backpropagation
components are intimately fused. These algorithms can
also
be seen
as
GEM (generalized EM) (Dempster
ct
01.
1977) algorithms. They can easily
be modified to MAP optimization with inclusion of priors.
3.2
Representation in Simple
HMM/NN
Architectures.
Consider the
particular HMM/NN described above, where
a
subset of the HMM states
are fully connected to
H
hidden units, and the hidden units are fully
connected to
M
softmax output units. The hidden unit bias is not really
necessary in the sense that for any HMM state
i,
any vector
of
biases
b,,,
and any vector of connections
()I,,,
there exists
a
new vector of connec-
tions
o;,,
that produces the same vector of hidden unit activations with
0
bias. This is not true in the general case, for instance,
as
soon
as
there are
multiple hidden layers, or if the input units are not fully interconnected
to the hidden layer. We have left the biases for the sake of generality,
and
also
because even
if
the biases do not enlarge the space of possible
representations, they may still facilitate the learning procedure. Similar
remarks hold more generally for the transfer functions. With an input
layer fully connected to
a
single hidden layer, the same hidden layer acti-
vation can be achieved with different activation functions, by modifying
the weights.
A
natural question to ask is what is the representation used in the
hidden layer, and what is the space of emission distributions achievable
in this fashion? Each HMM state in the network can be represented by
a
point in the
[-l.l]"
hypercube. The coordinates of
a
point are the
activities of the
H
hidden units.
By
changing its connections to the
H
hidden units, an HMM state can occupy any position in the hypercube.
So,
the space
of
achievable emission distributions is entirely determined
by the connections from the hidden to output layer.
If
these connections
are held fixed, then each HMM state can select
a
corresponding optimal
position in the hypercube, where its emission distribution, generated by
the NN weights, is
as
close
as
possible to the truly optimal distribution,
for instance in cross-entropy distance. During on-line learning,
all
pa-
rameters are learned at the same time
so
this may introduce additional
effects.
To
further understand the space
of
achievable distributions, consider
the transformation from hidden to output units. For notational conve-
nience, we introduce one additional hidden unit numbered
0,
always set
to
1,
to express the output biases
in
the form:
bx
=
,lxo.
If,
in this ex-
tended hidden layer, we turn
a
single hidden unit to
1,
one
at
a
time, we
obtain
H+
1
different emission distributions in the output layer
P"
=
[pi)
Hybrid
Modeling
1551
(0
5
h
5
H)
with
Consider now
a
general pattern of activity in the hidden layer
of
the
form
(1.
ill..
.
.
.
/ill).
Using 3.2 and 3.3, the emission distribution in the
output layer is then
After simplifications, this yields
(3.4)
(3.5)
Therefore, all the achievable emission distributions by the NN have the
form of
3.5,
and can be viewed
as
”combinations”
of
H
+
1
fundamental
distributions
P”
associated with each single hidden unit. In general, this
combination is different from
a
convex linear combination
of
the
P”s.
It
consists of three operations:
(1)
raising each component of
P”
to the
power
ir,,,
the activity of the hth hidden unit, (2) multiplying
all
the
corresponding vectors componentwise, and
(3)
normalizing. In this form,
the hybrid HMM/NN approach is different from
a
mixture of Dirichlet
distributions approach.
4
Simulation
Results
Here we demonstrate a simple application of the principles behind
HMM/NN hybrid architectures on the immunoglobulin protein family.
Immunoglobulins, or antibodies, are proteins produced by
B
cells that
bind with specificity to foreign antigens in order to neutralize them, or
target their destruction by other effector cells. The various classes of im-
munoglobulins are defined by pairs of light and heavy chains that are
held together principally by disulfide bonds7 (Fig.
3).
Each light and
heavy chain molecule contains one variable
(V)
region, and one (light)
or several (heavy) constant (C) regions. The
V
regions differ among im-
munoglobulins, and provide the specificity of the antigen recognition.
About one-third
of
the amino acids
of
the
V
regions form the hyper-
variable sites, responsible for the diversity of the vertebrate immune
response. Our data base is the same
as
the one used in Baldi
rt
111.
(1994b), and consists of human and mouse heavy chain immunoglob-
ulin
V
region sequences, from the Protein Identification Resources
(PIR)
data base. It contains 224 sequences, with minimum length
90,
average
length
N
=
117,
and maximum length 254.
’Disulfide
bonds are covalent bonds between two
sulfur
atoms
in
different amino
acids (typically cysteines)
of
a
protein that are important
in
determining secondary and
tertiary structure.
1552
Pierre Baldi and Yves Chauvin
Figure
3:
A
model
of
the structure
of
a typical human antibody molecule,
composed of two light and two heavy polypeptide chains. Interchain and in-
trachain disulfide bonds are indicated. Cysteine
(C)
residues are associated
with the bonds.
Two
identical active sites for antigen binding, corresponding
to the variable regions, are located in the arms
of
the molecule. (From
Makc-
rrlor
Biology
of
fhr
Crrir.
Vol.
11.
Fourth Edition, by Watson
et
ol.
Copyright
@
1987
by James
D.
Watson. Published by The Benjamin/Cummings Publishing
Company.)
For the immunoglobulin
V
regions, our results (Baldi
et
al.
199413)
were obtained
by
training
a
simple HMM, similar
to
the
one
in Fig-
ure
1,
containing
a
total
of
52N
+
23
=
6107
adjustable parameters. Here
we train
a
hybrid HMM/NN architecture with the following character-
istics. The basic model
is
an
HMM
with the architecture
of
Figure
1.
All the main state emissions are calculated by
a
common
NN,
with
2
hidden units. Likewise,
all
the insert state emissions are calculated by
a
common
NN,
with one hidden unit only. Each state transition distri-
Hybrid
Modeling
1553
bution
is
calculated by a different softmax network,
as
in our previous
work. With edge effects neglected, the total number of parameters of
this HMM/NN architecture is
1507 (117
x
3
x
3
=
1053
for the transitions,
(117x 3+3+3x 20+40)
=
454
for the emissions, including biases). This ar-
chitecture is not at all optimized: for instance, we suspect we could have
significantly reduced the number of transition parameters. Our goal at
this time is only to demonstrate the general HMM/NN principles, and
test the learning algorithm.
The hybrid architecture is then trained on-line, using both gradient
descent
(A.3),
and the Viterbi version
(A.7).
The training set consists
of a random subset of
150
sequences, identical to the training set used
previously. There, emission and transition parameters were initialized
uniformly. Here, the input-to-hidden weights are initialized with inde-
pendent gaussians, with mean
0
and standard deviation
1.
The hidden-
to-output weights are initialized to
1.
This yields a uniform emission
probability distribution on all the emitting states.8 Notice also that if all
the weights are initialized to
1,
including those from input to hidden
layer, then the hidden units cannot differentiate from each other. The
transition probabilities out of insert or delete states are initialized uni-
formly to
1/3.
We introduce, however, a small bias along the backbone
that favors main to main transitions, in the form of a Dirichlet prior. This
prior is equivalent to introducing a regularization term in the objective
function, equal to the logarithm of the backbone transition path. The reg-
ularization constant is set to
0.01,
and the learning rate to
0.1.
Typically,
10
training cycles are more than sufficient to reach equilibrium.
In Figure
4,
we display the multiple alignment of
20
immunoglobulin
sequences, selected randomly from both the training and validation sets.
The validation set consists of the remaining
74
sequences. This align-
ment is very stable between
5
and
10
epochs.’ It corresponds to
a
model
trained by
A.7
for
10
epochs. While there is currently no universally
accepted measure of the quality of an alignment, the present alignment
is similar to the previous one, derived with a simple HMM with more
than four times as many parameters. The algorithm has been able to
detect most
of
the salient features of the family. Most importantly, the
cysteine residues
(C)
toward the beginning and the end of the region
(positions
24
and
100
in this alignment), which are responsible for the
disulfide bonds that hold the chains, are perfectly aligned. The only ex-
ception is the last sequence
(PH0097),
which has
a
serine
(S)
residue in
its terminal portion. This is a rare but recognized exception to the con-
servation
of
this position. Some
of
the sequences in the family came with
a
”header” (transport signal peptide). We did not remove the headers
XWith Viterbi learning, this
is
probably better than
a
nonuniform initialization, such
as
the average composition.
A
nonuniform initialization may introduce distortions
in
the Viterbi paths.
YDifferences with the alignment published in the
1SMB95
Proceedings
result
from
differences in regularization, and not in the number of training
cycles.
1554
Pierre
Baldi
and
Yws
Chauvin
prior to training. The model is capable of detecting and accommodating
these headers, by treating them
as
initial repeated inserts,
as
can be seen
from the alignment of three of the sequences
(S09711,
A36194, S11239).
This multiple alignment contains also
a
few isolated problems, related in
part
to
the overuse of gaps and insert sates. Interestingly, this is most
evident in the hypervariable regions, for instance at positions 30-35 and
50-55.
These problems should be eliminated with
a
more careful selec-
tion
of
hybrid architecture and/or regularization. Alignments did not
improve using A.3 and/or
a
larger number of hidden units, up to 4.
In Figure
5,
we display the activity of the two hidden units associ-
ated with each main state (see 3.2). For most states, at least one of the
activities is saturated. The activities associated with the cysteine residues
responsible for the disulfide bridges (main states 24 and
100)
are
all
sat-
urated, and in the same corner
(-1.
+l).
Points close to the center
(0.0)
correspond to emission distributions determined by the
bias
only. For the
main states, the three emission distributions
of
equation 3.3, associated
with the bias and the two hidden units, are given by
P"
=
(0.442.0.000.0.005.0.000.0.001.0.000.0.004.0.002.0.133.
0.000.0.000.0.000.0.000.0.113.0.195.0.000.0.104.0.001.
0.000.0.000)
P'
=
(0.000.0.000.0.000.0.036.0.000.0.Y00.0.000.0.000.0.000.
0.000.0.000.0.000.0.037.0.000.0.000.0.000.0.000.
0.000.0.027)
and
P.'
=
(0.000.0.040.0.000.0.000.0.000.0.000.0.000.0.000.0.000.
0.942.0.001.0.000.0.016.0.000.0.000.0.000.0.000.0.001.
0.000.0.000)
using alphabetical order on single-letter amino acid symbols.
5
Discussion: The
Case
of
Multiple Models
The hybrid HMM/NN architectures described address the first limitation
of HMMs: the control
of
model structure and complexity.
No
matter how
complex the NN component, however, the final model
so
far remains
a
single HMM. Therefore the second limitation of HMMs, long-range de-
pendencies and underfitting, remains. This obstacle cannot be overcome
by simply resorting to higher-order HMMs. Most often these are com-
putationally intractable.
A
possible approach is to try to introduce
a
new state for each relevant
context. This requires
a
systematic method for determining relevant con-
texts of variable lengths, directly from the data. Furthermore, one must
1555
Hybrid Modeling
~
_____~
F37262
C30560
GlHUUW
SO9711
F36005
a27563
a36006
A36194
A31485
033548
AWSJ5
030560
S11239
GlMSAA
I27868
PL0118
PL0122
A33989
A30502
PH0091
FV262
B27563
C30560
GlHUOW
so9111
B360Cb
F36005
A36194
A31485
033548
AVMSJ5
030560
S11239
GlMSAA
127888
PL0118
PL0122
A339B9
A30502
PH0097
F37262
B27563
C30560
GlHIiDd
s09111
B36006
F36005
A36194
033548
AVMSJ5
030560
511239
GIMSAA
127888
PL0118
PL0122
A33969
A30502
?HOOP7
a31485
Figure
4:
Multiple alignment of
20
immunoglobulin sequences, randomly ex-
tracted from the training and validation data sets. Validation sequences:
F37262,
GlHUDW, A36194, A31485, D33548, 511239, 127888, A33989, A30502.
Align-
ment is obtained with
a
hybrid HMM/NN architecture trained for
10
cycles,
with two hidden units for the main state emissions, and one hidden unit for
the insert state emissions. Lower case letters correspond to emissions from in-
sert states. Notice the initial header (transport signal peptide) on some of the
sequences, captured
as
repeated transitions through the first insert state in the
model. The cysteines
(C),
associated with the disulfide bridge, in columns
24
and
100,
are perfectly aligned
(PH0097
is
a
known biological exception).
1556
Pierre Baldi and Yves Chauvin
..
i
*.
”*.
.
..
.:
.
.
.
I
I
I
I
-1
0
-0.5
00
05
hl
i
10
Figure
5:
Activity
of
the two hidden units associated with the emission
of
the
main states. The two activities associated with the cysteines
(C)
are
in
the upper
left corner, almost overlapping, with coordinates
(-1,
+l).
hope the number of relevant contexts remains small. An interesting ap-
proach along these lines can be found in Ron
et
al.
(1994),
where English
is modeled
as
a
Markov process with variable memory length of
up
to
10
letters or
so.
To
address the second limitation without resorting
to
a different model
class, one must consider more general HMM/NN hybrid architectures,
where the underlying statistical model is a
set
of HMMs. To see this,
consider again the
X
-
Y/X’
-
Y’
problem.
To
capture such dependencies
requires
mriable
emission vectors at the corresponding locations, together
with
a
linking mechanism. In this simple case, four different emission
vectors are needed:
e,,
e,,
e:
and
e;.
Each one of these vectors must
assign
a
high probability to the letters
X,
Y,
X’,
and
Y’,
respectively.
More importantly, there must be some kind of memory,
so
that
el
and
e,
are used for sequence
0,
and
e:
and
e(
are used for sequence
0’.
The
combination of
el
and
e,l
(or
e:
and
e,)
should be rare or not allowed,
unless required by the data. Thus
el
and
e,
must belong to
a
first
HMM,
and
e:
and
el
to a second HMM, with the possibility of switching from
one HMM
to
the other,
as
a
function of input sequence. Alternatively,
Hybrid
Modeling
1557
there must be a single HMM, but with variable emission distributions,
modulated again by some input.
In both cases then,
we
consider that the emission distribution of a
given state depends not only on the state itself, but also on an additional
stream of information
I.
That is now
H
=f(i.I).
In a multiple HMM/NN
hybrid architecture,
f
can be computed again by a NN. Depending on
the problem, the input
I
can assume different forms, and may be called
”context” or ”latent variable.” When feasible,
I
may even be equal to
the currently observed sequence
0.
Other inputs are, however, possible,
over different alphabets. An obvious candidate in protein modeling tasks
would be the secondary structure of the protein (rv-helices, J-sheets, and
coils). In general,
I
could
also
be any other array of numbers representing
latent variables for the HMM modulation (MacKay
1994).
We shall now
describe, without any simulations, two simple but somewhat canonical
architectures of this sort. Learning is briefly discussed in the Appendix.
5.1
Example
1:
Mixtures
of
HMM Experts.
A
first possible approach
is to put an HMM mixture distribution on the sequences. With
M
HMMs
MI..
. .
.M,,,
where
C,
A,
=
1,
and
As
are the mixture coefficients. Similarly, the Viterbi
likelihood is max,
A,P[TM,
(O)].
In generative mode, sequences are pro-
duced at random by each individual HMM, and
MI
is selected with
probability
A,.
Such a system can be viewed as
a
larger single HMM,
with a starting state connected to each one of the HMMs
M,,
with tran-
sition probability
A,
(Fig.
6).
This type of model is used in Krogh
ef
al.
(1994a) for unsupervised classification of globin protein sequences.
Notice that the parameters of each submodel can be computed by an
NN to create an HMM/NN hybrid architecture. Since the HMM ex-
perts form
a
larger single HMM, the corresponding hybrid architecture
is
also
identical to what we have seen in the section on single HMMs.
The only peculiarity is that states have been replicated, or grouped, to
form different submodels.
One
further step is to have variable mixture
coefficients that depend on the input sequence, or some other relevant
information. These mixture coefficients can be computed as softmax
out-
puts of
an
NN,
as in the mixture
of
experts architecture of Jacobs
rt
nl.
(1991).
5.2
Example
2:
Mixtures
of
Emission Experts.
A
different approach
is
to modulate a single HMM by considering that the emission parameters
e,x
should also be function of the additional input
I.
So
eIx
=
P(
i.
X.
I).
1558
Pierre Baldi and
Yves
Chauvin
s
*
.
-E
Figure
6:
Schematic representation
of
the type
of
multiple HMM architecture
used in Krogh
et
d.
(1994a)
for
discovering subfamilies within
a
protein family.
Each "box," between the start and end states, corresponds to an HMM with the
architecture
of
Figure
1.
Without any
loss
of
generality, we can assume that
P
is
a mixture
of
II
emission experts
P,:
11
P(i.
X.
I)
=
C
A,(i.
X.
I)P,ji.
X.
I)
/=I
(5.2)
In many interesting cases,
A,
is
independent
of
X,
resulting in the prob-
ability vector equation, over the alphabet:
(5.3)
If
n
=
1
and
P(
i.
I)
=
P(
i),
we are back
to
a
single
HMM.
An important
special case
is
derived
by
further assuming that
A,
does not depend
on
i,
and
PI
(i.
X.
I)
does not depend on
I
explicitly. Then
(5.3)
This provides
a
principled way
for
designing the top layers
of
general
hybrid
HMM/NN
architectures, such
as
the one depicted
in
Figure
7.
Hybrid Modeling
1559
Output emission distrlbution
Control
Emlsslon
experts
I
Input:
HMM
states
I
Input: external or context
network
Figure
7:
Schematic representation of
a
general HMM/NN architecture, where
the HMM parameters are computed by an NN
of
arbitrary complexity, that op-
erates on state information, but
also
on input or context. The input or context
is used to modulate the HMM parameters, for instance, by switching or mixing
different parameter experts. For simplicity, only emission parameters are rep-
resented, with three emission experts, and
a
single hidden layer. Connections
from the HMM states to the control network, and from the input to the hidden
layer, are
also
possible.
The components
P,
are computed
by
a NN, and the mixture coefficients
by another gating NN. Naturally, many variations are possible and, in
the most general case, the switching network can depend on the state
i,
and the distributions
P,
on the input
1.
In the case
of
protein modeling,
for instance, if the switching depends
on
position
i,
the emission experts
could correspond to different types of regions, such as hydrophobic and
hydrophilic, rather than different subclasses within a protein family.
6
Conclusion
A
large class of hybrid HMM/NN architectures has been described.
These architectures improve on single HMMs in
two
complementary di-
rections. First, the
NN
reparameterization provides a flexible
tool
for the
1560
Pierre Baldi and Yves
Chauvin
control of overfitting, the introduction of priors, and the construction of
an input-dependent mechanism for the modulation of the final model.
Second, modeling
a
data set with multiple
HMMs
allows for the cover-
age of
a
larger set
of
distributions, and the expression of non-stationarity
and correlations inaccessible to single
HMMs.
We recently found out that
related ideas have been proposed independently in Bengio and Frasconi
(1995),
but from
a
different viewpoint in terms of input/outyut
HMMs.
Not surprisingly, these ideas are also related to data compression, infor-
mation complexity, factorial codes, autoencoding and generative models
[for instance, Dayan
ct
al.
(1995),
and references therein].
The concept of hybrid HMM/NN architecture has been demonstrated,
in its simplest form, by providing
a
model of the immunoglobulin family.
The HMM/NN approach is meant to complement rather than substitute
many of the already existing techniques for incorporating prior informa-
tion in sequence models. Additional work
is
required to develop optimal
architectures and learning algorithms, and
to
test them on more challeng-
ing protein families and other domains.
Two
important issues for the success of
a
hybrid HMM/NN archi-
tecture
on
a
real problem are the design of the NN architecture, and the
selection of the external input or context. These issues are problem de-
pendent and cannot be dealt with generally. We have described some
examples of architectures using mixture ideas
for
the design
of
the NN
component. Different input choices are possible, such
as
contextual in-
formation or latent lwiables, sequences over
a
different alphabet (for
instance strokes versus letters in handwriting recognition), or just real
vectors, in the case of manifold parameterization (MacKay
1994).
As
pointed out in the introduction, the ideas presented here are not
limited to
HMMs,
or to protein or
DNA
modeling. They can be viewed
in
a
more general framework, where
a
class
of
parameterized model is
first constructed for the data, and then the parameters of the models are
calculated, and possibly modulated, by one or several other NNs (or any
other flexible reparameterization).
In
fact, several examples
of
simple
hybrid architectures can be found scattered throughout the literature.
A
classical case consists of binomial (respectively multinomial) classification
models, where membership probabilities are calculated by
a
NN with
a
sigmoidal (respectively normalized exponential) output (Rumelhart
ef
nl.
19%).
Other examples are the rnaster-sla\Te approach of Lapedes and
Farber
(1986),
and the sigrnoidal belief networks in Neal
(1992),
where
NNs
are used to compute the weights
of
another NN, or the conditional
distributions of
a
belief network. Although the principle of hybrid mod-
eling
is
not new, by exploiting
it
systematically in the case of
HMMs,
we have generated new classes of models. There are other classes where
the principle has not been applied systematically yet.
As
an example,
it
is well known that
HMMs
are equikralent to stochastic regular gram-
mars. The next level in the Chomsky hierarchy is context-free grammars
(SCFGs). One can consider hybrid SCFG/NN architectures, where
a
NN
Hybrid Modeling
1561
is used to compute the parameters of a SCFG, and/or to modulate or mix
different SCFGs. Such hybrid grammars might be useful, for instance, in
extending the work
of
Sakakibara
et
al.
(1994), on
RNA
modeling. Find-
ing optimal architectures for molecular biology applications and other
domains, and developing a better understanding of how probabilistic
models should be "-modulated as a function of input or context, are
some
of
the main challenges for hybrid approaches.
7
Appendix
7.1
Learning for Simple
HMM/NN
Architectures. Here we give on-
line equations (batch equations can be derived similarly). For a sequence
0,
we need to compute the partial derivatives
of
lnP(O), or lnP[~(0)],
with respect to the parameters
ci,
$,
and
b
of
the network.
7.1.2
Gradient
LearningonFullLikelihood.
Let
Q(0)
=
lnP(0). If
mlx(0)
is
the normalized count for the emission of
X
from
i
for
0,
derived using
the forward-backward algorithm (Baldi and Chauvin 1994b) then
so
that
The partial derivatives with respect to the network parameters
(r,
;I,
and
b
can be obtained by the chain rule, that is by backpropagating through
the network for each
i.
For each
0
and
i,
the resulting on-line learning
equations are
with
h,,
=
1,
and
hi,
=
0
for
j
#
i.
The full gradient results
by
summing
over all sequences, and all main states. For instance,
and similarly for
,j,
and the biases. It
is
worth noticing that these equa-
tions are slightly different from those obtained by gradient descent on the
local cross-entropy between the emission distribution
eIx
and the target
distribution
mlxlm,.
1562
Pierre Baldi
and
Yves
Chaurin
7.1.2
VitrrbiLcnrning.
Here
Q(0)
=
lnP[T(O)]. The component of this
term that depends on emission from main states, and thus on
O,
,j,
and
b,
along the Viterbi path
T
=
7r(O),
is given by
('4.5)
1'1
Y
fly
-
C
Inelx
=
-
C
~ix
ln
2
=
-
C
1
~,yIn
-
(1.X)En
(I
X)EX
Ta
lea
Y
TIY
where
TIx
is the target:
Tlx
=
1
if
X
is
emitted from main state
i
in
T(O),
and
0
otherwise. Thus computing the gradient of
Q(0)
=
-
lnP[x(O)]
with respect to
o,
{j,
and
b
is equivalent to computing the gradient of the
local cross entropy
between the target output and the output of the network, over all
i
in
T.
This cross-entropy error function, combined with the softmax output unit,
is the standard NN framework for multinomial classification (Rumelhart
c't
01.
1995).
In summary, the relevant derivatives can be calculated on-
line both with respect to the sequences
01.
.
. .
.
OK
and, for each sequence,
with respect to the Viterbi path. For each sequence
0,
and for each main
state
i
on the Viterbi path
T
=
T(O),
the corresponding contribution to
the derivative can be obtained by standard backpropagation on
H(T,.
el).
The Viterbi on-line learning equations, similar to
(A.3),
are given by
A)j\~i
Ah,
=
T/(TI>
-
fi,)
AOll,
=
h,&(WIl
+
bll)[CY
'47l(TlY
-
fl1
11
//(Ti,
-
~7i~fii(~~~ll
+
6\1)
(A.7)
I
Ah
=
yc(ol,!
+
hl)"~XIl(1
-
f1x)
-
Cr+x
'~,llflYl
for
(i.
X)
E
T(O),
with
=
1,
and
TIY
=
0
for
Y
#
X.
The full gradient
is obtained again by summing over all sequences, and all main states
present in the corresponding Viterbi paths. For instance,
and similarly for
tj
and the biases.
7.2
Learning for General
HMM/NN
Architectures.
For
a
given set-
ting of all the parameters, for
a
given observation sequence, and for
a
given input vector
I,
the general HMM/NN hybrid architectures reduce
to
a
single HMM. The likelihood
of
a
sequence, or some other measure
of its fitness, with respect to such HMM, can be computed by dynamic
programming.
As
long as
it
is
differentiable in the model parameters, we
can then backpropagate the gradient through the NN, including through
the portion of the network depending on
I,
such
as
the gating network
of Figure
4.
With minor modifications, this leads to learning algorithms
Hybrid Modeling
1563
similar to those described above. This form of learning encourages coop-
eration between the emission experts of Figure
7.
As
in the usual mixture
of experts architecture of Jacobs
ef
al.
(1991),
it may be useful to intro-
duce some degree of competition between the experts,
so
that each one
of them specializes, for instance, on
a
different subclass of sequences.
When the relevant input or hidden variable
I
is not known,
it
can be
learned together with the model parameters using Bayesian inversion.
Indeed, consider for instance the case where there is an input
I
associated
with each observation sequence
0,
and
a
hybrid model with parameters
10,
so
that we can compute
P(0
I
1.z~).
Let
P(I)
and
P(zu)
denote our
priors on
1
and
UJ.
Then
P(0
I
I.
zu)P(I)
P(0
I
70)
P(I
I
0.zo)
=
with
P(0
I
w)
=
/P(
0
I
I.
zu)P(I)
dI
(A.10)
The probability of the model parameters, given the data, can then be
calculated, using Bayes theorem again:
(A.ll)
assuming the observations are independent. These parameters can be
optimized by gradient descent on
-
logP(zu
I
D).
The main step is the
evaluation of the likelihood
P(0
I
zu)
(A.10),
and its derivatives with
respect to
zo,
which can be done by Monte Carlo sampling. The distribu-
tion on the latent variables
I
is calculated by
A.9.
The work
of
MacKay
(1994)
is an example of such
a
learning approach. The density network
used for protein modeling can be viewed essentially
as
a
special case
of
HMM/NN hybrid architecture, where each emission vector acts
as
a
softmax transformation on
a
low-dimensional real "hidden" input
I,
with independent gaussian priors on
1
and
zu.
The input
1
modulates the
emission vectors, and therefore the underlying
HMM,
as
a
function of
sequence.
7.3
Priors.
There are many ways to introduce priors in HMMs. Ad-
ditional work is required to compare them to the present methods. For
instance, it is natural to use Dirichlet priors (Krogh
et
al.
1994a)
on multi-
nomial distributions, such
as
emission vectors over discrete alphabets. It
is easy to check that if
a
multinomial distribution
is
calculated by
a
set
of
normalized exponential output units,
a
gaussian prior on the weights
of
these units is in general
not
equivalent to
a
Dirichlet prior on the outputs.
1564
Pierre Baldi and Yves Chauvin
Acknowledgments
The
work
of
P.B.
is
supported
by
a
grant
from
the
ONR.
The
work
of
Y.C.
is
supported in part
by
Grant
R43
LM05780 from the National Library
of
Medicine.
The
contents
of
this publication are solely the responsibility
of
the authors and
do
not necessarily represent the official
views
of
the
National Library
of
Medicine.
References
Altschul,
S.
F. 1991. Amino acid substitution matrices from
an
information
theoretic perspective.
J.
Mol.
Biol.
219,
1-11.
Baldi, P., and Chauvin, Y. 1994a. Hidden Markov models of the G-protein-
coupled receptor family.
J.
Coitip.
Biol.
1(4), 311-335.
Baldi, P., and Chauvin,
Y.,
1994b. Smooth on-line learning algorithms for hidden
Markov models.
Neirrnl
Coniy.
6(2),
305-316.
Baldi,
P.,
Brunak,
S.,
Chauvin, Y., and Engelbrecht,
J.
1994a. Hidden Markov
models
of
human genes. In
Adimcrs
iti
Neirrnl
Zilfornintioii
Processing
S!ysterns,
J.
D.
Cowan, G. Tesauro, and
J.
Alspector, eds., Vol. 6, pp. 761-768. Morgan
Kaufmann,
San
Mateo, CA.
Baldi,
I?,
Chauvin, Y., Hunkapillar,
T.,
and McClure, M.
1994b.
Hidden Markov
models of biological primary sequence information.
Proc.
Nntl.
Acnd. Sci.
Bengio, Y., and Frasconi,
P.
1995. An input-output HMM architecture.
In
Ad-
zmct7s
iti
NLwrnl
lnforimtiori
Processiiig
Systenis,
J.
D.
Cowan,
G.,
Tesauro, and
J. Alspector, eds.,
Vol.
7. Morgan Kaufmann, San Mateo, CA.
Bengio,
Y.,
Le Cunn,
Y.,
and Henderson, D. 1995. Globally trained handwritten
word recognizer using spatial representation, convolutional neural networks
and hidden Markov models. In
Adzmices
iii
Nrirrnl
It2fornintioti
Processing
S!ystmis,
J. D. Cowan, G. Tesauro, and
J.
Alspector, eds.,
Vol.
6. Morgan
Kaufmann, San Mateo, CA.
Bourlard, H., and Morgan, N. 1994.
Caiiiiectioriist
Speech
Rrcogiiitioti:
A
Hybrid
Apprunclz.
Kluwer Academic, Boston.
Cho,
S.
B.,
and Kim,
J.
H. 1995. An HMM/MLP architecture for sequence
recognition.
Neiirnl
Conip.
7, 358-369.
Dayan, P., Hinton,
G.
E.,
Neal,
R.
M., and Zemel, R.
S.
1995. The Helmholtz
machine.
Neirrnl
Conip.
7(5), 889-904.
Dempster, A.
P.,
Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from
incomplete data via the em algorithm.
J.
Roy.
Stnt.
Sac.
B39,
1-22.
Jacobs,
R.
A,, Jordan, M.
I.,
Nowlan,
S.
J., and Hinton, G.
E.
1991. Adaptive
mixtures of local experts.
Ncirrnl
Coirip.
3,
79-87.
Krogh, A,, Brown, M., Mian,
I.
S.,
Sjolander,
K.,
and Haussler, D. 1994a. Hidden
Markov models in computational biology: Applications to protein model-
ing.
].
Mu/.
Bid.
235,
1501-1531.
Krogh, A,, Mian,
I.
S.,
and Haussler, D. 1994b. A hidden Markov model that
finds
genes
in
E.
roli
DNA.
Nirrlf+ Arid
Rrs.
22,
47684778.
U.S.A.
91(3), 1059-1063.
Hybrid Modeling
1565
Lapedes,
A.,
and
Farber,
R.
1986.
A
self-optimizing, nonsymmetrical neural
net for content addressable memory and pattern recognition.
Physicn
22D,
MacKay,
D.
J.
C.
1994. Bayesian neural networks and density networks.
Pro-
ceedings
of
Workshop
on
Neiitrori
Scnttering
Datn
Aiialysis
mid
Proceedings
of
1994
MnxEnt
Conference,
Cambridge (UK).
Myers,
E.
W.
1994. An overview
of
sequence comparison algorithms
in
molec-
ular biology.
Protein
Sci.
3(1),
139-146.
Rabiner,
L.
R. 1989. A tutorial on hidden Markov models and selected applica-
tions in speech recognition.
Proc.
IEEE
77(2),
257-286.
Neal, R.
M.
1992. Connectionist learning of belief networks.
Artificinl
Iiifelligmce
Ron,
D.,
Singer,
Y.,
and Tishby, N. 1994. The power of amnesia. In
Advnriccs
in
Neurnl
lnformntion
Processing
Systems,
J.
D. Cowan,
G.
Tesauro, and
J.
Al-
spector, eds., Vol. 6. Morgan Kaufmann,
San
Mateo, CA.
Rumelhart,
D.
E.,
Durbin, R., Golden,
R.,
and Chauvin,
Y.
1995. Backpropaga-
tion: The basic theory. In
Backpropagation:
Theory,
Architectirres
mid
Applicn-
tions,
pp.
1-34.
Lawrence Erlbaum, Hillsdale, NJ.
Sakakibara,
Y.,
Brown, M., Hughey, R., Saira Mian,
I.,
Sjolander,
K.,
Underwood,
R.
C.,
and Haussler,
D.
1994.
The application
of
stochastic context-free gram-
mars
to
folding, aligning and modeling homologous RNA sequences.
UCSC
Technicnl
Report UCSC-CRL-94-14.
247-259.
56,
71-113.
Searls, D.
3.
1992. The linguistics
of
DNA.
Am.
Sci.
80,
579-591.
~~~ ~
Received
February
2,
1995,
accepted February
8,
1996
... While this review focuses on deep learning in neural networks, it must be noted that other forms of deep learning exist based on graphical probabilistic models (e.g., deep Bayesian networks, Boltzmann machines) and that hidden Markov models (HMMs), which are widely used in biological sequence analysis (10,11), already implement a form of deep learning since the transition between hidden states are not observed. Finally, different forms of deep learning can be combined (12), as described below. ...
Article
Since the 1980s, deep learning and biomedical data have been coevolving and feeding each other. The breadth, complexity, and rapidly expanding size of biomedical data have stimulated the development of novel deep learning methods, and application of these methods to biomedical data have led to scientific discoveries and practical solutions. This overview provides technical and historical pointers to the field, and surveys current applications of deep learning to biomedical data organized around five subareas, roughly of increasing spatial scale: chemoinformatics, proteomics, genomics and transcriptomics, biomedical imaging, and health care. The black box problem of deep learning methods is also briefly discussed. Expected final online publication date for the Annual Review of Biomedical Data Science Volume 1 is July 20, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
... In the case of sequences problems, for instance, these graphs are typically based on left-to-right chains associated with Bayesian network representations such as Markov Models, Hidden Markov Models (HMMs), Factorial HMMs, and Input-Output HMMs (Koller and Friedman 2009). In the inner approach, the variables associated with each node of the DAG are a function of the variables associated with all the parent nodes, and this function is parameterized by a neural network [see, for instance, Baldi and Chauvin (1996), Goller and Kuchler (1996) and Frasconi et al. (1998)]. Furthermore, the weights of neural networks sharing a similar functionality can be shared, yielding a recursive neural network approach (Fig. 1c-e). ...
Article
Full-text available
Feedforward neural network architectures work well for numerical data of fixed size, such as images. For variable size, structured data, such as sequences, d dimensional grids, trees, and other graphs, recursive architectures must be used. We distinguish two general approaches for the design of recursive architectures in deep learning, the inner and the outer approach. The inner approach uses neural networks recursively inside the data graphs, essentially to “crawl” the edges of the graphs in order to compute the final output. It requires acyclic orientations of the underlying graphs. The outer approach uses neural networks recursively outside the data graphs and regardless of their orientation. These neural networks operate orthogonally to the data graph and progressively “fold” or aggregate the input structure to produce the final output. The distinction is illustrated using several examples from the fields of natural language processing, chemoinformatics, and bioinformatics, and applied to the problem of learning from variable-size sets.
... General methods for designing and applying recurrent and recursive neural networks to problems with data of variable size or structure have been developed in Refs. [10][11][12][13][14], and applied systematically to a variety of problems ranging from natural language processing [15], to protein structure prediction [16][17][18][19] to prediction of molecular properties [20,21] and to the game of go [22]; previous studies have discussed the extension of such strategies to tasks involving tracks in high energy physics [23,24]. ...
Article
Classification of jets as originating from light-flavor or heavy-flavor quarks is an important task for inferring the nature of particles produced in high-energy collisions. The large and variable dimensionality of the data provided by the tracking detectors makes this task difficult. The current state-of-the-art tools require expert data-reduction to convert the data into a fixed low-dimensional form that can be effectively managed by shallow classifiers. We study the application of deep networks to this task, attempting classification at several levels of data, starting from a raw list of tracks. We find that the highest-level lowest-dimensionality expert information sacrifices information needed for classification, that the performance of current state-of-the-art taggers can be matched or slightly exceeded by deep-network-based taggers using only track and vertex information, that classification using only lowest-level highest-dimensionality tracking information remains a difficult task for deep networks, and that adding lower-level track and vertex information to the classifiers provides a significant boost in performance compared to the state-of-the-art.
... Also there is no analog of the " correspondence " or " variable-binding " problem introduced by relabelling the data according to the random permutation matrix P, which speeds up the computations but limits the expressive power of the Bayesian networks.A related trend in machine learning is the intro- duction A related trend in machine learning is the introduction and formalization of appropriate " structure " in a model class by model-generation procedures. A Minimum Description Length approach to neural net learning [7] and Hidden Markov Model-generating neural networks [2] are recent examples. Our grammar schemata may be viewed as procedures for generating models, with appropriate structure, in the following model classes: (1) larger parameterized L-system grammars; (2) simplified semantic networks; (3) Bayesian net-like grammar derivations; (4) constrained optimization problem formulations; and (5) relaxation-based neural networks. ...
Article
Full-text available
Starting with a statistical domain model in the form of a stochastic grammar, one can derive neural network architectures with some of the expressive power of a semantic network and also some of the pattern recognition and learning capabilities of more conventional neural networks. For example in this paper a version of the "Frameville" architecture, and in particular its objective function and constraints, is derived from a stochastic grammar schema. Possible optimization dynamics for this architecture, and relationships to other recent architectures such as Bayesian networks and variable-binding networks, are also discussed.
... In the literature, a number of attempts have been made to combine HMMs and NNs to form hybrid models that combine the expressive power of artificial NNs with the sequential time series aspect of HMMs. The approach in Baldi and Chauvin (1996) suggests a class of hybrid architectures wherein the HMM and NN components are trained inseparably. In these architectures, the NN component is used to reparameterize and tune the HMM component. ...
Article
Full-text available
The self-organizing hidden Markov model map SOHMMM introduces a hybrid integration of the self-organizing map SOM and the hidden Markov model HMM. Its scaled, online gradient descent unsupervised learning algorithm is an amalgam of the SOM unsupervised training and the HMM reparameterized forward-backward techniques. In essence, with each neuron of the SOHMMM lattice, an HMM is associated. The image of an input sequence on the SOHMMM mesh is defined as the location of the best matching reference HMM. Model tuning and adaptation can take place directly from raw data, within an automated context. The SOHMMM can accommodate and analyze deoxyribonucleic acid, ribonucleic acid, protein chain molecules, and generic sequences of high dimensionality and variable lengths encoded directly in nonnumerical/symbolic alphabets. Furthermore, the SOHMMM is capable of integrating and exploiting latent information hidden in the spatiotemporal dependencies/correlations of sequences’ elements.
... 6.2). DP algorithms are also essential for systems that combine concepts of NNs and graphical models, such as Hidden Markov Models (HMMs) (Stratonovich, 1960;Baum and Petrie, 1966) and Expectation Maximization (EM) (Dempster et al., 1977;Friedman et al., 2001), e.g., (Bottou, 1991;Bengio, 1991;Bourlard and Morgan, 1994;Baldi and Chauvin, 1996;Jordan and Sejnowski, 2001;Bishop, 2006;Hastie et al., 2009;Poon and Domingos, 2011;Dahl et al., 2012;Hinton et al., 2012a;Wu and Shao, 2014). ...
Article
In recent years, deep neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
... In the literature, a number of attempts have been made for combining HMMs and NNs to form hybrid models that combine the expressive power of artificial NNs with the sequential time series aspect of HMMs. The approach in Baldi and Chauvin (1996) suggests a class of hybrid architectures where the HMM and NN components are trained inseparably. In these architectures the NN component is used to reparameterize and tune the HMM component. ...
Article
A hybrid approach combining the Self-Organizing Map (SOM) and the Hidden Markov Model (HMM) is presented. The Self-Organizing Hidden Markov Model Map (SOHMMM) establishes a cross-section between the theoretic foundations and algorithmic realizations of its constituents. The respective architectures and learning methodologies are fused in an attempt to meet the increasing requirements imposed by the properties of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein chain molecules. The fusion and synergy of the SOM unsupervised training and the HMM dynamic programming algorithms bring forth a novel on-line gradient descent unsupervised learning algorithm, which is fully integrated into the SOHMMM. Since the SOHMMM carries out probabilistic sequence analysis with little or no prior knowledge, it can have a variety of applications in clustering, dimensionality reduction and visualization of large-scale sequence spaces, and also, in sequence discrimination, search and classification. Two series of experiments based on artificial sequence data and splice junction gene sequences demonstrate the SOHMMM's characteristics and capabilities.
Article
Deep learning methods applied to problems in chemoinformatics often require the use of recursive neural networks to handle data with graphical structure and variable size. We present a useful classification of recursive neural networks approaches into two classes, the Inner and Outer approach. The inner approach uses recursion inside the underlying graph, to essentially 'crawl' the edges of the graph, while the outer approach uses recursion outside the underlying graph, to aggregate information over progressively longer distances in an orthogonal direction. We illustrate the inner and outer approaches on several examples. More importantly, we provide open-source implementations$^1$ for both approaches in Tensorflow which can be used in combination with training data to produce efficient models for predicting the physical, chemical, and biological properties of small molecules. 1: www.github.com/Chemoinformatics/InnerOuterRNN and cdb.ics.uci.edu
Article
Full-text available
We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network.
Article
Molecular biologists frequently compare biosequences to see if any similarities can be found in the hope that what is true of one sequence either physically or functionally is true of its analogue. Such comparisons are made in a variety of ways, some via rigorous algo- rithms, others by manual means, and others by a combination of these two extremes. The topic of sequence comparison now has a rich history dating back over two decades. In this survey we review the now classic and most established technique: dynamic programming. Then a number of interesting variations of this basic problem are examined that are specifically motivated by applications in molecular biology. Finally, we close with a discus- sion of some of the most recent and future trends.
Article
A natural, collective neural model for Content Addressable Memory (CAM) and pattern recognition is described. The model uses nonsymmetrical, bounded synaptic connection matrices and continuous valued neurons. The problem of specifying a synaptic connection matrix suitable for CAM is formulated as an optimization problem, and recent techniques of Hopfield are used to perform the optimization. This treatment naturally leads to two interacting neural nets. The first net is a symmetrically connected net (master net) containing information about the desired fixed points or memory vectors. The second net is, in general, a nonsymmetric net (slave net), whose synapse values are determined by the master net, and is the net that actually performs the CAM task. The two nets acting together are an example of neural self-organization. Many advantages of this master/slave approach are described, one of which is that nonsymmetric synaptic matrices offer a greater potential for relating formal neural modeling to neurophysiology. In addition, it seems that this approach offers advantages in application to pattern recognition problems due to the new ability to sculpt basins of attraction. The simple structure of the master net connections indicates that this approach presents no additional problems in reduction to hardware when compared to single net implementations.
Article
This paper presents a hybrid architecture of hidden Markov models (HMMs) and a multilayer perceptron (MLP). This exploits the discriminative capability of a neural network classifier while using HMM formalism to capture the dynamics of input patterns. The main purpose is to improve the discriminative power of the HMM-based recognizer by additionally classifying the likelihood values inside them with an MLP classifier. To appreciate the performance of the presented method, we apply it to the recognition problem of on-line handwritten characters. Simulations show that the proposed architecture leads to a significant improvement in generalization performance over conventional approaches to sequential pattern recognition.
Article
Human factor D, an essential enzyme of the alternative pathway of complement activation, has been crystallized. Crystals were grown by vapor diffusion using polyethylene glycol 6000 and NaCl as precipitants. The factor D crystals are triclinic and the space group is P1 with unit cell dimensions , α = 101·0°, β = 109·7°, γ = 74·3°. The unit cell contains two molecules of factor D related by a non-crystallographic 2-fold axis. The crystals grow to dimensions of 0·8 mm × 0·5 mm × 0·2 mm within five days, are stable in the X-ray beam and diffract beyond 2·5 Å.
Article
Connectionist learning procedures are presented for “sigmoid” and “noisy-OR” varieties of probabilistic belief networks. These networks have previously been seen primarily as a means of representing knowledge derived from experts. Here it is shown that the “Gibbs sampling” simulation procedure for such networks can support maximum-likelihood learning from empirical data through local gradient ascent. This learning procedure resembles that used for “Boltzmann machines”, and like it, allows the use of “hidden” variables to model correlations between visible variables. Due to the directed nature of the connections in a belief network, however, the “negative phase” of Boltzmann machine learning is unnecessary. Experimental results show that, as a result, learning in a sigmoid belief network can be faster than in a Boltzmann machine. These networks have other advantages over Boltzmann machines in pattern classification and decision making applications, are naturally applicable to unsupervised learning problems, and provide a link between work on connectionist learning and work on the representation of expert knowledge.