ArticlePDF Available

Hybrid Modeling, HMM/NN Architectures, and Protein Applications

November 1996
Neural Computation 8(7):1541-65

November 1996
8(7):1541-65

DOI:10.1162/neco.1996.8.7.1541

Source
PubMed

Authors:

Pierre Baldi

University of California, Irvine

Yves Chauvin

We describe a hybrid modeling approach where the parameters of a mode are calculated and modulated by another model, typically a neural network (NN), to avoid both overfitting and underfitting. We develop the approach for the case of Hidden Markov Models (HMMs), by deriving a class of hybrid HMM/NN architectures. These architectures can be trained with unified algorithms that blend HMM dynamic programming with NN backpropagation. In the case of complex data, mixtures of HMMs or modulated HMMs must be used. NNs can then be applied both to the parameters of each single HMM, and to the switching or modulatation of the models, as a function of input or context. Hybrid HMM/NN architectures provide a flexible NN parameterization for the control of model structure and complexity. At the same time, they can capture distributions that, in practice, are inaccessible to single HMMs. The HMM/NN hybrid approach is tested, in its simplest form, by constructing a model of the immunoglobulin protein family. A hybrid model is trained, and a multiple alignment derived, with less than a fourth of the number of parameters used with previous single HMMs.

Content uploaded by Pierre Baldi

Content may be subject to copyright.

Communicated

Terrence Sejnowski

Hybrid Modeling, HMM/NN Architectures, and Protein

Applications

Pierre Baldi

Division

Biology,

Cnlifornin

lizstitiite

Tecl~tzoloxy,

Pnsndenn,

CA 91

125

USA

Yves Chauvin

Net-ID,

Inc.,

Snrr

Fmncisco,

94107

USA

We describe a hybrid modeling approach where the parameters of a

model are calculated and modulated by another model, typically

neu-

ral network (NN), to avoid both overfitting and underfitting. We de-

velop the approach for the case of Hidden Markov Models (HMMs),

by deriving a class

hybrid HMM/NN architectures. These architec-

tures can be trained with unified algorithms that blend HMM dynamic

programming with NN backpropagation. In the case

complex data,

mixtures of HMMs

modulated HMMs must be used. NNs can then

be applied both to the parameters of each single HMM, and to the

switching or modulation

the models, as a function

input or con-

text. Hybrid HMM/NN architectures provide a flexible NN parameter-

ization for the control

model structure and complexity. At the same

time, they can capture distributions that, in practice, are inaccessible to

single HMMs. The HMM/NN hybrid approach is tested, in its simplest

form, by constructing a model of the immunoglobulin protein family.

A hybrid model is trained, and a multiple alignment derived, with less

than a fourth of the number of parameters used with previous single

HMMs.

Introduction: Hybrid Modeling

One fundamental step in scientific reasoning is the inference of param-

eterized probabilistic models to account for a given data set

If we

identify

model

M(B),

given class, with its parameter vector

then

the goal is to approximate the distribution P(QlD), and often to find its

mode maxHP(BID). Problems, however, arise whenever there is

mis-

match between the complexity

the model and the data.

Too

complex

models result in overfitting; too simple models result in underfitting.

The hybrid modeling approach attempts to finess both problems.

When the model is too complex, it

reparameterized as a function of

Neicrd

Corrrplifattoii

1541-1565

(1996)

19Y6

Massachusetts Institute

Technology

1542

Pierre

Balcli

and

Chauviii

a simpler parameter vector

that

f(i[)\.’

When the data are

too

complex, short of resorting to a different model class, the only solution

to model the data with several

M(H)s,

with

varying discretely or

continuously across different regions of data space. Thus the parame-

ters must be modulated as a function

input, or context, in the form

f”).

In the general case, both may be desirabicl,

that

-fiici.I).

This

approach is hybrid, in the stmse that the functionf can belong

a different model class. Since neural networks

(NN)

have well-known

universal approximation properties, a natural approach

compute

lvith

NN,

but other representations are possible. Phis approach is also

hierarchical because model reparameterizations can easily be nested

sr\wal levels. Here, for simplicity,

confine ourselves to

single level

of reFarameteriraticin.

For concreteness, we focus on a particular class

probabilistic mod-

cds,

namely Hidden Markov Models (HMMs), and their application in

molecular biology.

ol’ercome the limitations of simple HMMs, we

propose to use hybrid HMM/NN architectures’ that combine

the

ex-

pressive power of artificial

NNs

with the sequential time series aspect

HMMs.

is, of course, not the first time HMMs and NNs are combined.

Hybrid architectures have been used both in speech and cursive hand-

writing recognition (Bourlard and Morgan

1994;

Cho and Kim

1995).

man!- of these applications, however,

NNs

are used

front end pro-

cc’ssors to extract features, such as strokes, characters, and phonemes.

HMMs are then used in higher processing stages for word and language

modeling.’ The HMM

and

NN components are often trained separately,

although there are some exceptions (Bengio

171.

1995).

different type

of hybrid architecture

also described in Cho and Kim

(1995),

where the

component is used to classify the pattern of likelihoods produced by

several HMMs. Here, in contrast, the HMM and NN components are

inseparable. This yields, among other things, unified training algorithms

where the HMM dynamic programming and the

backpropagation

blend together.

what follows, we first brieflv review HMMs, how they can be used

to model protein families, and their limitations. In Section

we develop

HMM/NN hybrid architectures for single models,

address the problem

parameter complexity and control or olwfitting. Simulation results are

presented in Section

for a simple HMM/NN hybrid architecture used

‘Classical Bayesian hierarchical modeling relies on the description

parameterized

prior

P,,

IW),

where

are

the hyperparanieter>. This

relateci to the present situation

(i(xt,

provided

prior

P(;oi

defined

the

neu’

parameters.

’HMM/NiY

architectures v,wc tirst described

‘it

NIPS44

workshop

(Vail,

CO)

and

at thc lntcrnational Symposium

011

Fifth Generation ConipLiter Systems (Tokyo,

Japan),

in December

1994.

Preliminary

versions

ivere

published

in the

Proceedings

the

Srmiposium,

and

in the

I’roceedings

the

ISM695

Conference.

’In

the nicrlcndar biology applications

considercd,

NNs

could conceivably

uwd

to interprct the analog output of

t-arious

sequencing machines, hut this

definitely

not

tht, focus here.

Hybrid Modeling

1543

to model

particular protein family (immunoglobulins). In Section

discuss

HMM/NN

hybrid architectures for multiple models, to address

the problem of long-range dependencies or underfitting.

HMMs

Protein

Families

Many problems in computational molecular biology can be cast in terms

of statistical pattern recognition and formal languages (Searls 1992). The

increasing abundance of sequence data creates a favorable situation for

machine learning approaches, where grammars are learned from the data.

In particular,

HMMs

are equivalent to stochastic regular grammars and

have been extensively used to model protein families and

DNA

coding

regions (Baldi

al.

1994a,b; Krogh

al.

1994a; Baldi and Chauvin 1994a;

Krogh

al.

1994b).

Proteins consist of polymer chains of amino acids. There are

im-

portant amino acids,

that proteins can be viewed

strings of letters,

over

20-letter alphabet. Protein sequences with

common ancestor

share functional and structural properties, and can be grouped into fam-

ilies. Aligning sequences in

family is important, for instance to detect

highly conserved regions, or motifs, with particular significance. Multi-

ple alignment

highly divergent families where,

a result of evolu-

tionary insertions and deletions, pairs of sequences often share less than

20"/0 amino acids, is a highly nontrivial task (Meyers 1994).

first-order discrete

HMM

can be viewed

stochastic genera-

tive model defined by

set of states

an alphabet

symbols, a

probability transition matrix

(f,,),

and a probability emission matrix

(e,x).

The system randomly evolves from state to state, while emit-

ting symbols from the alphabet. When the system

given state

has a probability

t,,

of moving to state

and

probability

eJX

of emitting

symbol

in the application

HMMs

to speech recognition, a family

of proteins can be seen

set of different utterances

the

same

word,

generated by

common underlying

HMM.

One of the standard

HMM

architectures for protein applications (Krogh

al.

1994a), is the left-right

architecture depicted in Figure

The alphabet has

20 symbols,

one

for

each amino acid

for

DNA

RNA

models, one symbol

per nucleotide). In addition to the start and end state, there are three

classes of states: the main states, the delete states, and the insert states

with

{start.

ml..

. . .

inN.

i,.

. .

iN+l.

dl..

. .

dN.

end}.

is the length of

the model, typically equal to the average length

the sequences in the

family. The main and insert states always emit an amino-acid symbol,

whereas the delete states are mute. The linear sequence of state transi-

tions

start

tT11

rn2...

tnN

end

the backbone

the model.

For each main state, corresponding insert and delete states are needed to

model insertions and deletions. The self-loop

the insert states allows

for multiple insertions at a given site.

1544

Pierre Baldi and Yves Cliauvin

Figure

Example

HMM

architecture used

protein modeling.

the

start state,

the end state.

d,,

ni,,

and

denote delete,

main,

and insert states,

respectively.

2.1

Learning

Algorithms.

Given

sample of

training sequences

01.

. . . .

OR,

the parameters of an

HMM

can be iteratively modified, in an

unsupervised way, to optimize the data

fit

according to some measure,

usually based

the likelihood

the data. Since the sequences

can

considered

independent, the overall likelihood is equal to the product

the individual likelihoods. Two target functions, commonly used for

training, are the negative log-likelihood:

InP(0k)

(2.1)

k=l

and the negative log-likelihood based on the optimal paths:

-x&=

-~lnP[.rr(O~.)]

(2.2)

where

~(0)

is the most likely

HMM

production path for sequence

~(0)

can be computed efficiently by dynamic programming (Viterbi al-

gorithm). Depending on the situation, the Viterbi path approach can be

considered

fast approximation to the full maximum likelihood, or

as an algorithm in its own right. This can be the case in protein model-

ing where,

described below, the optimal paths play an important role.

Hybrid Modeling

1545

When priors on the parameters are included, one can also add regular-

izer terms to the objective functions for maximum a posteriori

(MAP)

estimation.

Different algorithms are available for

HMM

training, including the

Baum-Welch or expectation-maximization

(EM)

algorithm, and different

forms of gradient descent and other generalized

(GEM)

(Dempster

al.

1977; Rabiner 1989; Baldi and Chauvin 1994a) algorithms. In the

Baum-Welch algorithm, the parameters are updated according to

(2.3)

where

Cvm,~

(respectively

C,M,,)

and

mix

(respectively

iz,,)

are the normalized4 expected emission (respectively transition) counts,

induced by the data, that can be calculated using the forward-backward

dynamic programming procedure (Kabiner 1989), or the Viterbi paths in

Viterbi learning.

for gradient descent, and other

GEM

algorithms,

useful reparameterization (Baldi and Chauvin 1994b), in terms of nor-

malized exponentials consists of

(2.4)

with

w;,

and

ZJ,~

as the new variables. This reparameterization has two

advantages:

(1)

modification of the

70s

and

automatically preserves

normalization constraints on emission and transition distributions; and

(2)

transition and emission probabilities can never reach the absorbing

value

The on-line gradient descent equations on the negative log-

likelihood are then

where

is the learning rate. The variables

n,,,

n,,

mix,

nz,

are again the

expected counts derived by the forward-backward procedure, for each

single sequence if the algorithm is to be used on-line. Similarly, in Viterbi

learning, at each step along

Viterbi path, and for any state

on the path,

the parameters of the model are updated according to

Ti,

(respectively

E,x

if the

transition (respectively emission

from

is used, and

otherwise. The new parameters are therefore

updated incrementally, using the discrepancy between the frequencies

induced by the training

data

and the probability parameters of the model.

?Unlike in Baldi and Chauvin

(1994b),

throughout this paper we use the

clas-

sical notation

(Rabiner

1989)

where

the counts,

for

given sequence, automatIcallv

incorporate

normalization by the probability

P(0)

the sequence itself.

1516

Pierre

Baldi

and

Yves

Chauviii

Regardless of the training method, once an

HMM

has been success-

fully trained on

family of sequences,

can be used in

number of

tasks. For instance, for any given sequence, one can compute its most

likely path,

well

its likelihood.

multiple alignment results im-

mediately from aligning all the optimal paths. The likelihoods can be

used for discrimination tests and data base searches (Krogh

al.

1994a;

Baldi and Chauvin

19941).

In the case of proteins,

HMMs

have been suc-

cessfully applied to several families such

globins, immunoglobulins,

kinases, and G-protein-coupled receptors. In most cases,

HMMs

have

performed well on all tasks yielding, for instance, multiple alignments

that are comparable to those derived by human experts.

2.2

Limitations

HMMs.

In spite of their success in various appli-

cations,

HMMs

can suffer from two weaknesses. First, they often have

large number of unstructured parameters. In the case

protein models,

the architecture of Figure

has

total of approximately

49N

parameters

(40N

emission parameters and

transition parameters). For

typical

protein family,

of the order of

few hundreds, resulting imme-

diately in models with over

10,000

tree parameters. This can lead to

orwfitting when only

few sequences are available,5 not an uncommon

situation in early stages of genome projects. Second, first-order

HMMs

are limited with respect to dependencies between hidden states, found in

most interesting problems. Proteins, for instance, fold into complex

shapes, essential to their function. Subtle long-range correlations in their

polypeptide chains may exist that are not accessible to

single

HMM.

For instance, assume that whenever

is found at position

it is gener-

ally followed by

at position

and whenever

is found at position

tends to

followed by

single

HMM

has typically twofisrd

emission vectors associated with the

and

positions. Therefore

cannot

capture such correlations. Related problems are

also

the nonstationarity

complex time series,

well

the variability often encountered in

"speaker-independent" recognition problems. Only

small fraction of

distributions over the space of possible sequences, essentially the facto-

rial distributions, can be represented by

reasonably constrained

HMM."

HMM/NN Hybrid Architectures: Single Model Case

3.1

Basic Idea.

general

HMM,

an emission or transition vector

function of the state

only:

=f(i).

The first basic idea is to have

~ ~~

51t

should be noted, however, that a typical sequence provides

the order

constraints, and

sequences or

provide a number of examples in the same range

the

number

HMM

parameters.

'Any distribution can be represented by a single

cxyorirritial

six

HMM,

with a start

state connected

different sequences

deterministic states, one for each possible

alphabet sequence, with a transikm probability equal

the probability

the

s~q~wi~cc

tse

Hybrid Modeling

1547

NN on top of the HMM, for the computation of the HMM parame-

ters, that is for the computation of the functionf. NNs are universal

approximators, and, therefore, can represent any

More importantly

perhaps, NN representations enable the flexible introduction of many

possible constraints. For simplicity, we discuss emission parameters only,

but the approach extends immediately to transition parameters

well.

In the reparameterization of

2.4,

we can consider that each one of the

HMM emission vectors

calculated by

small NN, with one input set

to one (bias), no hidden layers, and

softmax output units (Fig.

2a).

The connections between the input and the outputs are the

ZJ,~.

This can

be generalized immediately by having arbitrarily complex NNs, for the

computation of the HMM parameters. The NNs associated with different

states can also be linked with one or several common hidden layers, the

overall architecture being dictated by the problem at hand. In the case

of a discrete alphabet however, such

for proteins, the emission of

each state is

multinomial distribution, and, therefore, the output of the

corresponding network should consist of

softmax units.

simple example, consider the hybrid HMM/NN architecture of

Figure 2b consisting of the following:

Input layer: one unit for each state

At each time, all units are set

except one which is set to

If unit

set to

the network

computes

E,X,

the emission distribution of state

Hidden layer:

hidden units indexed by

each with transfer

functionfi, (logistic by default) with bias

bl,

M).

Output layer:

softmax units or weighted exponentials, indexed

with bias

bx.

Connections:

(01,~)

connects input position

to hidden unit

11.

(.jx,l)

connects hidden unit

to output unit

For input

the activity in the 11th unit in the hidden layer is given by

fil(0ill

be)

(3.1)

The corresponding activity in the output layer is

For hybrid HMM/NN architectures,

number of points are worth notic-

ing:

The HMM states can be partitioned into different groups, with

dif-

ferent networks for different groups. In protein applications, for

instance, one can use different NNs for insert states and for main

states, or for different groups of states along the protein sequence

corresponding, for instance, to different regions (hydrophobic, hy-

drophilic, alpha-helical, etc.).

1548

Pierre

Baldi

and

Yves

Chauvin

Output emission distributions

Input: HMM states

Fig.

Output

emission distribution

input: HMM states

Fig.

Figure

(a)

Schematic representation of siniple

HMM

/NN

hybrid architecture

used in Baldi

171.

(1994h).

Each

HMM

state has its

own

NN.

Here,

the

NNs

are

extremely simple, with no hidden la!*er, and an output layer

softmax units

computing the state emission,

transition, parameters. Only

output

emissions

are represented

for

simplicit!: (b) Schematic representation

HMM/NN

xchitecture

Lvhere

the

NNs

associated with different states (or different groups

states) are connected via me

several hidden layers.

Hybrid

Modeling

1549

HMM parameter reduction can easily be achieved using small hid-

den layers with

hidden units, and

small compared to

In the example of Figure 2b, with

hidden units and consid-

ering only main states, the number of parameters is

H(N

the HMM/NN architecture, versus

in the corresponding sim-

ple HMM. For protein models, this yields roughly

parameters

for the HMM/NN architecture, versus 20N for the simple HMM.

is equivalent to

2.4.

The number of parameters can be adaptively adjusted to variable

training set sizes, merely by changing the number of hidden units.

This is useful in environments with large variations in data base

sizes, as in current molecular biology applications.

The entire bag of connectionist tricks can be brought to bear on these

architectures, such as radial basis functions, multiple hidden layers,

sparse connectivity, weight sharing, gaussian priors, and hyperpa-

rameters. Several initializations and structures can be implemented

in a flexible way. For instance, by allocating different numbers of

hidden units to different subsets of emissions or transitions, it is

easy to favor certain classes of paths in the models, when needed.

In the HMM of Figure

for instance, one must introduce a bias

favoring main states over insert states, prior to any learning. It

easy also to tie different regions of a protein that may have sim-

ilar properties by weight sharing, and other types of long-range

correlations. By setting the output bias to the proper values, the

model can be initialized to the average composition

the training

sequences, or any other useful distribution.

Classical prior information in the form of substitution matrices,

for instance, is easily incorporated. Substitution matrices (Altschul

1991)

can be computed from data bases, and essentially produce

background probability matrix

(pxy),

where

pxy

is the proba-

bility that

be changed into

over a certain evolutionary time.

can be implemented as a linear transformation in the emission NN.

HMMs with continuous emission distributions are also easy to in-

corporate in the HMM/NN framework. The output emission dis-

tributions can be represented, for instance, in the form of samples,

moments, and/or mixture coefficients. In the classical mixture

gaussians case, means, covariances, and mixture coefficients can be

computed by the NN. Likewise, additional HMM parameters, such

as exponential parameters to model the duration of stay in any

given state, can be calculated by a NN.

With hybrid HMM/NN architectures, in general the M step

the

EM algorithm, cannot be carried analytically. One can still use, however,

some form of gradient descent using the chain rule, by computing the

derivatives

the target likelihood functions

2.1

2.2

with respect to

the HMM parameters, and then the derivatives of the HMM parameters

1550

Pierre Baldi

and

Yves Chauvin

with respect to the

parameters. For completeness,

derivation

the learning equations for the HMM/NN architecture described above

is given in the Appendix. In the resulting learning equations

(A.3

and

A.7),

the HMM dynamic programming and the NN backpropagation

components are intimately fused. These algorithms can

also

be seen

GEM (generalized EM) (Dempster

01.

1977) algorithms. They can easily

be modified to MAP optimization with inclusion of priors.

3.2

Representation in Simple

HMM/NN

Architectures.

Consider the

particular HMM/NN described above, where

subset of the HMM states

are fully connected to

hidden units, and the hidden units are fully

connected to

softmax output units. The hidden unit bias is not really

necessary in the sense that for any HMM state

any vector

biases

b,,,

and any vector of connections

()I,,,

there exists

new vector of connec-

tions

o;,,

that produces the same vector of hidden unit activations with

bias. This is not true in the general case, for instance,

soon

there are

multiple hidden layers, or if the input units are not fully interconnected

to the hidden layer. We have left the biases for the sake of generality,

and

also

because even

the biases do not enlarge the space of possible

representations, they may still facilitate the learning procedure. Similar

remarks hold more generally for the transfer functions. With an input

layer fully connected to

single hidden layer, the same hidden layer acti-

vation can be achieved with different activation functions, by modifying

the weights.

natural question to ask is what is the representation used in the

hidden layer, and what is the space of emission distributions achievable

in this fashion? Each HMM state in the network can be represented by

point in the

[-l.l]"

hypercube. The coordinates of

point are the

activities of the

hidden units.

changing its connections to the

hidden units, an HMM state can occupy any position in the hypercube.

So,

the space

achievable emission distributions is entirely determined

by the connections from the hidden to output layer.

these connections

are held fixed, then each HMM state can select

corresponding optimal

position in the hypercube, where its emission distribution, generated by

the NN weights, is

possible to the truly optimal distribution,

for instance in cross-entropy distance. During on-line learning,

all

pa-

rameters are learned at the same time

this may introduce additional

effects.

further understand the space

achievable distributions, consider

the transformation from hidden to output units. For notational conve-

nience, we introduce one additional hidden unit numbered

always set

to express the output biases

the form:

,lxo.

If,

in this ex-

tended hidden layer, we turn

single hidden unit to

one

time, we

obtain

H+

different emission distributions in the output layer

[pi)

Hybrid

Modeling

1551

with

Consider now

general pattern of activity in the hidden layer

the

form

(1.

ill..

/ill).

Using 3.2 and 3.3, the emission distribution in the

output layer is then

After simplifications, this yields

(3.4)

(3.5)

Therefore, all the achievable emission distributions by the NN have the

form of

3.5,

and can be viewed

”combinations”

fundamental

distributions

P”

associated with each single hidden unit. In general, this

combination is different from

convex linear combination

the

P”s.

consists of three operations:

(1)

raising each component of

P”

to the

power

ir,,,

the activity of the hth hidden unit, (2) multiplying

all

the

corresponding vectors componentwise, and

(3)

normalizing. In this form,

the hybrid HMM/NN approach is different from

mixture of Dirichlet

distributions approach.

Simulation

Results

Here we demonstrate a simple application of the principles behind

HMM/NN hybrid architectures on the immunoglobulin protein family.

Immunoglobulins, or antibodies, are proteins produced by

cells that

bind with specificity to foreign antigens in order to neutralize them, or

target their destruction by other effector cells. The various classes of im-

munoglobulins are defined by pairs of light and heavy chains that are

held together principally by disulfide bonds7 (Fig.

3).

Each light and

heavy chain molecule contains one variable

(V)

region, and one (light)

or several (heavy) constant (C) regions. The

regions differ among im-

munoglobulins, and provide the specificity of the antigen recognition.

About one-third

the amino acids

the

regions form the hyper-

variable sites, responsible for the diversity of the vertebrate immune

response. Our data base is the same

the one used in Baldi

111.

(1994b), and consists of human and mouse heavy chain immunoglob-

ulin

region sequences, from the Protein Identification Resources

(PIR)

data base. It contains 224 sequences, with minimum length

90,

average

length

117,

and maximum length 254.

’Disulfide

bonds are covalent bonds between two

sulfur

atoms

different amino

acids (typically cysteines)

protein that are important

determining secondary and

tertiary structure.

1552

Pierre Baldi and Yves Chauvin

Figure

model

the structure

a typical human antibody molecule,

composed of two light and two heavy polypeptide chains. Interchain and in-

trachain disulfide bonds are indicated. Cysteine

(C)

residues are associated

with the bonds.

Two

identical active sites for antigen binding, corresponding

to the variable regions, are located in the arms

the molecule. (From

Makc-

rrlor

Biology

fhr

Crrir.

Vol.

11.

Fourth Edition, by Watson

ol.

1987

by James

Watson. Published by The Benjamin/Cummings Publishing

Company.)

For the immunoglobulin

regions, our results (Baldi

al.

199413)

were obtained

training

simple HMM, similar

the

one

in Fig-

ure

containing

total

52N

6107

adjustable parameters. Here

we train

hybrid HMM/NN architecture with the following character-

istics. The basic model

HMM

with the architecture

Figure

All the main state emissions are calculated by

common

NN,

with

hidden units. Likewise,

all

the insert state emissions are calculated by

common

NN,

with one hidden unit only. Each state transition distri-

Hybrid

Modeling

1553

bution

calculated by a different softmax network,

in our previous

work. With edge effects neglected, the total number of parameters of

this HMM/NN architecture is

1507 (117

1053

for the transitions,

(117x 3+3+3x 20+40)

454

for the emissions, including biases). This ar-

chitecture is not at all optimized: for instance, we suspect we could have

significantly reduced the number of transition parameters. Our goal at

this time is only to demonstrate the general HMM/NN principles, and

test the learning algorithm.

The hybrid architecture is then trained on-line, using both gradient

descent

(A.3),

and the Viterbi version

(A.7).

The training set consists

of a random subset of

150

sequences, identical to the training set used

previously. There, emission and transition parameters were initialized

uniformly. Here, the input-to-hidden weights are initialized with inde-

pendent gaussians, with mean

and standard deviation

The hidden-

to-output weights are initialized to

This yields a uniform emission

probability distribution on all the emitting states.8 Notice also that if all

the weights are initialized to

including those from input to hidden

layer, then the hidden units cannot differentiate from each other. The

transition probabilities out of insert or delete states are initialized uni-

formly to

1/3.

We introduce, however, a small bias along the backbone

that favors main to main transitions, in the form of a Dirichlet prior. This

prior is equivalent to introducing a regularization term in the objective

function, equal to the logarithm of the backbone transition path. The reg-

ularization constant is set to

0.01,

and the learning rate to

0.1.

Typically,

training cycles are more than sufficient to reach equilibrium.

In Figure

we display the multiple alignment of

immunoglobulin

sequences, selected randomly from both the training and validation sets.

The validation set consists of the remaining

sequences. This align-

ment is very stable between

and

epochs.’ It corresponds to

model

trained by

A.7

for

epochs. While there is currently no universally

accepted measure of the quality of an alignment, the present alignment

is similar to the previous one, derived with a simple HMM with more

than four times as many parameters. The algorithm has been able to

detect most

the salient features of the family. Most importantly, the

cysteine residues

(C)

toward the beginning and the end of the region

(positions

and

100

in this alignment), which are responsible for the

disulfide bonds that hold the chains, are perfectly aligned. The only ex-

ception is the last sequence

(PH0097),

which has

serine

(S)

residue in

its terminal portion. This is a rare but recognized exception to the con-

servation

this position. Some

the sequences in the family came with

”header” (transport signal peptide). We did not remove the headers

XWith Viterbi learning, this

probably better than

nonuniform initialization, such

the average composition.

nonuniform initialization may introduce distortions

the Viterbi paths.

YDifferences with the alignment published in the

1SMB95

Proceedings

result

from

differences in regularization, and not in the number of training

cycles.

1554

Pierre

Baldi

and

Yws

Chauvin

prior to training. The model is capable of detecting and accommodating

these headers, by treating them

initial repeated inserts,

can be seen

from the alignment of three of the sequences

(S09711,

A36194, S11239).

This multiple alignment contains also

few isolated problems, related in

part

the overuse of gaps and insert sates. Interestingly, this is most

evident in the hypervariable regions, for instance at positions 30-35 and

50-55.

These problems should be eliminated with

more careful selec-

tion

hybrid architecture and/or regularization. Alignments did not

improve using A.3 and/or

larger number of hidden units, up to 4.

In Figure

we display the activity of the two hidden units associ-

ated with each main state (see 3.2). For most states, at least one of the

activities is saturated. The activities associated with the cysteine residues

responsible for the disulfide bridges (main states 24 and

100)

are

all

sat-

urated, and in the same corner

(-1.

+l).

Points close to the center

(0.0)

correspond to emission distributions determined by the

bias

only. For the

main states, the three emission distributions

equation 3.3, associated

with the bias and the two hidden units, are given by

(0.442.0.000.0.005.0.000.0.001.0.000.0.004.0.002.0.133.

0.000.0.000.0.000.0.000.0.113.0.195.0.000.0.104.0.001.

0.000.0.000)

(0.000.0.000.0.000.0.036.0.000.0.Y00.0.000.0.000.0.000.

0.000.0.000.0.000.0.037.0.000.0.000.0.000.0.000.

0.000.0.027)

and

P.'

(0.000.0.040.0.000.0.000.0.000.0.000.0.000.0.000.0.000.

0.942.0.001.0.000.0.016.0.000.0.000.0.000.0.000.0.001.

0.000.0.000)

using alphabetical order on single-letter amino acid symbols.

Discussion: The

Case

Multiple Models

The hybrid HMM/NN architectures described address the first limitation

of HMMs: the control

model structure and complexity.

matter how

complex the NN component, however, the final model

far remains

single HMM. Therefore the second limitation of HMMs, long-range de-

pendencies and underfitting, remains. This obstacle cannot be overcome

by simply resorting to higher-order HMMs. Most often these are com-

putationally intractable.

possible approach is to try to introduce

new state for each relevant

context. This requires

systematic method for determining relevant con-

texts of variable lengths, directly from the data. Furthermore, one must

1555

Hybrid Modeling

_____~

F37262

C30560

GlHUUW

SO9711

F36005

a27563

a36006

A36194

A31485

033548

AWSJ5

030560

S11239

GlMSAA

I27868

PL0118

PL0122

A33989

A30502

PH0091

FV262

B27563

C30560

GlHUOW

so9111

B360Cb

F36005

A36194

A31485

033548

AVMSJ5

030560

S11239

GlMSAA

127888

PL0118

PL0122

A339B9

A30502

PH0097

F37262

B27563

C30560

GlHIiDd

s09111

B36006

F36005

A36194

033548

AVMSJ5

030560

511239

GIMSAA

127888

PL0118

PL0122

A33969

A30502

?HOOP7

a31485

Figure

Multiple alignment of

immunoglobulin sequences, randomly ex-

tracted from the training and validation data sets. Validation sequences:

F37262,

GlHUDW, A36194, A31485, D33548, 511239, 127888, A33989, A30502.

Align-

ment is obtained with

hybrid HMM/NN architecture trained for

cycles,

with two hidden units for the main state emissions, and one hidden unit for

the insert state emissions. Lower case letters correspond to emissions from in-

sert states. Notice the initial header (transport signal peptide) on some of the

sequences, captured

repeated transitions through the first insert state in the

model. The cysteines

(C),

associated with the disulfide bridge, in columns

and

100,

are perfectly aligned

(PH0097

known biological exception).

1556

Pierre Baldi and Yves Chauvin

”*.

-1

-0.5

Figure

Activity

the two hidden units associated with the emission

the

main states. The two activities associated with the cysteines

(C)

are

the upper

left corner, almost overlapping, with coordinates

(-1,

+l).

hope the number of relevant contexts remains small. An interesting ap-

proach along these lines can be found in Ron

al.

(1994),

where English

is modeled

Markov process with variable memory length of

letters or

so.

address the second limitation without resorting

a different model

class, one must consider more general HMM/NN hybrid architectures,

where the underlying statistical model is a

set

of HMMs. To see this,

consider again the

Y/X’

Y’

problem.

capture such dependencies

requires

mriable

emission vectors at the corresponding locations, together

with

linking mechanism. In this simple case, four different emission

vectors are needed:

e,,

and

e;.

Each one of these vectors must

assign

high probability to the letters

X’,

and

Y’,

respectively.

More importantly, there must be some kind of memory,

that

and

are used for sequence

and

are used for sequence

0’.

The

combination of

and

e,l

(or

and

e,)

should be rare or not allowed,

unless required by the data. Thus

and

must belong to

first

HMM,

and

to a second HMM, with the possibility of switching from

one HMM

the other,

function of input sequence. Alternatively,

Hybrid

Modeling

1557

there must be a single HMM, but with variable emission distributions,

modulated again by some input.

In both cases then,

consider that the emission distribution of a

given state depends not only on the state itself, but also on an additional

stream of information

That is now

=f(i.I).

In a multiple HMM/NN

hybrid architecture,

can be computed again by a NN. Depending on

the problem, the input

can assume different forms, and may be called

”context” or ”latent variable.” When feasible,

may even be equal to

the currently observed sequence

Other inputs are, however, possible,

over different alphabets. An obvious candidate in protein modeling tasks

would be the secondary structure of the protein (rv-helices, J-sheets, and

coils). In general,

could

also

be any other array of numbers representing

latent variables for the HMM modulation (MacKay

1994).

We shall now

describe, without any simulations, two simple but somewhat canonical

architectures of this sort. Learning is briefly discussed in the Appendix.

5.1

Example

Mixtures

HMM Experts.

first possible approach

is to put an HMM mixture distribution on the sequences. With

HMMs

MI..

. .

.M,,,

where

and

are the mixture coefficients. Similarly, the Viterbi

likelihood is max,

A,P[TM,

(O)].

In generative mode, sequences are pro-

duced at random by each individual HMM, and

is selected with

probability

A,.

Such a system can be viewed as

larger single HMM,

with a starting state connected to each one of the HMMs

M,,

with tran-

sition probability

(Fig.

6).

This type of model is used in Krogh

al.

(1994a) for unsupervised classification of globin protein sequences.

Notice that the parameters of each submodel can be computed by an

NN to create an HMM/NN hybrid architecture. Since the HMM ex-

perts form

larger single HMM, the corresponding hybrid architecture

also

identical to what we have seen in the section on single HMMs.

The only peculiarity is that states have been replicated, or grouped, to

form different submodels.

One

further step is to have variable mixture

coefficients that depend on the input sequence, or some other relevant

information. These mixture coefficients can be computed as softmax

out-

puts of

NN,

as in the mixture

experts architecture of Jacobs

nl.

(1991).

5.2

Example

Mixtures

Emission Experts.

different approach

to modulate a single HMM by considering that the emission parameters

e,x

should also be function of the additional input

eIx

I).

1558

Pierre Baldi and

Yves

Chauvin

-E

Figure

Schematic representation

the type

multiple HMM architecture

used in Krogh

(1994a)

for

discovering subfamilies within

protein family.

Each "box," between the start and end states, corresponds to an HMM with the

architecture

Figure

Without any

loss

generality, we can assume that

a mixture

emission experts

P,:

P(i.

A,(i.

I)P,ji.

/=I

(5.2)

In many interesting cases,

independent

resulting in the prob-

ability vector equation, over the alphabet:

(5.3)

and

i),

we are back

single

HMM.

An important

special case

derived

further assuming that

does not depend

and

(i.

does not depend on

explicitly. Then

(5.3)

This provides

principled way

for

designing the top layers

general

hybrid

HMM/NN

architectures, such

the one depicted

Figure

Hybrid Modeling

1559

Output emission distrlbution

Control

Emlsslon

experts

Input:

HMM

states

Input: external or context

network

Figure

Schematic representation of

general HMM/NN architecture, where

the HMM parameters are computed by an NN

arbitrary complexity, that op-

erates on state information, but

also

on input or context. The input or context

is used to modulate the HMM parameters, for instance, by switching or mixing

different parameter experts. For simplicity, only emission parameters are rep-

resented, with three emission experts, and

single hidden layer. Connections

from the HMM states to the control network, and from the input to the hidden

layer, are

also

possible.

The components

are computed

a NN, and the mixture coefficients

by another gating NN. Naturally, many variations are possible and, in

the most general case, the switching network can depend on the state

and the distributions

on the input

In the case

protein modeling,

for instance, if the switching depends

position

the emission experts

could correspond to different types of regions, such as hydrophobic and

hydrophilic, rather than different subclasses within a protein family.

Conclusion

large class of hybrid HMM/NN architectures has been described.

These architectures improve on single HMMs in

two

complementary di-

rections. First, the

reparameterization provides a flexible

tool

for the

1560

Pierre Baldi and Yves

Chauvin

control of overfitting, the introduction of priors, and the construction of

an input-dependent mechanism for the modulation of the final model.

Second, modeling

data set with multiple

HMMs

allows for the cover-

age of

larger set

distributions, and the expression of non-stationarity

and correlations inaccessible to single

HMMs.

We recently found out that

related ideas have been proposed independently in Bengio and Frasconi

(1995),

but from

different viewpoint in terms of input/outyut

HMMs.

Not surprisingly, these ideas are also related to data compression, infor-

mation complexity, factorial codes, autoencoding and generative models

[for instance, Dayan

al.

(1995),

and references therein].

The concept of hybrid HMM/NN architecture has been demonstrated,

in its simplest form, by providing

model of the immunoglobulin family.

The HMM/NN approach is meant to complement rather than substitute

many of the already existing techniques for incorporating prior informa-

tion in sequence models. Additional work

required to develop optimal

architectures and learning algorithms, and

test them on more challeng-

ing protein families and other domains.

Two

important issues for the success of

hybrid HMM/NN archi-

tecture

real problem are the design of the NN architecture, and the

selection of the external input or context. These issues are problem de-

pendent and cannot be dealt with generally. We have described some

examples of architectures using mixture ideas

for

the design

the NN

component. Different input choices are possible, such

contextual in-

formation or latent lwiables, sequences over

different alphabet (for

instance strokes versus letters in handwriting recognition), or just real

vectors, in the case of manifold parameterization (MacKay

1994).

pointed out in the introduction, the ideas presented here are not

limited to

HMMs,

or to protein or

DNA

modeling. They can be viewed

more general framework, where

class

parameterized model is

first constructed for the data, and then the parameters of the models are

calculated, and possibly modulated, by one or several other NNs (or any

other flexible reparameterization).

fact, several examples

simple

hybrid architectures can be found scattered throughout the literature.

classical case consists of binomial (respectively multinomial) classification

models, where membership probabilities are calculated by

NN with

sigmoidal (respectively normalized exponential) output (Rumelhart

nl.

19%).

Other examples are the rnaster-sla\Te approach of Lapedes and

Farber

(1986),

and the sigrnoidal belief networks in Neal

(1992),

where

NNs

are used to compute the weights

another NN, or the conditional

distributions of

belief network. Although the principle of hybrid mod-

eling

not new, by exploiting

systematically in the case of

HMMs,

we have generated new classes of models. There are other classes where

the principle has not been applied systematically yet.

an example,

is well known that

HMMs

are equikralent to stochastic regular gram-

mars. The next level in the Chomsky hierarchy is context-free grammars

(SCFGs). One can consider hybrid SCFG/NN architectures, where

Hybrid Modeling

1561

is used to compute the parameters of a SCFG, and/or to modulate or mix

different SCFGs. Such hybrid grammars might be useful, for instance, in

extending the work

Sakakibara

al.

(1994), on

RNA

modeling. Find-

ing optimal architectures for molecular biology applications and other

domains, and developing a better understanding of how probabilistic

models should be "-modulated as a function of input or context, are

some

the main challenges for hybrid approaches.

Appendix

7.1

Learning for Simple

HMM/NN

Architectures. Here we give on-

line equations (batch equations can be derived similarly). For a sequence

we need to compute the partial derivatives

lnP(O), or lnP[~(0)],

with respect to the parameters

ci,

and

the network.

7.1.2

Gradient

LearningonFullLikelihood.

Let

Q(0)

lnP(0). If

mlx(0)

the normalized count for the emission of

from

for

derived using

the forward-backward algorithm (Baldi and Chauvin 1994b) then

that

The partial derivatives with respect to the network parameters

(r,

;I,

and

can be obtained by the chain rule, that is by backpropagating through

the network for each

For each

and

the resulting on-line learning

equations are

with

h,,

and

hi,

for

The full gradient results

summing

over all sequences, and all main states. For instance,

and similarly for

,j,

and the biases. It

worth noticing that these equa-

tions are slightly different from those obtained by gradient descent on the

local cross-entropy between the emission distribution

eIx

and the target

distribution

mlxlm,.

1562

Pierre Baldi

and

Yves

Chaurin

7.1.2

VitrrbiLcnrning.

Here

Q(0)

lnP[T(O)]. The component of this

term that depends on emission from main states, and thus on

,j,

and

along the Viterbi path

7r(O),

is given by

('4.5)

1'1

fly

Inelx

~ix

~,yIn

(1.X)En

X)EX

lea

TIY

where

TIx

is the target:

Tlx

emitted from main state

T(O),

and

otherwise. Thus computing the gradient of

Q(0)

lnP[x(O)]

with respect to

{j,

and

is equivalent to computing the gradient of the

local cross entropy

between the target output and the output of the network, over all

This cross-entropy error function, combined with the softmax output unit,

is the standard NN framework for multinomial classification (Rumelhart

c't

01.

1995).

In summary, the relevant derivatives can be calculated on-

line both with respect to the sequences

01.

. .

and, for each sequence,

with respect to the Viterbi path. For each sequence

and for each main

state

on the Viterbi path

T(O),

the corresponding contribution to

the derivative can be obtained by standard backpropagation on

H(T,.

el).

The Viterbi on-line learning equations, similar to

(A.3),

are given by

A)j\~i

Ah,

T/(TI>

fi,)

AOll,

h,&(WIl

bll)[CY

'47l(TlY

fl1

//(Ti,

~7i~fii(~~~ll

6\1)

(A.7)

yc(ol,!

hl)"~XIl(1

f1x)

Cr+x

'~,llflYl

for

(i.

T(O),

with

and

TIY

for

The full gradient

is obtained again by summing over all sequences, and all main states

present in the corresponding Viterbi paths. For instance,

and similarly for

and the biases.

7.2

Learning for General

HMM/NN

Architectures.

For

given set-

ting of all the parameters, for

given observation sequence, and for

given input vector

the general HMM/NN hybrid architectures reduce

single HMM. The likelihood

sequence, or some other measure

of its fitness, with respect to such HMM, can be computed by dynamic

programming.

long as

differentiable in the model parameters, we

can then backpropagate the gradient through the NN, including through

the portion of the network depending on

such

the gating network

of Figure

With minor modifications, this leads to learning algorithms

Hybrid Modeling

1563

similar to those described above. This form of learning encourages coop-

eration between the emission experts of Figure

in the usual mixture

of experts architecture of Jacobs

al.

(1991),

it may be useful to intro-

duce some degree of competition between the experts,

that each one

of them specializes, for instance, on

different subclass of sequences.

When the relevant input or hidden variable

is not known,

can be

learned together with the model parameters using Bayesian inversion.

Indeed, consider for instance the case where there is an input

associated

with each observation sequence

and

hybrid model with parameters

10,

that we can compute

P(0

1.z~).

Let

P(I)

and

P(zu)

denote our

priors on

and

UJ.

Then

P(0

zu)P(I)

P(0

70)

P(I

0.zo)

with

P(0

/P(

zu)P(I)

(A.10)

The probability of the model parameters, given the data, can then be

calculated, using Bayes theorem again:

(A.ll)

assuming the observations are independent. These parameters can be

optimized by gradient descent on

logP(zu

D).

The main step is the

evaluation of the likelihood

P(0

zu)

(A.10),

and its derivatives with

respect to

zo,

which can be done by Monte Carlo sampling. The distribu-

tion on the latent variables

is calculated by

A.9.

The work

MacKay

(1994)

is an example of such

learning approach. The density network

used for protein modeling can be viewed essentially

special case

HMM/NN hybrid architecture, where each emission vector acts

softmax transformation on

low-dimensional real "hidden" input

with independent gaussian priors on

and

zu.

The input

modulates the

emission vectors, and therefore the underlying

HMM,

function of

sequence.

7.3

Priors.

There are many ways to introduce priors in HMMs. Ad-

ditional work is required to compare them to the present methods. For

instance, it is natural to use Dirichlet priors (Krogh

al.

1994a)

on multi-

nomial distributions, such

emission vectors over discrete alphabets. It

is easy to check that if

multinomial distribution

calculated by

set

normalized exponential output units,

gaussian prior on the weights

these units is in general

not

equivalent to

Dirichlet prior on the outputs.

1564

Pierre Baldi and Yves Chauvin

Acknowledgments

The

work

P.B.

supported

grant

from

the

ONR.

The

work

Y.C.

supported in part

Grant

R43

LM05780 from the National Library

Medicine.

The

contents

this publication are solely the responsibility

the authors and

not necessarily represent the official

views

the

National Library

Medicine.

References

Altschul,

F. 1991. Amino acid substitution matrices from

information

theoretic perspective.

Mol.

Biol.

219,

1-11.

Baldi, P., and Chauvin, Y. 1994a. Hidden Markov models of the G-protein-

coupled receptor family.

Coitip.

Biol.

1(4), 311-335.

Baldi, P., and Chauvin,

Y.,

1994b. Smooth on-line learning algorithms for hidden

Markov models.

Neirrnl

Coniy.

6(2),

305-316.

Baldi,

P.,

Brunak,

S.,

Chauvin, Y., and Engelbrecht,

1994a. Hidden Markov

models

human genes. In

Adimcrs

iti

Neirrnl

Zilfornintioii

Processing

S!ysterns,

Cowan, G. Tesauro, and

Alspector, eds., Vol. 6, pp. 761-768. Morgan

Kaufmann,

San

Mateo, CA.

Baldi,

I?,

Chauvin, Y., Hunkapillar,

T.,

and McClure, M.

1994b.

Hidden Markov

models of biological primary sequence information.

Proc.

Nntl.

Acnd. Sci.

Bengio, Y., and Frasconi,

1995. An input-output HMM architecture.

Ad-

zmct7s

iti

NLwrnl

lnforimtiori

Processiiig

Systenis,

Cowan,

G.,

Tesauro, and

J. Alspector, eds.,

Vol.

7. Morgan Kaufmann, San Mateo, CA.

Bengio,

Y.,

Le Cunn,

Y.,

and Henderson, D. 1995. Globally trained handwritten

word recognizer using spatial representation, convolutional neural networks

and hidden Markov models. In

Adzmices

iii

Nrirrnl

It2fornintioti

Processing

S!ystmis,

J. D. Cowan, G. Tesauro, and

Alspector, eds.,

Vol.

6. Morgan

Kaufmann, San Mateo, CA.

Bourlard, H., and Morgan, N. 1994.

Caiiiiectioriist

Speech

Rrcogiiitioti:

Hybrid

Apprunclz.

Kluwer Academic, Boston.

Cho,

B.,

and Kim,

H. 1995. An HMM/MLP architecture for sequence

recognition.

Neiirnl

Conip.

7, 358-369.

Dayan, P., Hinton,

E.,

Neal,

M., and Zemel, R.

1995. The Helmholtz

machine.

Neirrnl

Conip.

7(5), 889-904.

Dempster, A.

P.,

Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from

incomplete data via the em algorithm.

Roy.

Stnt.

Sac.

B39,

1-22.

Jacobs,

A,, Jordan, M.

I.,

Nowlan,

J., and Hinton, G.

1991. Adaptive

mixtures of local experts.

Ncirrnl

Coirip.

79-87.

Krogh, A,, Brown, M., Mian,

S.,

Sjolander,

K.,

and Haussler, D. 1994a. Hidden

Markov models in computational biology: Applications to protein model-

ing.

Mu/.

Bid.

235,

1501-1531.

Krogh, A,, Mian,

S.,

and Haussler, D. 1994b. A hidden Markov model that

finds

genes

roli

DNA.

Nirrlf+ Arid

Rrs.

22,

47684778.

U.S.A.

91(3), 1059-1063.

Hybrid Modeling

1565

Lapedes,

A.,

and

Farber,

1986.

self-optimizing, nonsymmetrical neural

net for content addressable memory and pattern recognition.

Physicn

22D,

MacKay,

1994. Bayesian neural networks and density networks.

Pro-

ceedings

Workshop

Neiitrori

Scnttering

Datn

Aiialysis

mid

Proceedings

1994

MnxEnt

Conference,

Cambridge (UK).

Myers,

1994. An overview

sequence comparison algorithms

molec-

ular biology.

Protein

Sci.

3(1),

139-146.

Rabiner,

R. 1989. A tutorial on hidden Markov models and selected applica-

tions in speech recognition.

Proc.

IEEE

77(2),

257-286.

Neal, R.

1992. Connectionist learning of belief networks.

Artificinl

Iiifelligmce

Ron,

D.,

Singer,

Y.,

and Tishby, N. 1994. The power of amnesia. In

Advnriccs

Neurnl

lnformntion

Processing

Systems,

D. Cowan,

Tesauro, and

Al-

spector, eds., Vol. 6. Morgan Kaufmann,

San

Mateo, CA.

Rumelhart,

E.,

Durbin, R., Golden,

R.,

and Chauvin,

1995. Backpropaga-

tion: The basic theory. In

Backpropagation:

Theory,

Architectirres

mid

Applicn-

tions,

pp.

1-34.

Lawrence Erlbaum, Hillsdale, NJ.

Sakakibara,

Y.,

Brown, M., Hughey, R., Saira Mian,

I.,

Sjolander,

K.,

Underwood,

C.,

and Haussler,

1994.

The application

stochastic context-free gram-

mars

folding, aligning and modeling homologous RNA sequences.

UCSC

Technicnl

Report UCSC-CRL-94-14.

247-259.

56,

71-113.

Searls, D.

1992. The linguistics

DNA.

Am.

Sci.

80,

579-591.

~~~ ~

Received

February

1995,

accepted February

1996

Deep Learning in Biomedical Data Science

Article

Jul 2018

Pierre Baldi

Since the 1980s, deep learning and biomedical data have been coevolving and feeding each other. The breadth, complexity, and rapidly expanding size of biomedical data have stimulated the development of novel deep learning methods, and application of these methods to biomedical data have led to scientific discoveries and practical solutions. This overview provides technical and historical pointers to the field, and surveys current applications of deep learning to biomedical data organized around five subareas, roughly of increasing spatial scale: chemoinformatics, proteomics, genomics and transcriptomics, biomedical imaging, and health care. The black box problem of deep learning methods is also briefly discussed. Expected final online publication date for the Annual Review of Biomedical Data Science Volume 1 is July 20, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

The inner and outer approaches to the design of recursive neural architectures

Article

Full-text available

Jan 2018
DATA MIN KNOWL DISC

Pierre Baldi

Feedforward neural network architectures work well for numerical data of fixed size, such as images. For variable size, structured data, such as sequences, d dimensional grids, trees, and other graphs, recursive architectures must be used. We distinguish two general approaches for the design of recursive architectures in deep learning, the inner and the outer approach. The inner approach uses neural networks recursively inside the data graphs, essentially to “crawl” the edges of the graphs in order to compute the final output. It requires acyclic orientations of the underlying graphs. The outer approach uses neural networks recursively outside the data graphs and regardless of their orientation. These neural networks operate orthogonally to the data graph and progressively “fold” or aggregate the input structure to produce the final output. The distinction is illustrated using several examples from the fields of natural language processing, chemoinformatics, and bioinformatics, and applied to the problem of learning from variable-size sets.

Jet Flavor Classification in High-Energy Physics with Deep Neural Networks

Article

Jul 2016

Classification of jets as originating from light-flavor or heavy-flavor quarks is an important task for inferring the nature of particles produced in high-energy collisions. The large and variable dimensionality of the data provided by the tracking detectors makes this task difficult. The current state-of-the-art tools require expert data-reduction to convert the data into a fixed low-dimensional form that can be effectively managed by shallow classifiers. We study the application of deep networks to this task, attempting classification at several levels of data, starting from a raw list of tracks. We find that the highest-level lowest-dimensionality expert information sacrifices information needed for classification, that the performance of current state-of-the-art taggers can be matched or slightly exceeded by deep-network-based taggers using only track and vertex information, that classification using only lowest-level highest-dimensionality tracking information remains a difficult task for deep networks, and that adding lower-level track and vertex information to the classifiers provides a significant boost in performance compared to the state-of-the-art.

Symbolic Neural Networks Derived from Stochastic Grammar Domain Models 1 Symbolic Neural Networks Derived from Stochastic Grammar Domain Models

Article

Full-text available

Eric Mjolsness

Starting with a statistical domain model in the form of a stochastic grammar, one can derive neural network architectures with some of the expressive power of a semantic network and also some of the pattern recognition and learning capabilities of more conventional neural networks. For example in this paper a version of the "Frameville" architecture, and in particular its objective function and constraints, is derived from a stochastic grammar schema. Possible optimization dynamics for this architecture, and relationships to other recent architectures such as Bayesian networks and variable-binding networks, are also discussed.

Scaled self-organizing map - Hidden markov model architecture for biological sequence clustering

Article

Full-text available

Jul 2013
APPL ARTIF INTELL

The self-organizing hidden Markov model map SOHMMM introduces a hybrid integration of the self-organizing map SOM and the hidden Markov model HMM. Its scaled, online gradient descent unsupervised learning algorithm is an amalgam of the SOM unsupervised training and the HMM reparameterized forward-backward techniques. In essence, with each neuron of the SOHMMM lattice, an HMM is associated. The image of an input sequence on the SOHMMM mesh is defined as the location of the best matching reference HMM. Model tuning and adaptation can take place directly from raw data, within an automated context. The SOHMMM can accommodate and analyze deoxyribonucleic acid, ribonucleic acid, protein chain molecules, and generic sequences of high dimensionality and variable lengths encoded directly in nonnumerical/symbolic alphabets. Furthermore, the SOHMMM is capable of integrating and exploiting latent information hidden in the spatiotemporal dependencies/correlations of sequences’ elements.

Deep Learning in Neural Networks: An Overview

Article

Apr 2014
NEURAL NETWORKS

Jürgen Schmidhuber

In recent years, deep neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

Self-Organizing Hidden Markov Model Map (SOHMMM)

Article

Aug 2013

A hybrid approach combining the Self-Organizing Map (SOM) and the Hidden Markov Model (HMM) is presented. The Self-Organizing Hidden Markov Model Map (SOHMMM) establishes a cross-section between the theoretic foundations and algorithmic realizations of its constituents. The respective architectures and learning methodologies are fused in an attempt to meet the increasing requirements imposed by the properties of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein chain molecules. The fusion and synergy of the SOM unsupervised training and the HMM dynamic programming algorithms bring forth a novel on-line gradient descent unsupervised learning algorithm, which is fully integrated into the SOHMMM. Since the SOHMMM carries out probabilistic sequence analysis with little or no prior knowledge, it can have a variety of applications in clustering, dimensionality reduction and visualization of large-scale sequence spaces, and also, in sequence discrimination, search and classification. Two series of experiments based on artificial sequence data and splice junction gene sequences demonstrate the SOHMMM's characteristics and capabilities.

Inner and Outer Recursive Neural Networks for Chemoinformatics Applications

Article

Jan 2018

Deep learning methods applied to problems in chemoinformatics often require the use of recursive neural networks to handle data with graphical structure and variable size. We present a useful classification of recursive neural networks approaches into two classes, the Inner and Outer approach. The inner approach uses recursion inside the underlying graph, to essentially 'crawl' the edges of the graph, while the outer approach uses recursion outside the underlying graph, to aggregate information over progressively longer distances in an orthogonal direction. We illustrate the inner and outer approaches on several examples. More importantly, we provide open-source implementations$^1$ for both approaches in Tensorflow which can be used in combination with training data to produce efficient models for predicting the physical, chemical, and biological properties of small molecules. 1: www.github.com/Chemoinformatics/InnerOuterRNN and cdb.ics.uci.edu

PhD Forum: Human Activity Recognition in Smart Environments

Conference Paper

May 2017

Alec Bayliff

Recurrent neural network architectures - An overview

Conference Paper

Jan 1998
Lect Notes Comput Sci

Ah Chung Tsoi

Adaptive Mixture of Local Expert

Article

Full-text available

Feb 1991

We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network.

Hidden Markov models in computational biology: Applications to protein modeling

Article

Jan 1994

An Overview of Sequence Comparison Algorithms in Molecular Biology

Article

Jan 1994

Eugene W. Myers

Molecular biologists frequently compare biosequences to see if any similarities can be found in the hope that what is true of one sequence either physically or functionally is true of its analogue. Such comparisons are made in a variety of ways, some via rigorous algo- rithms, others by manual means, and others by a combination of these two extremes. The topic of sequence comparison now has a rich history dating back over two decades. In this survey we review the now classic and most established technique: dynamic programming. Then a number of interesting variations of this basic problem are examined that are specifically motivated by applications in molecular biology. Finally, we close with a discus- sion of some of the most recent and future trends.

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Article

Jan 1993

Lawrence Rabiner

A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition

Article

Oct 1986
PHYSICA D

A natural, collective neural model for Content Addressable Memory (CAM) and pattern recognition is described. The model uses nonsymmetrical, bounded synaptic connection matrices and continuous valued neurons. The problem of specifying a synaptic connection matrix suitable for CAM is formulated as an optimization problem, and recent techniques of Hopfield are used to perform the optimization. This treatment naturally leads to two interacting neural nets. The first net is a symmetrically connected net (master net) containing information about the desired fixed points or memory vectors. The second net is, in general, a nonsymmetric net (slave net), whose synapse values are determined by the master net, and is the net that actually performs the CAM task. The two nets acting together are an example of neural self-organization. Many advantages of this master/slave approach are described, one of which is that nonsymmetric synaptic matrices offer a greater potential for relating formal neural modeling to neurophysiology. In addition, it seems that this approach offers advantages in application to pattern recognition problems due to the new ability to sculpt basins of attraction. The simple structure of the master net connections indicates that this approach presents no additional problems in reduction to hardware when compared to single net implementations.

An HMM/MLP Architecture for Sequence Recognition

Article

Mar 1995

This paper presents a hybrid architecture of hidden Markov models (HMMs) and a multilayer perceptron (MLP). This exploits the discriminative capability of a neural network classifier while using HMM formalism to capture the dynamics of input patterns. The main purpose is to improve the discriminative power of the HMM-based recognizer by additionally classifying the likelihood values inside them with an MLP classifier. To appreciate the performance of the presented method, we apply it to the recognition problem of on-line handwritten characters. Simulations show that the proposed architecture leads to a significant improvement in generalization performance over conventional approaches to sequential pattern recognition.

Backpropagation: the basic theory

Article

The linguistics of DNA

Article

Oct 1992

David B Searls

Crystallization and preliminary X-ray investigation of factor D of human complement

Article

Jun 1991

Human factor D, an essential enzyme of the alternative pathway of complement activation, has been crystallized. Crystals were grown by vapor diffusion using polyethylene glycol 6000 and NaCl as precipitants. The factor D crystals are triclinic and the space group is P1 with unit cell dimensions , α = 101·0°, β = 109·7°, γ = 74·3°. The unit cell contains two molecules of factor D related by a non-crystallographic 2-fold axis. The crystals grow to dimensions of 0·8 mm × 0·5 mm × 0·2 mm within five days, are stable in the X-ray beam and diffract beyond 2·5 Å.

Connectionist learning of belief networks

Article

Jul 1992
ARTIF INTELL

Radford M. Neal

Connectionist learning procedures are presented for “sigmoid” and “noisy-OR” varieties of probabilistic belief networks. These networks have previously been seen primarily as a means of representing knowledge derived from experts. Here it is shown that the “Gibbs sampling” simulation procedure for such networks can support maximum-likelihood learning from empirical data through local gradient ascent. This learning procedure resembles that used for “Boltzmann machines”, and like it, allows the use of “hidden” variables to model correlations between visible variables. Due to the directed nature of the connections in a belief network, however, the “negative phase” of Boltzmann machine learning is unnecessary. Experimental results show that, as a result, learning in a sigmoid belief network can be faster than in a Boltzmann machine. These networks have other advantages over Boltzmann machines in pattern classification and decision making applications, are naturally applicable to unsupervised learning problems, and provide a link between work on connectionist learning and work on the representation of expert knowledge.

Hybrid Modeling, HMM/NN Architectures, and Protein Applications

Abstract

Recommended publications

Modeling and predicting transcriptional units of <$O_SSF>Escherichia coli<$C_SSF>genes using hidden...

Parameterization studies of hidden Markov models representinghighly divergent protein sequences

A Method to Combine HMM and BPNN on Speech Recognition

Improved profile HMM performance by assessment of critical algorithmic in SAM and HMMER