Content uploaded by Frank Noé

Author content

All content in this area was uploaded by Frank Noé on Jun 02, 2014

Content may be subject to copyright.

Eﬃcient Bayesian estimation of Markov model transition matrices with given

stationary distribution

Benjamin Trendelkamp-Schroer1, a) and Frank No´e1, b)

Institut f¨ur Mathematik und Informatik, FU Berlin, Arnimallee 6, 14195 Berlin

(Dated: December 12, 2013)

Direct simulation of biomolecular dynamics in thermal equilibrium is challenging due to the metastable nature

of conformation dynamics and the computational cost of molecular dynamics. Biased or enhanced sampling

methods may improve the convergence of expectation values of equilibrium probabilities and expectation

values of stationary quantities signiﬁcantly. Unfortunately the convergence of dynamic observables such

as correlation functions or timescales of conformational transitions relies on direct equilibrium simulations.

Markov state models are well suited to describe both, stationary properties and properties of slow dynamical

processes of a molecular system, in terms of a transition matrix for a jump process on a suitable discretiza-

tion of continuous conformation space. Here, we introduce statistical estimation methods that allow a priori

knowledge of equilibrium probabilities to be incorporated into the estimation of dynamical observables. Both,

maximum likelihood methods and an improved Monte Carlo sampling method for reversible transition ma-

trices with ﬁxed stationary distribution are given. The sampling approach is applied to a toy example as well

as to simulations of the MR121-GSGS-W peptide, and is demonstrated to converge much more rapidly than

a previous approach in1.

I. INTRODUCTION

Characterization of the conformational dynamics of

proteins and other biomolecules in thermal equilibrium

includes the identiﬁcation of their metastable states, and

quantiﬁcation of their populations and transition rates.

Such a characterization is essential to analyze and poten-

tially manipulate biologically important conformational

transitions, including folding, ligand binding, and aggre-

gation. Unfortunately, a direct observation of dynami-

cal processes with an atomistic resolution is impossible

because the scale of conformation dynamics lies well be-

low the diﬀraction limit of optical methods. Spectro-

scopic methods that provide information in atomistic de-

tail such as X-ray crystallography do usually only pro-

vide information about static quantities. NMR spec-

troscopy methods provide only indirect observations of

dynamical processes via relaxation dispersion correla-

tions whose interpretation is challenging and do not pro-

vide direct structural information. Single-molecule spec-

troscopic methods can probe the dynamical ﬂuctuations

of one to two observables directly, but they do not reveal

molecular structures.

The recent increase in computing power has enabled

the study of conformation dynamics in atomistic detail

via direct molecular dynamics simulations2–7. Nonethe-

less, the metastable nature of conformation dynamics8–12

in combination with the necessary explicit treatment of

fast degrees of freedom in the numerical integration of

the equations of motions renders the spontaneous obser-

vation of rare events on the milliseconds timescales or

slower diﬃcult. As a result, one faces severe diﬃculties

a)Electronic mail: benjamin.trendelkamp-schroer@fu-berlin.de

b)Electronic mail: frank.noe@fu-berlin.de; “corresponding author”

when trying to converge expectation values of observ-

ables depending on slow processes, such as the implied

time scales of large scale conformational changes13.

The recent years have seen the development of a host

of biased or enhanced sampling methods to accelerate

rare events, and thus to permit the eﬃcient exploration

of the system’s relevant conformations and estimation

of at least its thermodynamic quantities, such as the

stationary probabilities of states and stationary expec-

tation values. To name only some of the best-known

examples, replica exchange or parallel tempering meth-

ods facilitate the hopping over energetic barriers by ex-

changing molecular conformations between simulations

at diﬀerent temperatures14,15. Flooding methods obtain

stationary probabilities by ﬁlling up the free energy land-

scape according to the frequency of visits by the evolving

trajectory16,17 . Umbrella sampling18 proceeds by choos-

ing an appropriate re-weighting function restricting the

chain to a subspace relevant to the estimation of a cho-

sen observable. An improved version using the weighted

histogram analysis method19 guides the simulation along

a multidimensional hyper-surface speciﬁed by a set of a

priori chosen reaction coordinates20. For a short ped-

agogical overview of enhanced ensemble methods see21.

Applications include replica exchange folding studies of a

Small RNA hairpin22, single-copy tempering for trpzip2,

trp-cage, and the villin headpiece23 , as well as recon-

naissance meta-dynamics for the binding of benzamidine

to trypsin24. Examples for problems that have also been

successfully treated are ﬁrst and second order phase tran-

sitions in lattice spin systems25.

While biased or enhanced sampling methods can

generate estimates of equilibrium quantities eﬃciently,

they usually do not preserve the equilibrium dynam-

ics. Thus, dynamical observables such as rates or time-

correlation functions have to be estimated using other

methods, chieﬂy from direct equilibrium molecular dy-

arXiv:1301.2078v2 [physics.chem-ph] 22 Apr 2013

2

namics simulations. An approach frequently used to in-

tegrate and analyze molecular dynamics data is Markov

modeling12,26–32. Markov models approximate the con-

tinuous phase space dynamics in terms of a discrete space

Markov jump process. A particular advantage of this ap-

proach is that Markov processes have been extensively

studied in Mathematics so that there are a large num-

ber of rigorous results available. The construction of

Markov models proceeds through ﬁrst choosing a suit-

able discretization of conformation space and then es-

timating a transition probability matrix from counted

transitions between conformational subsets speciﬁed by

the discretization26. Choosing the discretization so as

to achieve an accurate Markov model is a topic of cur-

rent research30,33–35. As shown in33 the approximation

error can be bounded and vanishes as the discretiza-

tion gets ﬁner and the lag time is increased. A re-

cently outlined variational method36 can be employed to

approximate relevant spectral properties of the transi-

tion operator by an application of the famous Rayleigh-

Ritz principle. An approach using basis functions and

variational inequalities makes it possible to connect to

established methods from electronic structure calcula-

tions and may proof useful in iteratively improving con-

formation space discretization. For an overview of the

Markov state model approach to conformation dynamics

see32. The Markov state model approach has been able

to reconstruct complex molecular processes such as pro-

tein folding2,12,26–32,37,38, natively unstructured protein

dynamics35, and protein-ligand binding39–43 from com-

puter generated trajectories. In addition the Markov

model framework allows the comparison of simulation

driven predictions with experimental ﬁndings in a con-

sistent manner44–47.

Since enhanced and biased sampling methods can sig-

niﬁcantly improve the convergence of stationary quanti-

ties in the presence of long timescales, while direct molec-

ular dynamics simulations can probe dynamical quanti-

ties depending on short timescales, it would be desirable

to combine the advantages of both approaches. A natu-

ral mathematical basis to foster this combination is de-

tailed balance of the dynamics. Detailed balance states

that under equilibrium conditions, the ratio of stationary

probabilities between two states is equal to the inverse ra-

tio of transition rates or probabilities between them. On

the microscopic scale, detailed balance is a natural conse-

quence of the time inversion invariance of the microscopic

equations of motion and the Gaussian white noise nature

of the stochastic ﬂuctuations48 (p. 88ﬀ.). When using

Markov models, microscopic detailed balanced directly

translates into detailed balance on the level of Markov

states. Therefore, it would be desirable to include prior

information of the stationary distribution into the esti-

mation of dynamical observables such as correlation func-

tions and time scales or rates of conformational changes.

One could for example use well converged equilibrium

probabilities estimated on conformational subsets consti-

tuting a suitable discretization from an extended ensem-

ble simulation and generate observations of equilibrium

ﬂuctuations from a standard equilibrium simulation. The

precise knowledge of the stationary probabilities could for

example be used to obtain sharper estimates of dynam-

ical quantities such as timescales for large scale confor-

mational transitions.

Detailed balance is now commonly used as a con-

straint to guide the maximum likelihood estimation of

Markov model transition matrices from observed transi-

tion counts32,49 However, these existing approaches do

not permit to explicitly include prior knowledge of the

stationary distribution. Beyond maximum likelihood es-

timates, the estimation of statistical uncertainty stem-

ming from the fact that only ﬁnitely many transition

counts have been observed, is crucial to allow a mean-

ingful comparison with expectation values obtained from

other simulations as well as with observations from ex-

periments to be made50. Furthermore, quantiﬁcation

of statistical uncertainties is a prerequisite to guide an

adaptive sampling approach that aims at reducing them

eﬃciently49,51,52. In Singhal et al.51 direct sampling of

transition matrices was applied to calculate the distri-

bution of mean ﬁrst passage times. A computationally

eﬃcient procedure to estimate the variance together with

the mean based on a Gaussian approximation of the dis-

tribution of transition matrices and a ﬁrst order Tay-

lor expansion of the target observable was also devel-

oped. In52 the method was extended to the estimation of

eigenvalues and eigenvectors. In53, a similar perturbation

method was used to evaluate the statistical error of com-

mittor probabilities. In54 a related approach based on

perturbation theory of spectral subspaces is developed in

order to achieve a reﬁnement of a grid-free conformation

space discretization. A full Bayesian approach for esti-

mating statistical errors including the detailed balance

constraint was introduced in1. In a subsequent study, we

have extended the formalism by also including statisti-

cal uncertainties of spectroscopic observables50. Ref.55

has used a diﬀerent approach, an edge reinforced ran-

dom walk, to sample reversible transition matrices. As

yet, the Markov chain Monte Carlo approach in Ref.1is

the only approach that permits to explicitly include prior

knowledge of the stationary distribution into the estima-

tion of the probability distribution of transition matrices.

However, this sampler has rather poor mixing properties,

thus requiring many iterations and a high computational

load before the probability distributions can be estimated

reliably.

In the following we will introduce eﬃcient methods to

include prior knowledge of the stationary distribution

into reversible transition matrix estimates: (1) Maxi-

mum likelihood estimation methods are given that either

solve a constrained convex optimization problem using

standard optimization libraries, or proceed via an iter-

ative likelihood maximization algorithm. (2) An eﬃ-

cient Gibbs method is introduced to sample the condi-

tional densities of individual transition matrix elements,

oﬀering improved convergence properties over the pre-

3

vious approach in Ref.1. The estimation and sampling

methods described here are implemented in the EMMA

Markov model toolkit56. The maximum likelihood es-

timation for ﬁxed stationary distribution can be per-

formed using the EMMA command mm estimate and

the Gibbs sampling of reversible transition matrices with

ﬁxed stationary distribution is available via the command

mm transitionMatrixSampling.

II. PROBABILITY DISTRIBUTIONS FOR TRANSITION

MATRICES

If one has at hand only a ﬁnite observation X1, . . . , XN

of a Markov jump process there are usually an inﬁnite

number of transition matrices Pthat are compatible with

the given data. In the following we assume that one

can directly observe transitions between individual micro

states i∈1, . . . , n. A single entry pij of a transition

matrix quantiﬁes the probability to make a transition to

state jgiven that you have started in i,

pij =P(Xk+1 =j|Xk=i).

If the micro state jump process is Markovian the prob-

ability of observing a certain realization of the process

X1, . . . , XNdepends only on the number of transitions

between pairs of states in X1, . . . , XNtogether with the

probability to start in X1. Thus the matrix of transi-

tion counts Ctogether with the probability of the initial

state, p(X1), completely determines the probability of a

given observation for a ﬁxed P,

p(X1, . . . , XN|P) = p(C|P)p(X1).(1)

As a result of Markovianity the probability of observing

transition counts cij given a set of transition probabilities

pij is given by the multinomial distribution

p(C|P)∝

n

Y

i,j=1

pcij

ij .(2)

However, we need the probability of a certain transition

matrix given an observation of transition counts, p(P|C).

Bayes’ theorem can be used to relate p(C|P) to p(P|C)

via

p(P|C)∝p(C|P)p(P).

Using a suitable conjugate prior with prior counts bij, as

outlined in32, we ﬁnd that this probability is given by a

product of Dirichlet distributions

p(P|C)∝

n

Y

i,j=1

pcij +bij

ij .(3)

Here the following normalization condition for row-

stochasticity of Pis assumed to hold,

n

X

k=1

pik = 1 i= 1, . . . , n. (4)

The structure of (3) makes it possible to generate inde-

pendent Dirichlet distributed rows if no additional con-

straints on Pare imposed51,52,57,58. If one desires to

restrict the space of all admissible transition matrices to

those obeying a detailed balance condition

πipij =πjpji (5)

the additional interdependence between rows prohibits to

generate samples from (3) by direct sampling of individ-

ual rows. In1a Metropolis Hastings Monte Carlo chain

method is developed to generate random transition ma-

trices from (3) under the detailed balance constraint. In

the following we will only consider the situation in which

the stationary probabilities have been already computed

using a diﬀerent simulation algorithm. Note that ﬁx-

ing π1, . . . , πnand requiring detailed balance reduces the

number of independent variables pij from n(n−1) to

n(n−1)

2. This is a 50% reduction in dimension and we

expect that imposing this extra symmetry will have a

large eﬀect when comparing quantities estimated with

and without these constraints. In the following we will

use the normalization condition (4) to determine the di-

agonal of Pfrom the oﬀ-diagonal elements,

pii = 1 −X

k6=i

pik i= 1, . . . , n

and the detailed balance condition (5) in combination

with the ﬁxed stationary vector to determine the lower

triangular part of Pfrom the upper triangular one,

pji =πi

πj

pij 1≤i<j≤n.

This approach for incorporating a priori knowledge

about stationary probabilities has a straightforward gen-

eralization to situations in which the stationary proba-

bilities are not precisely known. If one has obtained a

probability model for the stationary probabilities

p(π|E) (6)

from an enhanced sampling method Eone can incorpo-

rate this prior knowledge of πinto a probability model

for P. The probability model for Pgiven the evidence C

and Eis given by

p(P|C, E) = Zdπ p(P|C, π)p(π|E).(7)

P∼p(P|C, E) can be sampled by iteratively generating

samples of πfrom p(π|E) and of Pfrom p(P|C, π).

III. CONDITIONAL PROBABILITIES

The Gibbs sampling strategy facilitates sampling of

a joint distribution by generating random variates from

the conditionals. In the following we will show that for

4

a ﬁxed stationary vector (π1, . . . , πn) all the conditionals

of p(P|C) have a simple analytical form. Furthermore

we will outline a method to generate random variates

eﬃciently from all conditionals for all possible conﬁgura-

tions of Pand π. For the sake of brevity of notation we

will often supress the ﬁxed observation Cwhen stating

relations for the conditionals. There are only four factors

in the joint probability (3) with an explicit dependence

on the transition matrix element pij. The element pii is

linked to pij by constraint (4), pji is related to pij by

(5), and ﬁnally pjj is dependent on pij by a combination

of (4) and (5). For this reason the conditional proba-

bility for pij is given conditioned on the following set of

transition matrix elements

{p11, . . . , pnn }/{pii, pij , pj i, pj j }.

In a slight abuse of notation we indicate this conditioning

on the above set writing the conditional density for pij

as p(pij |pk6=i,j,l6=i,j ). It is given by

p(pij |pk6=i,j,l6=i,j )∝pcij

ij pcji

ji pcii

ii pcjj

jj .

Plugging in the constraints (4), (5) we get

p(pij |pk6=i,j,l6=i,j )∝pcij +cji

ij

×((1 −X

k6=i,j

pik)−pij )cii

×((1 −X

k6=j,i

pjk )−πi

πj

pij )cjj

explicitly showing the unvariate dependence on pij . Now

we deﬁne

∆ij = (1 −X

k6=i,j

pik),(8)

Λij =πj

πi

(1 −X

k6=j,i

pjk ).(9)

Using these we can rewrite the conditional density as

p(pij |pk6=i,j,l6=i.j )∝pcij +cji

ij (∆ij −pij )cii (Λij −pij )cjj .

(10)

We assume that ∆ij ≤Λij . Then we can deﬁne

x=pij

∆ij

and deﬁne the following parameters,

a=cij +cji ,(11)

b=cii,(12)

c=cjj ,(13)

d=Λij

∆ij

.(14)

In the case ∆ij >Λij we switch the deﬁnition of b,c,

deﬁne d= ∆ij /Λij and x=pij /Λij . It can be seen that

in both cases a, b, c ≥0, d≥1, and 0 ≤x≤1. After a

little algebra we get

p(x|a, b, c, d)∝xa(1 −x)b(d−x)c(15)

with 0 ≤x≤1. This means that if we can generate ran-

dom variates from p(x|a, b, c, d) eﬃciently for all admissi-

ble parameters, we can eﬃciently sample all conditional

densities arising during a Gibbs sampling procedure. The

dependence of the conditionals for pij on both cij and cj i

clearly reﬂects the additional symmetry imposed by the

detailed balance condition.

A. Log-concave densities

We can write the density p(x|a, b, c, d) in the following

way,

p(x|a, b, c, d) = eq(x|a,b,c,d),

with

q(x|a, b, c, d) = alog(x) + blog(1 −x) + clog(d−x).

The second derivative of q(x|a, b, c, d) is given by

q00(x|a, b, c, d) = −a

x2−b

(1 −x)2−c

(d−x)2.

It is easy to see that

q00(x|a, b, c, d)≤0

for all 0 ≤x≤1 and all parameters a, b, c ≥0 and

d≥1. This is a suﬃcient condition for q(x|a, b, c, d)

to be a concave function and therefore all conditionals

p(x|a, b, c, d) fall into the category of log-concave densi-

ties. There exist eﬃcient approaches for the generation of

random variates from a log-concave density given explicit

knowledge of the mode point and the ability to evaluate

the density p(x) and the ﬁrst derivative of its logarithm

q(x) = log p(x). For an overview of methods to sample

from log-concave densities see58. The crucial feature em-

ployed by all these methods is that any concave function

q: Ω →Ris bounded from above by all its tangents, so

that for all x0for which q0(x0) exists, the following holds

q(x)≤q(x0) + q0(x0)(x−x0)∀x∈Ω.

Since the exponential function is a monotone function we

have

f(x) = eq(x)≤eq(x0)+q0(x0)(x−x0).

The global maximum or mode point of p(x|a, b, c, d) is

attained at xmwith

q0(xm|a, b, c, d)=0

5

subject to the constraint 0 ≤xm≤1. We have

q0(x|a, b, c, d) =x−1(1 −x)−1(d−x)−1

{a(1 −x)(d−x)−bx(d−x)−cx(1 −x)}.

It is obvious that it suﬃces to ﬁnd the zeros of

a(1 −x)(d−x)−bx(d−x)−cx(1 −x).

This expression is at most quadratic in xfor all ad-

missible parameters. Therefore extremal points of

q(x|a, b, c, d) are given by

x1,2=1

2(a+b+c)(a+b)d+ (a+c)±√r,

with

r= [(a+b)d+ (a+c)]2−4(a+b+c)ad.

It is apparent that p(x|a, b, c, d) has zeros at x0= 0, x0=

1 and x0=d. Recall that d≥1. This means that there

is one extremal point in [0,1] and one extremal point in

[1, d]. Therefore we conclude that xmcorresponds to the

smaller one of the two extremal points,

xm=1

2(a+b+c)(a+b)d+ (a+c)−√r.(16)

We note that the mode point need not lie in the interior

of the unit interval so that xm= 0 and xm= 1 are

possible values.

B. Optimal piecewise approximation

We will use a piecewise enveloping function

g(x|a, b, c, d) bounding p(x|a, b, c, d) consisting of a

uniform density around the mode point and exponen-

tial tails elsewhere. For log concave densities f(x) it

is possible to use the following general approach to

ﬁnd an enveloping function g(x) for f(x). Let again

q(x) = log f(x). Consider the following piecewise deﬁned

function h(x),

h(x) =

q(xl) + q0(xl)(x−xl)−∞ < x < xl

q(xm)xl≤x≤xu

q(xu) + q0(xu)(x−xu)xu≤x < ∞

.

Here xland xudenote the lower and the upper bound

for a region around xmin which f(x) will be bounded by

a uniform density f(xm)χ[xl,xu](x). As a consequence of

concavity the function h(x) is a valid dominating function

for q(x). Thus g(x) = eh(x)is a valid enveloping function

for f(x). Figure 1 shows p(x) and the enveloping density

g(x). The optimal choice for xland xuis the one that

minimizes the area between g(x) and f(x) leading to the

lowest possible rejection rate. One can show58 that an

xl≤xmand xu≥xmis optimal if

f(x∗

l) = f(xm)

e, f(x∗

u) = f(xm)

e.

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

p(x)

xlxu

Figure 1: The conditional density p(x|a, b, c, d) for

a= 8.0, b= 2.0, c= 4.0 and d= 30.0 (solid line) and

the corresponding enveloping function g(x) (dashed

line). The density was scaled to the mode point value

p(xm|a, b, c, d) to ﬁt it into the range [0,1].

Here edenotes the Euler number. If f−1is explicitly

known ﬁnding the optimal solution is straightforward. It

is apparent that for unimodal (continuous) densities fthe

inverse f−1is always unique on x≤xmand on x≥xm.

Unfortunately we do not have an explicit expression for

p−1(x|a, b, c, d). It is however always possible to choose

suboptimal points xland xuat the cost of a larger re-

jection rate. In the case that xmlies on the boundary of

the domain of f(x) the bounding function will be a single

exponential function. In the case that xlor xulie outside

the domain of deﬁnition one restricts h(x) by truncating

it a the boundary points so that only one exponential tail

or only the constant part will survive.

C. Suboptimal piecewise approximation

Unfortunately we do not have the inverse to the con-

ditional density (15). We will use additional information

about p(x|a, b, c, d) to make a good although suboptimal

choice for xland xu. In the following let 0 < xm<1.

A second order Taylor expansion of q(x|a, b, c, d) around

the mode point yields a Gaussian approximation of the

conditional density,

p(x)≈exp q(xm) + q00(xm)

2(x−xm)2.

The standard deviation is given by σ=q1

−q00(xm). We

simply set

xl=xm−σ, xu=xm+σ.

A comparison between the optimal points and the ones

obtained from the Gaussian approximation to p(x) is

shown in Figure 2.

6

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

p(x)

xl

x∗

lxux∗

u

e−1

e−0.5

Figure 2: The picture shows the location of the optimal

points x∗

land x∗

ufor p(x) (solid line) and the points xl

and xuobtained from the Gaussian approximation

(dashed line). The Gaussian approximation touches the

density at the mode point.

D. Rejection sampling using the envelope

It is straightforward to generate random variates from

the individual pieces of the enveloping function. There

are fast and reliable implementations for the generation

of uniform as well as for exponential random variates

and we can use rejection to sample from pieces of p(x)

individually. We can decompose p(x) as follows

p(x) =p1p(x)χ[0,xl](x) + p2p(x)χ[xl,xu](x)+

p3p(x)χ[xu,1](x)

with pidenoting the weights of the individual pieces. We

do not know the discrete probabilities pia priori but

there is a simple and eﬃcient algorithm circumventing

the need for pialtogether. The algorithm does only re-

quire the weights wiof the individual pieces of g(x). Due

to the simple form of g(x) obtaining analytic expressions

for wiis straightforward. The following algorithm 1 is a

variant of the modiﬁed composition method that can be

found in58 (p. 69). In the following let 0 < xl< xu<1.

The treatment of special cases is straightforward but re-

quires additional branches in the algorithm complicat-

ing notation. The discrete probabilities of the individual

dominating pieces wiare given by

wi=ri

P3

k=1 ri

,(17)

with

r1=Zxl

−∞

dx p(xl)eq0(xl)(x−xl)=p(xl)

q0(xl),(18)

r2=Zxu

xl

dx p(xm) = p(xm)(xl−xu),(19)

r3=Z∞

xu

dx p(xl)eq0(xl)(x−xl)=p(xu)

−q0(xu).(20)

In order to increase numerical stability for cases in which

a,b,c, and dhave large values (15) is usually scaled to

the mode point value. Additionally the logarithm of the

ﬁnal acceptance condition in algorithm 1,

log(U) + h(x)≤q(x),

can be tested instead of the condition U g(x)< p(x). The

case d1 can for example occur in situations in which

the probabilities for states diﬀer by orders of magnitude,

πiπj, since d∝πj/πi.

Algorithm 1: Sample p(x|a, b, c, d)

Input:a,b,c,d

Output:x

Compute xmusing (16)

σ=q1

−q00(xm)

xl=xm−σ

xu=xm+σ

Compute w1,w2, and w3using (17), (18), (19), (20)

repeat

Z∼χ[0,1](x)

U∼χ[0,1](x)

if Z < w1then

repeat

y∼e−q0(xl)y

x=xl−y

until x≥0

else if Z < w1+w2then

y∼χ[0,1](x)

x=xl+ (xu−xl)y

else

repeat

y∼eq0(xu)y

x=xl+y

until x≤1

end

until Ug(x)< p(x)

E. Modiﬁed rejection sampling for large dvalues

In the case d1 we can use an alternative strategy

to generate samples according to (15). We can rewrite

p(x|a, b, c, d)∝g(x|a, b)ψ(x|c, d)

with

g(x|a, b) = Γ(a+b)

Γ(a)Γ(b)xa(1 −x)b

and

ψ(x|c, d) = d−x

dc

.

7

The density g(x|a, b) is the usual beta density which can

be eﬃciently sampled and ψ(x|c, d) is a [0,1] valued func-

tion. The modiﬁed rejection method58, algorithm 2, can

be used to generate samples from p(x|a, b, c, d). The al-

gorithm is eﬃcient for cases in which ψ(x|c, d)≈1 for all

x∈[0,1]. In the case d1 we obtain

ψ(x|c, d)≈1−cx

d

using a Taylor expansion in x/d. Since x∈[0,1] the

algorithm is eﬃcient for c/d 1. It is straightforward

to see that the eﬃciency of algorithm 2 increases with

growing d.

Algorithm 2: Modiﬁed rejection algorithm

Input:a,b,c,d

Output:x

repeat

x∼g(x|a, b)

U∼χ[0,1](x)

until U < ψ(x|c, d)

IV. MAXIMUM LIKELIHOOD ESTIMATION

The maximum likelihood estimate T∗is the optimal

point of the likelihood function p(C|P). In other words

the given observation Cis most likely to be generated

by the optimal model T∗. The multinomial form of

the likelihood and the linear nature of the constraints

makes it possible to reformulate the problem of ﬁnding

the maximum likelihood estimate as a convex optimiza-

tion problem. Thus the global optimum T∗can be eﬃ-

ciently found. For a thorough introduction and exhaus-

tive overview see59.

We note that log is a strictly monotone function so

that ﬁnding the maximal point of p(C|P) is equivalent to

ﬁnding the maximal point of the log-likelihood function,

l(C|P) = log p(C|P).

Finding the reversible transition matrix with given sta-

tionary distribution πmaximizing l(C|P) can be stated

as the following optimization problem.

minimize −

n

X

i,j=1

cij log pij ,

subject to −pij ≤0 1 ≤i, j ≤n,

n

X

k=1

pik = 1 1 ≤i≤n,

πipij −πjpji = 0 1 ≤i<j≤n.

This is a constrained optimization problem in n2vari-

ables. A reduction in the number of independent vari-

ables can be achieved by eliminating constraints and ex-

plicitly incorporating them into the objective function

and the remaining constraints. There exist a number of

numerical libraries for the solution of convex optimiza-

tion problems. We have used the freely available python

cvxopt module60. The numerical solution of a convex

optimization problem is usually facilitated by iteratively

updating the suboptimal point by solving a system of

linear equations containing the ﬁrst and second order

derivatives of the objective function and all non-linear

constraints as well as the matrices specifying the linear

constraints. In order to start the iterative scheme one

needs a valid initial point to start the iteration. In the

following we will outline how we can compute a reversible

transition matrix Pwith ﬁxed stationary distribution

πfrom any given possibly non-reversible transition ma-

trix Q. Our method is similar to an approach outlined

in61. The guiding idea is the mechanism underlying the

Metropolis-Hastings algorithm transforming an arbitrary

transition matrix into one that is reversible with respect

to a given stationary distribution. Denote by aij the fol-

lowing weights,

aij = min{1,πjqji

πiqij }.(21)

These are precisely the weights in the Metropolis-

Hastings algorithm. Deﬁne a new transition matrix P(0)

by

p(0)

ij =(aij qij i6=j

1−Pk6=iaikqik i=j.(22)

Observe that the diagonal elements pii will always be

greater after such a transformation pii ≥qii for all

i= 1, . . . , n. We will use the transformation outlined

above in order to generate a valid starting point for the

likelihood maximization scheme. Since Qis arbitrary we

choose it to be the non-reversible maximum likelihood

estimator,

qij =cij

Pn

k=1 cik

.

We generateP(0) by enforcing the reversibility condition

with respect to πusing (21),(22).

As an alternative, the likelihood maximization can

be performed by iteratively maximizing the conditional

probabilities of pij for i<j. This is either done by the

reversible transition matrix estimator described in32, or

by the following iterative algorithm, algorithm 3. The

former algorithm is implemented in EMMA56 by the

mm estimateFixedPi command.

V. A GIBBS SAMPLER FOR TRANSITION MATRICES

WITH FIXED STATIONARY DISTRIBUTION

The Gibbs sampling approach as ﬁrst presented in62

achieves the following. Let p(x1, . . . , xn) be a given

joint probability distribution and denote by pi(xi) the

8

Algorithm 3: Iterative maximum likelihood

estimation with ﬁxed stationary distribution

Input:π,C

Output:P∗

P=P(0)(π , C)

S=P

while δ > do

for i∈ {1,...,n}do

for j∈ {1,...,n}do

if i < j then

Compute parameters ∆ij , Λij ,a,b,c,d.

Mode point xm=xm(a, b, c, d)

pij =xm·min(∆ij ,Λij )

pii = ∆ij −pij

pji =πi

πjpij

pjj =πi

πjΛij −pji

end

end

end

δ=|l(C|P)−l(C|S)|

S=P

end

P∗=P

marginal distribution of the i-th variable, pi(xi) =

p(xi|xj6=i). It can be shown that under certain condi-

tions (Lemma 10.11 in63) the ability to generate random

variates from all conditionals pi(xi) is suﬃcient to gen-

erate samples from the joint distribution p(x1, . . . , xn).

The algorithm can be stated as follows. Denote by

(x(k)

1, . . . , x(k)

n) the k-th random vector generated by the

algorithm. Then a new sample is generated by “sweep-

ing” through the vector updating all coordinates from

the respective conditional densities. In other words, for

iin 1, . . . , n,

x(k+1)

i∼p(xi|x(k+1)

1, . . . , x(k+1)

i−1, x(k)

i+1, . . . , x(k)

n).

There exist several variants of the Gibbs sampling al-

gorithm, the “random scan” version which picks ifrom

{1, . . . , n}at random and returns a new sample after nof

such updates instead of sweeping through all coordinates

in succession is especially popular.

Recall that the conditional distribution of pij is given

by (10) with parameters ∆ij and Λij explicitly given by

(8) and (9). Having obtained an explicit expression for

p(pij |pk6=i,j,l6=i,j ) for all i<jwe can proceed to con-

struct a Gibbs sampling algorithm to generate random

variates Pfrom the joint distribution. In order to start

the Markov chain we need a valid initial transition ma-

trix P(0) obeying detailed balance with respect to the

given stationary distribution (πi)1≤i≤n. The matrix P(0)

should also be irreducible so that choosing any irreducible

transition matrix qij and enforcing detailed balance with

respect to πaccording to (22) will result in a valid ini-

tial transition matrix possessing the desired properties.

However, we recommend to use the maximum likelihood

estimate obtained above as a starting point for the Gibbs

chain to immediately draw transition matrices from re-

gions of high probabilities. The computation of parame-

ters during the Gibbs sampling procedure can be simpli-

ﬁed by noting that

∆(k)

ij =p(k−1)

ij +p(k−1)

ii ,

Λ(k)

ij =πj

πip(k−1)

jj +p(k−1)

ji ,

where kdenotes the step in the Gibbs sampling chain.

In other words coupling between elements is mediated

only by diagonal elements pii. This gives rise to algo-

rithm 4. The usual procedure returns only after lsuch

elementary steps have been taken resulting in a sequence

P(0), P (l), . . . , P (N l)of transition matrices. Here lde-

notes the number of independent variables, l=n(n−

1)/2.

Algorithm 4: Gibbs sampling of Pwith ﬁxed

stationary distribution π

Input:π,C,P(k−1)

Output:P(k)

repeat

i∼ {1,...,n}

j∼ {1,...,n}

until i < j

∆(k)

ij =p(k−1)

ij +p(k−1)

ii

Λ(k)

ij =πj

πip(k−1)

jj +p(k−1)

ji

if ∆(k−1)

ij ≤Λ(k−1)

ij then

d=∆(k−1)

ij

Λ(k−1)

ij

b=Cii

c=Cjj

else

d=Λ(k−1)

ij

∆(k−1)

ij

b=Cjj

c=Cii

end

Sample x∼xa(1 −x)b(d−x)cusing algorithm 1

p(k)

ij =x·min(∆(k−1)

ij ,Λ(k−1)

ij )

p(k)

ii = ∆(k−1)

ij −p(k)

ij

p(k)

ji =πi

πjp(k)

ij

p(k)

jj =πi

πjΛ(k−1)

ij −p(k)

ji

In the case that πhas some uncertainty speciﬁed by

the probability model (6), the generation of a compati-

ble ensemble of transition matrices can be achieved us-

ing (7) given that samples of πcan be generated ac-

cording to p(π|E). The following algorithm 5 generates

P∼p(P|C, E). The number kis usually taken as the

minimal number of runs to decorrelate from the starting

point P(0).

9

Algorithm 5: Gibbs sampling of Pwith uncertain

stationary distribution

Input:C,E,k

Output:P

π∼p(π|E)

Compute P(0) according to (22) or via algorithm 3

for i∈ {1,...,k}do

Generate P(i)via algorithm 4 using π,C,P(i−1)

end

P=P(k)

A. Enforcing sparsity - a prior for metastable dynamics

The equilibrium dynamics of proteins does often ex-

hibit the feature of metastability. Thus any transition

matrix characterizing an approximation via a Markov

jump process on conformation space should also exhibit

traits of metastability. As discussed in64 metastable

Markov processes on discrete state spaces can be under-

stood in terms of nearly uncoupled Markov chains with

small transition probabilities between blocks deﬁning the

dynamics within a single metastable subset. For ﬁnite

observations of the metastable process the small proba-

bilities for transitions between metastable sets and the

zero probabilities of forbidden transitions might become

indistinguishable in an ensemble of transition matrix gen-

erated by a sampling approach with no prior informa-

tion. If the uncertainties of small but non-zero transition

probabilities are of the same order than those that corre-

spond to forbidden transitions it might not be possible to

recover the desired metastable properties from the gener-

ated ensemble. In physical systems there are of course no

forbidden transitions since all transition probabilities are

strictly positive. However many of these might be still

orders of magnitude smaller than the transition probabil-

ities between meta stable regions. In practice we will not

observe any of such transitions in a ﬁnite realization of

our process, not even for a typical realization long enough

to achieve suﬃcient sampling of meta stable transitions.

In this case we can treat them, in a very good approxi-

mation, as forbidden transitions.

We will show how one can enforce the generation of an

ensemble of transition matrices compatible with the spar-

sity structure of the given observations. If one assumes

detailed balance with respect to πobserved transitions cij

as well as observed transitions cji indicate nonzero prob-

abilities pij ,i < j. Therefore we conclude that when-

ever cij +cji >0 the probability of pij >0 is positive,

P(pij >0) >0. To enforce sampling of metastable tran-

sition matrices we require that pij = 0 if cij +cji = 0,

i<j, for all Pin the sample. In other words the sparsity

structure of C+CTis enforced for all P. This sparsity

prior is equivalent to using a prior count of -1 on all

(i, j) with cij +cj i = 0 in (3) (see supplementary infor-

mation of2). The sparse prior is applied by restricting

the sampling algorithm to those elements pij for which

cij +cji >0. It is apparent that the

qij =cij +cji

Pk(cik +cki)

possesses the desired sparse structure. Furthermore, gen-

erating P(0) according to (22) does not change the spar-

sity structure of the oﬀ-diagonal elements of Q. Starting

from a transition matrix with the desired sparse struc-

ture it is straightforward to restrict the Gibbs sampling

algorithm by updating only elements for which i<jand

cij +cji >0. Denote by

θ={(i, j)∈N2|1≤i < j ≤n, cij +cj i >0}

In algorithm 6 we outline a method to generate a sam-

ple of reversible transition matrices with ﬁxed stationary

distribution obeying the sparse structure given by θ.

Algorithm 6: Gibbs sampling of Pwith ﬁxed

stationary distribution π- Sparse version

Input:π,C,P(k−1),θ

Output:P(k)

Draw (i, j) uniformly from θ

Proceed as in algorithm 4

VI. RESULTS

In the following we will show that the above outlined

Gibbs sampling converges much faster than the Metropo-

lis Hastings approach developed in1. Please recall that

the general Metropolis Hastings approach generates ran-

dom variates from the density p(x) by choosing proposals

yconditioned on the current state of the chain xfrom a

proposal density q(x, y) and accepts proposed samples

with the following acceptance probability,

a(x, y) = min{1,p(y)q(y, x)

p(x)q(x, y)}.

The crucial diﬀerence between Gibbs sampling and

Metropolis Hastings sampling is that the Metropolis

chain remains in the current state as long as the proposed

value is rejected while the Gibbs sampling approach gen-

erates a new sample at each step. This possibility to

remain in the current state usually leads to longer corre-

lation times for the Metropolis chain than for the Gibbs

chain. Thus one needs to run longer Metropolis chains

than Gibbs chains to achieve an equal degree of con-

vergence. On the other hand one needs to be able to

generate random variates from all conditionals eﬃciently

while the Metropolis chain can be advanced using a pos-

sibly very simple proposal density q(x, y). In the fol-

lowing we will compare our current sampling approach

with the one developed in1and demonstrate improved

convergence and more-rapidly decaying autocorrelation.

10

We start with a simple model using the following count

matrix

C=

100 5 0

20 4 20

0 8 75

(23)

and stationary distribution

π= (0.5,0.01,0.49) .(24)

to assess the convergence properties of the two ap-

proaches.

A. Conditional distributions

We have generated a sample of N= 106random vari-

ates from the conditional density, p(x|a, b, c, d) (15), for

various choices of parameters a,b,c,dto demonstrate

the ability of our new method to correctly generate ran-

dom variates from all possible conditional densities. In

Figure 3 we compare the shape of each histogram to the

graph of the exact density function for the same param-

eter values. The ﬁgures clearly indicate that all densities

have been correctly sampled.

B. Convergence of mean values and variances

In order to assess the quality of a Monte Carlo sam-

pling procedure one usually computes the standard error

of the mean of an observable Oestimated from a ﬁnite

sample generated by evolving the chain for a ﬁnite num-

ber of steps. As observable we choose the value of in-

dividual matrix elements, O=pij and the value of the

second largest implied time scale, O=t2. We have gen-

erated a maximum likelihood reversible transition matrix

of the count matrix (23) with stationary distribution (24)

using the algorithm in56. Then nensemble = 100 indepen-

dent Gibbs samplers using algorithm 4 were used, taking

Nsteps in the range 102. . . 105, estimating E(pij ) and

E(t2) for each (nensemble, N ). Then we have estimated

the standard deviation over the sample for each (ﬁxed)

N. See Figure 4a for a comparison of the convergence

of E(pij ) between the two sampling approaches. The

slowest relaxation timescale t2, Figure 4b, is an exam-

ple for a global observable with a functional dependence

on all elements pij so that the expectation value E(t2) is

a suitable measure to access the convergence of general

observables. A comparison of the convergence behavior

is shown in Figure 4b. We can also choose to observe the

variance of individual matrix elements as well as the vari-

ance of the second largest implied timescale. The setup

is identical to the one outlined for mean values. The ﬁg-

ure also shows the convergence of the variance V(x) for

a single matrix element, Figure 4c, as well as for the im-

plied timescale, Figure 4d. The ﬁgures clearly indicate

the improved convergence properties of the presented ap-

proach over the previous MCMC sampler in1, requiring

two orders of magnitude less sampling steps to achieve a

similar error level.

C. Autocorrelation functions

As another measure of the improved convergence prop-

erties we compare the mixing times of the MCMC chain

and the Gibbs chain. We have generated a sample of

105transition matrices for the above count matrix, (23),

and stationary distribution, (24). Each sample was gen-

erated by advancing the chain using a single Gibbs or

Metropolis step. Let Xkbe the value of the observable

for the k-th sample. We have estimated the normalized

autocorrelation function

ρX(n) = E[(Xk−µ)(Xk+n−µ)]

σ2

using the following estimator for the sample autocorrela-

tion

ρX(n) = 1

σ2(N−l)

N−1−l

X

k=0

(Xk−µ)(Xk+n−µ),

with 0 ≤n≤l. The autocorrelation function for p13

in Figure 5a clearly indicates the faster decay of auto-

correlations for the Gibbs sampler. The autocorrelation

function for the second largest implied time scale demon-

strates a signiﬁcant improvement over the previous ap-

proach, see 5b. The number of steps required in order

to generate decorrelated samples, ndecorr , has been esti-

mated by assuming an exponential decay for the auto-

correlation function,

ρ(n) = e−n

ndecorr .

The area under the graph of the autocorrelation function

was computed using the trapezoidal rule and used as an

estimate for ndecorr . Values for ndecorr as well as for the

corresponding decorrelation time, tdecorr , for observables

p13 and t2can be found in Table I. ndecor r is two orders

of magnitude smaller for the Gibbs sampler than for the

Metropolis approach. Due to the comparable speed of

elementary sampling steps for both algorithms the im-

proved decorrelation constant, ndecorr , leads to a similar

improvement in decorrelation time.

D. Application to simulation data

In order to demonstrate the performance of the transi-

tion matrix sampling method we have applied the pre-

sented transition matrix Gibbs sampling algorithm to

simulation data for the synthetic peptide MR121-GSGS-

W. Trajectories were obtained by standard equilibrium

dynamics simulations of a constant volume ensemble at

11

0.0 0.2 0.4 0.6 0.8 1.0

x

0

1

2

3

4

5

6

p(x)

(a)

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

p(x)

(b)

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

p(x)

(c)

0.0 0.2 0.4 0.6 0.8 1.0

x

0

5

10

15

20

p(x)

(d)

0.0 0.2 0.4 0.6 0.8 1.0

x

0

2

4

6

8

10

12

p(x)

(e)

0.0 0.2 0.4 0.6 0.8 1.0

x

0

5

10

15

20

p(x)

(f)

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

p(x)

(g)

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

p(x)

(h)

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

p(x)

(i)

Figure 3: Conditional density p(x) (solid line) for diﬀerent parameters a,b,c, and d. The histograms show a sample

of N= 106random variates generated using the method outlined above. First row c= 4, d= 10: (a) a= 5, b= 0,

(b) a= 5, b= 2, (c) a= 0, b= 2. Second row c= 40, d= 100: (d) a= 100, b= 5, (e) a= 100, b= 100, (f ) a= 5,

b= 100. Third row: (g) a= 0, b= 0, c= 4, d= 10, (h) a= 0.5, b= 0.2, c= 40, d= 100, (i) a= 0, b= 3 ·104,

c= 4 ·103,d= 104.

293Kin explicit water with the Berendson thermostat us-

ing the Gromacs65 simulation software. Each of the two

trajectories used has a total length of 4µs with trajec-

tory frames separated by a time step of 10ps. A detailed

description of the simulation setup can be found in the

supplementary information of46. The tra jectories were

clustered using regular spatial clustering of RMSD dis-

tances using EMMA56. A spatial cutoﬀ of 3.5nm resulted

in a clustering with 107 distinct micro states. In order

to obtain an estimate for the stationary probabilities of

each micro state a Markov model with a lag time of 10ns

was generated using the reversible transition matrix es-

timator presented in32. The stationary distribution was

obtained from the estimated transition matrix as the left

eigenvector with eigenvalue 1. A corresponding matrix

containing transition counts between individual micro

states was obtained by counting transitions at the same

lag time. The sparse prior for metastable dynamics pre-

sented above was used to generate an ensemble of transi-

tion matrices using both, the Metropolis, and the Gibbs

sampling procedure. We have started nensemble = 100

independent chains with Nsteps in the range 105. . . 107

and estimated E(t2) and V(t2) for each (nensemble, N ).

The standard deviation was estimated over the sample for

each (ﬁxed) N. In order to speed up the computation we

have estimated t2from a spectral decomposition only af-

ter lelementary sampling steps. We have chosen las the

number of non zero independent transition probabilities

12

102103104105

N

10−5

10−4

10−3

10−2

10−1

σ(E(p13))

Present Gibbs sampler

MCMC sampler from [1]

(a)

102103104105

N

10−1

100

101

102

σ(E(t2))

(b)

102103104105

N

10−7

10−6

10−5

10−4

10−3

10−2

σ(V(p13))

(c)

102103104105

N

10−6

10−5

10−4

10−3

σ(V(t2))

(d)

Figure 4: Results obtained for the model system with count matrix (23) and stationary distribution (24). Standard

deviation for estimated mean and variance of observables is plotted against the number of elementary sampling

steps N. (a) mean transition matrix element E(p13), (b) mean of the second largest implied time scale E(t2), (c)

transition matrix element variance V(p13), (d) variance of the second largest implied time scale V(t2). The Gibbs

sampler introduced here (solid line) convergences faster than the Metropolis chain from1(dashed line) by almost two

orders of magnitude. For the mean second largest time scale E(t2), (b), the achieved speedup is more than one order

of magnitude.

pij . Figure 6 clearly indicates the improved convergence

properties of the presented approach. Here, the Gibbs

procedure needs one order of magnitude less sampling

steps to reach a similar error level. A comparison of the

autocorrelation functions for t2, Figure 7, shows an order

of magnitude smaller decorrelation constant ndecorr for

the Gibbs sampler compared to the Metropolis sampler

. Table I shows ndecorr with the corresponding decor-

relation time tdecorr . Due to the comparable speed of

elementary sampling steps for both algorithms the im-

proved decorrelation constant, ndecorr , leads to a one or-

der of magnitude lower decorrelation time, tdecorr, for the

Gibbs sampling algorithm.

E. Computational eﬃciency

For large entries in the count matrix cij 1 the af-

fected conditionals will be sharply peaked so that the

uniform proposal densities used in1will have very low

acceptance rates. This results in slow mixing chains

and large autocorrelation times for the Metropolis al-

gorithm so that much longer chains have to be run in

order to achieve the same level of convergence. Due to

the fact that algorithm 4 only use standard distributions

to envelope the conditionals, generating a random vari-

ate by rejection sampling from the conditional density is

eﬃcient enough to allow the generation of long chains.

The Gibbs sampling algorithm, algorithm 4, was imple-

13

0 50 100 150 200 250 300 350 400

n

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρp13(n)

Present Gibbs sampler

MCMC sampler from [1]

(a)

0 50 100 150 200 250 300 350 400

n

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

ρt2(n)

(b)

Figure 5: Autocorrelation functions for the model system with count matrix (23) and stationary distribution (24).

(a) autocorrelation function for the transition matrix element p13, (b) autocorrelation function for the second largest

implied time scale t2. The number of steps to take until samples are decorrelated ndecorr is two orders of magnitude

smaller for the Gibbs sampling method (solid line), ndecorr = 3 for p13 as well as for t2, compared to the Metropolis

sampling method (dashed line), ndecorr = 123 for p13 and ndecorr = 135 for t2.

105106107108

N

10−2

10−1

100

101

σ(E(t2))

Present Gibbs sampler

MCMC sampler from [1]

(a)

105106107108

N

10−6

10−5

10−4

σ(V(t2))

(b)

Figure 6: Results obtained for the synthetic peptide MR121-GSGS-W. Standard deviation of 6a the mean implied

time scale E(t2), 6b the implied time scale variance V(t2). The Gibbs sampler (solid line) shows a faster convergence

than the Metropolis sampler (dashed line) for mean and variance of the second largest implied timescale t2.

mented using the colt library66. The algorithm performs

108elementary sampling steps in 89.1son a 2GHz Intel

processor. Performing the same number of elementary

Metropolis steps takes 83.4s, with a 27% overall accep-

tance rate. The acceptance rate for the Metropolis step is

also highly dependent on the speciﬁc element. The num-

ber of steps required in order to generate decorrelated

samples ndecorr as well as the wall-clock decorrelation

time tdecorr is shown in Table I as an indicator of compu-

tational eﬃciency. The decorrelation time is calculated

as ndecorr ·tsample with tsample = 0.9µs for the Gibbs

sampling algorithm tsample = 0.8µs for the Metropolis

algorithm.

VII. COMPARISON WITH NONREVERSIBLE AND

REVERSIBLE ESTIMATION

We have used the following 3 ×3 transition matrix,

T=

0.99 0.01 0.0

0.45 0.1 0.45

0.0 0.01 0.09

,

14

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

n×105

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρt2(n)

Present Gibbs sampler

MCMC sampler from [1]

Figure 7: Autocorrelation function for the

MR121-GSGS-W peptide count matrix. The number of

steps to take until samples are decorrelated ndecorr is an

order of magnitude smaller for the Gibbs sampling

method (solid line), ndecorr = 4600, compared to the

Metropolis sampling method (dashed line),

ndecorr = 33000.

Gibbs sampler Metropolis sampler

ndecorr tdecorr ndecorr tdecor r

3×3 count matrix from (23)

p13 3 2.7µs 123 98.4µs

t23 2.7µs 135 108µs

MR121-GSGS-W peptide

t24600 4.14ms 33000 26.4ms

Table I: Decorrelation times for estimated mean and

variances of observables. The results were generated for

the model system with count matrix (23) and

stationary distribution (24) (p13 and t2) and for the

MR121-GSGS-W peptide (t2only).

to generate a transition counts Cby evolving a Markov

chain for N= 5000 steps starting from micro state i= 0.

To simulate the eﬀect of having used an eﬃcient enhanced

sampling algorithm, the exact stationary distribution

µ= (0.4945,0.011,0.4945)

and the observed transition counts were used to gener-

ate a sample of 100000 random transition matrices using

the Gibbs sampling algorithm. For each of the randomly

generated transition matrix the second largest eigenvalue

λ2and the corresponding implied time scale t2was com-

puted. For comparison, the observed transition counts

were used in a similar manner to generate a sample of

implied time scales without prior knowledge of the sta-

tionary distribution with and without explicitly enforcing

a detailed balance condition. It is clearly visible in Fig-

ure 8 that the estimation procedure including knowledge

about stationary probabilities gives a more accurate and

a much sharper estimate of this quantity.

VIII. DISCUSSION AND CONCLUSION

We have presented a Gibbs sampling algorithm for

the generation of transition matrices fulﬁlling the de-

tailed balance constraint with respect to a given station-

ary distribution. The presented algorithm shows a clear

improvement in convergence speed and autocorrelation

times over the algorithm presented in1. We believe that

the presented algorithm will be a useful tool for Monte

Carlo sampling of transition probabilities when a pri-

ori knowledge about stationary probabilities is available

in addition to observed transition counts. As already

pointed out in1enforcing the detailed balance condition

can lead to an immense reduction in variance of cer-

tain oﬀ-diagonal elements leading to sharper estimates

for kinetically relevant observables. With a priori esti-

mates of stationary distributions available from extended

ensemble simulations and an increased interest in esti-

mating dynamical quantities of molecular systems from

Markov model based approaches the outlined algorithm

will hopefully become a useful statistical tool for the anal-

ysis of metastable systems.

There are several directions for future research. An

improved scheme for the generation of random variates

from the conditional density could further speed up the

algorithm allowing the generation of transition matrices

for processes with larger state spaces. Larger acceptance

rates could be already achieved by ﬁnding better approx-

imations to the optimal boundary points x∗

l,x∗

uin the

deﬁnition of the piecewise enveloping function h(x). In

fact the only parameter of the conditional density that

is updated after a new sample is drawn is d. All pos-

sible values for a, b, c could in principle be calculated a

priori so that one might ﬁnd a set of n(n−1)/2 opti-

mal proposal densities each parametrized by d. Finding

a transformation removing the parametric dependence of

the conditionals on daltogether seems unlikely but would

of course open up the possibility for the design of even

faster algorithms.

We are currently pursuing an application of the algo-

rithm to data sets obtained by standard molecular dy-

namics simulations together with estimates of the equi-

librium distribution obtained from enhanced sampling

algorithms such as meta-dynamics, generalized ensem-

ble simulations or umbrella sampling to obtain sharper

estimates of dynamical quantities.

ACKNOWLEDGMENTS

The author’s would like to thank two anonymous ref-

erees for helpful comments and suggestions. One of the

authors would like to thank Guillermo Perez, Han Wang

and Ivan Kryven for discussions and helpful suggestions.

15

0 50 100 150 200

t2

0.00

0.02

0.04

0.06

0.08

0.10

0.12

p(t2)

(a)

0 50 100 150 200

t2

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

p(t2)

True mean

Sample mean

(b)

0 50 100 150 200

t2

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

p(t2)

(c)

Figure 8: Histograms for implied time scale t2corresponding to second largest eigenvalue λ2. The histogram of

timescales generated by incorporating knowledge about stationary probabilities (a) in comparison to the histogram

of timescales generated by the non-reversible (b) and reversible method (c). It is clearly visible that the sample

mean (dashed line) gives a more accurate prediction of the true value (solid line) due to the additional information

about stationary probabilities.

He thanks Luc Devroye for a suggestion concerning the

modiﬁed rejection sampling approach. He would espe-

cially like to thank Sven Kr¨onke for inspiring discus-

sions. B. Trendelkamp-Schroer acknowledges funding

by the DFG fund NO 825/3 and from the “Center of

Supramolecular Interactions” at FU-Berlin. Frank Noe

acknowledges funding from the research center Matheon.

REFERENCES

1F. No´e, J. Chem. Phys. 128, 244103 (2008).

2F. No´e, C. Sch¨utte, E. Vanden-Eijnden, L. Reich, and T. Weikl,

Proc. Natl. Acad. Sci. 106, 19011 (2009).

3D. E. Shaw, P. Maragakis, K. Lindorﬀ-Larsen, S. Piana, R. O.

Dror, M. P. Eastwood, J. A. Bank, J. M. Jumper, J. K. Salmon,

Y. Shan, and W. Wriggers, Science 330, 341 (2010).

4V. A. Voelz, G. R. Bowman, K. Beauchamp, and V. S. Pande,

J. Am. Chem. Soc. 132, 1526 (2010), pMID: 20070076.

5G. R. Bowman, V. A. Voelz, and V. S. Pande, J. Am. Chem.

Soc. 133, 664 (2011).

6K. Lindorﬀ-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Science

334, 517 (2011).

7S. K. Sadiq, F. No´e, and G. De Fabritiis, Proc. Natl. Acad. Sci.

109, 20449 (2012).

8R. Elber, M. Karplus, et al., Science 235, 318 (1987).

9J. Honeycutt and D. Thirumalai, Proc. Natl. Acad. Sci. 87, 3526

(1990).

10G. Nienhaus, J. Mourant, and H. Frauenfelder, Proc. Natl. Acad.

Sci. 89, 2902 (1992).

11C. Sch¨utte and W. Huisinga, Handbook of Numerical Analysis

10, 699 (2003).

12F. Noe, I. Horenko, C. Sch¨utte, and J. C. Smith, J. Chem. Phys.

126, 155102 (2007).

13J. Clarage, T. Romo, B. Andrews, B. Pettitt, and G. Phillips,

Proc. Natl. Acad. Sci. 92, 3288 (1995).

14Y. Sugita and Y. Okamoto, Chem. Phys. Lett. 314, 141 (1999).

15F. Rao and A. Caﬂisch, J. Chem. Phys. 119, 4035 (2003).

16H. Grubm¨uller, Phys. Rev. E 52, 2893 (Sep 1995).

17A. Laio and M. Parrinello, Proc. Natl. Acad. Sci. 99, 12562

(2002).

18G. Torrie and J. Valleau, J. Comp. Phys. 23, 187 (1977).

19A. M. Ferrenberg and R. H. Swendsen, Phys. Rev. Lett. 63, 1195

(Sep 1989).

20S. Kumar, J. Rosenberg, D. Bouzida, R. Swendsen, and P. Koll-

man, J. Comp. Chem. 16, 1339 (1995).

21S. Trebst and M. Troyer, Computer Simulations in Condensed

Matter Systems: From Materials to Chemical Biology Volume 1,

591(2006).

22A. E. Garcia and D. Paschek, J. Am. Chem. Soc. 130, 815 (2008).

23C. Zhang and J. Ma, J. Chem. Phys. 132, 244101 (2010).

24P. S¨oderhjelm, G. Tribello, and M. Parrinello, Proc. Natl. Acad.

Sci. 109, 5170 (2012).

25F. Wang and D. P. Landau, Phys. Rev. Lett. 86, 2050 (Mar

2001).

26C. Sch¨utte, A. Fischer, W. Huisinga, and P. Deuﬂhard, J. Comp.

Phys. 151, 146 (1999).

27W. C. Swope, J. W. Pitera, and F. Suits, J. Phys. Chem. B 108,

6571 (2004).

28N. Singhal, C. D. Snow, and V. S. Pande, J. Chem. Phys. 121,

415 (2004).

29V. Schultheis, T. Hirschberger, H. Carstens, and P. Tavan, J.

Chem. Theory Comp. 1, 515 (2005).

30J. Chodera, N. Singhal, V. Pande, K. Dill, and W. Swope, J.

Chem. Phys. 126, 155101 (2007).

31A. Pan and B. Roux, J. Chem. Phys. 129 (2008).

32J. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held,

J. Chodera, C. Sch¨utte, and F. No´e, J. Chem. Phys. 134, 174105

(2011).

33M. Sarich, F. No´e, and C. Sch¨utte, Multiscale Model. Sim. 8,

1154 (2010).

34J. D. Chodera and V. S. Pande, Proc. Natl. Acad. Sci. 108, 12969

(2011).

35G. Perez-Hernandez, F. Paul, T. Giorgino, G. De Fabritiis, and

F. No´e, “Identiﬁcation of slow molecular order parameters for

markov model construction,” (in press).

36F. No´e and F. N¨uske, “A variational approach to model-

ing slow processes in stochastic dynamical systems,” (2012),

arXiv:1211.7103.

37T. J. Lane, G. R. Bowman, K. Beauchamp, V. A. Voelz, and

V. S. Pande, J. Am. Chem. Soc. 133, 18413 (2011).

38G. R. Bowman, V. A. Voelz, and V. S. Pande, Curr. Opin. Struc.

Biol. 21, 4 (2011).

39K. Beauchamp, D. Ensign, R. Das, and V. Pande, Proc. Natl.

Acad. Sci. 108, 12734 (2011).

40M. Held, P. Metzner, J. Prinz, and F. No´e, Biophys. J. 100, 701

(2011).

16

41I. Buch, T. Giorgino, and G. De Fabritiis, Proc. Natl. Acad. Sci.

108, 10184 (2011).

42D. Huang and A. Caﬂisch, PLoS Comput. Biol. 7, e1002002+

(2011).

43G. Bowman and P. Geissler, Proc. Natl. Acad. Sci. 109, 11681

(2012).

44D. Sezer, J. H. Freed, and B. Roux, J. Phys. Chem. B 112, 11014

(2008).

45W. Zhuang, R. Z. Cui, D.-A. Silva, and X. Huang, J. Phys. Chem.

B115, 5415 (Mar. 2011).

46F. No´e, S. Doose, I. Daidone, M. L¨ollmann, M. Sauer, J. Chodera,

and J. Smith, Proc. Natl. Acad. Sci. 108, 4822 (2011).

47B. G. Keller, J.-H. Prinz, and F. No´e, Chem. Phys. 396, 92

(2012).

48T. Leli`evre, G. Stoltz, and M. Rousset, Free energy computations:

A mathematical perspective (World Scientiﬁc, 2010).

49G. R. Bowman, K. A. Beauchamp, G. Boxer, and V. S. Pande,

J. Chem. Phys. 131, 124101+ (2009).

50J. D. Chodera and F. No´e, J. Chem. Phys. 133, 105102 (2010).

51N. Singhal and V. Pande, J. Chem. Phys. 123, 204909 (2005).

52N. Hinrichs and V. Pande, J. Chem. Phys. 126, 244101 (2007).

53J. Prinz, M. Held, J. Smith, and F. No´e, Multiscale Model. Sim.

9, 545 (2011).

54S. R¨oblitz, Statistical Error Estimation and Grid-free Hierarchi-

cal Reﬁnement in Conformation Dynamics, Ph.D. thesis, FU-

Berlin (2008).

55S. Bacallado, J. D. Chodera, and V. Pande, J. Chem. Phys. 131,

045106 (2009).

56M. Senne, B. Trendelkamp-Schroer, A. S. Mey, C. Sch¨utte, and

F. No´e, J. Chem. Theory Comp. 8, 2223 (2012).

57S. Kube, M. Weber, T. Simos, and C. Tsitouras, in AIP Confer-

ence Proceedings, Vol. 1048 (2008) p. 339.

58L. Devroye, Non-uniform random variate generation, Vol. 4

(Springer-Verlag New York, 1986).

59S. Boyd and L. Vandenberghe, Convex optimization (Cambridge

university press, 2004).

60M. Anderson, J. Dahl, Z. Liu, and L. Vandenberghe, “Interior-

point methods for large-scale cone programming,” (MIT Press,

2011) pp. 55–84.

61Q. Jiang, Construction of Transition Matrices of Reversible

Markov Chains, Ph.D. thesis, University of Windsor (2009).

62S. Geman and D. Geman, Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on, 721(1984).

63C. Robert, G. Casella, and C. Robert, Monte Carlo statistical

methods, Vol. 2 (Springer New York, 1999).

64P. Deuﬂhard, W. Huisinga, A. Fischer, and C. Sch¨utte, Linear

Algebra Appl. 315, 39 (2000).

65E. Lindahl, B. Hess, and D. Van Der Spoel, J. Mol. Model. 7,

306 (2001).

66W. Hoschek, “Uniform, versatile and eﬃcient dense and sparse

multi-dimensional arrays,” (2000).