Article
Automated learning of generative models for subcellular location: Building blocks for systems biology
Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.
Cytometry Part A (Impact Factor: 2.93). 12/2007; 71(12):97890. DOI: 10.1002/cyto.a.20487 Source: PubMed
Fulltext
Available from: Robert F MurphyAutomated Learning of Generative Models
for Subcellular Location: Building Blocks
for Systems Biology
Ting Zhao,
1,2
Robert F. Murphy
3,4,5
*
Abstract
The goal of location proteomics is the systematic and comprehensive study of protein
subcellular location. We have previously developed automated, quantitative methods to
identify protein subcellular location families, but there have been no effective means of
communicating their patterns to integrate them with other information for building
cell models. We built generative models of subcellular location that are learned from a
collection of images so that they not only represent the pattern, but also capture its var
iation from cell to cell. Our models contain three components: a nuclear model, a cell
shape model and a proteincontaining object model. We built models for six patterns
that consist primarily of discrete structures. To validate the generated images, we
showed that they are recognized with reasonable accuracy by a classifier trained on real
images. We also showed that the model parameters themselves can be used as features
to discriminate the classes. The models allow the synthesis of images with the expecta
tion that they are drawn from the same underlying statistical distribution as the images
used to train them. They can potentially be combined for many proteins to yield a high
resolution location map in support of systems biology.
' 2007 International Society for
Analytical Cytology
Key terms
location proteomics; generative models; pattern recognition; subcellular location; shape
models; medial axis models; microscope image analysis; cell modeling; systems biology
A cell is a complex system with an enormous number of different types of molecules
that form a large interacting network. The growing field of systems biology seeks to
understand how living systems function by modeling such networks at various levels,
including the interaction of molecules in cells (1–3). Building accurate models
requires not only the chemical properties of the molecules involved, but also their
spatial distributions. This is especially important for proteins because the subcellular
location of a protein is so critical to its function that the same protein can have dif
ferent functions at different locations (4). Thus, cell models will not yield accurate
predictions unless proteins are modeled at their proper locations.
However, it is not easy to integrate subcellular location information into systems
biology. In some cases, cell modeling can be done at a coarse level, such as by consid
ering each major organelle as a single compartment (5). Given that cells go to great
lengths to build complex subcellular structures, however, it is unlikely that coarse
modeling will be sufficient for all purposes.
Thus, we need approaches that can provide information on subcellular location
with as much resolution as possible. These can be divided into predictive methods
and determinative methods. There has been extensive work on prediction of subcellu
lar location from sequence (6–9). A range of methods have been described for learn
ing to make predictions using the sequences of proteins whose location is known,
including methods that use motifs, amino acid composition, homology, and combi
1
Center for Bioimage Informatics,
Carnegie Mellon University, Pittsburgh,
Pennsylvania 15213
2
Department of Biomedical Engineering,
Carnegie Mellon University, Pittsburgh,
Pennsylvania 15213
3
Molecular Biosensor and Imaging
Center, Carnegie Mellon University,
Pittsburgh, Pennsylvania 15213
4
Department of Biological Sciences,
Carnegie Mellon University, Pittsburgh,
Pennsylvania 15213
5
Department of Machine Learning,
Carnegie Mellon University, Pittsburgh,
Pennsylvania 15213
Received 11 September 2007; Accepted 8
October 2007
This work was presented at the XXIII
Congress of the International Society for
Analytical Cytology, Quebec City, Canada,
20—24 May 2006.
Grant sponsor: NSF; Grant number:
EF0331657; Grant sponsor: NIH; Grant
numbers: U54 RR022241, U54 DA021519.
*Correspondence to: Robert F. Murphy;
Carnegie Mellon University, 4400 Fifth
Avenue, Pittsburgh, PA 15213, USA
Email: murphy@cmu.edu
Published online 30 October 2007 in Wiley
InterScience (www.interscience.
wiley.com)
DOI: 10.1002/cyto.a.20487
© 2007 International Society for
Analytical Cytology
Rapid Publication
Cytometry Part A 71A: 978990, 2007
Page 1
nations thereof. The major limitation of current systems is the
resolution of the subcellular location assignments in the train
ing data. These have been at the level of a handful of major
organelles, and thus current systems are unable to predict the
distribution of proteins within subcellular structures. In addi
tion, while some systems can predict that proteins are located
in more than one structure, they are unable to make quantita
tive predictions of the distribution of proteins between those
structures. Lastly, current systems cannot predict dynamic
behaviors such as cycling of proteins between compartments
or changes in distribution resulting from stimuli. Thus, while
the sequencebased machine learning methods that have been
described are in theory well suited to the problem, their utility
will only be fully realized when adequate highresolution
training data are available.
That is the domain of determinative methods. Currently,
the best way to obtain highresolution location information for
many proteins is to acquire images by microscopy. Although
visual examination is widely used to capture information from
the resulting images, a more efficient way to extract and analyze
the location information is using computer vision and machine
learning methods (10,11). Previously we have designed Sub
cellular Location Features (SLF) to describe the patterns in
microscope images (12,13) and the SLF allowed us to develop
methods to determine locations automatically (12,14–17).
These methods can be divided into two categories, classifica
tion and clustering, where the main difference is whether the
set of location patterns is predetermined. In classification, any
input image will be assigned to one of the classes (e.g., major
organelles) that were used to train the classifier. In contrast,
clustering does not assume that the categories are known but
rather finds them by grouping similar patterns. With well
designed features, each category is expected to represent a sin
gle location pattern. Using a consensus clustering approach
that is designed to yield reproducible clusters, we have built a
subcellular location tree for 3D images of CDtagged 3T3 cells
(16) and shown that location patterns falling in the same
humanlabeled category can be separated into statistically dis
tinguishable groups. These groups can be considered as subcel
lular location families, by analogy with families of proteins that
share similar sequence (18) or structure (19).
However, learning what patterns are possible and which
proteins display them is not sufficient for integrating location
information into systems biology studies. Ultimately, location
information must be incorporated into cell models to capture
cell behaviors that depend on proper protein locations. To this
end, images of location patterns can be used directly by some
simulation programs after appropriate segmentation or recon
struction (20,21). However, this approach does not readily
permit the effects of variation in pattern on the results of
simulations to be considered in a systematic way. In other
words, in the absence of a model for the variation in a pattern,
the degree to which results of simulations depend on that
variation can only be assessed by performing simulations for
different input images. However, if a model is available, the
choice of which patterns (and how many patterns) to use for
simulations can be made in a principled manner taking into
account the modes of variation in the pattern. Furthermore,
the use of actual images as a base for simulations does not
permit multiple proteins to be included in the same simula
tion unless they have been simultaneously imaged (e.g., using
multiple fluorescent probes). As the number of proteins to be
included grows, it is increasingly unlikely that they will have
been imaged in the same cell unless specific experiments are
done in contemplation of simulation. Even so, the number of
proteins that can be simultaneously imaged in living cells is
currently less than 10. An important alternative made possible
by pattern models then is to combine models from separate
images. In view of the above, we describe here methods for
building probabilistic models that are learned from images of
a location pattern and which we propose can be used to effec
tively include location information in cell simulations. Since it
is straightforward to collect multichannel images in which
distinguishable fluorescent probes are used to detect a specific
protein in parallel with reference markers (such as DNA), we
design our models to utilize markers for nuclear and cell shape.
The goal of the work described here is to build models of
the distribution of a protein within a given cell type that are
automated in the sense that they are learned directly from a
set of microscope images,
generative in the sense that they are able to synthesize new
examples of the pattern observed in that set,
statistically accurate in the sense that they reflect the varia
tion in the pattern from cell to cell,
compact in the sense that they can be communicated with
significantly fewer bits than the training data.
With these definitions, we formalize the problem as
Given a set of threechannel microscope images containing
information in separate channels about the position of the
cell boundary (plasma membrane), the distribution of nu
clear DNA, and the distribution of a specific protein,
Build an automated, generative, statistically accurate, and
compact model of the distribution of that protein inside a
cell described by its nuclear DNA distribution and plasma
membrane location.
Our approach will be to construct a nested set of condi
tional models. We start by using a medial axis model to repre
sent nuclear shape and a texture model to represent DNA dis
tribution within that shape. Using the nucleus as a starting
point, we generate a cell shape model. Finally, the nuclear
shape and cell shape serve as a framework for locating specific
proteins. The work described here uses 2dimensional images
and models to illustrate the principles and feasibility of the
approach, and it is important to note that other choices are
possible for each component of the model and that no claim
of optimality is made. Work on building 3dimensional mod
els and using other model components is in progress. In this
initial work, we also consider only interphase nuclei to sim
plify the task; future work will focus on a dynamic model
incorporating variation across the cell cycle.
RAPID P UBLICATION
Cytometry Part A 71A: 978990, 2007 979
Page 2
MATERIALS AND METHODS
For development and testing of the algorithms described
here, the images from the 3D HeLa dataset described pre
viously (14) were used (available from http://murphylab.
web.cmu.edu/data/3Dhela_images.html). The dataset contains
three fluorescence channels for each field, reflecting the distri
butions of DNA, total protein, and one of nine specific pro
teins. Each field has been previously segmented into single cell
regions using a seeded watershed approach (14). Since the
models we describe here are twodimensional, we extracted
from each 3D stack the 2D slice that contained the largest total
intensity in the DNA channel. Of the 454 images in the data
set, 7 did not have a complete cell boundary in this slice. We
therefore ignored these images leaving 447 images for the stu
dies described here. Protein location models were created for
six of the proteins in the dataset, including giantin, gpp130
(both Golgi proteins), LAMP2 (a lysosomal protein), a mito
chondrial protein, nucleolin (a nucleolar protein) and trans
ferrin receptor (an endosomal protein).
The algorithms used in this work were implemented
using Matlab (7.0 R14, The MathWorks) and all code is avail
able from http://murphylab.web.cmu.edu/software. While the
specific algorithms for creating each component of the model
are described in the Results, additional implementation details
are presented here. For modeling nuclear shape, a principal
axis alignment was done for each nuclear (DNA) image as
described previously (22). Briefly, after thresholding using the
RidlerCalvard method, each nucleus was rotated to align its
major axis and rotated an additional 1808 if necessary to
match the sign of the skewness along the minor axis. For mod
eling nuclear texture, texture synthesis toolboxes were down
loaded from http://www.nealen.net/projects/texsynth/hts_code.
zip and http://www.cns.nyu.edu/lcv/texture. The hybrid tex
ture synthesis code was modified slightly to avoid searching
patches of background pixels so that background would not
be counted as a part of the texture. For modeling cell shape,
the total protein image was thresholded just above the most
common pixel intensity and the largest resulting object (the
cell) was then filled and outlined. For estimating Gaussian
object mixtures, EM code originally from the NETLAB library
(http://www.ncrg.aston.ac.uk/netlab) for Matlab was used.
Some modifications were made to support estimating the
weighted Gaussian mixture. For classify ing patterns, the code
to calculate feature set SLF7DNA from the SLIC library
(http://murphylab.web.cmu.edu/software) was used. The SDA
implementation in SLIC was also used. Code for training and
using support vector machines was obtained from the
LIBSVM library (http://www.csie.ntu.edu.tw/cjlin/libsvm).
A Gaussian kernel was used and the parameters were searched
automatically for best performance on the training set in each
trial.
RESULTS
Medial Axis Model of Nuclear shape
Building a model for nuclear shape is the first step in our
modeling procedure. This is important in its own right given
the critical role of the nucleus in duplicating and expressing
genetic information. In addition to indicating the state of the
cell cycle, the size and shape of the nucleus can affect gene
expression and protein synthesis (23).
Interphase nuclei typically have a shape similar to an ovoid
or ellipse (Fig. 1), which is a uniaxial object. We have therefore
used a method adapted from medial axis transformation (24–
26) to fit the shape. For a 2D shape, the traditional definition of
a medial axis is a set of the centers of the circles that support the
shape. This is the same as building a Voronoi graph for all the
points on the shape boundary in the 2D space. In our version of
medial axis representation, we restrict the distance calculation
of the Voronoi graph to one dimension, the xaxis. This can
avoid generating branches, which will make a shape much
more complicated to model. It also reduces the complexity of
computation, as is described below.
We consider the shape of a nucleus to be described by a
parametric curve [x(t), y (t)]. We can define another curve
h(u) 5 [y(t
1
) 1 y(t
2
)]/2, where t
1
and t
2
satisfy xðt
1
Þ¼
xðt
2
Þ¼u, yðt
1
Þ¼maxfyðtÞjxðtÞ¼ug and yðt
2
Þ¼min
fyðtÞjxðtÞ¼ug. These definitions find the highest and lowest
points along a series of lines perpendicular to the major axis
of the nucleus and then averages them to find h(u), which is
the medial axis. The width along the medial axis is w(u) 5
y(t
1
) 2 y(t
2
). For convenience, we normalize distances so that
u is in the interval [0,1]. Given the medial axis and the width,
we can easily reconstruct the shape. This definition also pro
vides an easy way to find the medial axis in an image. Given a
DNA image for a single cell region, we threshold it to yield an
M 3 N binary digital image f(i,j), where i 5 1,2,...,M, j 5
1,2,..,N. f(i,j) 5 1 only when the pixel (i,j) belongs to the nu
clear object (i.e., is abovethreshold); otherw ise f(i,j) 5 0.
Then the medial axis of the nucleus in this image is defined on
fij9j; f ði; jÞ¼1g and a point of the medial axis at position i is
ðminfjjf ði; jÞ¼1gþmaxfjjðf ði; jÞ¼1gÞ=2. Thus the width
is wðiÞ¼maxfjjðf ði; jÞ¼1gminfjjf ði; jÞ¼1g. Figures
2a–2d illustrate the steps we have used to convert a nuclear
image into a medial axis representation.
Each of the two parts of the medial axis representation,
the medial axis and width, can be represented by a curve. To
parameterize such a curve, we fit a fourth order Bspline with
one internal point; we chose this order because it gives very
good fits to the nuclear width (Fig. 2f) and reasonably accu
rate fits to the medial axis itself (Fig. 2e). (Some of the high
frequency variation in the medial axis may be due to digitiza
tion artifacts from rotation; the spline fit results in some
smoothing of this variation but may also smooth nuclear
blebs.) Combining the six parameters for each of the two
spline fits with a parameter for the range of the medial axis
along the xaxis gives a total of 13 parameters for nuclear
shape. We have observed the fitted values of the internal point
to be around 0.5 for both the medial axis and width curves
and to have little contribution to the variation of the shape
(data not shown). Since the curves are defined over the inter
nal [0,1], this suggests that nuclei are roughly symmetric
about their center. We therefore chose to take the internal
point as a constant, leaving 11 remaining free parameters.
RAPID PUBLICATION
980 Generative Models for Subcellular Location
Page 3
A statistical model is required to describe the variation of
parameters from nucleus to nucleus. As an initial approach,
we have chosen to use multivariate normal distributions to
represent this variation. Figure 3 shows quantile–quantile
plots of each parameter (a straight line on these plots indicates
close agreement to a normal distribution). The range of the
medial axis (Fig. 3a) and the parameters of the width curve
(Figs. 3g–3k) are all fit well by normal distributions, while the
parameters of the medial axis cur ve are not as well fit (we
intend to consider other distributions in future work). For
distribution estimation, we assume that the parameters of the
medial axis and width are independent. This increases the flex
ibility of shape distribution and reduces the number of para
meters. Subject to the assumption of normality for the para
meters, we can capture the entire nuclear shape model using
47 values: the 11 means for each parameter and the 36 entries
in the covariance matrices of the medial axis and width pa
rameters (the medial axis has six parameters giving 21 unique
entries in its covariance matrix, and the width distribution has
five parameters giving 15 entries in its covariance matrix). To
generate a nuclear shape, we can draw parameters from the
normal distribution (Fig. 3) and then construct a shape from
the drawn parameters.
Texture of Nucleus
Given an image of the DNA distribution of a cell, the nu
clear texture reflects the condensation of chromosomes w ithin
the nucleus and the variation in DNA content along each
chromosome. Nuclear texture analysis has shown to be signifi
cant for biomedical studies such as cell cycle examination (27)
or disease diagnosis (28,29). The next task we consider is
therefore building a model for chromatin texture that can be
used to synthesize realistic nuclear DNA images.
We chose to apply texture modeling and synthesis techni
ques frequently used for natural images (30). However, these
techniques work best for textures that are homogeneous
within a shape, which is usually not the case in a nucleus.
Among many reasons, this is because the average intensity
decreases from the center of a nucleus to its boundary (Fig. 4).
This is because of decreasing thickness of the threedimen
sional ellipsoid nucleus near its edge and also because regions
in the middle of the nucleus may contain more outoffocus
light from above and below the image plane. So before esti
mating a nuclear texture model, the pixel intensities must be
adjusted for variation in average intensity across the nucleus.
This variation can be fit well by a function derived from ellipse
projection (Fig. 4): f ðdÞ¼a þ bð1 d
2
Þ
1=2
þ cð1 d
2
Þ
1=4
,
where d is the normalized distance to the center of the nu
cleus. After finding the function parameters, the intensity
Iðx; yÞ at the position ðx; yÞ is scaled as Iðx; yÞ=f ðdðx; yÞÞ.
After intensity normalization, we used a neighborbased
method to extend the texture to remove all background (31).
The texture image is then modeled by a parametric model
using wavelets (32). Combining the medial axis shape model
with the texture model allows us to synthesize nuclear images
with proper shape and approximate chromatin texture
(Fig. 5).
Shape Model of Cells
By definition, subcellular location of a protein takes place
within the bounds of the plasma membrane. We therefore
Figure 2. Example of fitting the medial axis description of a nu
clear shape by Bsplines. The original image (a) con taining a nu
cleus was processed into a binarized image (b), in which the nu
clear object consists of the white pixels. The nuclear object was
rotated so that its major axis is vertical (c). It is then converted
into the medial axis representation (d). The horizontal positions of
the medial axis as a function of the fractional distance along it are
shown by the symbols in (e), along with a Bspline fit (solid
curve). The width as a function of fractional distance is shown by
the symbols in (f), along with the corresponding fit (solid curve).
Scale bar, 5 lm.
Figure 1. Examples of nuclear images. The 447 nuclei in the 3D HeLa dataset were ranked by their Mahalanobis distances to the mean
value of the 11 parameters describing the shape based on medial axis. The nuclei shown are (a) the most typical nucleus, (b) the 100th
most typical nucleus, (c) the 200th most typical nucleus and (d) the 400th most typical nucleus. Scale bar, 5 lm.
RAPID P UBLICATION
Cytometry Part A 71A: 978990, 2007 981
Page 4
next incorporate a model for cell shape. Different types of cells
can have very different shapes, which are often related to their
functions. For example, a neuron has a treelike structure for
signal conduction, while columnar epithelial cells are roughly
rectangular so that they can be tightly connected to separate
different environments. What we deal with here are shapes of
cultured HeLa cells, which adhere to glass surfaces and spread
their cell bodies out to take on a ‘‘fried egg’’ shape (Fig. 6).
Although some general shape models such as polygons
(33) or active shape models (34) have been used to model cell
shapes, for our purposes we wished to make the cell shape
model conditional on the nuclear model described above (that
is, we wished to consider the correlation between the shape of
a cell and its nucleus). Figure 7 shows the histogram of the dif
ferences between nuclear major axis ang le and cell major axis
angle and the histogram of the distances between the nuclear
Figure 3. Estimating normality of the distributions of the parameters of the medial axis representations for all nuclei. Quantile—quantile
plots comparing the distributions of each parameter across all 447 nuclei (on the vertical axis) to a Gaussian distribution (on the horizontal
axis) are shown. (a) The length of the medial axis. (b—f) The five parameters of the medial axis curve. (g—k) The five parameters of the
width curve.
RAPID PUBLICATION
982 Generative Models for Subcellular Location
Page 5
center and cell center. From Figure 7a we concluded that the
nucleus and cell are aligned at similar orientations. Similarly,
Figure 7b shows that the center of the nucleus and the center
of the cell are ty pically close to each other, with an average
distance of 2.2 lm. If we model the cell shape independently,
we then need to model the correlations between nuclear and
cell alignments, including positions and orientations. But if
we build a cell shape model that is conditional on the nuclear
shape model, this will not be necessar y.
The conditional shape model we build can be illustrated
in a polar coordinate system, of which the origin is at the cen
ter of the nucleus. The boundary of the nucleus and the
boundary of the cell are denoted as d
n
(h) and d
c
(h) respec
tively, where h is the angular coordinate and belongs to [0,2p].
Because we know d
n
(h), it would be sufficient to describe the
cell shape using the radial coordinate ratio between the two
shapes. We call this the shape ratio of a cell, which is also a
function of angles and denoted as r(h)5d
c
(h)/d
n
(h).
If we sample h over 3608 in 1 degree increments, a shape
will be represented by a vector of length 360. Estimating the
statistical distribution of the vectors will require much more
data than we have to guarantee accuracy. To solve the problem,
we used principal component analysis (PCA) to reduce the
dimensions, as was done in active shape models (34). First,
the average shape ratio was calculated by taking the mean of
the shape ratios of all the cells. The residuals (the differences
between a cell’s shape ratio vector and the average shape ratio
vector) were calculated for each cell. So the average ratio
rðhÞ
and the residual d
i
ðhÞ of the ith cell are calculated as
rðhÞ¼
1
N
P
N
i¼1
r
i
ðhÞ and d
i
ðhÞ¼r
i
ðhÞ
rðhÞ, where N is the
number of cells. The principal components representation of
d
i
ðhÞ is
P
N
j¼1
k
ij
e
j
ðhÞ, where e
j
is the j th principal component
and k
ij
is its coefficient from the ith cell. The indices are
arranged in decreasing order of their contribution to the over
all variation. We can discard some components without losing
the essential properties of the shape.
The k
ij
matrix can be modeled by a multivariate normal
distribution, from which we can draw samples to synthesize a
cell shape. In our implementation we used 10 components,
which contain about 90% of the variation (data not shown).
Figure 8 shows the average shape and shapes illustrating the
four highest modes of shape variation. An example shape
synthesized from the model with 10 components is also
shown. The cell shapes represent good approximations to real
cell morphologies, although fine structure in the cell bound
aries is not captured well.
Protein Object Modeling
Gaussian objects. Previously we have shown that subcellular
location images can be wellmodeled by combinations of indi
vidual objects, which are defined as contiguous regions of
nonzero pixels in a segmented image (35). We therefore focus
in this paper on patterns that are comprised mainly of small,
roughly ellipsoidal objects (such as lysosomes and endo
somes). To model these objects as seen in 2D images, we can
use 2D Gaussian distributions, N(l,S), where l ¼
l
x
l
y
and
R ¼
R
xx
R
xy
R
yx
R
yy
. These parameters can be directly calcu
lated from the image of a single object. However, organelles
such as vesicles can aggregate or overlap to form a larger object
in an image that is nonGaussian in shape. We therefore used
Gaussian mixture distributions to describe the large objects as
combinations of smaller objects.
Figure 4. Capturing nuclear intensity variation. The intensity in a
nucleus decreases from the center to the boundary (x) and it can
be fit by a simple function as described in the text (solid line).
Figure 5. Examples of synthesized nuclei. Each nucleus is synthesized with two parts, shape and texture.
RAPID P UBLICATION
Cytometry Part A 71A: 978990, 2007 983
Page 6
The probability density function (PDF) of a Gaussian
mixture distribution can be denoted as f ðx j #Þ¼
P
m
k¼1
p
k
gðxjl
k
; R
k
Þ, where # ¼fp
k
; l
k
; R
k
j 0 p
k
1;
k ¼ 1; ...; mg,
P
m
k¼1
p
k
¼ 1 and gðx j l
k
; R
k
Þ is the PDF of the
Gaussian distribution with mean l
k
and covariance matrix
P
k
. In fact, the Gaussian mixture distribution can describe
small objects as well because a Gaussian distribution is a Gaus
sian mixture distribution with one component.
The expectationmaximization (EM) algorithm can be
used to estimate a Gaussian mixture distribution (36). How
ever, the EM algorithm requires the number of components as
an input. To estimate this number for an object, we used a
lowband filter to smooth the object and the take the number
of local maxima of the object intensities as the number of
components. Since each data point has a weight, which is the
intensity of the pixel ðw
i
¼ Iðx
i
; y
i
ÞÞ, we used the weighted
EM algorithm as follows,
E step:
a
ik
¼
P
ðtÞ
k
gðx
i
jl
ðtÞ
k
;
P
ðtÞ
k
Þ
P
m
k¼1
P
ðtÞ
k
gðx
i
jl
ðtÞ
k
;
P
ðtÞ
k
Þ
M step:
For k from 1 to m
P
ðtþ1Þ
k
¼
1
P
N
i¼1
w
i
X
N
i¼1
w
i
a
ik
;
l
ðtþ1Þ
k
¼
1
P
N
i¼1
w
i
a
ik
X
N
i¼1
w
i
a
ik
x
i
;
X
ðtþ1Þ
k
¼
1
P
N
i¼1
w
i
a
ik
X
N
i¼1
w
i
a
ik
ðx
i
l
ðtþ1Þ
k
Þðx
i
l
ðtþ1Þ
k
Þ
T
:
The solutions were obtained when the iteration of the
two steps converged.
The representation of Gaussian mixture leads to a new
definition of objects, which are called Gaussian objects because
each object is defined as a density function of a 2D Gaussian
distribution multiplied by a total intensity (this is p
k
P
N
i¼1
w
i
for the kth component). In the general Gaussian object, all of
the elements of the covariance matrices are free parameters.
However, this can lead to fitted objects with very large or
highly elongated shapes not t ypical of subcellular organelles
like lysosomes and endosomes (Fig. 9). To minimize this
effect, we can require each Gaussian object to be circularly
Figure 6. Examples of cells with different shapes, including a cell with the smallest cytoplasmnucleus area ratio (a), a cell with the ratio
closest to the average ratio (b), and a cell with the largest ratio (c). Scale bar, 5 lm.
Figure 7. The correlation between the cell morphology and nuclear morphology is shown by (a) the histogram of the differences between
nuclear angle and cell angle and (b) the histogram of the distances between nuclear center and cell center.
RAPID PUBLICATION
984 Generative Models for Subcellular Location
Page 7
symmetric (by constraining the covariance matrix to have
equal values along the diagonal and zero values for the off
diagonal elements). This gave better results (as judged by
comparison with real images) than the full covariance ma
trix (data not shown).
To describe the statistics of the Gaussian objects, we
found that the standard deviation of the objects (which con
trols their size) can be fit by an exponential distribution (Fig.
10a). In addition, the relative intensity of objects, which we
define as the square root of the ratio between the intensity and
variance, can also be fit by a Gaussian distribution (Fig. 10b).
The distribution of the number of Gaussian objects in each
cell can be fit by a Gamma distribution (Fig. 10c). To deter
mine how many Gaussian objects exist in a cell, we draw a
number from the Gamma distribution and round it to the
nearest integer.
Object position model. In addition to the number and sizes
of objects in a cell, the positions of these objects are impor tant
for synthesizing a realistic pattern. Proteins can be readily
divided into cytoplasmic, nuclear, or membrane bound.
Therefore, we modeled the positions of proteincontaining
objects using two parameters describing their relationship to
the nuclear and plasma membranes. The first, r, is defined as
the ratio of the distance of a given object to the nuclear mem
brane to the sum of that distance and the distance to the cell
membrane. The second, a, is defined as the angle between a line
from the center of the nucleus to an object’s position and the
major axis of the nucleus. The distance of an object to the cell
boundary or nuclear boundary can be found by building dis
tance maps. For example, to calculate the distance of an object
to a nucleus, we first obtain a binary image that only contains
the edge of the nucleus. Then we build a distance map in which
Figure 8. Illustration of the conditional cell shape model. In the cell shape model described in this paper, the shape of a given cell is
described by the ratio of the cell size to the nuclear size at each of 360 angles. Variation in cell shape is captured in two parts, the average
shape ratio and the variation of the differences between specific cell shape ratios and the average shape ratio. The six figures shown here
are ( a) the cell morphology corresponding to the average shape ratio, (b—e) the cell shapes after adding each of the first four principal
components respectively, and (f) a cell shape synthesized with shape ratios drawn at random from the distributions. For figures (b—e), the
ith principal component was added with the coefficient 1.5 r
i
, where r
i
is the standard deviation of the coefficient of the ith component
(the square root of the eigenvalue of that principal component).
Figure 9. Example of fitting objects by 2D Gaussian mixture. (a) An image containing the original object is (b) smoothed by a Gaussian
lowpass filter. Then the number of Gaussian objects is decided by the number of local maxima in the smoothed images. We can use either
(c) spherical or (d) full covariance matrices while fitting the objects by the EM algorithm.
RAPID P UBLICATION
Cytometry Part A 71A: 978990, 2007 985
Page 8
the intensity of the pixel is the smallest distance of that pixel to
the nuclear edge. The distances to the cell membrane were
obtained in the same way. This permits the distribution of
object positions for a given cell to be converted into a distribu
tion of r,a values. We model this distribution or potential (the
probability that a given position is the center of an object) as
Pðr; aÞ¼
e
b
0
þb
1
rþb
2
r
2
þb
3
sinaþb
4
cosa
1 þ e
b
0
þb
1
rþb
2
r
2
þb
3
sinaþb
4
cosa
and determine the values of the parameters by logistic regres
sion. Here the angle is transformed into the linear combina
tion of sin and cos functions because its value is periodic, i.e.
a and a 1 2p are the same angle. The parameters in this model
can easily be interpreted. For example, b
1
shows whether the
protein is more likely overlapping (when b
1
is positive) or not
overlapping (when b
1
is negative) the nucleus. b
3
and b
4
determine whether the protein is more likely distributed along
the major axis (when b
3
 is small and b
4
 is large) or minor
axis (when b
3
 is large and b
4
 is small) of the nucleus.
Figure 11 shows examples of potentials learned from
images of the lysosomal protein LAMP2. To use the potentials
to predict object positions, we normalized them so that their
sum is 1. This makes them the probabilities of an object being
found at each location in the twodimensional image grid. To
obtain object locations for a synthetic image, we randomly
choose the number of objects for that image from the Gamma
distribution discussed above and then randomly draw that
many object positions from the multinomial distribution spe
cified by the position probabilities.
Given this model for protein object positions, we can
finally synthesize location images containing all three chan
nels; examples synthesized from the trained models for six
proteins are shown in Figure 12. Many additional example
images are available at http://murphylab.web.cmu.edu/data.
Evaluation of Synthesized Images
Having described how the generative model is created
and how it can be used to synthesize images, the natural ques
tion is: How good are the synthesized images? We expect that
the synthesized images from good generative models should
be similar to real images. A simple way to verify this is to
visually determine the degree of difference between the real
Figure 10. Statistics of Gaussian objects for lysosomal protein images. (a) The distribution of object sizes across all images and the corre
sponding fitted Gaussian distribution. (b) The distribution of relative intensity per object across all images and the corresponding fitted
Gaussian distribution. (c) The distribution of number of objects per cell and the corresponding fitted Gamma distribution.
RAPID PUBLICATION
986 Generative Models for Subcellular Location
Page 9
and synthesized images. However, this is not a suitable
approach for our generated images, which are visually distin
guishable from real images because a number of sources of
noise have been removed to estimate an ideal distribution. It
is also difficult to make quantitative estimates of similarity
using visual examination. We therefore chose to compare the
images using the Subcellular Location Features, which have
been shown to represent location patterns very well (13). This
can be done by training a classifier on SLF of real images and
then applying it to the SLF of synthesized images to see how
well the synthesized images can be classified. Here we used the
SLF7DNA feature set minus feature SLF7.79, the fraction of
cellular fluorescence not included in objects (since the synthe
sized images contained no fluorescence that was not in ob
jects). This left 89 features, including 13 texture features after
downsampling to a pixel size of 1.15 lm and 32 gray levels,
49 Zernike moment features, 5 object skeleton features, 8 mor
phological features, 6 DNA features, 5 edge features, and 3
convex hull features (12,13). We used stepwise discriminant
analysis (SDA) (37) to select the most informative features for
both the real and synthesized images. SDA returned 40 fea
tures ranked in order of their ability to distinguish the classes.
Support vector machines were trained on the real images using
increasing numbers of these features. These were applied to
the synthesized images to test how well they can be recognized.
We considered the DNA pattern as a class, represented by the
synthesized nuclear images.
Figure 13 shows the average accuracies of classifying real
and synthesized images using various numbers of features.
The average classification accuracies were calculated after mer
ging the two Golgi proteins, giantin and gpp130, into a single
class since their patterns are so similar. Variation in accuracy
occurs as additional features are added, due to at least two
sources. The first (small) source of variation is the sampling
variation that happens between averages of random crossvali
dation trials. The second, larger source of variation occurs
only in the case when the population of training images is not
expected to be identical to that of the testing images (e.g., for
the case of training using real images and testing using syn
thetic images). In this case, the addition of a feature that dis
tinguishes among the classes of the real images better than
among the synthetic images can lead the classifier to put
weight on that feature at the expense of the previous features,
and lead to a decline in performance. The decline can poten
Figure 11. Illustrations of object position potentials. The estimated potentials of object positions for lysosomal proteins are shown for two
cells as a surface in 3D space. The higher the pixel, the more likely it is for an object to be located at that position . Cell (blue) and nuclear
(red) outlines are shown and the protein is shown in green.
Figure 12. Synthesized images for different protein patterns. Red:
nucleus; Blue: cell membrane; Green: protein. The proteins are:
(a) giantin, (b) gpp130, (c) LAMP2 (lysosomal), (d) a mitochondrial
protein, (e) nucleolin, and (f) transferrin receptor (endosomal).
RAPID P UBLICATION
Cytometry Part A 71A: 978990, 2007 987
Page 10
tially be partially reversed if a new feature is added that distin
guishes among the classes well for both the real and synthetic
images.
Among the feature sets giving classification accuracies on
real images that were higher than 90%, a set of 16 features
resulted in the best classification accuracy for the synthetic
images, 71%. The confusion matrix for this case is shown in
Table 1. The accuracy of classification of real images of nine
classes is 95%, which means that these features contain almost
all necessary information to distinguish the major patterns.
The average accuracy of classification using only synthesized
images is 87%. Thus the images generated by each of the mod
els are clearly different from each other even if they are not
always correctly recognized by a classifier trained on real
images.
We further tested how well the model parameters could
be used to discriminate real images. According to the models,
each protein image has nine features, one parameter for the
number of Gaussian objects, one parameter of object size dis
tribution, two parameters of object intensity distribution and
five parameters of object position model. Using just these nine
features we obtained a classification accuracy for real images
of 88% (Table 2), which means that the models captured most
essential information to distinguish the six patterns. This is an
important result in that the generative model parameters may
be considered to be a more ‘‘natural’’ representation of the
image patterns than previously described features.
DISCUSSION
This paper presents a framework for building generative
models of location patterns. The ability to represent and gen
erate subcellular distributions for all proteins will be impor
tant for systems biology. An important aspect of our frame
work is that the parameters of the models are all learned from
real data, enabling them to be applied to large scale projects
that are analyzing thousands of proteins (38). A critical advan
tage of generative models over simple collections of images for
the purpose of representing subcellular patterns is that correla
tions between components of the model (such as possible
correlations between nuclear orientation and cell shape) that
might be difficult to perceive from visual inspection can be
identified and captured.
Beyond simply describing a system for building such
models, however, we have also described an approach for the
evaluation of the images generated by these models using clas
sifiers trained on real images. This is a critical advance, since
there are many possible approaches to model building that
could be considered. The results in Table 1 show that most
synthesized images were correctly classified and also indicate
which patterns need fur ther model improvement in future
work. The only pattern classified with low accuracy is the mi
tochondrial pattern, for which most images were classified as
the endosome pattern.
We note also that the use of generated images in simula
tion studies in the future will provide an additional (and
potentially better) way to evaluate them: how they affect the
agreement between simulation results and experimental
Figure 13. Evaluation of synthesized images. Classification
accuracies are shown as a function of the number of features
used under three conditions: classifying synthesized images by a
classifier trained on real images (x), training and testing on real
images using cross validation (2) and training and testing synthe
sized images by cross validation (2).
Table 1. Classification of synthesized images
a
TRUE CLASSIFICATION
OUTPUT OF CLASSIFIER
DNA ER ACTIN GIA GPP LYSO. MIT. NUC ENDO. TUB.
DNA 100 00 000 00 00
Gia 0 0 0 31 54 13 0 1 1 0
Gpp 0 0 0 24 62 11 0 2 1 0
Lyso. 0 0 0 7 4 50 7 0 32 0
Mit. 0 0 0 0 0 2 18 080 0
Nuc. 1 0 0 4 15 0 0 80 00
Endo. 0 2 0 0 0 1 2 0 91 4
a
A classifier was trained using 16 features of real images. One hundred images were generated for each pattern shown in the row
headings. The values shown are the percentage of synthesized images for each row that were classified as one of the 10 patterns shown in
the column headings. Boldface numbers indicate the percentage of correctly classified images.
RAPID PUBLICATION
988 Generative Models for Subcellular Location
Page 11
results. Such studies will also potentially indicate directions to
improve the models.
We note that lysosomes are observed to overlap the nu
cleus in both real images and synthesized images. This appears
to be an artifact of imaging (presence in the same optical sec
tion of lysosomes above or below a section of nucleus) that is
carried over into the models (since lysosomes cannot normally
enter the nucleus). True threedimensional models are
required to solve this problem, and we are currently pursuing
this direction. However, the 2D location models we have
described are likely to be useful for those cases where a model
is confined to 2D (e.g., for computational efficiency).
While our current models may be useful immediately,
there are two important additional characteristics needed to
build accurate cell simulations. The first is to build models
that specify the location of multiple proteins (and eventually
all proteins) in the same cell. Only a small number of proteins
can currently be visualized in live cells using fluorescence
microscopy. An important alternative is to use fixed cells and
obtain correlated distributions by repeated cycles of staining,
imaging and photobleaching (39). While this is thought to be
able to image up to 100 proteins in the same sample, it is unli
kely that it can be extended to simultaneously measure tens of
thousands of proteins in the same cell (and of course it cannot
be applied to living cells). Thus, methods for combining infor
mation from different cells are necessary, and generative mod
els can play this role. Proteins can first be grouped into high
resolution subcellular location families and then a generative
model can be built for each family. These can be combined to
synthesize cell models showing tens of thousands of proteins
under the assumption that all proteins in a family show
highlycorrelated distributions.
The second necessary characteristic of future models is
the ability to represent changes in protein distribution over
time, on time scales from below a second to greater than a
year. The abilit y to directly acquire information on the dy
Table 2. Classification of real images based on the model parameters
a
TRUE CLASSIFICATION
OUTPUT OF CLASSIFIER
GIA GPP LYSO MIT. NUC. ENDO.
Gia 86 860 0 0
Gpp 6 80 10 2 2 0
Lyso. 2 6 84 404
Mit. 0 0 4 88 08
Nuc. 0 0 0 0 100 0
Endo. 0 0 2 10 0 88
a
Generative model parameters were estimated for individual images and used to train and test classifiers using 10fold crossvalida
tion. The average accuracy was 88%. Boldface numbers indicate the percentage of correctly classified images.
Figure 14. Description of the models as Bayesian networks. The network representing the models built in this paper is illustrated by figure
(a). More accurate but complicated models can be obtained by adding edges to the network (b).
RAPID P UBLICATION
Cytometry Part A 71A: 978990, 2007 989
Page 12
namics of protein distribution is a critical advantage of fluo
rescence microscopy.
In this vein, we can consider ways of representing genera
tive models and choosing their characteristics. The models
proposed in this paper can be put into a directed probabilistic
graphical model framework, which is also known as Bayesian
network (Fig. 14). The advantage of using a graphical model is
that we can tune the model structure in a more intuitive way.
In the graphical model, each node is a component of the
model and each edge is the correlation between the nodes. The
arrow means the direction of determination. So the procedure
of model design becomes adding or removing nodes or edges.
If we consider each component as a set of random variables,
then the graph becomes a Bayesian network. Therefore we can
use welldeveloped techniques for Bayesian networks to do
inference and interpretation.
The goal of building the generative models is to provide
an interface between location proteomics and systems biology,
so we have begun implementing generative models in our
Protein Subcellular Localization Image Database (PSLID,
http://pslid.cbi.cmu.edu). We have also done some prelimi
nary work to convert the models into XML format, which we
hope to merge into standard cell modeling descriptions such
as SBML (40) and CELLML (41). This will make our models
easily transferable between programs. We expect shortly to
release software to permit training of models and synthesis of
images on a variety of platforms (I. CaoBerg, T. Zhao, and
R.F. Murphy, in preparation). We anticipate a wide applicabil
ity of these tools in systems biology studies, especially in simu
lations of cell behavior that require detailed models for subcel
lular location.
ACKNOWLEDGMENTS
We thank Eric Xing and Geoffrey Gordon for helpful dis
cussions and critical reading of this manuscript.
LITERATURE CITED
1. Ideker T, Galitski T, Hood L. A new approach to decoding life: Systems biology.
Annu Rev Genomics Hum Genet 2001;2:343–372.
2. Kitano H. Computational systems biology. Nature 2002;420:206–210.
3. Sauro HM, Hucka M, Finney A, Wellock C, Bolouri H, Doyle J, Kitano H.
Next generation simulation tools: the Systems Biology Workbench and BioSPICE
integration. OMICS 2003;7:355–372.
4. Faust M, Montenarh M. Subcellular localization of protein kinase CK2. A key to its
function? Cell Tissue Res 2000;301:329–340.
5. Ortoleva P, Berry E, Brun Y, Fan J, Fontus M, Hubbard K, Jaqaman K, Jary
mowycz L, Navid A, SayyedAhmad A, Shreif Z, Stanley F, Tuncay K, Weitzke E,
Wu LC. The karyote physicochemical genomic, proteomic, metabolic cell modeling
system. OMICS 2003;7:269–283.
6. Chou KC, Elrod DW. Protein subcellular location prediction. Prot Eng 1999;12:107–
118.
7. Chou KC, Cai YD. Prediction and classification of protein subcellular location
sequenceorder effect and pseudo amino acid composition. J Cell Biochem 2003;
90:1250–1260.
8. Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector
machines using compositions of amino acids and amino acid pairs. Bioinformatics
2003;19:1656–1663.
9. Pan YX, Zhang ZZ, Guo ZM, Feng GY, Huang ZD, He L. Application of
pseudo amino acid composition for predicting protein subcellular location: Stochas
tic signal processing approach. J Prot Chem 2003;22:395–402.
10. Chen X, Velliste M, Murphy RF. Automated interpretation of subcellular patterns in
fluorescence microscope images for location proteomics. Cytometry Part A 2006;
69A:631–640.
11. Glory E, Murphy RF. Automated subcellular location determination and high
throughput microscopy. Developmental Cell 2007;12:7–16.
12. Boland MV, Murphy RF. A neural network classifier capable of recognizing the pat
terns of all major subcellular structures in fluorescence microscope images of HeLa
cells. Bioinformatics 2001;17:1213–1223.
13. Murphy RF, Velliste M, Porreca G. Robust numerical features for description and
classification of subcellular location patterns in fluorescence microscope images.
J VLSI Signal Process 2003;35:311–321.
14. Velliste M, Murphy RF. Automated determination of protein subcellular locations from
3D fluorescence microscope images. In: Proceedings of the 2002 IEEE International Sym
posium on Biomedical Imaging, Washington, DC, 7–10 June 2002. pp 867–870.
15. Chen X, Velliste M, Weinstein S, Jarvik JW, Murphy RF. Location proteomics—
Building subcellular location trees from high resolution 3D fluorescence microscope
images of randomlytagged proteins. Proc SPIE 2003;4962:298–306.
16. Chen X, Murphy RF. Objective clustering of proteins based on subcellular location
patterns. J Biomed Biotechnol 2005;2005:87–95.
17. Hu Y, Carmona J, Murphy RF. Application of temporal texture features to automated
analysis of protein subcellular locations in time series fluorescence microscope
images. In: Proceedings of the 2006 IEEE International Symposium on Biomedical
Imaging, Arlington, VA, 6–9 April 2006. pp 1028–1031.
18. Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein
sequences. BMC Bioinformatics 2005;6:15.
19. Balaji S, Srinivasan N. Use of a database of structural alignments and phylogenetic
trees in investigating the relationship between sequence and structural variability
among homologous proteins. Prot Eng 2001;14:219–226.
20. Loew LM, Schaff JC. The virtual cell: A software environment for computational
cell biology. Trends Biotechnol 2001;19:401–406.
21. Coggan JS, Bartol TM, Esquenazi E, Stiles JR, Lamont S, Martone ME, Berg
DK, Ellisman MH, Sejnowski TJ. Evidence for ectopic neurotransmission at a neu
ronal synapse. Science 2005;309:446–451.
22. Huang K, Murphy RF. Boosting accuracy of automated classification of fluorescence
microscope images for location proteomics. BMC Bioinformatics 2004;5:78.
23. Thomas CH, Collier JH, Sfeir CS, Healy KE. Engineering gene expression and pro
tein synthesis by modulation of nuclear shape. Proc Natl Acad Sci USA 2002;4:1972–
1977.
24. Blum H. Biological shape and visual science. J Theor Biol 1973;38:205–287.
25. Tam R, Heidrich W. Shape simplification based on the medial axis transform. In:
Proceedings of the 14th IEEE Conference on Visualization, 2003; Seattle, Washington,
USA. pp 481–488.
26. Hiransakolwong N, Vu K, Hua KA, Lang SD. Shape recognition based on the
medial axis approach. Proceedings of the 2004 IEEE International Conference on
Multimedia Exposition, 2004; Taipei, Taiwan. pp 257–260.
27. Murata Si, Herman P, Lakowicz JR. Texture analysis of fluorescence lifetime images
of AT and GCrich regions in nuclei. J Histochem Cytochem 2001;49:1443–1451.
28. Palcic B. Nuclear texture: Can it be used as a surrogate endpoint biomarker? J Cellu
lar Biochem 1994;19(Suppl):40–46.
29. Jørgensen T, Yogesan K, Tveter KJ, Skjørten F, Danielsen HE. Nuclear texture anal
ysis: A new prognostic tool in metastatic prostate cancer. Cytometry 1998; 24:277–283.
30. Zhu SC, Wu Y, Mumford D. Filters, random fields and maximum entropy (FRAME):
Towards a unified theory for texture modeling. Int J Comput Vision 1998;27:107–
126.
31. Nealen A, Alexa M. Hybrid texture synthesis. In: Proceedings of the 14th Euro
graphics Workshop Rendering, 2003; Leuven, Belgium. pp 97–105.
32. Portilla J, Simoncelli EP. A parametric texture model based on joint statistics of com
plex wavelet coefficients. Int J Computer Vision 2000;40:49–71.
33. Lehmussola A, Selinummi J, Ruusuvuori P, Niemisto
¨
, YliHarja O. Simulating fluo
rescent microscope images of cell populations. In: Proceedings of the 27 Annual Con
ference of the IEEE Engineering in Medicine and Biology Society, 2005; Shanghai,
China. pp 3153–3156.
34. Cootes TF, Taylor CJ, Cooper DH, Graham J. Active shape models—Their train
ing and application. Comput Vision Image Understanding 1995;61:38–59.
35. Zhao T, Velliste M, Boland MV, Murphy RF. Object type recognition for automated
analysis of protein subcellular location. IEEE Trans Image Process 2005;14:1351–1359.
36. Bilmes J. A gentle tutorial on the EM algorithm and its application to parameter esti
mation for Gaussion mixture and hidden Markov models. Berkeley, CA: Interna
tional Computer Science Institute; 1997. Report nr TR97–021.
37. Huang K, Velliste M, Murphy RF. Feature reduction for improved recognition of
subcellular location patterns in fluorescence microscope images. Proc SPIE
2003;4962:307–318.
38. Garcia Osuna E, Hua J, Bateman N, Zhao T, Berget P, Murphy R. Largescale
automated analysis of protein subcellular location patterns in randomlytagged 3T3
cells. Ann Biomed Eng 2007;35:1081–1087.
39. Schubert W, Bonnekoh B, Pmmer AJ, Philipsen L, Bockelmann R, Malykh Y,
Gollnick H, Friedenberger M, Bode M, Dress AWM. Analyzing proteome topol
ogy and function by automated multidimensional fluorescence microscopy. Nat Bio
technol 2006;24:1270–1278.
40. Hucka M, Finney A, Sauro H, Bolouri H, Doyle J, Kitano H, Arkin A, Bornstein B,
Bray D, CornishBowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V,
Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger
JL, Kremling A, Kummer U, Le Novere N, Loew LM, Lucio D, Mendes P, Minch E,
Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro
BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J.
The systems biology markup language (SBML): A medium for representation and
exchange of biochemical network models. Bioinformatics 2003;19:524–531.
41. Lloyd CM, Halstead MDB, Nielsen PF. CellML: Its future, present and past. Prog
Biophys Mol Biol 2004;85:433–450.
RAPID PUBLICATION
990 Generative Models for Subcellular Location
Page 13
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

 "The WESTPA software in turn manages ensembles of the MCell simulations, for either weighted ensemble or bruteforce sampling. CellOrganizer (CellOrganizer.org) is an open source tool for learning conditional generative models of cellular organization from images505152535455565758 . From these models, new cellular geometries can be generated from different parts of the " shape space " of the system. "
[Show abstract] [Hide abstract] ABSTRACT: The longterm goal of connecting scales in biological simulation can be facilitated by scaleagnostic methods. We demonstrate that the weighted ensemble (WE) strategy, initially developed for molecular simulations, applies effectively to spatially resolved cellscale simulations. The WE approach runs an ensemble of parallel trajectories with assigned weights and uses a statistical resampling strategy of replicating and pruning trajectories to focus computational effort on difficulttosample regions. The method can also generate unbiased estimates of nonequilibrium and equilibrium observables, sometimes with significantly less aggregate computing time than would be possible using standard parallelization. Here, we use WE to orchestrate particlebased kinetic Monte Carlo simulations, which include spatial geometry (e.g., of organelles, plasma membrane) and biochemical interactions among mobile molecular species. We study a series of models exhibiting spatial, temporal and biochemical complexity and show that although WE has important limitations, it can achieve performance significantly exceeding standard parallel simulationby orders of magnitude for some observables. 
 "Any selected cells were modeled as having an elliptical nuclei, by manually choosing four points on the nuclei borderline and applying the Hough transform [25] (seeFigure 3 and 4). The nucleus and the cell are proven to be approximately aligned [26] so the nuclear envelope is approximated by another ellipsis centered in the nuclei, with the same inclination and the axis twice as long. This rudimentary model provides us with the necessary segmentation to perform the cell tracking by simply creating a mask over the region where the ellipsis is located for any frame in the video. "
[Show abstract] [Hide abstract] ABSTRACT: The study of viral infections using live cell imaging (LCI) is an important area with multiple opportunities for new developments in computational cell biology. Here, this point is illustrated by the analysis of the subcellular distribution of mitochon‐ drium in cell cultures infected by Dengue virus (DENV) and in uninfected cell cul‐ tures (Mockinfections). Several videos were recorded from the overnight experiments performed in a confocal microscopy of spinning disk. The density dis‐ tribution of mitochondrium around the nuclei as a function of time and space ρ(r, θ, t) was numerically modeled as a smooth interpolation function from the image data and used in further analysis. A graphical study shows that the behavior of the mitochondrial density is substantially different when the infection is present. The DENVinfected cells show a more diffuse distribution and a stronger angular variation on it. This behavior can be quantified by using some usual image process‐ ing descriptors called entropy and uniformity. Interestingly, the marked difference found in the mitochondria density distribution for mock and for infected cell is present in every frame and not an evidence of time dependence was found, which indicate that from the start of the infections the cells are showing an altered subcel‐ lular pattern in mitochondrium distribution. Ulteriorly, it would be important to study by analysis of time series for clearing if there is some tendency or approxi‐ mate cycles. Those findings are suggesting that using the image descriptors entropy and uniformity it is possible to create a machine learning classifier that could recog‐ nize if a single selected cell in a culture has been infected or not. 
 "In the work described here, we extend the nonparametric models to 3D shapes and to the combination of cell and nuclear shapes. This eliminates the need to explicitly model the conditional dependency of one shape on the other, in contrast with the previous parametric models (Zhao and Murphy, 2007; Peng and Murphy, 2011). We also develop generative models of the dynamics of cell and nuclear shape. "
[Show abstract] [Hide abstract] ABSTRACT: Modeling cell shape variation is critical to our understanding of cell biology. Previous work has demonstrated the utility of nonrigid image registration methods for the construction of nonparametric nuclear shape models where pairwise deformation distances are measured between all shapes and are embedded into a low dimensional shape space. Using these methods we explore the relationship between cell shape and nuclear shape. We find that these are frequently dependent upon each other and use this as the motivation for the development of combined cell and nuclear shape space models, extending nonparametric cell representations to multiple component 3D cellular shapes and identifying modes of joint shape variation. We learn a firstorder dynamics model to predict cell and nuclear shapes given shapes at a previous time point. We use this to determine the effects of endogenous protein tags or drugs on the shape dynamics of cell lines, and show that tagged C1QBP reduces the correlation between cell and nuclear shape. To reduce the computational cost of learning these models, we demonstrate the ability to reconstruct shape spaces using a fraction of computed pairwise distances. The open source tools provide a powerful basis for future studies of the molecular basis of cell organization.