Page 1

Fast, single-molecule localization that achieves theoretically

minimum uncertainty

Carlas S. Smith1, Nikolai Joseph2, Bernd Rieger1, and Keith A. Lidke2

1 Department of Imaging Science & Technology, Delft University of Technology, Delft, The

Netherlands 2 Department of Physics & Astronomy, University of New Mexico, Albuquerque, NM ,

USA

We describe an iterative algorithm that converges to the maximum likelihood estimate of the

position and intensity of a single fluorophore. Our technique efficiently computes and

achieves the Cramér-Rao Lower Bound, an essential tool for parameter estimation. An

implementation of the algorithm on graphics processing unit hardware achieves more than

105 combined fits and Cramér-Rao Lower Bound calculations per second, enabling real-time

data analysis for super-resolution imaging and other applications.

In many single molecule fluorescence applications, it is often desired to find the position and

intensity of a single fluorophore as well as to estimate the accuracy and precision of these

parameters. Where accuracy is a measure of the systematic error or bias and precision is a

measure of the statistical error of an estimator1. In recent work that uses single-molecule

localization to generate super-resolution images2-6, single emitters are located and on the

mosaic of their found positions a two-dimensional Gaussian profile is placed to generate the

final super-resolution images. The width of the placed Gaussian blob, σ, is given by the

precision of the fluorophore position localization σ = (σ2x + σ2y)1/2 and in these super-

resolution techniques it is therefore necessary to both find the parameters and estimate their

precision. Reported values are in the range of 20-70 nm. In the application of super-

resolution imaging, it may be required to find the positions of more than 106 fluorophores in

order to generate one final image of a typical field-of-view of 50 × 50 μm. In many cases,

background rates of light detection may vary across the field of view and the fluorophore

emission rate of chemically identical fluorophores can vary due to effects such as uneven

illumination profile, dipole orientation or different optical path lengths. In this work, we

describe an iterative routine, implemented on a graphics processing unit (GPU) that

calculates the Maximum Likelihood Estimate (MLE) for the xy(z)-position, the photon

count of the fluorophore and the background fluorescence rate (Supplementary Note). We

show that our approach achieves the Cramér-Rao Lower Bound (CRLB) over a wide range

of parameters. The uncertainties of the fitted parameters are found by calculating their

CRLBs1, and in this sense the estimated σ for building up the super-resolution image is

optimal. We provide a software tool (www.diplib.org/home22266) that only requires an

inexpensive graphic card in order for single molecule fitting speed to be sufficient for real-

time data analysis.

Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research,

subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

Correspondence should be addressed to K.A.L. (klidke@unm.edu).

AUTHOR CONTRIBUTIONS

C.S.S. and K.A.L. worked out the algorithm and implementation, N.J. and K.A.L. performed the experiments, B.R. and K.A.L.

designed the research and wrote the paper

NIH Public Access

Author Manuscript

Nat Methods. Author manuscript; available in PMC 2010 November 1.

Published in final edited form as:

Nat Methods. 2010 May ; 7(5): 373–375. doi:10.1038/nmeth.1449.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

Since the speed and precision of single particle localization has long played an important

role in single particle tracking as well as in other single molecule biophysical techniques that

rely on fluorescent reporters, others have also considered these issues. Several algorithms

from the literature for finding particle positions are compared in7, but, without the context of

a statistical framework. The theoretically best-possible estimation precision of a fluorophore

position has been derived in8 by using the well established statistical method of finding the

CRLB in an unbiased parameter estimation problem. They considered many of the effects in

a real system including background fluorescence, finite camera pixel size, and camera

readout noise and they recently made a software tool available for estimation

(www4.utsouthwestern.edu/wardlab). Non-linear least-squares (NLLS) and MLE position

estimates were compared to the CRLB in9, and it was found that, in general, MLE gives

better precision than NLLS. The better precision of MLE is in concurrence with our results,

however in9 they were investigated with assumed known emission and background rates.

Here we demonstrate a robust, iterative routine that finds the particle position, the intensity

and the background count rate. We do not consider camera readout noise since for electron

multiplying (EM)CCD cameras, which are generally used for the fast frame rates desired in

super-resolution imaging, the readout noise is much less than 1 rms e- when using large EM

gain. As described further in the Supplementary Note, Supplementary Data and

Supplementary Figs. 1-4, the method presented is not restricted to 2D imaging with a

symmetric point spread function (PSF), but can be extended to handle super-resolution

techniques that encompass astigmatic imaging for z resolution as in10. In this case, the z

position is also calculated directly (not from intermediate σx, σy fits) and returned with

CRLB based uncertainties. The results of the iterative algorithm compared to CRLB-based

theoretical values are shown for a range of background rates and total collected photon

counts of the PSF (Fig. 1). We show results for σPSF = 1 with the size defined in unitless

back-projected pixels. The diffraction limit for high NA visible light imaging is > 200 nm

and σPSF > 90 nm 11. The algorithm both achieves and correctly reports the CRLB

uncertainties over a wide range of background and fluorophore intensities. Calculated

precision remains within a few percent of the theoretically achievable value even for less

than 100 collected photons. We find that in all conditions, when the reported CRLB is less

than σPSF / 2 (here 0.5), the reported CRLB matches the theoretical position, and the routine

achieves the CRLB. In typical super-resolution applications this corresponds to < 50 nm.

Addition of camera readout noise has effectively the same negative influence on the

parameter estimation as high background. Fortunately, this can be excluded for an EMCCD

for the reasons mentioned above. Example images of single fluorophores with intensities

and background rates near the σPSF / 2 value are shown in Supplementary Fig. 5. The

classical approach to solving the position fitting problem is via NNLS optimization (Fig.

1b). Here we chose a Levenberg-Marquardt (LM) optimization scheme with analytic and

computed first derivatives with respect to the optimization parameters. Note that it is

common practice to use computed derivatives only. The NNLS optimization clearly

performs worse in terms of precision than our iterative MLE approach due mainly to the

incorrect Gaussian-noise model implicitly present in any least-squares based optimization

scheme. We compare the predicted uncertainty of the fit as in12 (Eq. 14) with the theoretical

CRLB (Fig. 1c). Strikingly, they are identical for no background fluorescence but for any

non-zero background and low light conditions the deviation is almost a factor of two. This

means that in these cases the suggested uncertainty σ used to constitute the super-resolution

image is estimated at nearly half the CLRB based value (overly optimistic). The formula

presented in12 has the advantage that it can be readily calculated by hand from measurable

quantities. However, for a precise estimate, the use of the reported CRLB from our iterative

algorithm is preferred.

Our iterative update scheme is similar to that described in13. We show, however, that a

Gaussian approximation for the two-dimensional fluorophore PSF 11 and following

Smith et al.

Page 2

Nat Methods. Author manuscript; available in PMC 2010 November 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

localization leads to a compact analytical expression that allows for computationally fast

localization without compromising on the localization precision. Our approach achieves the

CRLB after a few (~ 10) iterations. It should be noted that the CRLB predicts the correct

precision only when the model function is correct, and the isotropic Gaussian model may not

be appropriate when imaging fluorophores with a fixed dipole orientation14, leading to

anisotropic emission. As described in the Supplementary Data, a ‘rule-of-thumb’ fitting

region size of 2 × 3σPSF + 1 is used. For z resolution imaging the Gaussian PSF model is

less reliable due to optical aberrations and the simplified model itself11.

GPU-based computation has the potential to increase floating point calculation speed by a

factor of 10-100 as compared to a modern CPU if the problem is amenable to a parallel

processing approach (www.nvidia.com). A generic C-like language interface is available for

simplified GPU programming (Nvidia CUDA), and a MATLAB (The Mathworks, USA)

interface has been developed. Operating with a fixed number of iterations complements the

GPU's single instruction multiple data strategy (SIMD). A GPU implementation of our

iterative method can perform 105 combined MLE and CRLB calculations per second of the

four parameter model needed to describe the emittance of a fluorophore (Fig. 2).

The CPU and GPU performance, characterized by the number of combined position fits and

CRLB calculations performed per second, are shown in Table 1. The slowest GPU tested

outperforms the CPU by more than one order of magnitude, with the fastest achieving

2.6·105 fits per second on a 7×7 fitting box size. We attribute this level of performance gain

over the CPU to the fact that this estimation problem is almost ideal for the GPU SIMD

architecture. Many iterations are performed on the same data, which are stored in local

shared memory, and each thread is independent, eliminating synchronization delays. We

also note that all CPU computation only ran on a single thread. In the right column of Table

1 the performance of a non-linear, least-squares fit (in C-code) on a CPU is shown for

reference. It is twice as slow as our iterative algorithm on a CPU. Commonly used

MATLAB LM optimization only computes about 5 fits/second. Given the readout rate of

current high-end (EM)CCD cameras (~ 10 Mhz) our GPU implementation performs

combined fits and uncertainty estimates at a speed sufficient for real-time analysis (see

Methods).

To summarize our findings, we have derived an iterative approach for making a maximum

likelihood estimate of the position and intensity of a single fluorophore as well as the

background count rate using a two-dimensional Gaussian PSF model and a Poisson noise

model. The iterative method achieves the minimal possible estimation uncertainty, as given

by the Cramér-Rao Lower Bound, over a wide range of emission and background rates that

could be found in single molecule experiments. Implementation of the iterative method on a

modern graphics processing unit yields more than 105 combined fits and CRLB calculations

per second, greatly facilitating the analysis of large data sets found in single molecule based

super-resolution techniques and is suitable for real-time data analysis.

METHODS

Methods and any associated references are available in the online version of the paper at

http://www.nature.com/naturemethods/.

Methods

GPU Implementation

The iterative method to solve the MLE problem described above is implemented on a GPU

using NVIDIA's compute unified device architecture (Nvidia CUDA), a C based

Smith et al.

Page 3

Nat Methods. Author manuscript; available in PMC 2010 November 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

programming language that makes it possible to readily program parallelized algorithms that

are executed on a GPU. High-end gaming and computing GPU's have a (card dependent)

large region of global device memory, usually several hundred MBytes. Execution is

performed on a number (card dependent) of multi-processors. Multi-processors contain eight

sub-processors and have 16 KBytes local memory that is shared between sub-processors.

Access to local shared memory is fast, whereas access to the device global memory incurs a

large performance penalty, and should be avoided when possible. The programming model

follows the GPU architecture in that parallelized execution is performed by breaking down

computations into ‘blocks’ and ‘threads’. Each block executes a set of threads using one

multi-processor. Typically, performance is optimal when multiples of 32 threads (called a

‘warp’) are scheduled and executed using the 8 sub-processors. However, we found for all

fit box sizes, the maximum fits/second occurred when the maximum fits/block were used,

which is limited by the available 16KB shared memory per block. This is likely due to

compiler optimization and the fact that each fit is independent of all others (no thread

synchronization is required).

We map our iterative algorithm on this programming model in the following way. A data

stack consisting of identically sized sub regions, which are centered single emitter images,

are input to the function. This data is copied from host (CPU) memory to device (GPU)

main memory. This data set is divided into blocks, which consist of the largest number of

images that can fit into shared memory. The execution of a block begins by copying the data

sub-set into local shared memory. Each thread then calculates a complete fit and CRLB

calculation for one image. The Fisher information matrix is calculated using Supplementary

Note Eqs. 9 and 11 and the CRLB is calculated using the analytical expression for the

inverse of a 4 × 4 matrix. The CPU performance was measured by replacing the block/

thread architecture with nested loops which call the same sub-function, and was compiled

using Microsoft Visual Studio Express 2008. Images were loaded in MATLAB (The

Mathworks, USA) as arrays and the C-Code and the CUDA GPU code were called via mex-

files. The CPU was a AMD Phenom II X3 720 @2.8 GHz and only one core was used for

the C-code.

Synthetic Data Generation and Analysis

The CUDA routine described above operates on a data set consisting of identically sized

images that contain images of (potential) single molecules. For the analysis of synthetic

data, a stack is generated using the same finite pixel approximation, including background as

described in the Supplementary Note. The center coordinate of each simulated emitter is

randomly shifted within the central pixel to prevent a biased result. After the generation of

the data stack, the images are corrupted with Poisson noise. This data stack is analyzed by

the routine, which returns the estimated position, intensity, and background and the CRLB

calculation for each 2D image in the stack. No camera read noise is added. Camera read

noise for electron multiplying (EM) CCD cameras, which are generally used for the fast

frame rates desired in super-resolution imaging, is much less than 1 rms e- with large EM

gain.

Levenberg-Marquardt Non-Linear Least-Squares Fitting

The standard way of fitting a Gaussian to data is via least-squares fitting. We used 1) an

existing implementation of MATLAB via the optimization toolbox (lsqcurefit) where we

enabled the Levenberg-Marquardt option as minimization scheme and 2) an implementation

from Numerical Recipes in C15. We ran the test on fits with computed Jacobian only and on

fits where we supplied analytic first derivatives. We used the following limits in the

stopping criterion: tolerance on the parameters 10−4, tolerance on the function 10−15 and

maximal 105 function evaluations. The stopping criterion was in all cases determined by the

Smith et al.

Page 4

Nat Methods. Author manuscript; available in PMC 2010 November 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

accuracy put on the parameters. The MATLAB routine is a lot slower (about two orders of

magnitude) than the C implementation although it makes automatic use of all available cores

on the CPU if multi-threading is enabled.

Single Molecule Imaging

Single molecule imaging experiments were performed in an epi-fluorescence microscope

setup consisting of an inverted microscope (IX71, Olympus America Inc.), 1.45 NA

objective (U-APO 150x NA 1.45, Olympus America Inc.), 635 nm diode laser (Radius 635,

Coherent Inc.), and an electron multiplying CCD camera (Luca DL6581-TIL, Andor

Technologies PLC.). The pixel size is 10 μm. The epi-fluorescence filter setup consisted of

a dichroic mirror (650 nm, Semrock) and an emission filter (692/40, Semrock). Individual

Cy5 molecules were immobilized on an amino-silane ((3-Aminopropyl)triethoxysilane,

Sigma-Aldrich) treated 8-well chambered cover slips (Lab-Tek II, Nunc) via an NHS-ester

linkage attached to the Cy5 (Cy5 Mono-reactive dye pack, GE Healthcare). An oxygen

scavenging system16 was used to extend fluorophore lifetimes and quench fluorophore

triplet states. This was necessary to perform repeated measurements of the same single

emitter for several frames while acquiring sufficient photons in order to address localization

accuracy. This is not necessary in a dedicated experiment.

Data was recorded by the CCD camera at either 10 or 20 frames per second. All data was

post-processed by 1) subtracting a pixel-dependent camera-offset, which was created by

averaging 300 dark frames, and 2) multiplying the resulting image by a gain factor to restore

correct Poisson statistics, as done in17. Single molecules candidates were identified in each

time frame as regions where the 2-D Gaussian filtered image (σ = σPSF) was greater than

one standard deviation of this image. Note that due to the speed of the GPU implementation,

a simple but fast method for identifying candidates is preferred as well as one that errs on

the side of including regions that do not contain single molecules. Square regions of a

specified number of pixels that included all identified regions in the time series were

collected into one stack and input to the GPU routine. The resulting found coordinates were

used in building trajectories only if they passed the following criterion: 1) Reported

localization accuracy was less than one fifth σPSF in each dimension, and 2) 1/N Σkln(L(xk|

θ))>-1, where N is the number of pixels, which essentially performs a shape test and can

rule out obvious cases of two proximate emitters. The remaining coordinates were connected

into ‘trajectories’ using an existing single particle tracking routine18. Only ‘trajectories’ that

showed little triplet state blinking were used in the final analysis, with a cut-off criterion that

var(I(t))<2*mean(I(t)) where I(t) is the sum over all pixels in the analyzed region in frame t.

‘Trajectories’ were adjusted to compensate for microscope stage drift by subtracting a linear

regression line from each single particle trajectory.

The width of the PSF used in the fitting routine was found by minimizing the mean square

error between the finite pixel model and the summed projection over a 100 frame time

series. The σPSF used in further analysis is the average found from analyzing the summed

projection of 5 different single emitters.

Astigmatic Imaging

The 3D astigmatic imaging was calibrated by imaging 100 nm red ( 690 nm emission) beads

(FluoSphere, Invitrogen) bound to a the bottom of an 8 well cover slip chamber (Lab-Tek II,

Nunc). The filter setup used was the same as that used for single molecule Cy5 imaging. We

imaged using a 60x 1.2 NA water objective. A 500 mm focal length cylindrical lens was

inserted in the emission beam path just after the first lens of a two color beam splitter

(OptoSplit II, Cairn Research, UK). A piezoelectric z-stage (Nano-LPS, Mad City Labs)

translated the focal plane in steps of 50 nm from −0.5 μm to 0.5 μm. At each focal plane, 20

Smith et al.

Page 5

Nat Methods. Author manuscript; available in PMC 2010 November 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript