Simulating spin systems on IANUS, an FPGA-based computer
ABSTRACT We describe the hardwired implementation of algorithms for Monte Carlo simulations of a large class of spin models. We have implemented these algorithms as VHDL codes and we have mapped them onto a dedicated processor based on a large FPGA device. The measured performance on one such processor is comparable to O(100) carefully programmed high-end PCs: it turns out to be even better for some selected spin models. We describe here codes that we are currently executing on the IANUS massively parallel FPGA-based system.
- SourceAvailable from: arxiv.org[show abstract] [hide abstract]
ABSTRACT: Introduction Spin glasses are a fascinating subject, both from the experimental and from the theoretical point of view 1;2;3;4 . In the framework of the mean field approximation a deep and complex theoretical analysis is needed to study the infinite range version of the model (the Sherrington-Kirkpatrick model, SK model in the following). Using the formalism of replica symmetry breaking 5 (RSB) one finds an infinite number of pure equilibrium states, which are organized in an ultrametric tree. It is fair to say that while most of the equilibrium properties of the SK model are well understood, much less is known about the detailed features of the dynamics, although recent progress has been made in this direction. A crucial question is how much of this very interesting structure survives in short range models, defined in finite dimensional space. Numerical simulations are very useful for trying to a03/1997;
Article: Computing in Science and Engineering
- [show abstract] [hide abstract]
ABSTRACT: Preface; 1. Introduction; 2. Some necessary background; 3. Simple sampling Monte Carlo methods; 4. Importance sampling Monte Carlo methods; 5. More on importance sampling Monte Carlo methods of lattice systems; 6. Off-lattice models; 7. Reweighting methods; 8. Quantum Monte Carlo methods; 9. Monte Carlo renormalization group methods; 10. Non-equilibrium and irreversible processes; 11. Lattice gauge models: a brief introduction; 12. A brief review of other methods of computer simulation; 13. Monte Carlo simulations at the periphery of physics and beyond; 14. Monte Carlo studies of biological molecules; 15. Outlook; Appendix; Index.2. 01/2005; Cambridge University Press.
arXiv:0704.3573v1 [cond-mat.dis-nn] 26 Apr 2007
Simulating spin systems on IANUS, an
F. Bellettia,b, M. Cotalloc,d, A. Cruzc,d, L. A. Fern´ andeze,d,
A. Gordillof,d, A. Maioranoa,d, F. Mantovania,b,∗, E. Marinarig,
V. Mart´ ın-Mayore,d, A. Mu˜ noz-Sudupee,d, D. Navarroh,i,
S. P´ erez-Gaviroc,d, J.J. Ruiz-Lorenzof,d, S. F. Schifanoa,b,
D. Scirettic,d, A. Taranc´ onc,d, R. Tripiccionea,b, J. L. Velascoc,d
aDipartimento di Fisica, Universit` a di Ferrara, I-44100 Ferrara (Italy)
bINFN, Sezione di Ferrara, I-44100 Ferrara (Italy)
cDepartamento de F´ ısica Te´ orica, Facultad de Ciencias,
Universidad de Zaragoza, 50009 Zaragoza (Spain)
dInstituto de Biocomputaci´ on y F´ ısica de Sistemas Complejos (BIFI), 50009
eDepartamento de F´ ısica Te´ orica, Facultad de Ciencias F´ ısicas,
Universidad Complutense, 28040 Madrid (Spain)
fDepartamento de F´ ısica, Facultad de Ciencia,
Universidad de Extremadura, 06071, Badajoz (Spain)
gDipartimento di Fisica, Universit` a di Roma “La Sapienza”, I-00100 Roma (Italy)
hDepartamento de Ingenieria Electr´ onica y Comunicaciones,
Universidad de Zaragoza, CPS, Maria de Luna 1, 50018 Zaragoza (Spain)
iInstituto de Investigaci´ on en Ingenieria de Arag´ on ( I3A),
Universidad de Zaragoza, Maria de Luna 3, 50018 Zaragoza (Spain)
We describe the hardwired implementation of algorithms for Monte Carlo simu-
lations of a large class of spin models. We have implemented these algorithms as
VHDL codes and we have mapped them onto a dedicated processor based on a large
FPGA device. The measured performance on one such processor is comparable to
O(100) carefully programmed high-end PCs: it turns out to be even better for some
selected spin models. We describe here codes that we are currently executing on the
IANUS massively parallel FPGA-based system.
Key words: Spin models, Monte Carlo methods, reconfigurable computing.
PACS: 05.10.Ln, 05.10.−a, 07.05.Tp, 07.05.Bx.
Preprint submitted to Computer Physics Communications 1 February 2008
Numerical simulations with Monte Carlo (MC) techniques of spin systems
that show a complex behavior (as, for example, because of the presence of
frustrated quenched disorder, the so called spin glasses) require huge com-
putational efforts: the non-trivial structure of the energy-landscape, the long
decorrelation time of the dynamics, the need to analyze several different re-
alizations of the system all conspire to make the problem very challenging to
clarify numerically. Reference  gives an introduction to numerical spin glass
systems, and discusses and elucidates a number of relevant details.
One of the bottom lines is that traditional computers are not optimized to-
wards the computational tasks that are relevant in a context of discrete vari-
ables: a large part of the needed CPU time is spent essentially performing
logical operations on individual bits or on variables that can only appear in
a few states, at variance with arithmetics on long data words (32 or 64 bits)
which is the typical workload for which computers are optimized today. This
problem can be turned into an opportunity by the proposal to develop a ded-
icated computer optimized to handle the typical workload associated to these
applications. The use of Field Programmable Gate Arrays (FPGAs) adds flex-
ibility to a dedicated architecture: an FPGA based system can be configured
on-demand to perform with potentially very high efficiency on a variety of
The FPGA approach for the simulation of spin systems has been proposed
several years ago , and is now revisited in the IANUS project, a massively
parallel modular system based on a building block of 16 high-performance FP-
GAs. The IANUS architectural concept has been described in , while details
of the hardware prototype, currently undergoing final tests, will be described
elsewhere . In this paper we focus on algorithm mapping: we explore several
avenues to map Monte Carlo algorithms for spin systems on FPGAs, provide
benchmark results for the performance of several associated implementations,
and present some very preliminary results of large scale numerical simulations,
quantifying the potential performance of full-scale IANUS systems.
This paper is structured as follows: Section 2 describes the spin models and the
algorithms we have implemented as our first application for IANUS. Section
3 gives details about the FPGA-based implementation of those models and
algorithms, covering various aspects of the VHDL design. In section 4 we
present some results and performance figures for the test simulations of two
different spin models. Section 5 draws the conclusions of the work developed
∗Corresponding author. Filippo Mantovani (firstname.lastname@example.org), Dipartimento
di Fisica,Universit` a di Ferrara, via Saragat 1, I-44100
+39 0532 974610.
so far, and outlines prospects for the near future.
2 Monte Carlo simulations of Spin Glass systems
IANUS has been designed as a multipurpose reprogrammable computer; its
first application is the simulation of spin models. We are interested in discrete
models whose variables (the spins) sit at the vertexes of a D−dimensional
lattice (the sites of the system). The spin variable associated to site i (si) take
only a discrete and finite set of values (in some cases, just two values).
We define an energy or cost function (the Hamiltonian H) that drives the
dynamics of the system. Configurations of the system that appear in the course
of the dynamics, once reached an equilibrium state, are distributed according
to the probability function
P ∼ e−βH,
where β is the inverse of the temperature T and tunes the features of the
type of configurations that appear at equilibrium: when β becomes large only
configurations that minimize H are important (when β → ∞ one looks for
optimal configurations, i.e. minima of H), while when β → 0 the weight is not
important and spin configurations become equiprobable. Our local dynamics
will allow us, in this way, to determine important features of physical systems
or for example, in very strict analogy with it, of sets of equalities we want to
Each spin only interacts with its nearest neighbors, i.e. with spins sitting
at sites that are exactly one lattice spacing far away. The strength of the
interaction of spins siand sj is proportional to the value of a coupling Jij,
which in some models (the classical Ising model) is constant over all the bonds
of the lattice (i.e. the connections among two first neighboring sites), or can
vary randomly from pair to pair (in this case, for a given realization of the
model, Jijdepends on i and j: it is fixed when defining the realization of the
model and does not change during the dynamics). The model can be extended
by adding an external magnetic field hiat every site (hican also be a random
variable), or also by considering the case of a diluted lattice (only certain sites
of the lattice are occupied by spins, while the others are empty, depending on
the value of the dilution, xi= 0,1).
A generic Hamiltonian for two-state (si= ±1) models has the form
where < i,j > means that the sum is taken on all pairs of neighboring sites
of the lattice.
Hamiltonians of the form (2) define several very interesting models. For in-
stance, the Edward-Anderson (EA) spin glass  has xi= 1 and hi= 0 for all
sites i, while Jijtakes random values (±1 in our work) with both positive and
negative support. The random field Ising model (RFIM) [6,7] has xi= 1 and
Jij= 1 everywhere, but the field at each site takes random values hi= ±|h|.
Another interesting case is the diluted antiferromagnet in a field (DAFF) ,
that has Jij = −1 and hi= h everywhere while dilution xitakes randomly
the value 0 or 1.
Models with two-state variables associated to the Hamiltonian (2) are usu-
ally referred to as Ising-like and their implementation on our FPGA-based
computer are extensively discussed in this paper. Many other different spin
models are very important: they have for example higher space dimensional-
ity or are defined on non regular random graphs, longer range interactions or
multivalued spin variables (for example the Potts models). In this note we also
discuss the implementation of the dynamics of a four-state glassy Potts model
, defined by the Hamiltonian
H = −
where the sum runs over first-neighbor sites, and the site variables sican take
four values. πi,j are quenched random permutations of (0,1,2,3) (there are
4! of them): the pair of first neighboring spins (si,sj) has non zero energy
only if si= πi,j(sj). This model displays a number of features that are typical
of structural glasses, and could hopefully help describe the glassy state, that
stays difficult to understand.
Our goal is to analyze, by numerical Monte Carlo simulations, the properties
of the models described above. We have implemented for the IANUS processor
two well-known algorithms, namely Metropolis and Heat Bath.
Both algorithms update a single spin at a time: they sweep the entire lattice
and then start again. After a (long enough) number of steps one reaches, as
discussed before, an equilibrium state, and the spin configurations that appear
during the dynamics are typical of the probability distribution (1).
In the case of the Metropolis algorithm we propose to update a spin si, and
we calculate the corresponding energy change ∆E. If ∆E < 0, the update
makes the energy function lower, and change is accepted. Otherwise we do
not necessarily refuse the update (this would be a β = ∞ dynamics, where we
move to the closest local minimum of H) but we accept it with a probability
In the case of the Heat Bath algorithm we directly select the new value of the
spin with a probability proportional to the Boltzmann factor
PHB(si= +1) =
e−βE++ e−βE−, (4)
where E+ and E− are the local energies of the two spin configurations for
spin sipointing up (si= +1) or down (si= −1), respectively. Since when we
change sionly a few terms of the energy function change (the ones containing
spin siand its first neighbors), this is a fast and easy computation.
We define one full MC sweep to be the iteration of these simple steps for all
sites of the lattice. The spin configurations that appear during the dynamical
process we are simulating are correlated: a spin configuration depends on the
ones that appeared at former times, and only when we consider large time sep-
aration among two such configuration we can consider them as independent.
In this way we can define a correlation time (that depends on β and char-
acterizes the dynamics), that we can roughly define as the number of Monte
Carlo sweeps it takes to make two spin configurations uncorrelated (see refs.
 and ). An estimate of this correlation time is usually calculated dur-
ing the simulation, taking configurations at various times and measuring their
Other algorithms are used in some simulations, as they offer higher efficiency
in decorrelating the spin configurations (see  for a review). On one side no
very effective specialized algorithm exist, for example, for the very interesting
case of spin glasses (we have in mind here mainly cluster algorithms), and their
implementation on IANUS would probably not be very effective: so we do not
use this kind of algorithms, and stay with simple, local dynamics. On the
other side, algorithms like Parallel Tempering  are crucial for simulating
complex systems like spin glasses, but their implementation on our FPGA
based devices is a trivial add-on so we do not discuss them here.
3 Hardware implementation
The guiding line of our implementation strategy is to try to express all the
parallelization opportunities allowed by the FPGA architecture, matching as
much as possible the potential for parallelism offered by spin systems. Let
us start by noticing that, because of the locality of the spatial interaction
, the lattice can be split in two halves in a checkerboard scheme (we are
dealing with a so-called bipartite lattice), allowing in principle the parallel
update of all white (or black) sites at once. Additionally, one can further boost
performance by updating in parallel more copies of the system. We do so by
updating at the same time two spin lattices (see later for further comments on
this point). Standard PCs cannot efficiently exploit all available parallelism for
several reasons, the most fundamental one being memory architecture, that
prevents the processor from gathering fast enough all variables associated
to the computation. Sharing the simulation between several computers is an
interesting parallel solution, but optimization has a bottleneck in the limited
bandwidth and large latency associated to communication patterns (see ).
The hardware structure of FPGAs allows exploitation of the full parallelism
available in the algorithm, with the only limit of logic resources. As we explain
below, the FPGAs that we use (Virtex4/LX160 and Virtex4/LX200, manu-
factured by Xilinx) have enough resources for the simultaneous update of half
the sites for lattices of up to 83sites. For larger systems there are not enough
logic resources to generate all the random numbers needed by the algorithm
(one number per update, see below for details), so we need more than one
clock cycle to update the whole lattice. In other words, we are in the very
rewarding situation in which: i) the algorithm offers a large degree of allowed
parallelism, ii) the processor architecture does not introduce any bottleneck
to the actual exploitation of the available parallelism, iii) performance of the
actual implementation is only limited by the hardware resources contained in
We have developed a parallel update scheme, supporting 3-D lattices with
L ≥ 16, associated to the Hamiltonian of (2). One only has to tune a few
parameters to adjust the lattice size and the physical parameters defined in
H. We regard this as an important first step in the direction of creating flexible
enough libraries of application codes for an FPGA-based computers.
The number of allowed parallel updates depends on the number of logic cells
availables in the FPGAs. For the Ising-like codes developed so far, we update
up to 1024 sites per clock cycle on a Xilinx Virtex4-LX200, and up to 512
sites/cycle for the Xilinx Virtex4-LX160. The algorithm for the Potts model
requires more logic resources and larger memories, so performances lowers to
256 updates/cycle on both the LX200 and LX160 FPGAs.
We now come to the description of the actual algorithmic architecture, shown
in fig. 1.
Fig. 1. Parallel update scheme. The spins that must be updated, their neighbors,
the couplings and all other relevant values are passed to the update cells where the
energy is computed. The result is used as a pointer to a Look-up Table (LUT).
The associated value is compared with a random number (RNG), and following the
comparison, the updated spin value is computed.
In short, we have a set of update cells (512 in the picture): they receive in
input all the variables and the parameters needed to perform all required
arithmetic and logic operations, and compute the updated value of the spin
variable. Data (variables and parameters) are kept in memories and are fed to
the appropriate update cell. Updated values are written back to memory, to
be used for subsequent updates.
The choice of an appropriate storage structure for data and the provision of
enough data channels to feed all update cells with the data they need is a
complex challenge; designing the update cells is a comparatively minor task.
Hence we describe first the memory structures of our codes, followed by some
details on the architecture of the update cells.
Virtex-4 FPGAs have several small RAM-blocks that can be grouped together
to form bigger memories. We use these blocks to store all data items: spins,
couplings, dilutions and external fields. The configurable logic blocks are used
for random number generators and update cells.
To update one spin of a three dimensional model we need to read its six near-
est neighbors, six couplings, the old spin value (for the Metropolis algorithm)
and some model-dependent information such as the magnetic field for RFIM
and the dilution for DAFF. All these data items must be moved to the appro-
priate update cells, in spite of the hardware bottleneck that only two memory
locations in each block can be read/written at each clock cycle.
Let us analyze first the Ising models, considering for definiteness L = 16. We
choose to use an independent memory of size L3for each variable. This is
actually divided into smaller memories, arranged so that reading one word
from each gives us all the data needed for a single update cycle. We need
163= 4096 bits to store all the spins of one replica. We have 16 vertical
planes, and save each plane in a different memory of width 16 bits and height
16 (see Fig.2). In this simple case the logic resources within the FPGA allow
to update one whole horizontal plane in one clock cycle (because we mix the
two bipartite sublattices of two different copies of the system, see the following
discussion), and the reading rate matches requirements, as we need to read
only one word from each of the sixteen memories.
Fig. 2. Examples of the spin memory structure: L=16 and L=32.
The configuration is slightly more complex when the size of the lattice grows
and the update of a full plane in just one clock cycle is no longer possible.
In this case we split each plane in a variable number of blocks NB, adjusted
so that all the spins of each block can be updated in one clock cycle. The
number of independent memories is L/NB, as only these need to be read at
the same time. The data word still have width L, while the height is L × NB
to compensate for the reduced number of memories. Considering L = 32,
for example, we have a plane made of 322= 1024 spins, too large to be
updated in one cycle (in the Xilinx Virtex4-LX160). We split it in two blocks
of 32 × 16 = 512 spins each. To read 16 lines every clock cycle we store the
spins into 16 memories, each of width 32 bits and height 32×2: the total size
of the memory is still 323bits.
As already remarked, we simulate two different replicas in the same FPGA.
This trick bypasses the parallelism limit of our MC algorithms (nearest neigh-
bors cannot be updated at the same time, see  ). We mesh the spins of the
two replicas in a way that puts all the whites of one replica and the blacks of
the other in distinct memories that we call respectively P and Q (see Fig.3).
Every time we update one slice of P we handle one slice of whites for replica 1
and one slice of blacks for replica 2. Obviously the corresponding slice of mem-
ory Q contains all the black neighbors of replica 1 and all the white neighbors
of replica 2.
Fig. 3. Structure of spin configuration memories: meshing of replicas.
The amount of memory available in the FPGA limits the lattice size we can
simulate and the models we can implement. In both the Virtex4-LX160 and
LX200 it is possible to simulate EA, RFIM and DAFF models in 3D with
size up to L = 88 (not all smaller sizes are allowed). Because of the dramatic
critical slowing down of the dynamics of interesting complex spin models these
size limits are confortably larger of what we can expect to be able study (even
with the tremendous power made available by IANUS) in a reasonable amount
of (wall-clock) time: memory size is presently not a bottle-neck.
Things are even more complicated when one considers multi-state variables,
as more bits are required to store the state of the system and all associated
parameters. In the four state Potts model (see sec. 2.1) the site variables need
two bits and the couplings eight bits. In order to keep a memory structure
similar to that outlined before we store each bit in a different memory. For
example a lattice with L = 16 requires 16 × 2 memories for the site variables
(they were sixteen in the Ising case), and 16 × 8 memories for the couplings.
The lattice meshing scheme is maintained. With our reference FPGAs we
can simulate three dimensional Potts model with at most L = 40 and four
dimensional Potts model with L = 16.
We now come to the description of the update cells. The Hamiltonian we have
written is homogeneous: the interaction has the same form for every site of
the lattice, and it only depends on the values of the couplings, the fields and
the dilutions. This means that we can write a standard update cell and use it
as a black box to update all sites: it will give us the updated value of the spin
(provided that we feed the correct inputs). This choice makes it easy to write
a parametric program, where we instantiate the same update cell as many
times as needed.
We have implemented two algorithms: Metropolis and Heat Bath. The update
cell receives in input couplings, nearest neighbors spins, field and dilution and,
if appropriate, the old spin value (for the Metropolis dynamics). The cell uses
these values as specified by (2) and computes a numerical value between 0 and
15 (the range varies depending on the model) used as an input to a LUT. The
value read from the LUT is compared with a random number and the new spin
state is chosen depending on the result of the comparison. Once again, things
are slightly different for the Potts model due to the multi-state variables and
Our goal is to update in parallel as many variables as possible, which means
that we want to maximize the number of cells that will be accessing the LUT
at the same time. In order to avoid routing congestion at the hardware layer we
replicate the LUTs: each instance is read only by two update cells. The waste
in logic resources – the same information is replicated many times within the
processor – is compensated by the higher allowed clock frequency.
3.3 Random numbers
Monte Carlo methods depend strongly on the random numbers used to drive
the updates: this determines the imperative need to implement a very reliable
pseudo-random number generator (RNG), that produces a sequence of num-
bers under the selected distribution, with no known or evident pathologies.
We use the Parisi-Rapuano shift register method  defined by the rules:
I(k)=I(k − 24) + I(k − 55)
R(k)=I(k) ⊗ I(k − 61) ,
where I(k−24), I(k−55) and I(k−61) are elements (32-bit wide) of a so called
wheel that we initialize with externally generated random values. I(k) is the
new element of the updated wheel, and R(k) is the generated pseudo-random
A straightforward implementation of this algorithm produces one random
number at each step, for each wheel that we maintain. A wheel uses many
hardware resources (in our case we use the three pointer values 24, 55 and 61
so we need to store 62 numbers), and the random number generator is a sys-
tem bottleneck, since the number of updates per clock cycle is limited by how
many random values we are able to produce. A large performance improve-
ment comes from the implementation of the wheel through logic (as opposed
to memory) blocks, as the former can be written in cascade-structured combi-
natorial logic that may be tuned to produce several numbers per clock cycle.
We can exploit this feature and use a limited number of wheels to produce
more numbers, thus increasing the number of updates per clock. Remember
that to produce one random number we must save the result of the sum of
two values and then perform the XOR with a third value. The wheel is then
shifted and the computed sum fills the empty position. All this is done with
combinatorial logic, so one can produce various pseudo-random numbers sim-
ply replicating these operations and, of course, increasing logic complexity. A
schematic representation of a simplified case is given in fig. 4.
RRRRR R R R R R
Fig. 4. Hardware implementation of the Parisi-Rapuano RNG. For graphical reasons
the example refers to a wheel of only 20 numbers and following the reduced equations
I(k) = I(k − 10) + I(k − 14) and R(k) = I(k) ⊗ I(k − 20). The combinatorial logic
complexity grows when producing more numbers.
The logic complexity of the implementation depends on the parameters of
(5) and on the quantity of random numbers we need. We use one wheel to
generate up to 96 numbers per clock (so more wheels are active at the same
time to compute all needed random values).
To keep the wheel safely below its period limit we choose to reload the wheel
every now and then (for example every 107MC sweeps).
With respect to the choice of 32-bit random numbers, we have verified that
this word size is sufficient for the models we want to simulate (our tests show
that 24-bit would be enough). Other models may require better random num-
bers. We do not address this issue here. We just note that generating random
numbers of larger size (e.g., 40 or even 64-bit) would be straightforward, at
the price, of course, of a larger resource usage.
All in all, our carefully handcrafted VHDL codes use a very large fraction
of the available FPGA resources, as measured by the number of used logic
blocks and RAM-blocks. The following table shows figures for the Ising-like
and Potts-model codes. Mapping has been performed with the ISE toolkit
made available by Xilinx. The Ising-like code is limited by logical resources,
while the Potts model, with its larger storage requirements, is limited by
available memory space.
ModelResourceNumber used% (LX160)% (LX200)
Ising-likeLog. blocks157,649 (117%)88%
(1024 updates) RAM-blocks 16056%47%
Ising-likeLog. blocks83,651 62%46%
(512 updates)RAM-blocks 8028%23%
Potts q = 4 Log. blocks 117,58686%66%
Use of FPGA resources, as absolute values and as fraction of available blocks on
our FPGAs, for the Ising-like and Potts codes. In both cases, the 3-D lattice has a
linear size L = 32. The Ising-like code is limited by available logic resources , while
the Potts code is memory-limited.
Ram-blocks224 77% 67%
4 Benchmark tests
4.1Edward-Anderson spin glass model
We have simulated an L = 32 3D system at β(= 1/T) = 0.878. The number
of MC sweeps sums up to 8 × 109. See reference  for previous simulations
done with the special purpose machine SUE on a lattice of size L = 20
Checking that thermalization has been reached is a common and non-trivial
problem in spin glass simulations. Here we provide only a short review of our
analysis: full details will be published elsewhere. In our early tests, configura-
tions were copied to the host computer every 106MC sweeps, because, when
performing these tests, we had a very slow communication channel to the
host1. This value is too high to see clearly the evolution towards equilibrium
along the first sweeps. Fig. 5 shows the MC history of a physically meaningful
quantity, the squared overlap q2; a zoom of the leftmost part of the plot (inset
graphic) shows the drift from the initial value (0.045) to a value probably very
close to the equilibrium value in less then 50 × 106sweeps.
Fig. 5. Evolution of q2: the x-axis scale is 106MC sweeps per file
We have analyzed the thermalization rate also with the standard log2L data
binning: we divide the data points into four groups of variable size (namely
the last half of the measures, then the previous quarter, the previous eighth
and the sixteenth before this) and then average over all samples in each group.
From the smaller 1/16th to the bigger 1/2 group the averaged value is expected
to shift toward its equilibrium value. Fig. 6 shows the behavior of the squared
overlap q2. The time dependence we observe on the latest data is very small,
it does not expose any systematic drift and is surely far smaller than the
A clean visual representation of the system thermalization is also given via
the average overlap probability distribution P(q), which should be symmetric
at equilibrium (with no external field), as shown in fig. 7: this is obviously
only a necessary condition for thermalization, but it surely is a good sign.
1The situation has now improved dramatically. The I/O interface to the host
computer is discussed in details in 
Fig. 6. Thermalization test of the squared overlap q2as a function of β = 1/T.
0.2 0.4 0.6
Fig. 7. Distribution of the overlap q, showing a reasonably symmetric behavior,
within error bars.
The algorithms described in the previous section are mapped on the selected
FPGAs with a system clock of 62.5 MHz. At each clock cycle, 512 (1024) spins
are updated on an LX160 (LX200), corresponding to an average update time
of 32 ps/spin (16 ps/spin).
It is interesting to compare these figures with those appropriate for a PC.
Understanding what exactly has to be compared is not completely trivial.
The fastest PC code for spin model simulations available to us is the multi-
spin coding, which updates in parallel a large number of (up to 128) samples
of the system at the same time, using only one random number generator,
which is shared across all samples: this scheme is useful to obtain a large
number of configuration data, appropriate for statistical analysis. We call this
an asynchronous multi-spin coding (AMSC) as inside each sample there is no
As we have stated before, the biggest problem with the models we want to
study is the decorrelation time, and the large number of Monte Carlo sweeps
that it may take to bring a configuration to equilibrium. The AMSC procedure
has a serious problem here since each sample evolves for the same number of
sweeps as if it were the only one being simulated. In other words, efficient
codes on a PC achieve high overall performance by simulating for relatively
short MC time a large number of independent samples. A code that updates
in parallel more spins belonging to the same system would be more useful to
attain equilibrium, when working on large systems. The resulting algorithm,
synchronous MSC (SMSC), takes less time to simulate one sample, but the
global performance is lowered because of more complex operations involved
and the need to use an independent random number for each spin. The SMSC
PC-code available to us updates up to 128 spins of a single sample. Syn-
chronous codes are not commonly used in PC based numerical simulations
because of their globally poor performances.
Generally speaking we think that comparison with a SMSC code is appropriate
for a single FPGA system, while comparison with an AMSC code is more
relevant when considering a massively parallel IANUS system (we plan to
build a system with 256 FPGA-based nodes). Here we simply present our
preliminary comparison data for both cases in table 2.
LX160LX200PC (SMSC)PC (AMSC)
Update Rate 32 ps/spin 16 ps/spin3000 ps/spin 700 ps/spin
Comparing the performances of two Xilinx Virtex4 FPGAs and two different codes
running on a high-end PC.
The MSC values are referred to an Intel Core2Duo (64 bit) 1.6 GHz processor.
Inspection of table 2 tells us that one LX160 runs as fast as 90 PCs, while
the LX200 performance is comparable to that of 180 PCs. In other words,
the 8 ×109MC iterations required to thermalize a lattice of size L = 32 took
approximately 6 hours to be completed on just one Virtex4-LX200: they would
take 18 days on a PC running the SMSC algorithm.
Performance comparison with published work is difficult. As far as we know
the SMSC is not used for massive simulations, so data on the performances of
this algorithm is not widespread. The AMSC is commonly used. Even if it is
considered almost a standard in spin glass simulations, we have not been able
to find recent speed analysis. The seminal works on this algorithm  and
 are way too old in technology terms to allow a fair comparison. All in all,