arXiv:0804.4639v1 [astro-ph] 29 Apr 2008
Beowulf Analysis Symbolic INterface BASIN:
Interactive Parallel Data Analysis for Everyone∗
E. Vesperini1, D.M. Goldberg1, S. McMillan1, J. Dura2, D. Jones2
1Department of Physics, Drexel University, Philadelphia, PA
email@example.com, firstname.lastname@example.org, email@example.com
2Department of Computer Science, Drexel University, Philadelphia, PA
April 29, 2008
Submitted for publication to Computing in Science and Engineering
special issue on Computational Astrophysics
The advent of affordable parallel computers such as Beowulf PC clusters and, more
recently, of multi-core PCs has been highly beneficial for a large number of scientists
and smaller institutions that might not otherwise have access to substantial computing
facilities. However, there has not been an analogous progress in the development and
dissemination of parallel software: scientists need the expertise to develop parallel
codes and have to invest a significant amount of time in the development of tools even
for the most common data analysis tasks. We describe the Beowulf Analysis Symbolic
INterface (BASIN) a multi-user parallel data analysis and visualization framework.
BASIN is aimed at providing scientists with a suite of parallel libraries for astrophysical
data analysis along with general tools for data distribution and parallel operations on
distributed data to allow them to easily develop new parallel libraries for their specific
1 You have a Supercomputer. Are you a
The first “Beowulf” PC cluster  provided a proof of concept that sparked a revolution
in the use of off-the-shelf computers for numerical simulations, data mining, and statis-
tical analysis of large datasets. Because these clusters use commodity components, this
paradigm has provided an unbeatable ratio of computing power per dollar, and over the past
decade Beowulf-class systems have enabled a wide range of high-performance applications
on department-scale budgets.
Beowulfs have unquestionably been highly beneficial in our own field of Astrophysics, espe-
cially in smaller academic institutions that might not otherwise have access to substantial
computing facilities. Beowulfs have been successfully applied to numerical simulations that
use a variety of algorithmic approaches to study systems spanning the range of scales from
individual stars and supernovae (see e.g. ) to the horizon size of the universe (see e.g.).
Unfortunately, progress in the development of affordable parallel computers has not been
matched by analogous progress in the development and dissemination of parallel software
aimed at providing scientists with the tools needed to take advantage of the computing power
of parallel machines. Scientists with access to a Beowulf cluster still need the expertise to
develop parallel codes, and have to spend a tremendous amount of time in the development
and testing of tools, even for the most common data analysis tasks. This is in stark contrast to
the situation for serial data analysis and simulations, for which a large number of standard
general-purpose (e.g. Matlab, Maple, R/Splus, Mathematica, IDL) and specialized (e.g.
IRAF for astronomical data analysis) tools and libraries exist.
The recent advent of multi-core PCs has broadened still further the base of commodity
machines that enable computationally intensive parallel simulations and data analysis of
large datasets. However, this impressive technological advance serves also to make the lack
of general tools for parallel data analysis even more striking. The result is a significant
barrier to entry for users or developers of parallel computing applications.
As the gap between increasing parallel computing power and the availability of software
tools needed to exploit it has become more and more evident, a number of projects have
attempted to develop such tools. Our team has developed a package of parallel computational
tools—the Beowulf Analysis Symbolic INterface (BASIN)—to deal with precisely these
issues. BASIN is a suite of parallel computational tools for the management, analysis and
visualization of large datasets. It is designed with the idea that not all scientists need to
be specialists in numerics. Rather, a user should be able to interact with his or her data
in an intuitive way. In its current form, the package can be used either as a set of library
functions in a monolithic C++ program or interactively, from a Python shell, using the
BASIN Python interface.
BASIN is not the only package with this goal in mind. This magazine has recently presented
descriptions of Star-P  (a commercial package aimed at providing environments such as
Matlab, Maple and other with seamless parallel capabilities) and PyBSP  (a Python library
for the development of parallel programs following the bulk synchronous parallel model). As
an open-source project, BASIN growth is to be driven by the needs and contributions of users
and developers both in the astrophysical community and, possibly, in other computationally
Figure 1: Schematic diagram illustrating the interaction between the main components of the
BASIN pipeline: the Data Analysis Engine, the User Interface, and the Graphics Display
Engine. Several remote users (labeled in blue), may connect to and issue commands on an
existing parallel compute engine running the BASIN kernel and visualization engine, and
may share the same datasets in a collaborative data analysis session.
The three basic components of BASIN are the Data Analysis Engine (hereafter DAE),
the User Interface (hereafter UI) and the Graphics Display Engine (hereafter GDE). Fig. 1
illustrates the relationships among these system components.
The User Interface (UI)
Although the entire set of BASIN functions and tools can be accessed from within a
C++ program using BASIN as a library, the primary mode of usage of BASIN is one
in which a user accesses its functionality from within an interactive Python shell, applying
the BASIN analysis and visualization tools to a “live” dataset during an interactive data
analysis session. A Python GUI interface which provides “point-and-click” access to all the
BASIN functionality is also available.
The communication between the client and the DAE is managed by the tools included in
IPython11. A particularly important feature of the IPython1 package is the capability of
allowing multiple clients to connect to the same remote engine. Taking advantage of this
feature, multiple distributed BASIN users can connect to the same DAE, share the same
data and collaborate on a data analysis session. The UI includes functionality to keep track
of and save part or the entire history of a data analysis session. For a multi-user session, the
history also includes information on which of the participating users issued each command.
The Data Analysis Engine (DAE)
The DAE is the core of the BASIN package and is responsible for parallel data management
and all data analysis operations. It is implemented in C++/MPI-2 and Python wrappers
are created using the Boost.Python library.2
The basic DAE objects visible to the user are Regions of data, the Data objects associated
with them and the individual Attributes comprising these Data objects. For example,
for data coming from a simulation, a Region might be a set of nested grids of density fields,
or arrays of particle properties. On the other hand, for a galaxy survey, a Region might
represent a physical region of the sky, or data taken over a particular period of time.
Two distinct types of distributed data are supported: Grids describe regularly sampled
data, e.g. density values on a 3-D Cartesian grid from a CFD calculation. Lists repre-
sent irregularly sampled data, such as a set of arrays representing the properties of particles
in a snapshot from an N-body simulation, or the right ascensions, declinations, and red-
shifts of galaxies in a survey. Each individual property in a List or Grid is stored in an
Attribute which can be accessed by the user, and on which the user can perform a variety
of operations. Some of the the main features of Attributes are:
1. Physically distributed data stored in an Attribute are accessed by the user as if they
were a single data structure on a shared-memory machine. The initial data distribution
is automatically managed by the class constructor. For example,
>>> x = attribute_random(1000000,seed)
will generate 106random numbers uniformly distributed (by default) between 0 and
1, and automatically distribute them as evenly as possible across all nodes. As in a
shared-memory machine, global indexes can be used to access individual data elements.
All the complexities of data distribution and retrieval in a distributed-memory cluster
are handled internally and hidden to the user, who can operate as though working on
a shared-memory environment.
2. Attributes can store n-dimensional arrays of both primitive datatypes and user-
defined structures and objects.
3. All the math operators and the ¸ math library functions are overloaded to operate on
Attributes in a parallel vectorized form. For example, a new radius attribute may
be computed from a list of cartesian components in a completely intuitive way:
>>> r = sqrt(pow(x,2)+pow(y,2))
4. All relational and logical operators are similarly overloaded to operate in a parallel
vectorized form on Attributes .
5. In order to take advantage of faster access to local data, the Attribute class allows
each process to locate its own block of data and, if possible, to work on just that block,
thus avoiding unnecessary inter-process data transfer.
We have developed libraries for several specific astrophysical subfields. For example, the
StellarDynamics library includes many functions normally used in the analysis of N-body
simulations of galaxies and star clusters, such as the determination of the center of the stellar
distribution, measurement of characteristic scales (such as the core, half-mass, tidal, and
other Lagrangian radii), robust estimates of local densities, and the computation of radial
and other profiles of physically important quantities, such as density, velocity dispersion,
ellipticity, anisotropy, mass function, and so on. Similarly, the Cosmology library allows a
user to perform operations commonly encountered in cosmological contexts—for example,
read in a list of galaxy redshifts, select a cosmological model, the compute a host of model-
specific properties, such as lookback time and comoving distance. A user can transform
between observational parameters (α, δ, z), to Cartesian coordinates, and quickly and easily
view lengthy and/or complex datasets.
BASIN also contains more general libraries for important mathematical operations, such as
Fast Fourier Transforms and statistical analysis. As often as possible, we have incorporated
powerful existing packages into the BASIN framework. For example, we have imported
the “Fastest Fourier Transform in the West” (FFTW)3package as our central FFT engine.
The BASIN internals handle all of the issues involved in packing and unpacking data, and
keeping track of whether the data are real or complex. The user simply invokes the transform
via a one-line command.
Much of this functionality has been developed using the DAE tools for data distribution and
parallel operations, with minimal, or even no explicit use of MPI. Users without advanced
knowledge of MPI can therefore still add new BASIN functions and create new libraries for
their own and general use. For example, the BASIN function for the parallel calculation
the center of mass of a stellar system [defined as rcm=
ri= (xi, yi, zi) are the masses and coordinates of the stars in the stellar system], might be
implemented within BASIN as