Using Dynamic Condor-Based Services for Classifying Schizophrenia in Diffusion Tensor Images.
ABSTRACT Diffusion tensor imaging (DTI) provides insight into the white matter of the human brain, which is affected by schizophrenia. By comparing a patient group to a control group, the DTI-images are on average expected to be different for white matter regions. Principal component analysis (PCA) and linear discriminant analysis (LDA) are used to classify the groups. In this work, the number of principal components is optimised for obtaining the minimal classification error. A robust estimate of this error is computed in a cross-validation framework, using different compositions of the data into a training and a testing set Previously, sequential runs were performed in MATLAB, resulting in long execution times. In this paper we describe an experiment where this application was run on a grid with minimal modifications and user effort. We have adopted a service-based approach that autonomously launches image analysis services onto a campus-wide Condor pool comprising of volunteer resources. This allows high throughput analysis of our data in a dynamic resource pool. The challenge in adopting such an approach comes from the nature of the resources, which change randomly with time and thus require fault tolerance. Through this approach we have reduced the computation time of each dataset from 90 minutes to less than 10. A minimal classification error of 22% was obtained, using 15 principal components.
- [Show abstract] [Hide abstract]
ABSTRACT: Many research institutions and Universities own computational capacity that is not effectively utilized, thereby providing an opportunity for such institutions to use such capacity to offer Cloud services (to both internal and external users). However, the unreliability and unpredictability of these resources mean that their use in the context of a Service Level Agreement (SLA) is high risk, leading to a reduction in reputation as well as economic penalties in case of SLA violation. We propose a methodology that addresses the issues of unreliability and unpredictability such that Cloud software services could be hosted upon volunteered resources. To enable the harnessing of these resources we rely on autonomic fault management techniques that allow such systems to independently adapt the resources they use based upon their perception of individual resource reliability. Using our approach we were able to scale out the backend infrastructure of the Cloud service elastically (min 30thinspaces per worker), opportunistically and autonomically. We address two key questions in this article: can a campus volunteer infrastructure be used in Cloud provisioning? What measures are necessary in order to ensure reliability at the resource level? Copyright © 2011 John Wiley & Sons, Ltd.Concurrency and Computation Practice and Experience 03/2011; 24(9):992 - 1014. · 0.85 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: This paper presents the design, implementation, and usage of a virtual laboratory for medical image analysis. It is fully based on the Dutch grid, which is part of the Enabling Grids for E-sciencE (EGEE) production infrastructure and driven by the gLite middleware. The adopted service-oriented architecture enables decoupling the user-friendly clients running on the user's workstation from the complexity of the grid applications and infrastructure. Data are stored on grid resources and can be browsed/viewed interactively by the user with the Virtual Resource Browser (VBrowser). Data analysis pipelines are described as Scufl workflows and enacted on the grid infrastructure transparently using the MOTEUR workflow management system. VBrowser plug-ins allow for easy experiment monitoring and error detection. Because of the strict compliance to the grid authentication model, all operations are performed on behalf of the user, ensuring basic security and facilitating collaboration across organizations. The system has been operational and in daily use for eight months (December 2008), with six users, leading to the submission of 9000 jobs/month in average and the production of several terabytes of data.IEEE transactions on information technology in biomedicine: a publication of the IEEE Engineering in Medicine and Biology Society 01/2010; 14:979-985. · 1.69 Impact Factor
Conference Paper: Gridifying a Diffusion Tensor Imaging Analysis Pipeline.[Show abstract] [Hide abstract]
ABSTRACT: Diffusion Tensor MRI (DTI) is a rather recent image acquisition modality that can help identify disease processes in nerve bundles in the brain. Due to the large and complex nature of such data, its analysis requires new and sophisticated pipelines that are more efficiently executed within a grid environment. We present our progress over the past four years in the development and porting of the DTI analysis pipeline to grids. Starting with simple jobs submitted from the command-line, we moved towards a workflow-based implementation and finally into a web service that can be accessed via web browsers by end-users. The analysis algorithms evolved from basic to state-of-the-art, currently enabling the automatic calculation of a population-specific `atlas' where even complex brain regions are described in an anatomically correct way. Performance statistics show a clear improvement over the years, representing a mutual benefit from both a technology push and application pull.10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGrid 2010, 17-20 May 2010, Melbourne, Victoria, Australia; 01/2010
Using Dynamic Condor-based Services for
Classifying Schizophrenia in Diffusion Tensor
Simon Caton, Matthan Caan, S´ ılvia Olabarriaga, Omer Rana and Bruce Batchelor
Cardiff University, UK
Academic Medical Centre, University of Amsterdam and Delft University of Technology, NL
Abstract—Diffusion Tensor Imaging (DTI) provides insight
into the white matter of the human brain, which is affected by
Schizophrenia. By comparing a patient group to a control group,
the DTI-images are on average expected to be different for white
matter regions. Principal Component Analysis (PCA) and Linear
Discriminant Analysis (LDA) are used to classify the groups.
In this work, the number of principal components is optimised
for obtaining the minimal classification error. A robust estimate
of this error is computed in a cross-validation framework,
using different compositions of the data into a training and
a testing set. Previously, sequential runs were performed in
MATLAB, resulting in long execution times. In this paper we
describe an experiment where this application was run on a grid
with minimal modifications and user effort. We have adopted
a service-based approach that autonomously launches Image
Analysis Services onto a campus-wide Condor pool comprising
of volunteer resources. This allows high throughput analysis of
our data in a dynamic resource pool. The challenge in adopting
such an approach comes from the nature of the resources, which
change randomly with time and thus require fault tolerance.
Through this approach we have reduced the computation time
of each dataset from 90 minutes to less than 10. A minimal
classification error of 22% was obtained, using 15 principal
I. INTRODUCTION AND RELATED WORK
Over the past few years, diffusion tensor imaging (DTI) 
has provided important insights into the structure of the brain.
A still growing number of studies adopt DTI to determine
changes in brain structure in schizophrenia, a cognitive dis-
order affecting approximately 1% of the world’s population.
Typically, schizophrenia is studied through region of interest
(ROI)-analysis or voxel-based analysis (VBA) . To date
many regions of the brain have been reported to be linked with
schizophrenia, however, exactly which ones are most relevant
is still unknown. Computerised image processing of the brain
imagery has lead to improvements in the understanding of
the disease, however disseminating the visual information into
usable data is complex because the brain structure is highly
In  an algorithm is presented that models correlations
between different brain regions using a pattern recognition
approach. Within the pattern recognition framework, a classi-
fication error is computed, indicating the ability to distinguish
between the groups. A robust estimate of the classification
error is computed in a series of independent jobs that could
run in parallel. However, parallel execution is hindered by
relatively high costs of MATLAB licenses. Therefore, com-
putations in  were performed sequentially, resulting in
excessively long running times.
A number of medical applications have been proposed ear-
lier, based on a parallel implementation of MATLAB scripts.
Statistical analysis of Alzheimer’s disease was performed
using Statistical Parametric Mapping, within MATLAB .
Here, MATLAB licenses are required for all nodes. In 
a problem solving environment automatically generated MAT-
LAB code and carried out simplified Grid computing, thus en-
abling Finite Element Modeling. A generalisable architecture
was presented in  for grid-enabled processing of pathologi-
cal images for computer-aided prognosis, by deploying parallel
MATLAB on a cluster.
The closest approach to our work includes the use of
compiled MATLAB code, run on a Condor cluster to speed up
the computation speed of Electrical Impedance Tomography
reconstruction algorithms . However, in this instance, the
code was pre-defined for homogeneous tasks and the standard
Condor system was utilised. In  Condor-G and Pegasus are
used to speed up MATLAB-based evaluation of tomographical
data. Here, inter-site network latencies and queue wait times
significantly impact upon performance. In  compiled MAT-
LAB scripts were deployed over a 180 node Condor system
to process vibro-acoustic data. Here, dramatic performance
improvements were observed as resources were not limited by
available MATLAB licenses. However, this does not address
the issue of queue time.
In ,  MATLAB-based Condor-G jobs are created
on the fly in response to the detection of heavy load on the
user’s workstation. The user’s machine is relieved of work
transparently, but the complexity of job generation means that
the approach is slower than regular Condor. MATLAB*G 
is a parallel MATLAB for the ALiCE Grid . Here, the
user submits a job interactively from a MATLAB environ-
ment. Each job is then decomposed into a number of tasks
and executed in fine-grained parallel manner on (licensed)
MATLAB slaves deployed as ALiCE jobs. JavaPorts  is a
distributed component framework that remaps MATLAB tasks
for heterogeneous nodes. In this work, tasks are compiled and
distributed to a heterogeneous cluster at runtime, good speedup
can be achieved for a small number of nodes, but saturates as
more nodes are added due to messaging overheads. In  a
review of 27 other parallel MATLAB approaches is presented.
Most of these approaches either require a license or the use of
parallel programming libraries (e.g. MPI and PVM) . The
logistics of distributed or multi-institutional software licensing
remains an unresolved issue ,  and thus we cannot use
such an approach.
We describe a method for porting the application in  to
an architecture of dynamic Condor-based services for generic
image analysis . Each service relates to a continuously
running, license free and remotely controllable compiled in-
stance of a MATLAB-based Image Processing Engine (IPE).
By using compiled MATLAB sessions we side-step most
queue-based latencies, as can be found in the above work.
A campus-wide Condor pool made up of volunteer resources
provides a source of raw computational power. As Condor
was not designed for a reliable infrastructure, we layer our
own infrastructure on top of Condor to improve performance
and reliability. The image analysis was carried out using the
2,500 node Cardiff University Condor Pool, of which a total
of 400 machines were used.
The paper is organised as follows: section II presents the
application and section III the grid infrastructure. The imple-
mentation is described in section IV. Results are presented in
section V, discussions and conclusions in section VI.
The application consists of an image analysis pipeline for
DTI-scans. The goal is to locate regions in which the brain
differs in schizophrenic patients and healthy controls. In case
of affected brain tissue, signal changes in DTI scans are to
be expected. As these changes are subtle, patient and control
groups are compared to gain statistical power. Still, imaging
artifacts, noise and heterogeneity within the patient and control
groups hamper the detection of pathological processes in the
data. The method proposed in  adopts a pattern classifi-
cation approach using supervised techniques. The system is
designed to identify if and where a significant difference exists
in the scans of the patient and the control groups. Note that
the relevance of the results obtained with the proposed method
resides in analysing the given population, for clinical research
purposes, and not (yet) for classifying new incoming patients.
The data adopted in this study consists of DTI scans of 34
schizophrenic patients and 24 controls. Scans were acquired
on a dedicated MR-scanner at the Amsterdam Medical Center
(AMC). After isotropic resampling, 3D-volumes were obtained
(128x128x48 voxels, resolution of 2x2x2 mm). Pre-processing
consists of spatially aligning (registration) and smoothing
of the data. This step is currently performed on a single
Classification is done by means of Principal Component
Analysis (PCA) and Linear Discriminant Analysis (LDA).
PCA determines representative features, or Principal Compo-
nents (PCs), out of the data. LDA then decides, using these
PCs, if a significant difference can be found between patients
A training set of a subset of DTI-volumes with known class-
labeling is used to train the LDA-classifier. This classifier is
subsequently applied to samples in an unseen test set. The
classification error is defined as the relative part of the test set
that is incorrectly classified. Randomly classifying the subjects
would induce an error of 50%. Based on the group sizes, an
upper boundary on the error of 35% is concluded to yield
a significant difference between patients and healthy controls
The estimated classification error should be independent of
the composition of training and test sets, therefore the cross-
validation strategy is adopted. It consists of splitting the total
dataset (all DTI-volumes) in different groups that are partially
used as test set, and the remainder of the volumes as the
training set. Different compositions of groups lead to repeated
cross-validation, and the computed classification errors are
averaged for all. Typically, a five-fold cross-validation is
performed, and repeated ten times.
An essential parameter in the classification is nPCs, the
number of principal components used by the PCA/LDA algo-
rithm. A low nPCs yields limited information for the classifier,
resulting in a biased estimate. For a high nPCs, noise will
distort the estimation, such that incorrect brain regions will be
reported. The optimal nPCs depends on the application, and
needs to be found for each new study.
If the classification error obtained with the optimal param-
eter configuration is low, such that a significant difference
between patients and controls can be concluded, it is of clinical
interest to know which brain regions contribute most to this
difference. That is, we want to interpret the outputs of the pat-
tern recognition system. For that purpose, a map is calculated
by the PCA/LDA-algorithm. This map indicates, per voxel,
the amount of variation in DTI-values that contributed to
the optimal classification into patients and controls. Negative
values in the map relate to decreased DTI-values for patients,
and positive values indicate increased values. Clinically, it is
known that schizophrenia is related to merely a few brain
regions, therefore the map is thresholded to extract those
regions where the mapping has the highest absolute value.
An example of such a map is given in figure 1d.
A. Challenges in distributing the application
Conventionally, the application presented in the previous
section was analysed using MATLAB with two external tool-
boxes: PRTools  and DipLib . In this application, one
parameter needs to be tuned: nPCs. This means that the cross-
validation step has to be performed repetitively for a large
range of parameter values, resulting in large running costs.
Such long computations require much effort for management
and logistics, and additionally prevent users from running
other tasks on their machines at the same time. As the
tasks involved are independent from each other, they can be
executed in parallel.
Fig. 1.Workflow of the application indicating the parallelised steps.
Porting this application onto distributed resources is a
challenging task. Normally a high-performance infrastructure
such as provided by the VL-e project1would be useful to run
the computations. Unfortunately running (licensed) MATLAB
programs is not possible in this infrastructure, which holds for
public grids in general. Moreover, mechanisms are needed for
efficient and meaningful data transportation and control.
It would be unreasonable and unrealistic to expect the adop-
tion of complex low-level parallel approaches, such as MPI, to
speed up this application . Firstly, as it is possible to adopt
a coarse-grained parallel approach quickly and easily, with
minimal changes to application code. Secondly, we cannot
expect users to be experts in distributed computing. Hence,
preference is given to a system that is easy to use and has
reasonable performance (over a system with peak performance
but difficulty of use) .
As the complexity of an application grows, so too does the
necessity to use a significant number of resources to support
its execution . It is seldom the case that a single research
group possess such resources and must draw on resources
from other sources. Such scenarios are common, and greatly
benefit from opportunistic approaches. Systems like Condor,
Nimrod , NetSolve , Legion  etc. may be used, but
Pancake  measured that users can spend approximately
10% of their time performing job setup. For large applications
this is likely to be a significant period of time and for
inexperienced users far greater than 10%. It is often difficult
to predict the needs of an application in advance, particularly
if those needs change during runtime . This can occur for
several reasons: (1) resource requirements change due to the
availability of new and better resources , (2) resources are
shared among various users and no single user has control
over their allocation  or (3) one or more resources fail.
Consequently, the availability of resources varies with time
and it is difficult to create a stable environment . The final
result is that it is nearly impossible for users to map and then
manage the execution of their application by hand .
1Virtual Laboratory for enhanced Science, http://www.vl-e.nl
B. Running Compiled MATLAB Applications on a Condor
We used 400 nodes of the 2,500 node Windows Condor
Pool at Cardiff, which are made compiled MATLAB-ready by
installing the MATLAB Component Runtime (MCR) libraries.
Of these 400 nodes there are on average 120 available at
any given time. A consequence of a campus-wide Condor
pool with no explicit centralised control is that workstations
are both regularly and haphazardly upgraded and modified
from multiple sources. We use an extension of Condor’s
ClassAds to identify machines which have MCR installed.
However, over time, we cannot guarantee that this remains the
case, as workstation updates can induce anomalous behaviour.
This is manifested in one of four problems associated with
using the Condor system , but it is left up to the user
to determine why their job has failed. Condor provides no
distinction between failures due to application error and errors
which stem from the operating environment.
In addition to the above factors, the MCR provides fur-
ther challenges: (1) the machine owner has priority over the
Condor user. However, a compiled MATLAB job cannot be
successfully suspended and resumed (compiler version 4.7),
and as in the case above, Condor will not reschedule a job
when this occurs. We can also not recover from such errors,
without resubmission, as check pointing is not supported for
Windows-based Condor pools. (2) The MCR can and often
does fail during initialisation. Even without these administra-
tive challenges compiled MATLAB applications can generate
other errors. Such errors can be application specific – such as
running out of memory, or can simply be irregular events such
as a class loader error or corrupted files. However, whatever
the cause, handling such errors in a generic way is a significant
A. Compiling the Image Processing Engine (IPE)
Our IPE is not a regular compiled MATLAB application.
Regular applications (e.g. ) cater for a limited number of
scenarios, can only be used in batch mode and consequently
must be initialised for every distinct job. Our engine differs in
the following ways: (1) It caters for many disparate scenarios,
as it is essentially a scaled down version of MATLAB. (2) One
instance can serve multiple distinct tasks with only one ini-
tialisation phrase and thus reduce the impact on performance
of unpredictable queueing times. This results in a higher job
throughput. (3) It is primarily used interactively but can also
be used in batch mode.
Compiling entire MATLAB toolkits requires a different
approach than compiling regular stand-alone applications.
Normally the compiler only needs to be told which function
to compile and in the majority of cases all dependencies are
handled for the user. In our case, we must explicitly tell the
compiler which functions are to be included in the standalone
application. When merging toolkits this quickly becomes a list
of substantial size, which in the case of PRTools and DipLib
is around 1,500 functions. This relates to a compiler call of
over 60,000 characters in length. Such a process cannot be
performed by hand, and therefore must be automated. For this
task we have created a Java-based interface to the MATLAB
Compiler, allowing a user to create simple definitions of what
to include and, when necessary, exclude from the compiled
B. System Overview
There are four core components that constitute this sys-
tem (see Fig 2). (1) The Central Service Manager (CSM),
is responsible for allocating resources to user applications
and autonomously and dynamically determines the resource
requirements of the system as a whole. The CSM may also
run as a Condor job. (2) A number of Multiple Image Analysis
Services (IASs), determined by the number of available system
resources. Each IAS is an instance of the IPE controlled
through a Java Server and runs as a continuous Condor job; ter-
minating only on failure or eviction. (3) The Condor Resource
Manager (CRM) is a job submission and monitoring daemon
running on a Condor submit node. It attempts to realise the
requirements specified by the CSM. As previously mentioned,
there are many difficulties to be overcome in performing this
realization. Hence, the CRM monitors each IAS job and the
Condor pool as a whole in order to determine: (i) to what
extent it can satisfy the requirements made by the CSM, (ii)
which workstations and IAS jobs are demonstrating anomalous
behaviour and (iii) whether resources should be released due
to a high demand from other Condor users. (4) One or more
Lookup Services (LS), which are used by user applications
and system components to locate the CSM. In order to avoid
confusion we do not use Task and Job synonymously. A Job
corresponds to an instance of an IAS running on a Condor
node. A Task corresponds to a MATLAB command sequence,
executed by an IAS.
Due to the nature of this approach it is inevitable that
IASs join the system haphazardly. In order to reduce system
overhead tasks are buffered in remote IAS queues, which
enables the IAS to download any data whilst a previous
task executes. However, as more resources become available
known to the user’s application – for further detail see: 
System overview. The shaded area relates to the system components
task reallocation may be required in order to take advantage
of new resources. This means that our scheduler must act
opportunistically, with likely future task reallocations. Hence,
it is important not to plan too far ahead. To facilitate this
we autonomously vary the size of IAS queues in relation to
the number of unallocated tasks. Therefore, new resources do
not result in multiple task reallocations, which would in turn
increase system overhead. To determine from which IAS a
task is taken for reallocation, the scheduler uses the follow-
ing heuristic: Min(Avg(ExecutionTime(Tsim)) − IASTe)
where Te is the currently executing Task for a given IAS, and
Tsimis a similar Task, which has been monitored in the past.
We then take the Task that was most recently submitted to the
IAS for reallocation.
In order to control MATLAB sessions we define meta-
interpreters (MI), that: (1) execute a command or command
sequence, (2) construct results, (3) perform any error handling,
and (4) return any result or error events to user applications.
Results are returned by the MI in MATLAB’s .mat format so
that they can be easily aggregated into the complete analysis
workflow – a Task Specification Object (TSO) was defined
relating to this MI.
C. Porting the application
In order to port any application into a parallel context it
is inevitable that some changes to code will be required.
This was also true for this application, although the changes
required were not very significant. The following changes were
necessary; all data loads were separated from the execution
code. The reasons for encouraging this approach are threefold:
(1) our IPE is a modular image processing package in which
algorithms are constructed using sequences of interchangeable
function calls. (2) It encourages further additions to the
package to keep this modular approach, with the assumption
that islands of functionality are less likely to occur. (3) The
name of files transferred to and from IASs are transparently
allocated globally unique names to prevent naming conflicts
and to ensure that an IAS does not perform the same data
transfer twice. As such, explicit loads made by the programmer
would fail. All console and visual output was also suppressed
for improved performance. No further modifications were
Fig. 3 shows an activity diagram illustrating how the ap-
plication is executed. This process can be summarised as
follows: (1) the creation of TSOs by the user application
specifying the MATLAB sequence desired, upon what data
it is to be performed, and a Task Identifier (TID). The TSO
is then passed through the client interface where MATLAB
syntax is validated, data is placed into an HTTP server root
for transfer and then sent to the scheduler. (2) Locate CSM,
register and define the quantity of resources required. (3) The
CSM informs the CRM of the new resource requirements.
(4) The CSM checks the number of free IASs. (5) IASs are
allocated if possible, otherwise the CSM waits for new IASs to
become available. (6) The client interface receives information
about discovered IAS(s). (7) The client interface schedules all
the buffered Tasks. (8) For each subsequent IAS, one Task is
reallocated using the above heuristic.
execution cycle of a user application
Activity Diagram showing interactions between components and the
As Tasks complete, the resource requirements of the ap-
plication reduce, and the Client Interface keeps the CSM
informed accordingly. As results are received, they are down-
loaded from the IAS’s HTTP daemon and saved under the
name specified in the TSO’s TID.
D. Fault Tolerance
The necessity for strong fault tolerance in an approach such
as this has been motivated in Section III-B. In order to handle
the range of errors that can occur, two distinct mechanisms
are required: (1) Infrastructural and (2) Task oriented.
Infrastructural fault tolerance is provided by the CRM.
In order to detect that an MCR installation is portraying
anomalous behaviour, every instance of an IAS job is closely
monitored. This monitoring data is used to determine how
long an IAS should require for initialisation, once the Condor
job has started. The decision making process employed by the
CRM is based on the data from over 5,000 successful IAS
initialisation. A Job will be terminated if: JobRuntime >
Max(10 minutes,avg(initTime) × 5). There are two pos-
sible reasons for this to occur. (1) The workstation has poor
performance or is heavily loaded, in which case it is not a
suitable choice of workstation or (2) An MCR related error has
occurred. In  we determined that the average initialisation
time for an IAS was 3 minutes and a minimum of 24 seconds.
The IAS job’s initialisation sequence also contains an MCR
integrity check. Should this fail, the job will terminate with
a specific error code. The CRM monitors the Condor logs in
order to detect if an IAS Condor Job has been suspended, or if
a specific error has occurred with a job during startup. When a
job is suspended the CSM is informed and the job terminated.
If, however, the CRM terminates a job for a reason other than
suspension, or the Job log reports an unexpected error code,
the specific workstation receives 1 fail point. If a workstation
receives more than a specific value of fail points it is black
listed for 24 hours, for every subsequent point. This period
of time was chosen with the assumption that the workstation
will be restarted within this time allowing for the network
management software to remedy any software abnormalities.
By employing these infrastructural fault measures we know
that by the time an IAS becomes available that it will function
correctly. However, specific user Tasks can still result in error
and hence we also require some fault tolerance within the
scheduler. Our error model adopts two tiers. (1) IAS Error
(due to an IAS failure) and (2) MATLAB Error. In such cases
the task is rescheduled. The second is application specific
and relates directly to a fault within a specific Task. These
are simply presented to the user at the end of execution and
currently not resubmitted. Application specific logic is needed
to handle these errors, as they are specific to the particular
type of application being executed in MATLAB.
A. Application output
In figure 4, the mean classification error is plotted against
the number of principal components for the ten-times repeated
five-fold cross-validation (yielding 50 runs in total). For this
particular dataset with 58 subjects, an error below 35% can
be considered significant.
Mean errors decrease from 28% for 1 PC to 22% 15
PC’s, after which it stays approximately constant; for larger
nPCs, error reduction is not significant, so we choose 15
PCs as optimal for further classification. The corresponding
thresholded map is overlaid to the FA volume and displayed
in figure 5. This figure is presented to the radiologist for
B. Application Execution
When executed sequentially on a single workstation the
analysis of a single dataset requires 90 minutes. This consists
of 20 independent iterations, the results of which are later
aggregated. We have examined three scenarios for the analysis
of a given dataset: (1) No IAS jobs have been submitted by
Fig. 4. Mean classification error as function of nPCs, with error bars denoting
the standard deviation respective to the mean.
regions for increased and decreased DTI-values for patients respectively, with
anatomy on the background. Slices are displayed from top to bottom.
The thresholded map for nPCs=15, displayed as white and black
the CRM before the application begins. (2) The system has
some resources, and (3) enough IASs are available for use
by the application. These three scenarios have been chosen
to illustrate the opportunistic qualities of the system and the
potential speed-up achievable using this approach. Whilst the
second and third scenarios may seem a waste of resources,
the strategy of spare capacity is inseparable from the need to
accommodate high demand on short notice, which in turn is
the raison d’ˆ etre for grid and cluster computing .
Figure 6 shows the analysis of one dataset when no system
resources were available when the application started. In this
scenario the system must wait for Condor to schedule some
IAS jobs and for these IASs to successfully initialise. In this
No of IASs
No of reallocations
Fig. 6. Plot of application execution when no resources were available when
the application began.
instance some IASs were initialised within 30 seconds, with
further resources becoming available in the following 100
seconds. It is clear from the graph that IASs are acquired
sporadically, which emphasizes the change in availability of
these nodes. We can also see that as the number of IASs
reaches 10, task reallocation begins to take advantage of new
resources, as no tasks remain in the queue, but are either
complete, executing or buffered in remote IAS queues.
The user waited 15 minutes for their application to com-
plete. Considering that a maximum of 13 nodes were used
(the 14th was not utilised) this does not initially show good
performance. However, consider that the problem size is not
a perfect fit for this number of nodes and that with time new
resources were gained and released; on average 8 IASs were
used. Also, consider the heterogeneity of the workstations be-
ing used. The average task completion time was in the order of
4 minutes. However, minimum and maximum execution times
are: 3 and 7 minutes respectively. With these considerations a
completion time of 15 minutes is respectable.
Fig 7 compares the execution of the application for the
three above scenarios. For scenarios 2 and 3 there were 11
and > 20 IASs initialised, respectively. The scheduler initially
buffers tasks and as such reallocation occurs 4 and 10 times,
respectively. It is clear that the system’s performance in either
of these two scenarios greatly impacts the overall wait time for
the user, scenario 3 in particular. In scenario 3 task reallocation
occurred within 30 seconds (note that the dataset was also
transferred in this time). The effects of task and resource
heterogeneity are also most apparent in scenario 3. However,
using this scenario we have reduced the user’s waiting time
from 90 minutes to less that 10, despite using unreliable
C. Infrastructure Performance
The evaluation focused on (1) IAS initialisation times and
(2) Effectiveness of the fault tolerance mechanisms. On aver-
age an IAS requires 220 seconds. During the development of
No of IASs
(a) Number of IASs
0 200400600 800
(b) Tasks completed
Fig. 7.Plot of application execution when a handful of resources were available.
this work the CRM has submitted > 25,000 IAS jobs. Of these
25,000 the CRM logged 5,000 successful initialisations; 3,500
of which were later terminated in response to suspension. The
remaining 1,500 can be accounted for during the testing of
Of interest is the number of IAS job failures detected by the
CRM. In total 9,000 failures occurred and can be attributed
to: (1) MCR error, (2) no MCR installation despite the work-
station’s ClassAd reporting otherwise, (3) the initialisation
time exceeded the threshold value mentioned in section IV-
D. Of these (3) is by far the most frequent, consisting of
approximately 7,500 jobs. Without the use of the CRM and
other fault monitoring measures it would not be possible
to achieve the above performance in such an environment.
These 9,000 jobs were distributed over the 400 MCR-ready
The CRM database is configured using workstation name
and IP address which are extracted from the relevant ClassAds.
In order to log an event, the CRM must know both the IP
address of the machine and it’s domain name, as not all IP
addresses relate to only one Condor node. It is also possible for
workstations to be upgraded and IP addresses to be reassigned.
The dynamic nature of the Condor pool means that the CRM
cannot acquire ClassAds for all machines at any given time,
although over 2,500 records exist. Hence, all events cannot
be recorded, and therefore there is a difference between the
number of jobs recorded and the total number submitted.
VI. DISCUSSION AND CONCLUSIONS
A grid-enabled MATLAB application for performing stud-
ies to discriminate among population groups, in this case
schizophrenic patients and controls, is described. The ul-
timate goal of this application is to help the clinical re-
searcher identify brain regions that differ the most between
the groups, enabling evidence-based decisions. We describe
how the MATLAB-based application was ported to a campus
Grid infrastructure, with minor modifications. The applica-
tion workload, consisting basically of independent tasks, was
transparently distributed among a 400 machine Condor Pool.
As pre-compiled MATLAB is used, licenses are not required
to run the application in parallel. From the user point of
view this is very important since it enables running any
MATLAB application on grids without the high cost of cluster
licenses. The grid-enabled implementation provided a latency
reduction from 90 minutes to 10 minutes per dataset, enabling
the realization of complex and time consuming experiments
in reasonable time. The reduction in latency was obtained
through (1) compiling MATLAB code, (2) parallel execution,
(3) an improved job submission and management scheme.
Our use of interactive services permits us to side-step the
Condor queuing strategy for most tasks. Instead, our man-
agement component (CRM) can submit IAS jobs in response
to a global demand. The scheduler can then interactively
and dynamically allocate tasks to the available resources
during the execution of the application as a whole, performing
reallocation as required. This means that we can perform tasks
near to their normal execution times should IASs be preloaded,
and thus greatly improve performance.
With this Condor-based system, we are able to compare
results with an earlier implementation , demonstrating
correctness, but with a reduction in time for data analysis.
The success of this experiment opens up the possibility of
reducing latency in other parts of the image analysis pipeline,
or performing more complex performance assessments within
large parameter ranges. In our current workflow, as depicted in
figure 1, the interpretation and visualisation of the PCA/LDA-
mapping are not included in the Condor framework. Instead, a
naive thresholding is applied, to find regions where brain tissue
of schizophrenics is affected. Future work will be on shaving
the PCA/LDA mapping , to more precisely delineate the
regions of interest. Non-discriminating voxels are iteratively
discarded, each time updating the PCA/LDA-mapping. We
would like to optimise the shaving step size and/or stopping
criterion. The interpretation step of the workflow will thereby
be migrated into the Condor workflow as well. Even more
interesting is the joint optimisation of the number of principal
components and the shaving parameters, as it yields an even
higher-dimensional parameter space.
There are new challenges in preparing the Condor pool
for this migration. Hardware requirements will be increased,
as more memory is needed to perform the computations.
Besides, scheduling the jobs on the Condor grid becomes more
challenging, as the job length will be longer and dependent on
the chosen shaving parameters.
The proposed framework can be generalised to other
pathologies, among which are brain diseases such as Multiple
Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS).
Also functional MRI clinical studies for neurosurgery planning
of oncology patients could benefit from the proposed approach.
For each new case, the type of analysis described here must
be repeated, so there is interest in porting the current solution
to the grid infrastructure available in The Netherlands.
We kindly thank dr. B.D. Peters and dr. F.M. Vos for col-
laboration in the medical application. We are very grateful
to James Osbourne for his continuous support with Condor.
Caan and Olabarriaga are funded by the VL-e Project in a
BSIK grant from the Dutch Ministry of Education, Culture
and Science (OC&W, as part of the ICT innovation program
of the Ministry of Economic Affairs (EZ)).
 ALiCE grid computing project. http://www.comp.nus.edu.sg/ teoym/alice.htm.
 D. Abramson, R. Sosic, J. Giddy, and B. Hall. Nimrod: A tool for
performing parameterised simulations using distributed workstations. In
Proc of the Fourth IEEE Int Symp on High Performance Distributed
Computing, pages 122 – 131, 1995.
 Sudesh Agrawal, Jack Dongarra, Keith Seymour, and Sathish Vadhiyar.
Netsolve: past, present, and future; a look at a grid enabled server.
 S. Bagnasco, F. Beltrame, et al. Early diagnosis of alzheimer’s disease
using a grid implementation of statistical parametric analysis.
Proceedings of Healthgrid, pages 69 – 81, 2006.
 P.J. Basser, J. Mattiello, and D. Le Bihan. Estimation of the effec-
tive self-diffusion tensor from the NMR spin echo. J Magn Reson,
 Haresh S. Bhatt, V. H. Patel, and A. K. Aggarwal.
client-server model for development environment of distributed image
processing. LNCS 1971, GRID 2000 R. Buyya and M. Baker (Eds.:),
pages 135 – 145, 2000.
 Daniel Bunford-Jones, Omer F. Rana, David W. Walker, MatthewAddis,
Mike Surridge, and Ken Hawick.
clusters in computational grids. In Proceedings of The 15th International
Parallel and Distributed Processing Symposium, pages 759 – 767, 2001.
 Rajkumar Buyya, Manzur Murshed, David Abramson, and Srikumar-
Venugopal. Scheduling parameter sweep applications on global grids:
a deadline and budget constrained cost-time optimization algorithm.
Software - Practice And Experience, 35:491 – 512, 2003.
Resource discovery for dynamic
 M.W.A. Caan, K.A. Vermeer, et al. Shaving diffusion tensor images in
discriminant analysis: A study into schizophrenia. Med Im Anal, 10:841
 B. Cambazoglu, O. Sertel, et al. Efficient processing of pathological
images using the grid: Computer-aided prognosis of neuroblastoma.
In Proc 5th IEEE workshop on Challenges of large applications in
distributed environments, pages 35 – 41, 2007.
 Simon Caton, Omer Rana, and Bruce Batchelor. Dynamic condor-based
services for distributed image analysis. In CCGRID ’07: Proceedings of
the Seventh IEEE International Symposium on Cluster Computing and
the Grid, pages 49 – 56, 2007.
 Ying Chen and Suan Fong Tan. Matlab*g: A grid-based parallel matlab.
 Ron Choy and Alan Edelman.
Proceedings of the IEEE, VOL. 93, NO. 2:331 – 341, 2005.
 Pawel Czarnul, Andrzej Ciereszko, and Marcin Frcaczak.
efficient parallel image processing on cluster grids usinggimp. M. Bubak
et al. (Eds.): ICCS 2004, LNCS 3037, pages 451 – 458, 2004.
 Ewa Deelman, Tevfik Kosar, Carl Kesselman, and Miron Livny. What
makes workflows work in an opportunistic environment? Concurrency
and Computation.: Practice and Experience, 18:1187 – 1199, 2006.
 Pattern Recognition Group Delft University.
pattern recognition toolbox. http://www.prtools.org/.
 Quantitative Imaging Group Delft University. DIPLib, a MATLAB tool-
box for scientific image processing andanalysis. http://www.diplib.org/.
 J. Fritschy, L. Horesh, et al. Using the GRID to improve the compu-
tation speed of electrical impedance tomography (EIT) reconstruction
algorithms. Physiol. Meas., 26:209–215, 2005.
 C.Q. Howard, C.H. Hansen, and A.C. Zander. Optimisation of design
and location of acoustic and vibration absorbers using a distributed
computing network. In Proc. of ACOUSTICS, pages 173–178, 2005.
 Z. Jun and Y. Umetani. A problem solving environment for automatic
matlab 3d finite element code generation and simplified grid computing.
In Proceedings of the Second IEEE International Conference on e-
Science and Grid Computing (e-Science’06), pages 99 – 104, 2006.
 V. Kalogeraki, P. M. Melliar-Smith, and L. E. Moser. Using multiple
feedback loops for object profiling, scheduling and migration in soft
real-time distributed object systems. IEEE Int Symp on Object-Oriented
Real-Time Distributed Computing, pages 291 – 300, 1999.
 R.A. Kanaan, J.S. Kim, et al. Diffusion tensor imaging in schizophrenia.
Biol Psychiatry, 58(12):921–929, 2005.
 Stephen D. Kleban and Scott H. Clearwater.
performance computing systems: What does fair really mean?
Proceedings of the 3rd IEEE/ACM International Symposium on Cluster
Computing and the Grid (CCGRID03), pages 146 – 153, 2003.
 Adam Lathers, Mei-Hui Su, Alex Kulungowski, Abel W. Lin, Gaurang
Mehta, Steven T. Peltier, Ewa Deelman, and Mark H. Ellisman. Enabling
parallel scientific applications with workflow tools. IEEE Challenges of
Large Applications in Distributed Environments, pages 55–60, 2006.
 Mike Lewis and Andrew Grimshaw. The core legion object model. In
Proceedings of HPDC-5, pages 551 – 561, 1996.
 Elias S. Manolakos, Demetris G. Galatopoullos, and Andrew P. Funk.
Distributed matlab based signal and image processing using javaports.
In Proceedings of IEEE International Conference on Acoustics, Speech,
and Signal Processing.(ICASSP’04), pages 217 – 220, 2004.
 Cherri M. Pancake and Curtis Cook. What users need in parallel tool
support: Survey results and analysis.
Performance Computing Conference, pages 40 – 47, 1994.
 Raihan Ur Rasool and Qingping Guo.
computing. I. Stojmenovic et al. (Eds.): ISPA 2007, LNCS 4742, pages
556 – 562, 2007.
 Raihan Ur Rasool and Guo Qingping.
Enabling matlab for the grid.IEEE International Conference on e-
Business Engineering (ICEBE’06), pages 473 – 478, 2006.
 Douglas Thain, Todd Tannenbaum, and Miron Livny.
computing in practice: The condor experience.
Computation: Practice and Experience, 17:323 – 359, 2005.
Parallel MATLAB: Doing it right.
PRTools, a statistical
Fair share on high
In Proc IEEE Scalable High-
A pro-middleware for grid
Users-grid matlab plug-in: