The pipeline system for Octave and Matlab (PSOM): a lightweight scripting framework and execution engine for scientific workflows.
ABSTRACT The analysis of neuroimaging databases typically involves a large number of inter-connected steps called a pipeline. The pipeline system for Octave and Matlab (PSOM) is a flexible framework for the implementation of pipelines in the form of Octave or Matlab scripts. PSOM does not introduce new language constructs to specify the steps and structure of the workflow. All steps of analysis are instead described by a regular Matlab data structure, documenting their associated command and options, as well as their input, output, and cleaned-up files. The PSOM execution engine provides a number of automated services: (1) it executes jobs in parallel on a local computing facility as long as the dependencies between jobs allow for it and sufficient resources are available; (2) it generates a comprehensive record of the pipeline stages and the history of execution, which is detailed enough to fully reproduce the analysis; (3) if an analysis is started multiple times, it executes only the parts of the pipeline that need to be reprocessed. PSOM is distributed under an open-source MIT license and can be used without restriction for academic or commercial projects. The package has no external dependencies besides Matlab or Octave, is straightforward to install and supports of variety of operating systems (Linux, Windows, Mac). We ran several benchmark experiments on a public database including 200 subjects, using a pipeline for the preprocessing of functional magnetic resonance images (fMRI). The benchmark results showed that PSOM is a powerful solution for the analysis of large databases using local or distributed computing resources.
-
Citations (0)
-
Cited In (0)
Page 1
ORIGINAL RESEARCH ARTICLE
published: 03 April 2012
doi: 10.3389/fninf.2012.00007
The pipeline system for Octave and Matlab (PSOM):
a lightweight scripting framework and execution
engine for scientific workflows
Pierre Bellec1,2*, Sébastien Lavoie-Courchesne1,2,3, Phil Dickinson1,3, Jason P . Lerch4,5,
Alex P . Zijdenbos6and Alan C. Evans3
1Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Montréal, QC, Canada
2Département d’Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, QC, Canada
3McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University, Montréal, QC, Canada
4Mouse Imaging Centre, The Hospital for Sick Children, Toronto, ON, Canada
5Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
6Biospective Incorporated, Montréal, QC, Canada
Edited by:
Andrew P . Davison, Centre National
de la Recherche Scientifique, France
Reviewed by:
Ivo Dinov, University of California,
USA
Yann Cointepas, CEA - NeuroSpin,
France
*Correspondence:
Pierre Bellec, Centre de Recherche
de l’Institut Universitaire de
Gériatrie de Montréal, 4545 chemin
Queen-Mary, Montréal, QC H3W
1W5, Canada.
e-mail: pierre.bellec@criugm.qc.ca
The analysis of neuroimaging databases typically involves a large number of
inter-connected steps called a pipeline. The pipeline system for Octave and Matlab (PSOM)
is a flexible framework for the implementation of pipelines in the form of Octave or
Matlab scripts. PSOM does not introduce new language constructs to specify the steps
and structure of the workflow. All steps of analysis are instead described by a regular
Matlab data structure, documenting their associated command and options, as well as
their input, output, and cleaned-up files. The PSOM execution engine provides a number
of automated services: (1) it executes jobs in parallel on a local computing facility as long
as the dependencies between jobs allow for it and sufficient resources are available; (2) it
generates a comprehensive record of the pipeline stages and the history of execution,
which is detailed enough to fully reproduce the analysis;(3) if an analysisis started multiple
times, it executes only the parts of the pipeline that need to be reprocessed. PSOM is
distributed under an open-source MIT license and can be used without restriction for
academic or commercial projects. The package has no external dependencies besides
Matlab or Octave, is straightforward to install and supports of variety of operating systems
(Linux, Windows, Mac). We ran several benchmark experiments on a public database
including 200subjects, using a pipeline for the preprocessing of functional magnetic
resonance images (fMRI). The benchmark results showed that PSOM is a powerful
solution for the analysis of large databases using local or distributed computing resources.
Keywords: pipeline, workflow, Octave, Matlab, open-source, parallel computing, high-performance computing,
neuroimaging
1. INTRODUCTION
Therapiddevelopmentofpublicdatabasesinneuroimaging(e.g.,
Evans, 2006; Biswal et al., 2010; Burton, 2011) is opening excit-
ing avenues for data mining. The analysis of a neuroimaging
databasetypicallyinvolvesalargenumberofinter-connected pro-
cessing steps, collectively referred to as a pipeline (or workflow)
(Deelman et al., 2009). Neuroimaging pipelines can be imple-
mented asaMatlabscript,e.g.,DPARSF(Chao-GanandYu-Feng,
2010),fMRIstat1(Worsleyetal.,2002),SPM2(Ashburner,2011),
or brainstorm3(Tadel et al., 2011). Matlab is a programming
language for general scientific computing, well-adapted to the
rapid prototyping of new algorithms. It can also wrap heteroge-
neous tools implemented in a variety of languages. To facilitate
the inclusion of these computational tools in complex scientific
workflows, we developed a general-purpose pipeline system in
1http://www.math.mcgill.ca/keith/fmristat/
2www.fil.ion.ucl.ac.uk/spm/
3http://neuroimage.usc.edu/brainstorm/
Octave and Matlab (PSOM)4. To contrast PSOM against alter-
native projects, we reviewed key features of popular packages
within four areas of a pipeline life cycle (Deelman et al., 2009):
(1)compositionofthepipeline;(2)mappingofthepipelinetothe
underlying resources; (3) execution of the pipeline; (4) recording
of the metadata and provenance.
1.1. PIPELINE COMPOSITION
The composition of a pipeline is the generation of a (possibly
abstract) representation of all steps of analysis and associated
dependencies, including access to datasets. Many extensions of
existing languages have been developed for that purpose, such as
matlabbatch5for Matlab, or Nipype6(Gorgolewski et al., 2011)
and the Soma-workflow7(Laguitton et al., 2011) for Python.
4http://code.google.com/p/psom/
5http://sourceforge.net/apps/trac/matlabbatch/wiki
6nipy.org/nipype
7http://brainvisa.info/soma-workflow
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 1
NEUROINFORMATICS
Page 2
Bellec et al.The pipeline system for Octave and Matlab
Somescripting languageswerealsodevelopedspecificallyto com-
pose pipelines, e.g., DAGMan8, Swift9(Wilde et al., 2011) and
Pegasus (Deelman et al., 2005). All these systems differ by the
way the dependencies between jobs are encoded. DAGMan and
Soma-workflow are both based on an explicit declaration of
dependencies between jobs by users. The pipeline thus takes
the form of a directed acyclic graph (DAG) with jobs as nodes
and dependencies as (directed) edges. The Pegasus package also
uses a DAG as input, yet this DAG is represented in an XML
format called DAX. DAX graphs can be generated by any script-
ing language. By contrast, Nipype, Swift, and PSOM build on
the notion of futures (Baker and Hewitt, 1977), i.e., a list of
datasets (or variables) that will be generated by a job at run-time.
The data-flow then implicitly defines the dependencies: all the
inputs of a job have to exist before it can be started. An alter-
native to scripting approaches for pipeline composition is to rely
on graphical abstractions. A number of projects offer sophisti-
cated interfaces based on “box and arrow” graph representations,
e.g., Kepler10(Ludäscher et al., 2006), Triana11(Harrison et al.,
2008), Taverna12(Oinn et al., 2006), VisTrails13(Callahan et al.,
2006), Galaxy (Goecks et al., 2010) and LONI pipeline14(Dinov
et al., 2009). Because the graph representations can get really
large,variousmechanisms havebeendevelopedto keeptherepre-
sentation compact, such as encapsulation(the ability to represent
a sub-pipeline as one box) and the use of control operations,
e.g., iteration of a module over a grid of parameters, instead of
a pure data-flow dependency system. Note that complex control
mechanism are also necessary in systems geared toward data-flow
dependencies to give the ability to, e.g., branch between pipelines
or iterate a subpart of the pipeline until a data-dependent con-
dition is satisfied. Finally, systems that put a strong emphasis on
pipeline composition and re-use, such as Taverna, Nipype, and
LONI pipeline, critically depend on the availability of a library
of modules to build pipelines. Taverna claims to have over 3500
such modules, developed in a variety of domains such as bioin-
formatics or astronomy. Nipype and LONI both offer extensive
application catalogue for neuroimaging analysis.
1.2. PIPELINE MAPPING
When a pipeline representation has been generated, it needs to
be mapped onto available resources. For example, in grid com-
puting, multiple production sites may be available, and a subset
of sites where the pipeline will run has to be selected. This selec-
tion process can simply be a choice left to the user, e.g., Kepler,
Taverna, VisTrails, Soma-workflow. It can also be automatically
performed based on the availabilityand current workload at each
registered production site, e.g., CBRAIN (Frisoni et al., 2011)
and Pegasus, as well as quality of service issues. Another typ-
ical mapping task is the synchronization of the datasets across
8http://research.cs.wisc.edu/condor/dagman/
9http://www.ci.uchicago.edu/swift/
10kepler-project.org
11http://www.trianacode.org/
12taverna.org.uk
13http://www.vistrails.org/
14http://pipeline.loni.ucla.edu/
multiple data servers to the production site(s), an operation that
can itself involve some interactions through web services with a
database system, such as XNAT (Marcus et al., 2007) or LORIS
(Das et al., 2012). The Pegasus project recompose pipelines at the
mapping stage. This feature proceeds by grouping tasks in order
to limit the over-head related to job submission and more gener-
ally optimize the pipeline for the infrastructure where it will be
executed. Such mapping operation is central to achieve high per-
formance in grid or cloud computing settings. Note that some
pipeline systems have no or limited mapping capabilities. The
PSOM project as well as matlabbatch, Nipype, and DAGMan for
example were designed to work locally on the production server.
The Soma-workflow can map pipelines in remote execution sites,
butdoes not recompose the pipeline to optimize the performance
of execution as Pegasus does. On the other end of the spec-
trum, CBRAIN is essentially a mapping/execution/provenance
tool where pipelines have to be first composed in another system
(such as PSOM).
1.3. PIPELINE EXECUTION
A dedicated execution engine is used to run the pipeline after
mapping on computational resources. It will detect the degree of
parallelism present in the pipeline at any given time, and process
jobs in parallel depending on available computational resources.
All pipeline systems reviewed here, including PSOM, can exe-
cute jobs in parallel on a multi-core machine or a supercomputer
through submissions to a queuing mechanism such as SGE qsub,
after a proper configuration has been set. Some of them (e.g.,
Taverna, Triana, Pegasus, CBRAIN) can also run jobs concur-
rently on one or multiple supercomputers in a computing grid,
and are able to accommodate the variety of queuing mecha-
nism found across productionsites. Some execution engines, e.g.,
Nipype, will supportapipelinethatbuildsdynamically, forexam-
ple with a data-dependent branching in the pipeline. Fault toler-
ance is also an important feature. A first level of fault-tolerance is
the notification of errors to the user, coupled with the ability to
restart the pipeline where it stopped (e.g., PSOM, Nipype, Soma-
workflow). The execution engine canalso check that the expected
output files have properly been generated (e.g., Pegasus, PSOM).
In addition, after an error occurred, an execution engine may re-
submit a job a number of times before considering that it has
definitelyfailed(e.g.,Swift,PSOM)becausesomerandomfailures
can occur due to, e.g., improper configuration, memory, or disk
space exhaust on one execution node. An execution engine can
also feature the ability to perform a “smart update,” i.e., restart a
pipeline while re-using the results from prior executions as much
as possible (e.g., Kepler, Nipype, PSOM).
1.4. PIPELINE PROVENANCE
The final stage of a pipeline life cycle is provenance tracking,
which represents the comprehensive recording of the process-
ing steps applied to the datasets. This can also be extended to
the archiving of the computing environment used for production
(e.g., the version of the software that was used for process-
ing), and the origin of the datasets that were used as inputs
(MacKenzie-Graham et al., 2008). Provenance is a critical step
to achieve reproducible research, which is itself considered as a
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 2
Page 3
Bellec et al. The pipeline system for Octave and Matlab
cornerstone of the scientific method (Mesirov, 2010). A competi-
tiononprovenancegenerationdemonstratedthatseveralpipeline
systems captured similar informations (Bose et al., 2006). How
these informations can be accessed easily and shared remains an
area of development15. The quality of provenance tracking also
depends on the quality of the interface between the pipeline sys-
tem and the tools applied by each job: a comprehensive list of
underlying parameters has to be generated before it is recorded.
The PSOM development framework was notably designed to
facilitate the systematic recording of the default job parameters as
part of the provenance, in a way that scales well with the number
of parameters. An innovative feature introduced by the VisTrails
package is the capacity to graphically represent the changes made
to a pipeline, not onlyproviding a provenance mechanism for the
pipeline execution but also for the steps of pipeline generation
and/or variations in employed parameters.
1.5. PSOM FEATURES
The PSOM is a lightweight scripting solution for pipeline
composition, execution, and provenance tracking. The pack-
age is intended for scientists who prototype new algorithms
and pipelines using Octave or Matlab (O/M). PSOM is actively
developed since 2008, and it has been inspired by several PERL
pipeline systems (called RPPL, PCS, and PMP) used at the
McConnell Brain Imaging Centre, Canada, over the past fifteen
years (Zijdenbos et al., 1998). PSOM is based on a new stan-
dard to represent all steps of a pipeline analysis as a single O/M
variable. This representation defines dependencies between pro-
cessing steps implicitly by the data-flow. We established a limited
number of scripting guidelines with the goal of maintaining
a concise and modular code. These guidelines are suggestions
ratherthanmandates,andthepipelinerepresentationcanbegen-
erated using any coding strategy. PSOM comes with a generic
pipeline execution engine offering the following services:
1. Parallelcomputing:Automatic detection andexecutionofpar-
allel components in the pipeline. The same code can run in a
single matlab session, on a multi-core machine or on a dis-
tributed architecture with hundreds ofexecution nodes justby
changing the PSOM configuration.
2. Provenancetracking: Generationofacomprehensive recordof
the pipeline stages and the history of execution. These records
are detailed enough to fully reproduce an analysis, and profile
the components of the pipeline.
3. Faulttolerance: Multipleattempts willbemadeto runeach job
before it is considered as failed. Failed jobs can be automati-
cally re-started by the user after termination of the pipeline.
4. Smart updates: When an analysis is started multiple times,
the parts of the pipeline that need to be reprocessed are
automatically detected and those parts only are executed.
1.6. COMPARISON BETWEEN PSOM AND OTHER PACKAGES
As reviewed above, there are several alternatives with broader
functionality than PSOM, such as LONI pipeline, VisTrails,
Pegasus, Kepler, Triana, Galaxy, and Taverna. These systems
15www.w3.org/2011/prov/
notablysupportagraphicalcompositionofthe pipeline, database
interfaces, and mapping capabilities. They, however, requireusers
to write dedicated interfaces for importing computational mod-
ules. The DAGManand Soma-workflow systems even leave to the
user the task to generate the dependency graph of the pipeline
using a third-party software, and concentrate mainly on the
pipeline mapping, execution, and provenance. The aim of the
PSOM project was to propose a single environment where com-
putationalmodules andpipelines couldbedevelopedjointly. This
is achieved bybuilding apipeline representation using native data
structures of O/M. As our intended audience is developers, a
graphical tool for pipeline composition was not a priority and is
not currently available. PSOM also does not offer pipeline map-
ping capabilities because PSOM pipelines can be easily interfaced
after the development phase with projects specifically focused
on pipeline mapping, such as CBRAIN. By contrast, PSOM fea-
tures powerful pipeline execution capabilities, in terms of fault
tolerance and smart updates. Thanks to these features, users
can modify, debug, or optimize the computational modules of
a PSOM pipeline at the same time they are implementing (and
testing) it.
The closest alternatives to PSOM arematlabbatch and Nipype.
Both offer a simple scripting strategy to implement complex
pipelines using data structures that are native to Matlab and
Python, respectively. The pipeline composition is based on a set
of dedicated scripting constructs, which may result in a highly
concise code. Two projects have recently pursued this idea even
further by adding coding constructs inspired by the Swift script-
ing languageto Python, the PYDflow (Armstrong, 2011) package,
and R, the SwiftR16package. PSOM pipelines are not as con-
cise as the ones implemented with these systems, but they can be
constructed with common O/M operations only. This choice was
madeto limitthe learningcurvefornew users, who will hopefully
find PSOM syntax very intuitive if they are already familiar with
O/M. The distinctive features of PSOM are:
1. Minimally invasive: No new programming construct is intro-
duced to script a pipeline.
2. Portable: PSOM is distributed under an MIT open-source
license, granting the rights to modify, use and redistribute the
code, free of charge, as part of any academic or commercial
project. Moreover,theinstallationofPSOMisstraightforward:
it has no dependency and does not require compilation. Any
system that supports Matlab or Octave (i.e., Linux, Windows,
and Mac OS X) will run PSOM.
3. Dual O/M compatibility: PSOM users can benefit of the com-
fort of the Matlab environment for development purposes
(graphical debugging tools, advanced profiling capabilities)
and of the free open-source Octave interpreter to execute a
code on hundreds of cores.
1.7. PAPER OUTLINE
The standard representation of a pipeline using a O/M variable is
first presented in Section 2. Section 3 provides an overview of the
key features of the execution engine on simple examples, while
16http://people.cs.uchicago.edu/∼tga/swiftR/
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 3
Page 4
Bellec et al.The pipeline system for Octave and Matlab
Section 4 details how these features were implemented. Section 5
provides further coding guidelines designed to keep the genera-
tion of pipelines concise, re-usable, and readable. Finally, Section
6 reviews some neuroinformatics projects that were implemented
with PSOM. A preprocessing pipeline for functional magnetic
resonance imaging (fMRI) was used for a benchmark evaluation
of PSOM execution perfomance with several computing envi-
ronments and execution configurations. The paper ends with a
discussion of current PSOM capabilities and directions for future
developments.
2. PIPELINE REPRESENTATION
A pipeline is a collection of jobs, which is implemented using
the so-called O/M structure data type. The fields used in the
pipeline are arbitrary, unique names for the jobs. Each job
can have up to five fields, in which all but the first one are
optional:
• command: (mandatory) a string describing the command that
will be executed by the job.
• files_in: (optional) a list of input files.
• files_out: (optional) a list of output files.
• files_clean: (optional) a list of files that will be deleted by
the job.
• opt: (optional) some arbitrary parameters.
The jobs are executed by PSOM in a protected environment
where the only available variables are files_in, files_out,
files_clean, and opt. The following code is a toy example
of a simple pipeline:
% Job "sample" : No input, generate a
random vector a
command = ’a = randn([opt.nb_samps 1]);
save(files_out,’’a’’)’;
pipeline.sample.command
pipeline.sample.files_out
pipeline.sample.opt.nb_samps = 10;
% Job "quadratic" : Compute a.^2 and
= command;
= ’sample.mat’;
save the results
command = ’load(files_in); b = a.^2;
save(files_out,’’b’’)’;
pipeline.quadratic.command
pipeline.quadratic.files_in
pipeline.sample.files_out;
pipeline.quadratic.files_out =
’quadratic.mat’;
= command;
=
Thefirstjob, namedsample, does nottakeanyinputfile,and
will generate oneoutputfile called ’sample.mat’. Ittakes one
parameter nb_samps, equals to 10. The field opt can be of any
oftheO/M datatypes. Thesecond job, named quadratic, uses
the output of sample as its input (quadratic.files_in
is filled using sample.files_out). This convention avoids
the generation of file names at multiple places in the script.
It also makes explicit the dependence between sample
and quadratic when reading the code: as the input of
quadratic is the output of sample, sample has to be
completed before quadratic can be started. This type of
dependency between jobs, called “file-passing,” is translated into
a directed dependency graph, see Figure1A. The dependency
graph dictates the order of job execution. It can be represented
using the following command:
psom_visu_dependencies(pipeline)
Let’s now assume that the output of sample is regarded as an
intermediate file that does not need to be retained. A new job
cleanup is added to delete the output of sample, which is
declared using the field files_clean:
% Adding a job "cleanup" : delete the
output of "sample"
pipeline.cleanup.command
’delete(files_clean)’;
pipeline.cleanup.files_clean
pipeline.sample.files_out;
=
=
Because cleanup will delete the input file of quadratic,
it is mandatory to wait until quadratic is successfully exe-
cuted before cleanup is started. This type ofdependency, called
“cleanup”, is again included as a directed link in the dependency
graph, see Figure1B.
FIGURE 1 | Examples of dependency graphs. In panel (A), the input file
of the job quadratic is an output of the job sample; sample thus needs
to be completed before starting quadratic. This type of dependency
(“file-passing”) can be represented as a directed dependency graph. In
panel (B), the job cleanup deletes an input file of quadratic;
quadratic thus needs to be completed before starting cleanup.
Note that such “cleanup” dependencies may involve more than two
jobs: if cleanup deletes some input files used by both quadratic
and cubic, cleanup depends on both of them (panel C). The
same property holds for “file-passing” dependencies: if sum is using the
outputs of both quadratic and cubic, sum depends on both jobs
(panel D).
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 4
Page 5
Bellec et al.The pipeline system for Octave and Matlab
The order in which the jobs are added to the pipeline does not
haveany implications on the dependency graph,and is thus inde-
pendent of the order of their execution. For example, if a new job
cubic is added:
% Adding a job "cubic" : Compute a.^3 and
save the results
command = ’load(files_in);
c = a.^3; save(files_out,’’c’’)’;
pipeline.cubic.command
pipeline.cubic.files_in
pipeline.sample.files_out;
pipeline.cubic.files_out
’cubic.mat’;
=
=
command;
=
the job cleanup will be dependent upon quadratic and
cubic, because the latter jobs are using the output of sample
as an input, a file that is deleted by cleanup (Figure1C).
The type of files_in, files_out, and files_clean
is highly flexible. It can be a string, a cell of strings, or a nested
structure whose terminal fields are strings or cells of strings. The
following job for example uses two inputs, generated by two
different jobs (see Figure1D):
% Adding a job "sum" : Compute a.^2+a.^3
and save the results
command = ’load(files_in{1});
load(files_in{2}); d = b+c, ...
save(files_out,’’d’’)’;
pipeline.sum.command
pipeline.sum.files_in{1} =
pipeline.quadratic.files_out;
pipeline.sum.files_in{2} =
pipeline.cubic.files_out;
pipeline.sum.files_out
= command;
= ’sum.mat’;
3. PIPELINE EXECUTION
3.1. A FIRST PASS THROUGH THE TOY PIPELINE
When a pipeline structure has been generated by the user,
PSOM offers a generic command to execute the pipeline:
psom_run_pipeline(pipeline,opt_pipe)
where opt_pipe is a structure of options that can be used
to set the configuration of PSOM, see Section 4.6. The main
configuration option is the name of a folder used to store the logs
ofthepipeline,whichisthe“memory”ofthepipelinesystem.When
invoked, PSOM first determines which jobs need to be restarted
using the logs folder. The jobs are then executed in independent
sessions, as soon as all their dependencies are satisfied. The next
section (Section 4) describes the implementation of all stages
of pipeline execution in details. This section outlines the key
mechanisms using simpleexamples, starting with the toy pipeline
presentedinthelastsectionwithoutthecleanupjob(seeFigure 2).
Initially, only one job (sample) can be started because it does
not have any parent in the dependency graph (Figure 2A). As
FIGURE 2 | Pipeline execution: a first pass through the toy pipeline.
Each panel represents one step in the execution of the toy pipeline
presented in Section 2, without the cleanup job. This example assumes
that at least two jobs can run in parallel, and that the pipeline was not
executed before. All jobs are executed as soon as all of their dependencies
are satisfied, possibly with some jobs running in parallel.
soon as this job has been successfully completed, its two children
(quadratic andcubic)arestarted.Thisisassumingofcourse
that the configuration allows PSOM to execute at least two jobs
in parallel (e.g., background execution on a dual-core machine),
see Figure 2B. The job sum is started only when both of its
dependencieshavebeensatisfied, seeFigures 2C,D.Whenalljobs
are completed, the pipeline manager finally exits (Figure 2E).
3.2. UPDATING A PIPELINE (WITH A BUG)
This next example shows how the pipeline manager deals with
the updateof a pipeline. That is to say that a pipeline is submitted
for execution after it was previously executed using the same logs
folder. If one of the jobs has changed since the last submission,
this job along with all of its children in the dependency graph
are scheduled to be reprocessed. Here, the job quadratic is
modified to introduce a bug, before restarting the pipeline:
% Changing the job quadratic to
introduce a bug
pipeline.quadratic.command = ’BUG!’;
% Restart the pipeline
psom_run_pipeline(pipeline,opt_pipe)
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 5
Page 6
Bellec et al.The pipeline system for Octave and Matlab
FIGURE 3 | Pipeline management, example 2: updating a pipeline
(with one bug). Each panel represents one step in the execution of the toy
pipeline presented in Section 2, without the cleanup job. This example
assumes that at least two jobs can run in parallel, and that the pipeline has
already been executed once as outlined in Figure 2. The pipeline is first
started after changing the job quadratic to introduce a bug (panels A–B).
When the execution of the pipeline fails, the job quadratic is modified to
fix the bug. The pipeline is then restarted and completes successfully
(panels C–E).
Thepipelinemanagerfirstrestarts thejobquadratic becausea
change is detected in its description (Figure3A). After the execu-
tionofthe jobis completed, the jobis taggedwith a“failed”status
(panel B). The job sum is not started because it has a dependency
that cannot be solved, and the pipeline manager simply exits. It is
thenpossibleto access thelogs ofthefailed job,i.e., atext descrip-
tion of the job, start time, user name, system used as well as end
time and all text outputs:
>> psom_pipeline_visu
(opt.path_logs,’log’,’quadratic’);
***********************************
Log of the (octave) job : quadratic
Started on 19-Jul-2011 16:01:36
User: pbellec
host : sorbier
system : unix
***********************************
command= BUG!
files_in= /home/pbellec/database/
demo_psom/sample.mat
= /home/pbellec/database/
demo_psom/quadratic.mat
files_clean = {}(0x0)
opt= {}(0x0)
********************
The job starts now !
********************
Something went bad ... the job has FAILED !
The last error message occured was :
parse error:
syntax error
>>> BUG!
File /home/pbellec/svn/psom/trunk/
psom_run_job.m at line 110
****************
Checking outputs
****************
The output file or directory ...
/home/pbellec/database/demo_psom/
quadratic.mat has not been generated!
files_out
*******************************************
19-Jul-2011 16:01:36 : The job has FAILED
Total time used to process the
job : 0.00 sec.
*******************************************
The pipeline is then modified to fix the bug in quadratic.
After restarting the pipeline, the jobs quadratic and sum run
sequentially and are successfully completed (Figures3C–E).
3.3. ADDING A JOB
Updating the pipeline is not solely restricted to changing the
description of a job that was previouslya part of the pipeline. It is
also possible to add new jobs and resubmit the pipeline. Figure4
shows the steps of resolution of the full toy pipeline (including
the cleanup job) when the subpipeline (not including the clean-
up pipeline) had already been successfully completed prior to
submission. In that case, there is no job that depends on the out-
puts of cleanup, so the only job that needs to be processed
is cleanup itself and the pipeline is successfully completed
immediately after this job is finished.
3.4. RESTARTING A JOB AFTER CLEAN UP
It is sometimes useful to force a job to restart, for example a job
that executes a modified script while the job description remains
identical. PSOM is not able to detect this type of change in the
pipeline (it assumes that all libraries are identical across multi-
ple runs of the pipeline). The following option will force a job to
restart:
opt_pipe.restart = {’quadratic’};
psom_run_pipeline(pipeline,opt_pipe);
In this example, all jobs whose name includes quadratic will
be restarted by the pipeline manager. Further we will assume that
the full toy pipeline (including the cleanup job) has already
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 6
Page 7
Bellec et al. The pipeline system for Octave and Matlab
FIGURE 4 | Pipeline management, example 3: adding a (cleanup) job.
This example assumes that the toy pipeline (without the cleanup job) had
already been successfully completed. The full toy pipeline (with the
cleanup job) is then submitted for execution. The only job that is not yet
processed is cleanup, and the pipeline execution ends after cleanup
successfully completes.
been completed. In the absence of the cleanup job, the job
quadratic would be restarted as well as all of its children. The
inputsofquadratic, however,havebeendeletedbycleanup.
Itis therefore, not possibleto restart the pipeline atthis stage. The
pipelinemanagerwillautomaticallydetect thatthemissinginputs
can be re-generated by restarting the job sample. It will thus
restart this job as well as all of its children, including cubic (see
Figure5 for a step-by-step resolution of the pipeline). Note that
this behavior is iterative, such that if some inputs from sample
had been missing, the pipeline manager would look for jobs that
could be restarted to generate those files.
3.5. PIPELINE HISTORY
When PSOM is solving a pipeline, it is not generating a color-
coded graph such as those presented in Figures2–5. Rather, it
outputs a text summary of all operations, such as job submis-
sion, job completion, andjob failure.Each eventis reported along
with the time of its occurrence. This is presented in the following
example for the first execution of the toy pipeline (Figure2):
*****************************************
The pipeline PIPE is now being processed.
Started on 21-Jul-2011 09:37:45
user: pbellec, host: berry, system: unix
*****************************************
21-Jul-2011 09:37:45 -
...The job sample has been submitted to the
queue (1 jobs in queue).
21-Jul-2011 09:37:48 -
...The job sample has been successfully
completed (0 jobs in queue).
21-Jul-2011 09:37:48 -
...The job quadratic has been submitted to
the queue (1 jobs in queue).
21-Jul-2011 09:37:48 -
...The job cubic has been submitted to the
queue (2 jobs in queue).
FIGURE 5 | Pipeline management, example 4: restarting a job after its
inputs have been cleaned up. This example assumes that the full toy
pipeline (including the cleanup job) has already been successfully
completed. The same pipeline is then submitted for a new run and the job
quadratic is forced to be restarted. Because the inputs of quadratic
(generated by sample) have been deleted by cleanup, the pipeline
manager also restarts the job sample (panel A). Because all jobs depend
indirectly on sample, all jobs in the pipeline have to be reprocessed
(panels B–D).
21-Jul-2011 09:37:52 -
...The job quadratic has been successfully
completed (1 jobs in queue).
21-Jul-2011 09:37:52 -
...The job cubic has been successfully
completed (0 jobs in queue).
21-Jul-2011 09:37:52 -
...The job sum has been submitted to the
queue (1 jobs in queue).
21-Jul-2011 09:37:55 -
...The job sum has been successfully
completed (0 jobs in queue).
*******************************************
The processing of the pipeline is
terminated.
See report below for job completion status.
21-Jul-2011 09:37:55
*******************************************
All jobs have been successfully completed.
These logs are concatenated across all instances of pipeline
executions, and they are saved in the logs folder. They can be
accessed using a dedicated M-command:
psom_pipeline_visu
(opt_pipe.path_logs,’monitor’)
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 7
Page 8
Bellec et al.The pipeline system for Octave and Matlab
The logs of individual jobs can also be accessed with the same
command, using a different option:
psom_pipeline_visu
(opt_pipe.path_logs,’log’,JOB_NAME)
as shown in Section 3.2. Finally, it is possible to get access to the
execution time for all jobs from the pipeline, which can be useful
for benchmarking purposes:
>> psom_pipeline_visu
(opt_pipe.path_logs,’time’,’’)
**********
cleanup: 0.07 s, 0.00 mn,
0.00 hours, 0.00 days.
cubic: 0.07 s, 0.00 mn,
0.00 hours, 0.00 days.
quadratic : 0.08 s, 0.00 mn,
0.00 hours, 0.00 days.
sample: 0.13 s, 0.00 mn,
0.00 hours, 0.00 days.
sum: 0.11 s, 0.00 mn,
0.00 hours, 0.00 days.
**********
Total computation time : 0.46 s, 0.01 mn,
0.00 hours, 0.00 days.
4. IMPLEMENTATION OF THE PIPELINE EXECUTION ENGINE
4.1. OVERVIEW
At the user level, PSOM requires two objects to be specified: (1) a
pipeline structure which describes the jobs, see Section 2;
(2) an opt_pipe structure which configures how the jobs will
be executed, see Section 4.6. The configuration notably includes
the name of a so-called logs folder, where a comprehensive record
of the pipeline execution is kept. The pipeline execution itself is
initiated byacalltothefunctionpsom_run_pipeline, which
comprises three distinct modules:
1. The initialization stage starts off with basic viability checks.
If the same logs folder is used multiple times, the current
pipeline is compared against older records. This determines
which jobs need to be (re)started.
2. When the initialization stage is finished, a process called the
pipeline manager is started. The pipeline manager remains
active as long as the pipeline is running. Its role is to create
small scripts to run individual jobs, and then submit those
scripts for execution assoon astheir dependencies aresatisfied
and sufficient resources, as determined by the configuration,
become available.
3. Each job is executed in an independent session by a job man-
ager. Upon termination of the job, the completion status
(“failed” or “finished”) is checked and reported to the pipeline
manager using a “tag file” mechanism.
Thissectiondescribestheimplementationofeachmodule,aswell
as the configuration of PSOM and the content of the logs folder.
An overview is presented in Figure6.
4.2. PIPELINE INITIALIZATION
The initialization of pipeline execution includes the following
steps:
1. Checkthat the (directed) dependency graphofthepipeline is
acyclic. A dependency graph that includes a cycle is impossi-
ble to solve.
2. Checkthatalloftheoutputfilesaregeneratedonlyonce(oth-
erwise the results of the pipeline may depend on an arbitrary
order of job executions).
3. If available, retrieve the history of previous pipeline execu-
tions. Determine which jobs need to be processed based on
thehistory. Updatethepipeline history accordingly. This step
will be further detailed below.
4. Check that all of the input files that are not generated as part
of the pipeline are present on the disk. If not, issue a warning
becausesomejobsmayfailwheninputfiles aremissing. This,
however,dependsonthebehaviorofthecommandsspecified
by the user and cannot be tested by PSOM. The decision to
continue is thus left to the user who may decide to interrupt
the execution at this stage.
5. Create all the necessary folders for output files. This feature
circumvents the repetitive task of coding the creation of the
output folder(s) inside each individual job.
6. If some of the output files already exist, delete them. This
step is intended to avoid possible errors in the pipeline
execution due to some jobs not overwriting the output
files.
To determine what jobs from the pipeline actually need to
be processed, the jobs submitted for execution are compared
with those previously executed in the same logs folder (if any),
along with their completion status. Therearethree possiblestatus
results:
• ’none’ means that the job has never been started (this is the
default if no previous status exists).
• ’finished’ means thatthe job waspreviouslyexecuted and
successfully completed.
• ’failed’ means that the job was previously executed and
had failed.
A job will be added to the “to-do list” (i.e., will be executed by the
pipeline manager) if it meets one of the following conditions:
• the job has a ’failed’ status.
• the job has a ’none’ status.
• the description of the job has changed.
• the user forced a restart of the job using opt_pipe.
restart. See Section 3.4.
Every time a job A is added to the to-do list, the following actions
are taken:
• Change the status of the job A to ’none’.
• Add all jobs with a dependency on A to the “to-do list”.
• If an input file of A is missing and a job of the pipeline can
generate this file, add this last job to the “to-do list”.
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 8
Page 9
Bellec et al.The pipeline system for Octave and Matlab
FIGURE 6 | Overview of the PSOM implementation. On the user’s side
(left panel), a structure pipeline is built to describe the list of jobs, and a
structure opt_pipe is used to configure PSOM. The memory of the pipeline
is a logs folder located on the disk space (right panel), in which a series of
files are stored to provide a comprehensive record of multiple runs of pipeline
execution. The PSOM proceeds in three stages (center panel). At the
initialization stage, the current pipeline is compared with previous executions
to set up a “to-do” list of jobs that needs to be (re)started. Then, the pipeline
manager is started, which constantly submits jobs for execution and
monitors the status of on-going jobs. Finally, each job is executed
independently by a job manager which reports the completion status upon
termination (either “failed” or “finished”).
Note that the process of adding a job to the to-do list is recursive
and it can lead to restarting a job with a ’finished’ status,
e.g., if that job has changed or if it is dependent on a job that has
changed.
4.3. PIPELINE MANAGER
After the pipeline has been initialized, a small process called the
“pipeline manager” is started. The pipeline manager is essentially
a long loop that constantly monitors the state of the pipeline,
and submits jobs for execution. The pipeline manager as well as
the individual jobs can run within the current O/M session, or
in an independent session running either locally (on the same
machine) or remotely (on another computer/node). At any given
point in time, the pipeline manager submits all ofthe jobs that do
not have an unsatisfied dependency, as long as there are enough
resources available to process the jobs. The following rules apply
to determine if the dependencies of a job are satisfied:
1. If a job has been successfully completed, the dependencies
to all the children in the dependency graph are considered
satisfied.
2. Conversely, the dependencies of a job are all satisfied if there
are no dependencies in the first place or if the parents in the
dependency graph all have a ’finished’ status.
Depending on the selected configuration, there may also be a
limit to the maximal number of jobs that can be submitted for
execution simultaneously. This was implemented because some
high-performancecomputing facilities imposesuch alimit. Upon
completion or failure, the jobs report their status using tag files
located in the logs folder. A tag file is an empty file with a name
of the form JOB_NAME.failed or JOB_NAME.finished,
which indicates the completion status. If the pipeline system was
fully based on tag files to store status, a pipeline with thou-
sands of jobs would create thousands of tag files. This would
cause very important delays when accessing the file system. The
pipeline manager thus monitors these tag files and removes them
as soon as their presence is detected. The tag files are used to
update a single O/M file coding for the status of all jobs in the
pipeline. As the tag files are empty files, there is no possible race
condition between their creation and their subsequent removal
by the pipeline manager. The pipeline manager also adds status
updatesinaplaintext“history”filewhichcanbemonitored while
being updated in a terminal or from O/M through the dedicated
command psom_pipeline_visu.
4.4. JOB MANAGER
When a job is submitted for execution by the pipeline man-
ager, the command specified by the user is always executed by
a generic job manager. The job manager is a matlab function
(psom_run_job) which automates the generationofa job pro-
file, logs, as well as the tag files that are used to report the
Frontiers in Neuroinformatics www.frontiersin.org
April 2012 | Volume 6 | Article 7 | 9
Page 10
Bellec et al.The pipeline system for Octave and Matlab
completion status to the pipeline manager. This function notably
executes the command in a try ... catch block, which
means that errors in the command will not crash the job man-
ager. When the command has finished to run, the job manager
will check that all of the output files have been properly gener-
ated. If an error occurs, or if one of the output files is missing,
then the job is marked as ’failed’. Otherwise it is considered
’finished’. The job manager reports back the completion
status of the job to the pipeline manager using a tag file mech-
anism already described in Section 4.3. The job manager also
automatically generates logs, i.e., a text record of the execution
of the command, as well as other automatically generated infor-
mations such as the user name, the date, the time, and the type of
operating system, see Section 3.5 for an example. Finally, the job
manager measures and saves the execution time of the command
for profiling purposes.
4.5. LOGS FOLDER
The logs folder contain the following files:
• PIPE_history.txt: A plain text file with the history of
the execution of the pipeline manager (see Section 3.5 for an
example).
• PIPE_jobs.mat: An O/M file were each job is saved as a
variable. This structure includes the latest version of all jobs
executed from the logs folder.
• PIPE_status.mat: An O/M file where the status of each
job is saved as one (string) variable.
• PIPE_logs.mat: An O/M file where the logs of each job is
saved as one (string) variable.
• PIPE_profile.mat: An O/M file where each job appears
as avariable. This variableis anO/M structure, notablyinclud-
ing the execution time of the command.
• PIPE.mat: An O/M file where PSOM configuration variables
are saved.
Importantly, using PIPE_jobs.mat, itis possibleto re-execute
the pipeline from scratch at any point in time, or to access
any of the parameters that were used for the analysis. The logs
folder thus contains enough information to fully reproduce the
results of the pipeline. Moreover, with this information being
stored in the form of an M-structure, it is easy to access and
fully scalable. This can support jobs with potentially hundreds
or even thousands of parameters. Octave and Matlab both use
the HDF5 file format (Poinot, 2010). This format offers internal
compression, yet still allows PSOM to read or write individ-
ual variables without accessing the rest of the file. This is a
key technical feature that enables PSOM to quickly update the
logs/status/profile files for each job, regardless of the size of the
pipeline. Note that the logs folder also contain other files gen-
erated temporarily as part of the pipeline submission/execution
process, as well as backup files in the event the main files are
corrupted.
4.6. PSOM CONFIGURATION
The only necessary option to start a pipeline is setting where to
store the logs folder:
>> opt_pipe.path_logs =
’/home/pbellec/database/demo_psom/logs/’;
Itishighlyrecommendedthatthelogsfolderbeusedsolelyforthe
purposes of storing the history of the pipeline. Another impor-
tant, yet optional parameter is setting how the individual jobs of
the pipeline are executed:
>> opt_pipe.mode = ’batch’;
Five execution modes are available:
• ’session’: The jobs are executed in the current O/M ses-
sion, one after the other.
• ’background’: This is the default. Each job is executed
in the background as an independent O/M session, using
an “asynchronous” system call. If the user’s session is inter-
rupted, the pipeline manager and the jobs are interrupted
as well.
• ’batch’: Each job is executed in the background as an inde-
pendent O/M session, using the at command on Linux and
thestart commandonwindows.Iftheuser’s sessionisinter-
rupted, the pipeline manager and the jobs are not interrupted.
This mode is less robust than background and may not be
availableon some platforms.
• ’qsub’: The jobs are executed on a remote execution server
throughindependentsubmissionstoaqueuingschedulerusing
a qsub command (either torque, SGE, or PBS). Such queuing
schedulers are in general avalaible in high-performance com-
puting facilities. They need to be installed and configured by a
system administrator.
• ’msub’: The jobs are executed on a remote execution server
through independent submissions to a a queuing scheduler
usingamsub command(MOAB).Thisisessentiallyequivalent
to the qsub mode.
Additional options are available to control the bash environment
variables, as well as O/M start-up options, among others. A func-
tion called psom_config can be used to assess whether the
configuration ofPSOM is correct. This procedureincludesmulti-
ple tests to validate that each stage of a job submission is working
properly. It will provide some environment-specific suggestions
to fixtheconfigurationwhenaproblemisdetected. PSOMrelease
0.9hasbeentested inavarietyofplatforms(Linux, windows,Mac
OSX) and execution modes. More details can be found in PSOM
online resources, see the discussion section for links.
5. CODING GUIDELINESFOR MODULES AND PIPELINES
The pipeline structure that is used in PSOM is very flexible, as it
does not impose any constraints on the way the code executed
by each job is implemented or on the way the pipeline struc-
ture itself is generated. Additional coding guidelines and tools
have been developed to keep the code concise and scalable, in
the sense that it can be used to deal with functions with tens
or hundreds of parameters and thousands of jobs. These guide-
lines also facilitate the combination of multiple pipelines while
keeping track of all parameters: a critical feature to ensure full
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 10
Page 11
Bellec et al.The pipeline system for Octave and Matlab
provenance of a pipeline analysis. A generic tool is available to
test the presence of mandatory parameters and set up default
parameter values. Another tool is the so-called “brick” function
type, which can be used to run jobs. A last set of guidelines
and tools have been developed to generate the pipeline structures
themselves.
5.1. SETTING THE JOB PARAMETERS
There is no strict framework to set the default of the input
arguments in Octave/Matlab. We developed our own guidelines,
which have several advantages over a more traditional method
consisting in passing each parameter one by one. As can be seen
in the attributes of a job, our method consists of passing all
parameters as fields of a single structure opt. A generic function
psom_struct_defaults can be used to check for the pres-
ence of mandatory input arguments, set default values, and issue
warnings for unkown arguments. The following example shows
how to set the inputarguments ofa function using thatapproach:
opt.order
opt.slic
opt.timing
list_fields
= [1 3 5 2 4 6];
= 1;
= [0.2,0.2];
= { ’method’ , ’order’ ,
’slice’ , ’timing’ , ’verb’ };
list_defaults = { ’linear’ , NaN , [] ,
NaN , true };
opt = psom_struct_defaults
(opt,list_fields,list_defaults)
warning: The following field(s) were
ignored in the structure :
opt = {
method = linear
order= [13
slice= [](0x0)
timing = [0.20000
verb=1 }
slic
524 6]
0.20000]
Note that only three lines of code are used to set all the defaults,
and that a warning was automatically issued for the typo slic
instead of slice. Such unlisted fields are simply ignored. Also,
the default value NaN can be used to indicate a mandatory argu-
ment (an error will be issued if this field is absent). This approach
will scale up well with a large number of parameters. It also facil-
itates the addition of extra parameters in future developments
while maintaining backwards compatibility. As long as a new
parameter is optional, a code written for old specifications will
remain functional.
5.2. BUILDING MODULES FOR A PIPELINE : THE “BRICK”
FUNCTION TYPE
The bricks are a special type of O/M function which take files
as inputs and outputs, along with a structure to describe some
options. In brief, a brick precisely mimics the structure of a job
in a pipeline, except for the files_clean field. The command
used to call a brick always follows the same syntax:
[files_in,files_out,opt] =
brick_name(files_in,files_out,opt)
where files_in, files_out and opt play the same roles as
the fields of a job. The key mechanism of a brick is that there will
always be an option called opt.flag_test which allows the
programmertomakeatest, ordry-run.Ifthat(boolean)optionis
true, thebrickwill notdo anything butupdatethe defaultparam-
eters andfilenamesinitsthreearguments. Usingthis mechanism,
it is possible to use the brick itself to generate an exhaustive list of
the brickparameters,and test ifa subsetofparameters areaccept-
abletorunthebrick.Inaddition,ifachangeismadetothedefault
parameters of a brick, this change will be apparent to any piece of
code that is using a test to set the parameters, without a need to
change the code.
When the file names files_in or files_out are struc-
tures, a missing field will be interpreted either as a missing
input which can be replaced by a default dataset, or an output
that does not need to be generated. If the field is present but
empty, then a default file name is generated. Note that an option
opt.folder_out can generally be used to specify in which
folder the default outputs should be generated. Finally, if a field is
present and non-empty, the file names specified by the users are
used to generate the outputs. These conventions allow complete
control over the number of output files generated by the brick,
and the flexibility to use default names. The following example
is a dry-run with a real brick implemented in the neuroimaging
analysis kit17(NIAK) (Bellec et al., 2011):
files_in
’/database/func_motor_subject1.mnc’;
files_out.filtered_data = ’’;
files_out.var_low = ’’;
opt.hp = 0.01;
opt.folder_out = ’/database/filtered_data/’;
opt.flag_test = true;
>>[files_in,files_out,opt] = ... niak_brick
_time_filter(files_in,files_out,opt)
files_in=
/database/func_motor_subject1.mnc
files_out =
{
filtered_data = /database/filtered_data/
/func_motor_subject1_f.mnc
var_high= gb_niak_omitted
var_low= /database/filtered
_data//func_motor_subject1_var_low.mnc
beta_high= gb_niak_omitted
beta_low= gb_niak_omitted
dc_high= gb_niak_omitted
dc_low= gb_niak_omitted
}
opt =
{
hp=0.010000
folder_out= /database/filtered_data/
flag_test=1
flag_mean=1
=
17code.google.com/p/niak
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 11
Page 12
Bellec et al.The pipeline system for Octave and Matlab
flag_verbose
tr
lp
=
= -Inf
= Inf
1
}
The default output names have been generated in opt.
folder_out, and some of the outputs will not be gener-
ated (they are associated with the special tag ’gb_niak_
omitted’). A large number of other parameters that were not
used in the call have been assigned some default values.
5.3. PIPELINE IMPLEMENTATION
A so-called pipeline generator is a function that, starting from a
minimal description of a file collection and some options, gen-
erates a full pipeline. Because a pipeline can potentially create a
verylargenumber ofoutputs, itisdifficultto implementageneric
system that is as flexible as a brick in terms of output selection.
Instead, the organization of the output of the pipeline will fol-
low some canonical, well-structured pre-defined organization. As
a consequence, the pipeline generator only takes two input argu-
ments, files_in and opt (similar to those of a job), and does
not feature files_out. The following example shows how to
setfiles_in forniak_pipeline_corsica, implemented
in NIAK:
%% Subject 1
files_in.subject1.fmri{1} =
’/demo_niak/func_motor_subject1.mnc’;
files_in.subject1.fmri{2} =
’/demo_niak/func_rest_subject1.mnc’;
files_in.subject1.transf
’/demo_niak/transf_subject1.xfm’;
=
%% Subject 2
files_in.subject2.fmri{1} =
’/demo_niak/func_motor_subject2.mnc’;
files_in.subject2.fmri{2} =
’/demo_niak/func_rest_subject2.mnc’;
files_in.subject2.transf
’/demo_niak/transf_subject2.xfm’;
=
The argument opt will include the following standard fields:
• opt.folder_out: Name of the folder where the outputs
of the pipeline will be generated (possibly organized into
subfolders).
• opt.size_output: This parameter can be used to vary the
amount of outputs generated by the pipeline (e.g., ’all’:
generate all possible outputs; ’minimum’, clean all interme-
diate outputs, etc).
• opt.brick1: All the parameters of the first brick used in the
pipeline.
• opt.brick2: All the parameters of the second brick used in
the pipeline.
• ...
Inside the code of the pipeline template, adding a job to the
pipeline will typically involve a loop similar to the following
example:
% Initialize the pipeline to a structure
with no field
pipeline = struct();
% Get the list of subjects from files_in
list_subject = fieldnames(files_in);
% Loop over subjects
for num_s = 1:length(list_subject)
% Plug the ’fmri’ input files of the
subjects in the job
job_in = files_in.
(list_subject{num_s}).fmri;
% Use the default output name
job_out = ’’;
% Force a specific folder organization
for outputs
opt.fmri.folder_out = [opt.folder_out
list_subject{num_s} filesep];
% Give a name to the jobs
job_name =
[’fmri_’ list_subject{num_s}];
% The name of the employed brick
brick = ’brick_fmri’;
% Add the job to the pipeline
pipeline = ... psom_add_job(pipeline,
job_name,brick,job_in,job_out,opt.fmri);
% The outputs of this brick are just
intermediate outputs :
% clean these up as soon as possible
pipeline = psom_add_clean(pipeline,
[job_name ... ’_clean’],pipeline.
(job_name).files_out);
end
The command psom_add_job first runs a test with the brick
to update the default parameters and file names, and then adds
the job with the updated input/output files and options. By
virtue of the “test” mechanism, the brick is itself defining all
the defaults. The coder of the pipeline does not actually need
to know which parameters are used by the brick. Any mod-
ification made to a brick will immediately propagate to all
pipelines, without changing one line in the pipeline genera-
tor. Moreover, if a mandatory parameter has been omitted by
the user, or if a parameter name is not correct, an appropri-
ate error or warning will be generated at this stage, prior to
any work actually being performed by the brick. The command
psom_add_clean adds a cleanup job to the pipeline, which
deletes the specified list of files. Because the jobs can be speci-
fied in any order, it is possible to add a job and its associated
cleanup at the same time. Finally, it is very simple to combine
pipelines together: the command psom_merge_pipeline
simply combines the fields of two structures pipeline1 and
pipeline2.
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 12
Page 13
Bellec et al.The pipeline system for Octave and Matlab
6. APPLICATIONS IN NEUROIMAGING
The PSOM project is just reaching the end of its beta testing
phase, and as such it has only been adopted by a couple
of laboratories as a development framework. There are still
been several successful applications, including the generation of
simulated fMRI (Bellec et al., 2009), clustering in resting-state
fMRI (Bellec et al., 2010a,b), clustering in event-related fMRI
(Orban et al., 2011), simulations in electroencephalography and
optical imaging (Machado et al., 2011), reconstruction of fiber
tracts (Kassis et al., 2011), as well as non-parametric permutation
testing (Ganjavi et al., 2011). The PSOM framework has also
been used for the development of an open-source software
package called NIAK18
(Bellec et al., 2011). This software
package, which relies on the PSOM execution engine, has been
used in a number of recent studies (Dansereau et al., 2011;
Moeller et al., 2011; Schoemaker et al., 2011; Carbonell et al.,
2012). We used the fMRI preprocessing pipeline from the NIAK
package to run benchmarks of the parallelization efficiency of
the PSOM execution engine. This pipeline has been integrated
into the CBRAIN computing platform (Frisoni et al., 2011),
where it has been used to preprocess and publicly release19
fMRI datasets collected for about 1000 children and adolescents,
as part of the ADHD-200 initiative20(Lavoie-Courchesne et al.,
2012).
6.1. THE NIAK FMRI PREPROCESSING PIPELINE
The NIAK fMRI preprocessing pipeline applies the follow-
ing operations to each functional and structural dataset in a
database. The first 10s of the acquisition are suppressed to
allow the magnetization to reach equilibrium. The fMRI vol-
umes are then corrected of inter-slice difference in acquisi-
tion time, rigid body motion, slow time drifts, and physiolog-
ical noise (Perlbarg et al., 2007). For each subject, the mean
motion-corrected volume of all the datasets is coregistered with
a T1 individual scan using minctracc (Collins et al., 1994), which
is itself non-linearly transformed to the Montreal Neurological
Institute (MNI) non-linear template (Fonov et al., 2011) using
the CIVET pipeline (Ad-Dab’bagh et al., 2006). The functional
volumes arethen re-sampled in the stereotaxic spaceandspatially
smoothed.
Most operations are implemented through generic medical
image processing modules, the MINC tools21. These tools are
codedinamixtureof CandC++languages,aswellassomePERL
scripts, and usually operate through the command line. Simple
PSOM-compliant “brick” wrappers have been implemented in
NIAK for the required MINC tools. Other bricks are also pure
O/M implementations for original methods or a port from other
O/M projects. Finally, some of the operations (motion correc-
tion, correction of physiological noise) are themselves pipelines
involving several steps, see Figure7 for an example of a full
dependency graph. The code of the individual NIAK fMRI pre-
processing pipeline is 735 lines long, and only 321 lines after
18www.nitrc.org/projects/niak
19http://www.nitrc.org/plugins/mwiki/index.php/neurobureau:NIAKPipeline
20http://fcon_1000.projects.nitrc.org/indi/adhd200/
21http://en.wikibooks.org/wiki/MINC
excluding header comments and variable initialization. The code
is thus conciseenough to beeasily reviewed, quality-checked, and
modified.
6.2. BENCHMARKS
We used the Cambridge resting-state fMRI database for the
benchmark, which is publicly available as part of the 1000 func-
tional connectome project22. This database (Liu et al., 2009)
includes 198 subjects with one structural MRI and one fMRI run
each (119 volumes, TR = 3s). The processing was done in var-
ious computing environments and execution modes to test the
scalability of PSOM:
• peuplier-n: A machine with an Intel(R)CoreTMi7 CPU
(four computing cores, eight threads), 16GB of memory, a
local file system and an Ubuntu operating system. For n = 1,
both the pipeline manager and individual jobs were executed
sequentially in a single Octave session. For n > 1, the pipeline
manager and individual jobs were executed in the background
in independent Octave sessions using an at command, with
up to n jobs running in parallel.
• magma-n: a machine with four six-Core AMD OpteronTM
Processor 8431 (for a total of 24 computing cores), 64GB of
memory, an NTFS mounted file system and an openSUSE
operating system. For n = 1, both the pipeline manager and
individual jobs were executed sequentially in a single Octave
session. Forn> 1, thepipelinemanagerraninthebackground
using an at command and individual jobs were executed in
the background in independent Octave sessions using an SGE
qsub command, with up to n jobs running in parallel.
• guillimin-n: asupercomputer with 14400 Intel Westmere-
EP cores distributed across 1200 compute nodes located at the
CLUMEQ-McGilldatacentre, EcoledeTechnologieSuperieure
in Montreal, Canada. guillimin ranked 83th in the top
500 list of the most powerful supercomputers, released in
November,201123. Includedinthe facility is nearly 2PBofdisk
storageusingthegeneralparallelfilesystem (GPFS).Forn= 1,
both the pipeline manager and individual jobs were executed
sequentially in a single Octave session. For n > 1, the pipeline
manager ran in the background using an at command and
individual jobs were executed on distributed computing nodes
in independent Octave sessions using a MOAB msub com-
mand, with up to n jobs running in parallel.
We
(peuplier-8, magma-{8,16,24,40} and guillimin-
{24,50,100,200}). For experiments on peuplier and
magma, Octave release 3.2.4 was used, with PSOM release 0.8.9
and NIAK release 0.6.4.3. On guillimin, octave release 3.4.2
was available and some development versions of NIAK (v1270
on the subversion repository) and PSOM (v656 of the subversion
repository) were used because they implemented some bug fix
for this release. During the time of the experiment, the PSOM
jobs were the only ones running on the execution servers for
investigatedthe performanceofPSOMon
22http://fcon_1000.projects.nitrc.org/
23www.top500.org/lists/2011/11
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 13
Page 14
Bellec et al.The pipeline system for Octave and Matlab
FIGURE 7 | An example of dependency graph for the NIAK fMRI
preprocessing pipeline. This example includes two subjects with
two fMRI datasets each. The pipeline includes close to 100 jobs,
and cleanup jobs have been removed to simplify the represen-
tation. Colors have been used to code the main stages of the
preprocessing.
peuplier and magma, while guillimin had about 75%
processors in use.
6.3. RESULTS
The raw Cambridge database had a size of 7.7GB, with a
total of 21GB generated by the pipeline (output/input ratio
of 273%). The NIAK pipeline included 5153 jobs featuring
8348 unique input/output files (not including temporary files).
Figure8A shows the distribution of execution times for all jobs
on peuplier-8. The pipeline included about 1500 “cleanup”
jobs deleting intermediate files, with an execution time of less
than 0.2s. The other jobs lasted anywhere between a few sec-
onds and 15min, with hundreds of jobs of less than 2min.
Because of the large number of very short jobs, the pipeline
manager was not able to constantly submit enough jobs to
use n cores at all time, even when it would have been pos-
sible in theory. This effect was small on peuplier, magma
or guillimin-{24,50} see Figure8B–C. It became pro-
nounced on guillimin-{100,200}, see Figure8D. The
serial execution time of the pipeline (sum of execution time of all
jobs) varied a lot from one configuration to the other: from 120h
(5 days)onguillimin-24 to almost double(220h, 9 days)on
magma-8. The serial execution time, however, increased quickly
on guillimin with an increasing n, see Figure8E. Despite this
effect, and thanks to parallelization, the parallel execution time
(time elapsed between the beginning and the end of the pipeline
processing)steadilydecreasedwithanincreasingn,seeFigure8F.
The speed-up factor (defined as the ratio between the serial and
parallel execution time) still departed from the optimal value n.
Consistent with ourobservationsontheeffective number ofcores
used onaverage,thedeparturebetween the speed-up factor andn
increased with n, and became pronounced for n greater than 100,
see Figure8G. This result can be expressed as a parallelization
efficiency, defined as the ratio between the empirical speed-up
factor and n. Parallelization efficiency was excellent (over 90%)
on peuplier-8 and gradually decreased with an increasing n
to reach about80% onpeuplier-24 or guillimin-24 and
60% onguillimin-200. Inthis lastsetting, the fMRI datasets
and structural scans of about 200 subjects were still processed in
a little bit more than 2h.
7. DISCUSSION
7.1. OVERVIEW
We propose a new PSOM to implement, run, and re-run pipeline
analysis on large databases. Our approach is well-suited for
pipelines involving heterogeneous tools that can communicate
through a file system in a largely parallel fashion. This notably
matches the constraints found in neuroimaging. The PSOM
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 14
Page 15
Bellec et al.The pipeline system for Octave and Matlab
FIGURE 8 | Benchmark experiments with the NIAK fMRI preprocessing
pipeline. The distribution of execution time for all jobs on one server
(peuplier) is shown in panel (A). The number of jobs running at any given
time across the whole execution of the pipeline (averaged on 5min time
windows) is shown in panels (B–D) for servers peuplier, magma and
guillimin, respectively. The user-specified maximum number of
concurrent jobs is indicated by a straight line. The serial execution time of the
pipeline, i.e., the sum of execution times for all jobs, is shown in panel (E).
The parallel execution time, i.e., the time elapsed between the beginning and
the end of the pipeline processing, is shown in panel (F). The speed-up
factor, i.e., serial time divided by parallel time, is presented in panel (G), along
with the ideal speed-up, equal to the user-specified maximal number of
concurrent jobs. Finally, the parallelization efficiency (i.e., the ratio between
the empirical speed-up and the ideal speed-up) is presented in panel (H).
coding standards produce concise, readable code which in our
experience is easy to maintain and develop over time. It is also
highly scalable:apipeline canincorporatethousands ofjobs, each
one featuring tens to hundreds of parameters. From a developer’s
perspective, using PSOM does not limit the scope of distribution
of the software, as pipelines can be executed inside an O/M ses-
sion as would any regular O/M code. The very same code can
alsobedeployedonamulti-coremachineorinasupercomputing
environment simply by changing the PSOM configuration.
7.2. ONLINE DOCUMENTATION
The main body of documentation is available on a wiki hosted
online by google code, see Table 1. This resource is updated for
each new release of PSOM. It covers selected topics such as the
Table 1 | Online resources for PSOM.
Ressources URL
Developer’s site
User’s site
Downloads
Forum
Wiki overview
PSOM short tutorial
Coding guidelines
PSOM configuration
PSOM tests
code.google.com/p/psom
nitrc.org/projects/psom/
nitrc.org/frs/?group_id=316
nitrc.org/forum/forum.php?forum_id=1316
code.google.com/p/psom/w/list
code.google.com/p/psom/wiki/HowToUsePsom
code.google.com/p/psom/wiki/CodingGuidelines
code.google.com/p/psom/wiki/ConfigurationPsom
code.google.com/p/psom/wiki/TestPsom
configuration of the pipeline manager more extensively than this
paper. The “short PSOM tutorial” reproduces step-by-step all the
experiments reported in Section 3.
7.3. THE BENEFITS OF PIPELINE ANALYSIS
Parallel computing is a central feature of PSOM, as it allows to
reduce the time necessary to complete an analysis. The pipeline
system can be beneficial even when used within a single session.
PSOM automatically keeps a record of all the steps and parame-
ters of the pipeline. These logs are detailed enough to reproduce
an entire analysis (as long as the production environment itself
can be reproduced). This is an essential feature in the perspec-
tive of reproducible research. The pipeline logs can also be used
for profiling the execution time of the whole pipeline as well as
its subparts. This can be useful to run a benchmark or to iden-
tify computational bottlenecks. It is finally possible to restart the
pipeline at any stage, or even to add stages or change parameters.
Over multiple executions, PSOM will restart only the pipeline
stages impacted by the changes. This ability to properly handle
pipeline updates is critical in the development phase, and can
also be useful to test alternative choices of parameter/algorithmic
selection.
7.4. PARALLEL COMPUTATION CAPABILITIES
The benchmark experiments demonstrated that PSOM is able to
handle pipelines featuring thousands of jobs and tens of giga-
bytes of data. It can also dramatically reduce the execution time:
an fMRI database including almost 200 subjects could be pre-
processed in less than 3h. The parallelization efficiency was
Frontiers in Neuroinformaticswww.frontiersin.org
April 2012 | Volume 6 | Article 7 | 15