with electroencephalography (MEG/EEG). A battery of behavioral
and cognitive tests will also be included along with the collection
of genetic material. This endeavor will yield valuable informa-
tion about brain connectivity, its relationship to behavior, and
the contributions of genetic and environmental factors to indi-
vidual differences in brain circuitry. The data generated by the
WU-Minn HCP consortium will be openly shared with the sci-
The HCP has a broad informatics vision that includes support
for the acquisition, analysis, visualization, mining, and sharing of
connectome-related data. As it implements this agenda, the con-
sortium seeks to engage the neuroinformatics community through
open source software, open programming interfaces, open-access
data-sharing, and standards-based development. The HCP infor-
matics approach includes three basic domains.
• Data support components include tools and services that
manage data (e.g., data uploads from scanners and other data
collection devices); execution and monitoring of quality assu-
rance, image processing, and analysis pipelines and routines;
secure long-term storage of acquired and processed data;
search services to identify and select subsets of the data; and
download mechanisms to distribute data to users around the
The past decade has seen great progress in the reﬁnement of non-
invasive neuroimaging methods for assessing long-distance con-
nections in the human brain. This has given rise to the tantalizing
prospect of systematically characterizing human brain connectivity,
i.e., mapping the connectome (Sporns et al., 2005). The eventual
elucidation of this amazingly complex wiring diagram should reveal
much about what makes us uniquely human and what makes each
person different from all others.
The NIH recently funded two consortia under the Human
Connectome Project (HCP)1. One is led by Washington University
and University of Minnesota and involves seven other institu-
tions (the “WU-Minn HCP consortium”)2. The other, led by
Massachusetts General Hospital and UCLA (the MGH/UCLA HCP
consortium), focuses on building and reﬁning a next-generation 3T
MR scanner for improved sensitivity and spatial resolution. Here,
we discuss informatics aspects of the WU-Minn HCP consortium’s
plan to map human brain circuitry in 1,200 healthy young adults
using cutting-edge non-invasive neuroimaging methods. Key
imaging modalities will include diffusion imaging, resting-state
fMRI, task-evoked fMRI, and magnetoencephalography combined
Informatics and data mining tools and strategies for the
Human Connectome Project
Daniel S. Marcus1*, John Harwell2, Timothy Olsen1, Michael Hodge1, Matthew F. Glasser 2, Fred Prior 1,
Mark Jenkinson3, Timothy Laumann4, Sandra W. Curtiss2 and David C. Van Essen2†
1 Department of Radiology, Washington University School of Medicine, St. Louis, MO, USA
2 Department of Anatomy and Neurobiology, Washington University School of Medicine, St. Louis, MO, USA
3 Oxford Centre for Functional Magnetic Resonance Imaging of the Brain, University of Oxford, John Radcliffe Hospital, Oxford, UK
4 Department of Neurology, Washington University School of Medicine, St. Louis, MO, USA
The Human Connectome Project (HCP) is a major endeavor that will acquire and analyze
connectivity data plus other neuroimaging, behavioral, and genetic data from 1,200 healthy
adults. It will serve as a key resource for the neuroscience research community, enabling
discoveries of how the brain is wired and how it functions in different individuals. To fulﬁll its
potential, the HCP consortium is developing an informatics platform that will handle: (1) storage
of primary and processed data, (2) systematic processing and analysis of the data, (3) open-
access data-sharing, and (4) mining and exploration of the data. This informatics platform will
include two primary components. ConnectomeDB will provide database services for storing
and distributing the data, as well as data analysis pipelines. Connectome Workbench will provide
visualization and exploration capabilities. The platform will be based on standard data formats
and provide an open set of application programming interfaces (APIs) that will facilitate broad
utilization of the data and integration of HCP services into a variety of external applications.
Primary and processed data generated by the HCP will be openly shared with the scientiﬁc
community, and the informatics platform will be available under an open source license. This
paper describes the HCP informatics platform as currently envisioned and places it into the
context of the overall HCP vision and agenda.
Keywords: connectomics, Human Connectome Project, XNAT, caret, resting state fMRI, diffusion imaging, network
analysis, brain parcellation
Tr ygve B. Leergaard, University of
Jan G. Bjaalie, University of Oslo,
Russell A Poldrack, University of
Daniel S. Marcus, Washington
University School of Medicine, 4525
Scott Avenue, Campus Box 8225,
St. Louis, MO, USA.
†David C. Van Essen for the WU-Minn
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 1
publishe d: 27 June 2011
• Visualization components include a spectrum of tools to view
anatomic and functional brain data in volumetric and surface
representations and also using network and graph-theoretic
representations of the connectome.
• Discovery components are an especially important category of
the HCP’s informatics requirements, including user interfaces
(UI) for formulating database queries, linking between related
knowledge/database systems, and exploring the relationship of
an individual’s connectome to population norms.
The HCP is expected to generate approximately 1 PB of data, which
will be made accessible via a tiered data-sharing strategy. Besides
the sheer amount of data, there will be major challenges associated
with handling the diversity of data types derived from the various
modalities of data acquisition, the complex analysis streams asso-
ciated with each modality, and the need to cope with individual
variability in brain shape as well as brain connectivity, which is
especially dramatic for cerebral cortex.
To support these needs, the HCP is developing a comprehensive
informatics platform centered on two interoperable components:
ConnectomeDB, a data management system, and Connectome
Workbench (CWB), a software suite that provides visualization
and discovery capabilities.
ConnectomeDB is based on the XNAT imaging informatics plat-
form, a widely used open source system for managing and sharing
imaging and related data (Marcus et al., 2007)3. XNAT includes an
open web services application programming interface (API) that
enables external client applications to query and exchange data with
XNAT hosts. This API will be leveraged within the HCP informatics
platform and will also help externally developed applications con-
nect to the HCP. CWB is based on Caret software, a visualization
and analysis platform that handles structural and functional data
represented on surfaces and volumes and on indiv iduals and atlases
(Van Essen et al., 2001). The HCP also beneﬁts from a variety of
processing and analysis software tools, including FreeSurfer, FSL,
Here, we provide a brief overview of the HCP, then describe
the HCP informatics platform in some detail. We also provide a
sampling of the types of scientiﬁc exploration and discovery that
it will enable.
overvIew of the human connectome Project
InferrIng long-dIstance connectIvIty from IN VIVO ImagIng
The two primary modalities for acquiring information about
human brain connectivity in vivo are diffusion imaging (dMRI),
which provides information about structural connectivity, and
resting-state functional MRI (R-fRMI), which provides informa-
tion about functional connectivity. The two approaches are comple-
mentary, and each is ver y promising. However, each has signiﬁcant
limitations that warrant brief comment.
Diffusion imaging relies on anisotropies in water diffusion to
determine the orientation of ﬁber bundles within white matter.
Using High Angular Resolution Diffusion Imaging (HARDI), mul-
tiple ﬁber orientations can be identiﬁed within individual voxels.
This enables tracking of connections even in regions where multiple
ﬁber bundles cross one another. Probabilistic tractography inte-
grates information throughout the white matter and can reveal
detailed information about long-distance connectivity patterns
between gray-matter regions (Johansen-Berg and Behrens, 2009;
Johansen-Berg and Rushworth, 2009). However, uncertainties aris-
ing at different levels of analysis can lead to both false positives
and false negatives in tracking connections. Hence, it is impor-
tant to continue reﬁning the methods for dMRI data acquisition
R-fMRI is based on spatial correlations of the slow ﬂuctuations
in the BOLD fMRI signal that occur at rest or even under anesthesia
(Fox and Raichle, 2007). Studies in the macaque monkey demon-
strate that R-fMRI correlations tend to be strong for regions known
to be anatomically interconnected, but that correlations can also
occur between regions that are linked only indirectly (Vincent et al.,
2007). Thus, while functional connectivity maps are not a pure
indicator of anatomical connectivity, they represent an invaluable
measure that is highly complementary to dMRI and tractography,
especially when acquired in the same subjects.
The HCP will carry out a “macro-connectome” analysis of long-
distance connections at a spatial resolution of 1–2 mm. At this scale,
each gray-matter voxel contains hundreds of thousands of neurons and
hundreds of millions of synapses. Complementary efforts to chart the
“micro-connectome” at the level of cells, dendrites, axons, and synapses
aspire to reconstruct domains up to a cubic millimeter (Briggman and
Denk, 2006; Lichtman et al., 2008), so that the macro-connectome and
micro-connectome domains will barely overlap in their spatial scales.
a two-Phase hcP effort
Phase I of the 5-year WU-Minn HCP consortium grant is focused
on additional reﬁnements and optimization of data acquisition
and analysis stages and on implementing a robust informatics plat-
form. Phase II, from mid-2012 through mid-2015, will involve data
acquisition from the main cohort of 1,200 subjects as well as con-
tinued reﬁnement of the informatics platform and some analysis
methods. This section summarizes key HCP methods relevant to
the informatics effort and describes some of the progress already
made toward Phase I objectives. A more detailed description of our
plans will be published elsewhere.
We plan to study 1,200 subjects (300 healthy twin pairs and available
siblings) between the ages of 22 and 35. This design, coupled with
collection of subjects’ DNA, w ill yield invaluable information about
(i) the degree of heritability associated with speciﬁc components
of the human brain connectome; and (ii) associations of speciﬁc
genetic variants with these components in healthy adults. It will
also enable genome-wide testing for additional associations (e.g.,
Visscher and Montgomery, 2009).
All 1,200 subjects will be scanned at Washington University on
a dedicated 3 Tesla (3T) Siemens Skyra scanner. The scanner
will be customized to provide a maximum gradient strength
of ∼100 mT/m, more than twice the standard 40 mT/m for
the Skyra. A subset of 200 subjects will also be scanned at the
University of Minnesota using a new 7T scanner, which is
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 2
and function. It will also provide a starting point for future stud-
ies that examine how abnormalities in structural and functional
connectivity play a role in neurological and psychiatric disorders.
The HCP will use a battery of reliable and well-validated meas-
ures that assess a wide range of human functions, including cogni-
tion, emotion, motor and sensory processes, and personality. The
core of this battery will be from the NIH Toolbox for Assessment
of Neurological and Behavioral function4. This will enable federa-
tion of HCP data with other large-scale efforts to acquire neu-
roimaging and behavioral data and will facilitate comparison of
brain-behavior relationships across studies (Gershon et al., 2010).
Additional tests that are currently being piloted will be drawn from
Blood samples collected from each subject during their visit
will be sent to the Rutgers University Cell and DNA Repository
(RUCDR), where cell lines will be created and DNA will be
extracted. Genetic analysis will be conducted in early 2015, after
all Phase II subjects have completed in-person testing. Performing
the genotyping in the later stages of the project will allow the
HCP to take advantage of future developments in this rapidly
advancing ﬁeld, including the availability of new sequencing
technologies and decreased costs of whole-genome sequencing.
Genetic data and de-identiﬁed demographic and phenotype data
will be entered into the dbGAP database in accordance with NIH
data-sharing policies. Summary data look-up by genotype will be
possible via ConnectomeDB.
The collection of this broad range of data types from multiple
family groups will necessitate careful coordination of the various
tests during in-person visits. Figure 1 illustrates the data collection
workﬂow planned for the high-throughput phase of the HCP. All
1,200 subjects in the main cohort will be scanned at Washington
University on the dedicated 3T scanner. A subset of 200 subjects
(100 same-sex twin pairs, 50% monozygotic) will also be scanned
at University of Minnesota using 7T MRI (HARDI, R-fMRI, and
T-fMRI) and possibly also 10.5 T. Another subset of 100 (50
same-sex twin pairs, all monozygotic) will be scanned at St. Louis
University (SLU) using MEG/EEG. Many data management and
quality control (QC) steps will be taken to maximize the quality
and reliability of these datasets (see Data Workﬂow and Quality
the hcP InformatIcs aPProach
Our HCP informatics approach includes components related to
data support and visualization. The Section “Data Support” dis-
cusses key data types and representations plus aspects of data pro-
cessing pipelines that have major informatics implications. This
leads to a discussion of ConnectomeDB and the computational
resources and infrastructure needed to support it, as well as our
data-sharing plans. The Section “Visualization” describes CWB
and its interoperability with ConnectomeDB. These sections also
include examples of potential exploratory uses of HCP data.
expected to provide improved signal-to-noise ratio and better
spatial resolution, but is less well established for routine, high-
throughput studies. Some subjects may also be scanned on a
10.5 T scanner currently under development at the University
of Minnesota. Having higher-field scans of individuals also
scanned at 3T will let us use the higher-resolution data to con-
strain and better interpret the 3T data.
Each subject will have multiple MR scans, including HARDI,
R-fMRI (Resting-state fMRI), T-fMRI (task-evoked fMRI), and
standard T1-weighted and T2-weighted anatomical scans. Advances
in pulse sequences are planned in order to obtain the highest reso-
lution and quality of data possible in a reasonable period of time.
Already, new pulse sequences have been developed that accelerate
image acquisition time (TR) by sevenfold while maintaining or
even improving the signal-to-noise ratio (Feinberg et al., 2010).
The faster temporal resolution for both R-fMRI and T-fMRI made
possible by these advances will increase the amount of data acquired
for each subject and increase the HCP data storage requirements, a
point that exempliﬁes the many interdependencies among various
HCP project components.
Task-fMRI scans will include a range of tasks aimed at providing
broad coverage of the brain and identifying as many functionally
distinct parcels as possible. The results will aid in validating and
interpreting the results of the connectivity analyses obtained using
resting-state fMRI and diffusion imaging. These “functional local-
izer” tasks will include measures of primary sensory processes (e.g.,
vision, motor function) and a wide range of cognitive and affective
processes, including stimulus category representations, working
memory, episodic memory, language processing, emotion process-
ing, decision-making, reward processing and social cognition. The
speciﬁc tasks to be included are currently being piloted; ﬁnal task
selection will be based on multiple criteria, including sensitivity,
reliability and brain coverage.
A subset of 100 subjects will also be studied with combined
MEG/EEG, which provides vastly better temporal resolution
(milliseconds instead of seconds) but lower spatial resolution than
MR (between 1 and 4 cm). Mapping MEG/EEG data to cortical
sources will enable electrical activity patterns among neural popula-
tions to be characterized as functions of both time and frequency.
As with the fMRI, MEG/EEG will include both resting-state and
task-evoked acquisitions. The behavioral tasks will be a matched
subset of the tasks used in fMRI. The MEG/EEG scans, to will
be acquired at St. Louis University using a Magnes 3600 MEG
(4DNeuroimaging, San Diego, CA, USA) with 248 magnetometers,
23 MEG reference channels (5 gradiometer, and 18 magnetom-
eter) and 64 EEG voltage channels. This data will be analyzed in
both sensor space and using state-of-the-art source localization
methods (Wipf and Nagarajan, 2009; Ou et al., 2010) and using
subject speciﬁc head models derived from anatomic MRI. Analyses
of band-limited power (BLP) will provide measures that reﬂect the
frequency-dependent dynamics of resting and task-evoked brain
activity (de Pasquale et al., 2010; Scheeringa et al., 2011).
behavIoral, genetIc, and other non-ImagIng measures
Measuring behavior in conjunction with mapping of structural and
functional networks in HCP subjects will enable the analysis of the
functional correlates of variations in “typical” brain connectivity 4www.nihtoolbox.org
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 3
or time-series values will be stored in the binary portion of the
NIFTI-2 format. Datasets whose brainordinates include both vox-
els and surface vertices pose special metadata requirements that
are being addressed for the HCP and for other software platforms
by a “CIFTI” working group (with “C” indicating connectivity).
A description of CIFTI data types including example ﬁle formats
has been reviewed by domain experts and is available for pub-
lic comment6. CIFTI ﬁle formats will support metadata that map
matrix rows and columns to brainordinates, parcels (see below),
and/or time points, in conformance with NIFTI conventions for
Individuals, atlases, and registration. The anatomical substrates
on which HCP data are analyzed and visualized will include individ-
ual subjects as well as atlases. In general, quantitative comparisons
across multiple subjects require registering data from individuals
to an atlas. Maximizing the quality of inter-subject registration
(alignment) is a high priority but also a major challenge. This is
especially the case for cerebral cortex, owing to the complexity
and variability of its convolutions. Several registration methods
and atlases are under consideration for the HCP, including popu-
lation-average volumes and population-average cortical surfaces
based on registration of surface features. Major improvements in
inter-subject alignment may be attainable by invoking constraints
related to function, architecture, and connectivity, especially for
cerebral cortex (e.g., Petrovic et al., 2007; Sabuncu et al., 2010). This
is important for the HCP informatics effort, insofar as improved
atlas representations that emerge in Phase II may warrant support
by the HCP.
Parcellations. The brain can be subdivided into many subcorti-
cal nuclei and cortical areas (“parcels”), each sharing common
characteristics based on architectonics, connectivity, topographic
organization, and/or function. Expression of connectivity data
as a matrix of connection weights between parcels will enable
data to be stored very compactly and transmitted rapidly. Also,
Volumes, surfaces, and representations. MR images are acquired
in a 3-D space of regularly spaced voxels, but the geometric rep-
resentations useful for subsequent processing depend upon brain
structure. Subcortical structures are best processed in standard
volumetric (voxel) coordinates. The complex convolutions of the
cortical sheet make it advantageous for many purposes to model
the cortex using explicit surface representations – a set of vertices
topologically linked into a 2D mesh for each hemisphere. However,
for other purposes it remains useful to analyze and visualize cortical
structures in volume space. Hence, the HCP will support both volu-
metric and surface representations for analysis and visualization.
For some connectivity data types, it is useful to represent sub-
cortical volumetric coordinates and cortical surface vertices in a
single ﬁle. This motivates introduction of a geometry-independent
terminology. Speciﬁcally, a “brainordinate” (brain coordinate) is a
spatial location within the brain that can be either a voxel (i, j, k
integer values) or a surface vertex (x, y, z real-valued coordinates
and a “node number”); a “grayordinate” is a voxel or vertex within
gray matter (cortical or subcortical); a “whiteordinate” is a voxel
within white matter or a vertex on the white matter surface. These
terms (brainordinate, grayordinate, and whiteordinate) are espe-
cially useful in relation to the CIFTI data ﬁles described in the
When feasible, the HCP will use standard NIFTI-1 (volumetric)
and GIFTI (surfaces) formats. Primary diffusion imaging data will
be stored using the format MiND recently developed by Patel et al.
(2010). By conforming to these existing formats, datasets gener-
ated using one software platform can be read by other platforms
without the need to invoke ﬁle conversion utilities. Several types
of connectivity-related data will exceed the size limits supported
by NIFTI-1 and GIFTI and will instead use the recently adopted
NIFTI-2 format5. NIFTI-2 is similar to NIFTI-1, but has dimension
indices increased from 16-bit to 64-bit integers, which will be use-
ful for multiple purposes and platforms. For the HCP, connectivity
FIGURE 1 | HCP subject workﬂow.
5http://www.nitrc.org/forum/message.php?msg_id = 3738 6http://www.nitrc.org/projects/cifti
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 4
data model includes a standard experiment hierarchy, including
projects, subjects, visits, and experiments. On top of this basic
hierarchy, speciﬁc data type extensions can be added to represent
speciﬁc data, including imaging modalities, derived imaging meas-
ures, behavioral tests, and genetics information. The Data Service
provides mechanisms for incorporating these extensions into the
XNAT infrastructure, including the database backend, middleware
data access objects, and frontend reports and data entry forms.
Finally, the Search Service allows complex queries to be executed
on the database.
All of XNAT ’s services are accessible via an open web services API
that follows the REpresentational State Transfer (REST) approach
(Fielding, 2000). By utilizing the richness of the HTTP protocol,
REST web services allow requests between client and server to be
speciﬁed using browser-like URLs. The REST API provides speciﬁc
URLs to create, access, and modify every resource under XNAT’s
management. The URL structures follow the organizational hier-
archy of XNAT data, making it intuitive to navigate the API either
manually (rarely) or programmatically. HCP will use this API for
interactions between ConnectomeDB and CWB, for import ing data
into and out of processing pipelines, and as a conduit between
external software applications and HCP datasets. External libraries
and tools that can interact with the XNAT API include pyxnat – a
Python library for interfacing with XNAT repositories7; 3D Slicer –
an advanced image visualization and analysis environment8; and
LONI Pipeline – a GUI-based pipelining environment9.
API extensions. The HCP is developing additional services to sup-
port connectome-related queries. A primary initial focus is on
a service that enables spatial queries on connectivity measures.
This service will calculate and return a connectivity map or a
task-evoked activation map based on speciﬁed spatial, subject, and
calculation parameters. The spatial parameter will allow queries to
specify the spatial domain to include in the calculation. Examples
include a single brainordinate (see above), a cortical or subcorti-
cal parcel, or some other region of interest (collection of brain-
ordinates). This type of search will beneﬁt from registering each
subject’s data onto a standard surface mesh and subcortical atlas
parcellation. The subject parameter will allow queries to specify the
subject or subject groups to include in the calculation examples
including an individual subject ID, one or more lists of subjec t IDs,
subject characteristics (e.g., subjects with IQ > 120, subjects with
a particular genotype at a particular genetic locus), and contrasts
(e.g., subjects with IQ > 110 vs. subjects with IQ < 90). Finally,
the calculation parameter will allow queries to specify the speciﬁc
connectivity or task-evoked activation measure to calculate and
return. Basic connectivity measures will include those based on
resting-state fMRI (functional connectivity) and diffusion imag-
ing (structural connectivity). Depending on the included subject
parameter, the output connectivity measure might be the indi-
vidual connectivity maps for a speciﬁc subject, the average map
for a group of subjects, or the average difference map between
two groups. When needed, the requested connectivity information
graph-theoretic network analyses (see below) will be more tractable
and biologically meaningful on parcellated data. However, this will
place a premium on the ﬁdelity of the parcellation schemes. Data
from the HCP should greatly improve the accuracy with which the
brain can be subdivided, but over a time frame that will extend
throughout Phase II. Hence, just as for atlases, improved parcel-
lations that emerge in Phase II may warrant support by the HCP.
Networks and modularity. Brain parcels can often be grouped
into spatially distributed networks and subnetworks that subserve
distinct functions. These can be analyzed using graph-theoretic
approaches that model networks as nodes connected by edges
(Sporns, 2010). In the context of HCP, graph nodes can be brain-
ordinates or parcels, and edges can be R-fMRI correlations (full
correlations or various types of partial correlations), tractography-
based estimates of connection probability or strength, or other
measures of relationships between the nodes. The HCP will use
several categories of network-related measures, including meas-
ures of segregation such as clustering and modularity (Newman,
2006); measures of integration, including path length and global
efﬁciency; and measures of inﬂuence to identify subsets of nodes
and edges central to the network architecture such as hubs or
bridges (Rubinov and Sporns, 2010).
Processing pipelines and analysis streams. Generation of the
various data types for each of the major imaging modalities will
require extensive processing and analysis. Each analysis stream
needs to be carried out in a systematic and well-documented way.
For each modality, a goal is to settle on customized processing
streams that yield the highest-quality and most informative types
of data. During Phase I, this will include systematic evaluation of
different pipelines and analysis strategies applied to the same sets
of preliminary data. Minimally processed versions for each data
modality will also remain available, which will enable investigators
to explore alternative processing and analysis approaches.
XNAT foundation. ConnectomeDB is being developed as a custom-
ized version of the XNAT imaging informatics platform (Marcus
et al., 2007). XNAT is a highly extensible, open source system for
receiving, archiving, managing, processing, and sharing both imag-
ing and non-imaging study data. XNAT includes ﬁve services that
are critical for ConnectomeDB operations. The DICOM Service
receives and stores data from DICOM devices (scanners or gate-
ways), imports relevant metadata from DICOM tags to the data-
base, anonymizes sensitive information in the DICOM ﬁles, and
converts the images to NIFTI formatted ﬁles. The Pipeline Service
for deﬁning and executing automated and semi-automated image
processing procedures allows computationally intensive process-
ing and analysis jobs to be ofﬂoaded to compute clusters while
managing, monitoring and reporting on the execution status of
these jobs through its application interface. The Quality Control
Service enables both manual and automated review of images and
subsequent markup of speciﬁc characteristics (e.g., motion arti-
facts, head positioning, signal to noise ratio) and overall usability
of individual scans and full imaging sessions. The Data Service
allows study data to be incorporated into the database. The default
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 5
of dynamic user interaction, and portability across client systems
(browsers, desktop applications, mobile devices). The interface will
include two main tracks. The Download track emphasizes rapid
identiﬁcation of data of interest and subsequent download. The
most straightforward downloads will be pre-packaged bundles,
containing high interest content from each quarterly data release
(see Data-Sharing below). Alternatively, browsing and search
interfaces will allow users to select individual subjects and sub-
jects groups by one or more demographic, genetic, or behavioral
criteria. The Visualization & Discovery track will include an embed-
ded version of CWB, which will allow users to explore connectivity
data on a rendered 3D surface (see Visualization below). Using a
faceted search interface, users will build subject groups that are
dynamically rendered by CWB.
The HCP informatics platform will support high-throughput data
collection and open-access data-sharing. Data collection require-
ments include uploading acquired data from multiple devices
and study sites, enforcing rigorous QC procedures, and executing
standardized image processing. Data-sharing requirements include
supporting world-wide download of very large data sets and high
volumes of API service requests. The overall computing and data-
base strategy for supporting these requirements is illustrated in
Figure 3 and detailed below.
Computing infrastructure. The HCP computing infrastructure
(Table 1) includes two complementary systems, an elastically
expandable virtual cluster and a high performance computing sys-
tem (HPCS). The virtual cluster has a pool of general purpose serv-
ers managed by VMW are ESXi. Speciﬁc virtual machines (VMs)
for web servers, database servers, and compute nodes are allocated
from the VMW are cluster and can be dynamically provisioned to
(e.g., average difference maps) w ill be dynamically generated. Task-
evoked activation measures will include key contrasts for each
task and options to they view activation maps for a particular task
in a speciﬁc subject, the average map for a group of subjects, or
comparing two groups.
Importantly, connectivity results wil l be accessible either as dense
connectivity maps, which will have ﬁne spatial resolution but will
be slower to compute and transmit, or as parcellated connectivity
maps, which will be faster to process and in some situations may be
pre-computed. Additional features that are planned include options
to access time courses for R-fMRI data, ﬁber trajectories for structural
connectivity data, and individual subject design ﬁles and time courses
for T-fMRI data. Other approaches such as regression analysis will
also be supported. For example, this may include options to deter-
mine the correlation between features of particular pathways or net-
works and particular behavioral measures (e.g., working memory).
When a spatial query is submitted, ConnectomeDB will parse
the parameters, search the database to identify the appropriate
subjects, retrieve the necessary ﬁles from its ﬁle store, and then
execute the necessary calculations. By executing these queries on
the database server and its associated computing cluster, only the
ﬁnal connectivity or activation map will need to be transferred back
to the user. While this approach increases the computing demands
on the HCP infrastructure, it will dramatically reduce the amount
of data that needs to be transferred over the network. CWB will be
a primary consumer of this service, but as with all services in the
ConnectomeDB API, it will be accessible to other external clients,
including other visualization environments and related databases.
User interface. The ConnectomeDB UI is being custom devel-
Figure 2). Building on advanced web technologies has several
advantages, including streamlined access to remote data, high levels
FIGURE 2 | The Connectome UI. (Left) This mockup of the Visualization &
Discovery track illustrates key concepts that are being implemented,
including a faceted search interface to construct subject groups and an
embedded version of Connectome Workbench. Both the search interface
and Workbench view are fed by ConnectomeDB’s open API. (Right) This
mockup of the Download track illustrates the track’s emphasis on guiding
users quickly to standard download packages and navigation to
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 6
HCP’s capacity during peak load. During extremely high load, we
may also utilize commercial cloud computing ser vices to elastically
expand the cluster’s computing capacity.
To support the project’s most demanding processing streams,
we have partnered with the WU Center for High Performance
Computing (CHPC), which operates an IBM HPCS that com-
menced operating in 2010. Pipelines developed for the HCP greatly
match changing load conditions. Construction of the VMs is man-
aged by Puppet (Puppet Labs), a systems management platform that
enables IT staff to manage and deploy standard system conﬁgura-
tions. The initial Phase 1 cluster includes 4 6-core physical CPUs
that will be expanded in project years 3 and 5. We will partner with
the WU Neuroimaging Informatics and Analysis Center (NIAC),
which runs a similar virtual cluster, to dynamically expand the
Table 1 | The HCP computing infrastructure.
Component Device Notes
Virtual cluster 2 Dell PowerEdge R610s managed byVMWare
Additional nodes will be added in years 3 and 5. Dynamically expandable
using NIAC cluster.
Web servers VMs running Tomcat 6.0.29 and XNAT 1.5 Load-balanced web servers host XNAT system and handle all API requests.
Monitored by Pingdom and Google Analytics.
Database servers VMs running Postgres 9.0.3. Postgres 9 is run in synchronous multi-master replication mode, enabling
high availability and load balancing.
Compute Cluster VMs running Sun Grid Engine-based queuing. Executes pipelines and on-the-ﬂy computations that require short latencies.
Data storage Scale-out NAS (Vendor TBD) Planned 1 PB capacity will include tiered storage pools and 10Gb
connectivity to cluster and HPCS.
Load balancing Kemp Technologies LoadMaster 2600 Distributes web trafﬁc across multiple servers and provides hardware-
accelerated SSL encryption
HPCS IBM system in WU’s CHPC The HPC will execute computationally intensive processing including
“standard” pipelines and user-submitted jobs.
DICOM gateway Shuttle XS35-704 Intel Atom D510 The gateway uses CTP to manage secure transmission of scans from
UMinn scanner to ConnectomeDB.
Partner institutions, cloud computing Mirror data sites will ease bottlenecks during peak trafﬁc periods. Elastic
computing strategies will automatically detect stress on compute cluster
and recruit additional resources.
The web servers, database servers, and compute cluster are jointly managed as a single VMware ESXi cluster for efﬁcient resource utilization and high availability.
The underlying servers each include 48-GB memory and dual 6-core processors. Each node in the VMware cluster is redundantly tied back in to the storage system
for VM storage. All nodes run 64-bit CentOS 5.5. The HPCS includes an iDataPlex cluster (168 nodes with dual quad core Nehalem processors and 24-GB RAM), an
e1350 cluster (7 SMP servers, each with 64 cores and 256-GB RAM), a 288-port Qlogic Inﬁniband switch to interconnect all processors and storage nodes, and 9 TB
of high-speed storage. Connectivity to the system is provided by a 4 × 10 Gb research network backbone.
FIGURE 3 | ConnectomeDB architecture, including data transfer
components. ConnectomeDB will utilize the Tomcat servlet container as
the application server and use the enterprise grade, open source
PostgreSQL database for storage of non-imaging data, imaging session
meta-data, and system data. Actual images and other binary content are
stored on a ﬁle system rather than in the database, improving performance
and making the data more easily consumable by external software
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 7
review, they will be de-identiﬁed, including removal of sensitive
ﬁelds from the DICOM headers and obscuring facial features in
the high-resolution anatomic scans, transferred to a public-facing
database, and shared with the public according to the data-sharing
plan described below. All processing and analysis pipelines will be
executed on the public-facing system so that these operations are
performed on de-identiﬁed data only.
MRI data acquired at Washington University will be uploaded
directly from the scanner to ConnectomeDB over the DICOM
protocol on a secure private network. MRI data acquired at the
University of Minnesota will be sent from the scanners to an on-site
DICOM gateway conﬁgured with RSNA’s Clinical Trial Processor
(CTP) software. The CTP appliance will receive the data over
the DICOM protocol, which is non-encrypted, and relay it to
ConnectomeDB over the secure HTTPS protocol. Once the data
have been uploaded, several actions will be triggered. First, XNAT’s
DICOM service will import metadata from the DICOM header
ﬁelds into the database and places the ﬁles into its ﬁle repository.
Next, a notiﬁcation will be sent to HCP imaging staff to complete
manual inspection of the data. Finally, a series of pipelines will
be executed to generate sequence-speciﬁc automated QC metrics
with ﬂags to the HCP imaging staff regarding problematic data,
and to validate metadata ﬁelds for protocol compliance. We aim to
complete both manual and automated QA within 1 h of acquisi-
tion, which will enable re-scanning of individuals while they are
MEG/EEG data will be uploaded to ConnectomeDB via a dedi-
cated web form in native 4D format that will insure de-identiﬁ-
cation and secure transport via https. QC procedures will ensure
proper linkage to other information via study speciﬁc subject IDs.
EEG data will be converted to European Data format (EDF)10 while
MEG data will remain in source format.
Demographic and behavioral data will be entered into
ConnectomeDB, either through import mechanisms or direct data
entry. Most of the behavioral data will be acquired on the NIH
Toolbox testing system, which includes its own database. Scripts
are being developed to extract the test results from the Toolbox
database and upload them into ConnectomeDB via XML docu-
ments. Additional connectome-speciﬁc forms will be developed
for direct web-based entry into ConnectomeDB, via desktop or
Quality control. Initial QC of imaging data will be performed
by the technician during acquisition of the data by reviewing the
images at the scanner console. Obviously ﬂawed data will be imme -
diately reacquired within the scan session. Once imaging studies
have been uploaded to the internal ConnectomeDB, several QC
and pre-processing procedures will be triggered and are expected
to be completed within an hour, as discussed above. First, the scans
will be manually inspected in more detail by trained technicians.
The manual review process will use a similar procedure as that
used by the Alzheimer’s Disease Neuroimaging Initiative, which
includes evaluation of head positioning, susceptibility artifacts,
motion, and other acquisition anomalies along a 4-point scale
(Jack et al., 2008). Speciﬁc extensions will be implemented for
beneﬁt from the ability to run in parallel across subjects and take
advantage of the vast amount of memory available in the HPCS
nodes. Already, several neuroimaging packages including FreeSurfer,
FSL, and Caret have been installed on the platform and are in active
use by the HCP. The system utilizes a MOAB/TORQUE scheduling
system that manages job priority. While the CHPC’s HPCS is a
shared resource openly available to the University’s research com-
munity, the HCP will have assured priority on the system to ensure
that the project has sufﬁcient resources to achieve its goals.
The two HCP computing systems are complementary in that the
virtual cluster provides rapid response times and can be dynami-
cally expanded to match load. The HPCS, on the other hand, has
large computing power but is a shared resource that queues jobs.
The virtual cluster is therefore best for on-the-ﬂy computing, such
as is required to support web services, while the HPCS is best for
computationally intensive pipelines that are less time sensitive.
The total volume of data produced by the HCP will likely be
multiple petabytes (1 petabyte = 1,000,000 gigabytes). We are cur-
rently evaluating data storage solutions that handle data at this scale
to determine the best price/performance ratio for the HCP. Based
on preliminary analyses, we are expecting to deploy 1 PB of stor-
age, which will require signiﬁcant compromises in deciding which
of the many data types generated will be preserved. Datasets to be
stored permanently will include primary data plus the outputs of
key pre-processing and analysis stages. These w ill be selected on the
basis of their expected utility to the community and on the time
that would be needed to recompute or regenerate intermediate
A driving consideration in selecting a storage solution is close
integration with the HPCS. Four 10-Gb network connections
between the two systems will enable high-speed data transmis-
sion, which will put serious strain on the storage device. Given
these connections and the HPCS’s architecture, at peak usage, the
storage system will need to be able to sustain up to 200,000 input/
output operations per second, a benchmark achievable by a number
of available scale-out NAS (Network Attached Storage) systems. To
meet this benchmark, we expect to design a system that includes
tiered storage pools with dynamic migration between tiers.
In addition to this core storage system, we are also planning for
backup, disaster recovery, and mirror sites. Given the scale of the
data, it will be impossible to backup all of the data, so we will prior-
itize data that could not be regenerated, including the raw acquired
data and processed data that requires signiﬁcant computing time.
We will utilize both near-line backups for highest priority data and
offsite storage for catastrophic disaster recovery. As described below,
our data-sharing plan includes quarterly data releases throughout
Phase 2. To reduce bottlenecks during peak periods after these
releases, we aim to mirror the current release on academic partner
sites and commercial cloud systems. We are also exploring distri-
bution through the BitTorrent model (Langille and Eisen, 2010).
Data workﬂow. All data acquired w ithin the HCP will be uploaded
or entered directly into ConnectomeDB. ConnectomeDB itself
includes two separate database systems. Initially, data are entered
into an internal-facing system that is accessible only to a small
group of HCP operations staff who are responsible for review-
ing data quality and project workﬂow. Once data pass quality 10http://www.edfplus.info/
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 8
BOLD and diffusion imaging. Second, automated programs will
be run to assess image quality. Speciﬁc quality metrics are cur-
rently being developed for each of the HCP imaging modalities
and behavioral paradigms. The resulting metrics will be com-
pared with the distribution of values from previous acquisitions
to determine whether each is within an expected range. During
the initial months of data acquisition, the number of HCP scans
contributing to these norm values will be limited, so we will seed
the database with values extracted from data obtained in similar
studies and during the pilot phase. As the study database expands,
more sophisticated approaches will become available, including
metrics speciﬁc for individual fMRI tasks (which may vary in the
amount of head motion). Speciﬁc QC criteria for each metric will
be developed during Phase I.
Data quality will be recorded in the database at the imaging
session level and for each scan within the session. The database
will include a binary pass/fail determination as well as ﬁelds for the
aforementioned manual review criteria and the automated numeric
QC metrics. Given the complexity and volume of image data being
acquired in the HCP protocol, we anticipate that individual scans
within each imaging visit will vary in quality. A single fMRI run, for
example, might include an unacceptable level of motion, whereas
other scans for that subject are acceptable in quality. In such cases,
data re-acquisition is unlikely. The appropriate strategy for han-
dling missing datasets will be dependent on exactly which data
Pipeline execution. The various processing streams described
above are complex and computationally demanding. In order to
ensure that they are run consistently and efﬁciently across all sub-
jects, we will utilize XNAT’s pipeline service to execute and monitor
the processing. XNAT’s pipeline approach uses XML documents
to formally deﬁne the sequence of steps in a processing stream,
including the executable, execution parameters, and input data.
As a pipeline executes, the pipeline service monitors its execution
and updates its status in the database. When a pipeline exits, noti-
ﬁcations will be sent to HCP staff to review the results, following
pipeline-speciﬁc QC procedures similar to those used to review
the raw data. Pipelines that require short latency (such as those
associated with initial QC) will be executed on the HCP cluster,
while those that are more computationally demanding but less time
sensitive will be executed on the HPCS.
Provenance. Given the complexity of the data analysis streams
described above, it will be crucial to keep accurate track of the
history of processing steps for each generated ﬁle. Provenance
records will be generated at two levels. First, a record of the com-
putational steps executed to generate an image or connectivity map
will be embedded within a NIFTI header extension. This record
will contain sufﬁcient detail that the image could be regenerated
from the included information. Second, higher level metadata,
such as pipeline version and execution date, will be written into
an XCEDE-formatted XML document (Gadde et al., 2011) and
imported into ConnectomeDB. This information will be used to
maintain database organization as pipelines develop over time.
Data-sharing. The majority of the data collected and stored by the
HCP will be openly shared using the open-access model recom-
mended by the Science Commons11. The only data that will be with-
held from open access are those that could identify individual study
participants, which will be made available only for group analyses
submitted through ConnectomeDB. Data will be distributed in a
rolling fashion through quarterly releases over the course of Phase
2. Data will be released in standard formats, including DICOM,
NIFTI, GIFTI, and CIFTI.
Given the scope and scale of the datasets, our aim of open and
rapid data-sharing represents a signiﬁcant challenge. To address this
challenge, the HCP will use a tiered distribution strategy (Figure 4).
The ﬁrst tier includes dynamic access to condensed representations
of connectivity maps and related data. The second distribution tier
will allow users to download bundled subsets of the data. These
bundles will be conﬁgured to be of high scientiﬁc value while still
being small enough to download within a reasonable time. A third
tier will allow users to request a portable hard drive populated by a
more extensive bundle of HCP data. Finally, users needing access
to extremely large datasets that are impractical to distribute will be
FIGURE 4 | HCP data distribution tiers.
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 9
Figure 5 illustrates how CWB allows concurrent visualization
of multiple brain structures (left and right cerebral hemispheres
plus the cerebellum) in a single window. Subcortical structures
will be viewable concurrently with surfaces or as volume slices in
Connectome Workbench will include options to display
the results of various network analyses. For example, this may
include concurrent visualization of network nodes in their 3D
location in the brain as well as in a spring-embedded network,
where node position reflects the strength and pattern of con-
nectivity. The connection strength of graph edges will be rep-
resented using options of thresholding, color, and/or thickness.
As additional methods are developed for displaying complex
connectivity patterns among hundreds of nodes, the most useful
of these will be incorporated either directly into CWB or via
third party software.
Both the dense time-series and the parcellated time-series ﬁles
provide temporal information related to brain activity. A visualiza-
tion mode that plays “movies” by sequencing through and display-
ing each of the timepoints will be implemented. Options to view
results of Task-fMRI paradigms will include both surface-based
and volume-based visualization of individual and group-average
data. Given that Task-fMRI time courses can vary signiﬁcantly
across regions (e.g., Nelson et al., 2010), options will also be avail-
able to view the average time course for any selected parcel or
MEG and EEG data collected as part of the HCP will entail addi-
tional visualization requirements. This will include visualization in
both sensor space (outside the skull) and after source localization to
cortical parcels whose size respects the attainable spatial resolution.
Representations of time course data will include results of power
spectrum and BLP analyses.
able to obtain direct access to the HPCS to execute their computing
tasks. This raises issues of prioritization, cost recovery, and user
qualiﬁcation that have yet to be addressed.
Some of the data acquired by the HCP could potentially be used
to identify the study participants. We will take several steps to miti-
gate this risk. As mentioned above, sensitive DICOM header ﬁelds
will be redacted and facial features in the images will be obscured.
Second, the precision of sensitive data ﬁelds will be reduced in the
open-access data set, in some cases binning numeric ﬁelds into
categories. Finally, we will develop web services that will enable
users to submit group-wise analyses that would operate on sensitive
genetic data without providing users with direct access to individual
subject data. For example, users could request connectivity differ-
ence maps of subjects carrying the ApoE4 allele versus ApoE2/3.
The resulting group-wise data would be scientiﬁcally useful while
preventing individual subject exposure. This approach requires care
to ensure that requested groups are of sufﬁcient size and the number
of overall queries is constrained to prevent computationally driven
approaches from extracting individual subject information.
The complexity and diversity of connectivity-related data types
described above result in extensive visualization needs for the HCP.
To address these needs, CWB, developed on top of Caret software
(Van Essen et al., 2001)12 will include both browser and desktop
versions. The browser-based version will allow users to quickly
view data from ConnectomeDB, while the desktop version will
allow users to carry out more demanding visualization and analysis
steps on downloaded data.
Connectome Workbench is based on Caret6, a prototype Java-
based version of Caret, and will run on recent versions of Linux,
Mac OS X, and Windows. It will use many standard Caret features
for visualizing data on surfaces and volumes. This includes multi-
ple viewing windows and many display options. Major visualiza-
tion options will include (i) data overlaid on surfaces or volume
slices in solid colors to display parcels and other regions of interest
(ROIs), (ii) continuous scalar variables to display fMRI data, shape
features, connectivity strengths, etc., each using an appropriate pal-
ette; (iii) contours projected to the surface to delineate boundaries
of cortical areas and other ROIs, (iv) foci that represent centers of
various ROIs projected to the surface; and (v) tractography data
represented by needle-like representations of ﬁber orientations
in each voxel.
A “connectivity selector” option will load functional and struc-
tural connectivity data from the appropriate connectivity matrix
ﬁle (dense or parcellated) and display it on the user-selected surface
and/or volume representations (e.g., as in Figure 2). Because dense
connectivity ﬁles will be too large and slow to load in their entirety,
connectivity data will be read in from disk by random access when
the user requests a connectivity map for a particular brainordinate
or patch of brainordinates. For functional connectivity data, it may
be feasible to use the more compact time-series datasets and to calcu -
late on the ﬂy the correlation coefﬁcients representing connectivity.
FIGURE 5 | Connectome Workbench visualization of the inﬂated atlas
surfaces for the left and right cerebral hemispheres plus the cerebellum.
Probabilistic architectonic maps are shown of area 18 on the left hemisphere
and area 2 on the right hemisphere.
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 10
By the end of Phase II, the WU-Minn HCP consortium anticipates
having acquired an unparalleled neuroimaging dataset, linking
functional, structural, behavioral, and genetic information in a large
cohort of normal human subjects. The potential neuroscientiﬁc
insights to be gained from this dataset are great, but in many ways
unforeseeable. An overarching goal of the HCP informatics effort
is to facilitate discovery by helping investigators formulate and test
hypotheses by exploring the massive search space represented by
its multi-modal data structure.
The HCP informatics approach aims to provide a platform
that will allow for basic visualization of the dataset’s constituent
parts, but will also encourage users to dynamically and efﬁciently
make connections between the assembled data types. Users will
be able to easily explore the population-average structural con-
nectivity map, determine if the strength of a particular con-
nection is correlated with a speciﬁc behavioral characteristic or
genetic marker, or carry out a wide range of analogous queries.
If the past decade’s experience in the domain of genome-related
bioinformatics is a guide, data discovery is likely to take new and
unexpected directions soon after large HCP datasets become
available, spurring a new generation of neuroinformatics tools
that are not yet imagined. We will be responsive to new meth-
odologies when possible and will allow our interface to evolve
as new discoveries emerge.
The HCP effort is ambitious in many respects. Its success in
the long run will be assessed in many ways – by the number and
impact of scientiﬁc publications drawing upon its data, by the
utilization of tools and analysis approaches developed under its
auspices, and by follow-up projects that explore brain connectiv-
ity in development, aging, and a myriad of brain disorders. From
the informatics perspective, key issues will be whether HCP data
are accessed widely and whether the tools are found to be suitably
powerful and user-friendly. During Phase I, focus groups will be
established to obtain suggestions and feedback on the many facets
of the informatics platform and help ensure that the end product
meets the needs of the target users. The outreach effort will also
include booths and other presentations at major scientiﬁc meetings
(OHBM, ISMRM, and SfN), webinars and tutorials, a regularly
updated HCP website15, and publications such as the present one.
In addition to the open-access data that will be distributed by
the HCP, the HCP informatics platform itself will be open source
and freely available to the scientiﬁc community under a non-viral
license. A variety of similar projects will likely emerge in the com-
ing years that will beneﬁt from its availability. We also anticipate
working closely with the neuroinformatics community to make
the HCP informatics system interoperable with the wide array of
informatics tools that are available and under development.
While signiﬁcant progress has been made since funding com-
menced for the HCP, many informatics challenges remain to be
addressed. Many of the processing and analysis approaches to be
used by the HCP are still under development and will undoubtedly
evolve over the course of the project. How do we best handle the
myriad of potential forks in processing streams? Can superseded
pipelines be retired midway through the project or will users prefer
Querying ConnectomeDB from Connectome Workbench. While
users will often analyze date already downloaded to their own
computer, CWB will also be able to access data residing in the
Connectome database. Interactions between the two systems
will be enabled through ConnectomeDB’s web services API.
CWB will include a search interface to identify subject groups
in ConnectomeDB. Once a subject group has been selected,
users can then visually explore average connectivity maps for
this group by clicking on locations of interest on an atlas surface
in CWB. With each click, a request to ConnectomeDB’s spatial
query service will be submitted. Similar interactive explorations
will be possible for all measures of interest, e.g., behavioral test-
ing results or task performances from Task-fMRI sessions, with
the possibility of displaying both functional and structural con-
Browser-based visualization and Querying Connectome DB.
Users will also be able to view connectivity patterns and other search
results via the ConnectomeDB UI so that they can quickly visualize
processed data without having to download data – and even view
results on tablets and smart phones. To support this web-based
visualization, we will develop a distributed CWB system in which
the visualization component is implemented as a web-embeddable
The computational components of CWB will be deployed as a
set of additional web services within the Connectome API. These
workbench services will act as an intermediary between the viewer
and ConnectomeDB, examining incoming visualization requests
and converting them into queries on the data services API. Data
retrieved from the database will then be processed as needed and
sent to the viewer.
Links to external databases
Providing close links to other databases that contain extensive
information about the human brain will further enhance the util-
ity of HCP-related datasets. For example, the Allen Human Brain
Atlas (AHBA)13 contains extensive data on gene expression patterns
obtained by postmortem analyses of human brains coupled to a pow-
erful and ﬂexible web interface for data mining and visualization. The
gene expression data(from microarray analyses and in situ hybr idiza-
tion analyses) have been mapped to the individual subject brains in
stereotaxic space and also to cortical surface reconstructions. We plan
to establish bi-directional spatially based links between CWB and the
AHBA. This would enable a user of CWB interested in a particular
ROI based on connectivity-related data to link to the AHBA and
explore gene expression data related to the same ROI. Conversely,
users of AHBA interested in a particular ROI based on gene expres-
sion data would be able to link to ConnectomeDB/Workbench and
analyze connectivity patterns in the same ROI. A similar strategy
will be useful for other resources, such as the SumsDB searchable
database of stereotaxic coordinates from functional imaging studies14.
Through the HCP’s outreach efforts, links to additional databases
will be developed over the course of the project.
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 11
Anderson, C. H. (2001). An inte-
grated software suite for surface-based
analyses of cerebral cortex. J. Am. Med.
Inform. Assoc. 8, 443–459.
Vincent, J. L., Patel, G. H., Fox, M. D.,
Snyder, A. Z., Baker, J. T., Van Essen,
D. C., Zempel, J. M., Snyder, L. H.,
Corbetta, M., and Raichle, M. E.
(2007). Intrinsic functional archi-
tecture in the anaesthetized monkey
brain. Nature 447, 83-86.
Visscher, P. M., and Montgomery, G.
W. (2009). Genome-wide asso-
ciation studies and human disease:
from trickle to flood. JAMA 302,
Wipf, D., and Nagarajan, S. A. (2009).
Uniﬁed Bayesian framework for MEG/
EEG source imaging. NeuroImage 44,
Conflict of Interest Statement: The
authors declare that the research was
conducted in the absence of any com-
mercial or financial relationships that
could be construed as a potential conﬂict
Received: 18 March 2011; accepted: 08 June
2011; published online: 27 June 2011.
Citation: Marcus DS, Harwell J, Olsen T,
Hodge M, Glasser MF, Pr ior F, Jenkinson M,
Laumann T, Curtiss SW and Van Essen DC
(2011) Informatics and data mining tools
and strategies for the Human Connectome
Project. Front. Neuroinform. 5:4. doi:
Copyright © 2011 Marcus, Harwell, Olsen,
Hodge, Glasser, Prior, Jenkinson, Laumann,
Curtiss and Van Essen for the WU-Minn
HCP Consortium. This is an open-access
article subject to a non-exclusive license
between the authors and Frontiers Media
SA, which permits use, distribution and
reproduction in other forums, provided the
original authors and source are credited and
other Frontiers conditions are complied with.
scheme for human left lateral parietal
cortex. Neuron 67, 156–170.
Newman, M. E. (2006). Modularity
and community structure in net-
works. Proc. Natl. Acad. Sci. USA 103,
Ou, W., Nummenmaa, A., Ahveninen, J.,
Belliveau, J. W., Hämäläinen, M. S.,
and Golland, P. (2010). Multimodal
functional imaging using fMRI-
informed regional EEG/MEG source
estimation. NeuroImage 52, 97–108.
Patel, V., Dinov, I. D., Van Horn, J. D.,
Thompson, P. M., and Toga, A. W.
(2010). LONI MiND: metadata in NIfTI
for DWI. Neuroimage 51, 665–676.
Petrovic, V. S., Cootes, T. F. , Mills, A. M. ,
Twining, C. J., and Taylor, C. J. (2007).
Automated analysis of deformable struc-
ture in groups of images. Proc. British
Machine Vision Conference 2, 1060–1069.
Rubinov, M., and Sporns, O. (2010).
Complex network measures of brain
connectivity: uses and interpretations.
Neuroimage. 52, 1059–1069.
Sabuncu, M. R., Singer, B. D., Conroy, B.,
Bryan, R. E., Ramadge, P. J., and Haxby,
J. V. (2010). Function-based inter-
subject alignment of human cortical
anatomy. Cereb. Cortex 20, 130–140.
Scheeringa, R., Fries, P., Petersson, K.-M.,
Oostenveld, R., Grothe, I., Norris, D.
G., Hagoort, P., and Bastiaansen, M.
C. M. (2011). Neuronal dynamics
underlying high- and low-frequency
EEG oscillations contribute indepen-
dently to the human BOLD signal.
Neuron 69, 572–583.
Sporns, O., Tononi, G., and Kötter R.
(2005). The human connectome: a
structural description of the human
brain. PLoS Comput. Biol. 1: e42.
Sporns, O. (2010). Networks of the Brain.
Cambridge, MA: MIT Press, 375 pp.
Van Essen, D. C., Drury, H. A., Dickson,
J., Harwell, J., Hanlon, D., and
Wagster, M. V. (2010). Assessment of
neurological and behavioural func-
tion: the NIH Toolbox. Lancet Neurol.
Jack, C. R. Jr., Bernstein, M. A., Fox, N. C.,
Thompson, P., Alexander, G., Harvey,
D., Borowski, B., Britson, P. J., Whitwell,
J. L., Ward, C., Dale, A. M., Felmlee, J.
P., Gunter, J. L., Hill, D. L., Killiany,
R., Schuff, N., Fox-Bosetti, S., Lin, C.,
Studholme, C., DeCarli, C. S., Krueger,
G., Ward, H. A., Metzger, G. J., Scott,
K. T., Mallozzi, R., Blezek, D., Levy, J.,
Debbins, J. P., Fleisher, A. S., Albert, M.,
Green, R., Bartzokis, G., Glover, G.,
Mugler, J., and Weiner, M. W. (2008).
The Alzheimer’s Disease Neuroimaging
Initiative (ADNI): MRI methods. J.
Magn. Reson. Imaging 27, 685–691.
Johansen-Berg, H., and Behrens,
T. E. (2009). From Quantitative
Measurement in-vivo Neuroanatomy.
Boston, MA: Elsevier.
Johansen-Berg, H., and Rushworth, M.
F. (2009). Using diffusion imaging to
study human connectional anatomy.
Annu. Rev. Neurosci. 32, 75–94.
Langille, M. G. I., Eisen, J. A. (2010).
BioTorrents: a ﬁle sharing service for
scientiﬁc data. PLoS ONE 5(4): e10071.
Lichtman, J. W., Livet, J., and Sanes, J. R.
(2008). A technicolour approach to
the connectome. Nat. Rev. Neurosci.
Marcus, D. S., Olsen, T. R., Ramaratnam,
M., and Buckner, R. L. (2007). The
Extensible Neuroimaging Archive
Toolkit: an informatics platform for
managing, exploring, and sharing
neuroimaging data. Neuroinformatics
Nelson, S. M., Cohen, A. L., Power, J. D.,
Wig, G. S., Miezin, F. M., Wheeler,
M. E., Velanova, K., Donaldson, D.
I., Phillips, J. S., Schlaggar, B. L., and
Petersen, S. E. (2010). A parcellation
Beckmann, M., Johansen-Berg, H.,
and Rushworth, M. F. (2009).
Connectivity-based parcellation of
human cingulate cortex and its rela-
tion to functional specialization. J.
Neurosci. 29, 1175–1190.
Briggman, K. L., and Denk, W. (2006).
Towards neural circuit reconstruc-
tion with volume electron microscopy
techniques. Curr. Opin. Neurobiol. 16,
de Pasquale, F., Della Penna, S., Snyder, A.
Z., Lewis, C., Mantini, D., Marzetti, L.,
Belardinelli, P., Ciancetta, L., Pizzella,
V., Romani, G. L., and Corbetta, M.
(2010). Temporal dynamics of spon-
taneous MEG activity in brain net-
works. Proc. Natl. Acad. Sci. U.S.A.
Feinberg, D. A., Moeller, S., Smith, S. M.,
Auerbach, E., Ramanna, S., Glasser,
M. F., Miller, K. L., Ugurbil, K., and
Yacoub, E. (2010). Multiplexed echo
planar imaging for sub-second whole
brain FMRI and fast diffusion imag-
ing. PLoS ONE 5, e15710. doi: 10.1371/
Fielding, R. T. (2000). Architectural Styles
and The Design of Network-Based
Software Architectures. Doctoral dis-
sertation, University of California,
Irvine. Available at: http://www.ics.
Fox, M. D., and Raichle, M. E. (2007).
Spontaneous fluctuations in brain
activity observed with functional
magnetic resonance imaging. Nat.
Rev. Neurosci. 8, 700–711.
Gadde, S., Aucoin, N., Grethe, J. S., Keator,
D. B., Marcus, D. S., and Pieper, S.
(2011). XCEDE: an extensible schema
for biomedical data. Neuroinformatics
Gershon, R. C., Cella, D., Fox, N. A.,
Havlik, R. J., Hendrie, H. C., and
for them to remain operational? What if a pipeline is found to
be ﬂawed? These and other data processing issues will require an
active dialog with the user community over the course of the pro-
ject. Subject privacy is another issue that requires both technical
and ethical consideration. How do we minimize the risk of subject
exposure while maximizing the utility of the data to the scientiﬁc
community? Finally, what disruptive technologies may emerge over
the 5 years of the HCP? How do we best maintain focus on our core
deliverables while retaining agility to adopt important new tools
that could further the scientiﬁc aims of the project? History suggests
that breakthroughs can come from unlikely quar ters. We anticipate
that the HCP’s open data and software sharing will encourage such
breakthroughs and contribute to the nascent ﬁeld of connectome
science and discovery.
Funded in part by the Human Connectome Project
(1U54MH091657-01) from the 16 NIH Institutes and Centers
that Support the NIH Blueprint for Neuroscience Research, by
the McDonnell Center for Systems Neuroscience at Washington
University, and by grant NCRR 1S10RR022984-01A1 for the CHCP;
1R01EB009352-01A1 and 1U24RR02573601 for XNAT support;
and 2P30NS048056-06 for the NIAC. Members of the WU-Minn
HCP Consortium are listed at http://www.humanconnectome.
org/about/hcp-investigators.html and http://www.humancon-
nectome.org/about/hcp-colleagues.html. We thank Steve Petersen,
Olaf Sporns, Jonathan Power, Andrew Heath, Deanna Barch, Jon
Schindler, Donna Dierker, Avi Snyder, and Steve Smith for valuable
comments and suggestions on the manuscript.
Marcus et al. HCP informatics
Frontiers in Neuroinformatics www.frontiersin.org June 2011 | Volume 5 | Article 4 | 12