PreprintPDF Available

Tractography and machine learning: Current state and open challenges

Preprints and early-stage research may not have been peer reviewed yet.


Supervised machine learning (ML) algorithms have recently been proposed as an alternative to traditional tractography methods in order to address some of their weaknesses. They can be path-based and local-model-free, and easily incorporate anatomical priors to make contextual and non-local decisions that should help the tracking process. ML-based techniques have thus shown promising reconstructions of larger spatial extent of existing white matter bundles, promising reconstructions of less false positives, and promising robustness to known position and shape biases of current tractography techniques. But as of today, none of these ML-based methods have shown conclusive performances or have been adopted as a de facto solution to tractography. One reason for this might be the lack of well-defined and extensive frameworks to train, evaluate, and compare these methods. In this paper, we describe several datasets and evaluation tools that contain useful features for ML algorithms, along with the various methods proposed in the recent years. We then discuss the strategies that are used to evaluate and compare those methods, as well as their shortcomings. Finally, we describe the particular needs of ML tractography methods and discuss tangible solutions for future works.
Tractography and machine learning: Current state
and open challenges
Philippe Poulin1| Daniel Jörgens2| Pierre-Marc
Jodoin*1| Maxime Descoteaux*1
1Department of Computer Science,
Université de Sherbrooke, Sherbrooke,
Québec, Canada
2Department of Biomedical Engineering
and Health Systems, KTH Royal Institute of
Technology, Stockholm, Sweden
*Co-last author
Philippe Poulin, Department of Computer
Science, Université de Sherbrooke,
Sherbrooke, Québec, Canada
Funding information
FRQNT;Université de Sherbrooke
Institutional Chair in Neuroinformatics;
NSERC Discovery grant from Pr
Descoteaux and Jodoin
Supervised machine learning (ML) algorithms have recently
been proposed as an alternative to traditional tractography
methods in order to address some of their weaknesses. They
can be path-based and local-model-free, and easily incorpo-
rate anatomical priors to make contextual and non-local de-
cisions that should help the tracking process. ML-based tech-
niques have thus shown promising reconstructions of larger
spatial extent of existing white matter bundles, promising
reconstructions of less false positives, and promising robust-
ness to known position and shape biases of current tractog-
raphy techniques. But as of today, none of these ML-based
methods have shown conclusive performances or have been
adopted as a de facto solution to tractography. One rea-
son for this might be the lack of well-dened and extensive
frameworks to train, evaluate, and compare these methods.
In this paper, we describe several datasets and evalu-
ation tools that contain useful features for ML algorithms,
along with the various methods proposed in the recent years.
We then discuss the strategies that are used to evaluate and
compare those methods, as well as their shortcomings. Fi-
nally, we describe the particular needs of ML tractography
methods and discuss tangible solutions for future works.
Diffusion MRI, tractography, machine learning, benchmark
arXiv:1902.05568v1 [q-bio.NC] 14 Feb 2019
In the eld of diffusion magnetic resonance imaging (dMRI), tractography refers to the process of inferring streamline
structures that are locally aligned with the underlying white matter (WM) dMRI measurements [
]. A simple approach
to obtain such streamlines is an iterative process in which, starting from a seed point, an estimate of the local tissue
orientation is determined and followed for a certain step length before repeating the orientation estimation at the
new position. The tracking procedure may be deterministic [
] (at each point, the algorithm follows the strongest
orientation) or probabilistic [
] (at each point, the algorithm samples a direction closely aligned with the strongest
orientation). Tracking may also be global as some methods recover streamlines all at once [
]. In between the local
and global methods is the category of shortest-path methods, including front evolution, simulated diffusion, geodesic,
and graph-based approaches [
]. Ultimately, the collection of all trajectories created in that way is called a tractogram.
In traditional methods, the estimate of the local tissue ber orientation is usually inferred from an explicit and
local model which ts the (local) diffusion data. These local models include diffusion tensor models [
], multi-
tensor models [
], and other methods that aim at reconstructing the ber orientation distribution function (fODF)
like constrained spherical deconvolution (CSD) [
], to name a few. However, the choice of the best model is by
itself difcult [
], as it depends on various factors such as data acquisition protocol or targeted WM regions, and
therefore has a direct inuence on the quality of an obtained tractogram [
]. Moreover, traditional methods based
on local orientation alone are prone to make common mistakes, such as missing the full spatial extent of bundles and
producing a great amount of false positive connections [15].
Another important factor for the performance of a tractography method are the actual rules that regard the
progression of a single step as well as simple global properties of an individual streamline. Traditional methods may
dene several engineered, or “manually-dened”, high-level rules with the aim of improving the anatomical plausibility
of the recovered tractogram. Instances of these are constraints on streamline length (i.e. ltering streamlines that are
too long or too short), streamline shape (e.g. ltering streamlines with sharp turns), or progression rules that make
streamlines “bounce off” the WM border when they are about to leave the WM mask with a certain angle [
]. In
the same way as modeling noise and artifacts, and dening the right local model, also the design of these high-level rules
has a direct impact on the performance of a tractography method [16, 15].
To address these inherent difculties, recent proposals suggest that machine learning (ML) algorithms, supervised
or unsupervised, may be used to implicitly learn a local, global or contextual ber orientation model as well as the
tracking procedure. Approaches ranging from the application of self-organizing maps (SOM) [
], random forests
(RF) [
], Multilayer Perceptrons (MLP) [
], Gated Recurrent Units (GRU) [
], as well as Convolutional
Neural Networks (CNN) [
] and Autoencoders [
], have been employed at the core of tractography to drive streamline
progression. Apart from the differences in their underlying architecture, these ML methods differ substantially in
aspects of the exact problem formulation, e.g. denition of the input data to the model, modeling the predictions as a
regression [
] or classication problem [
], or even the general tractographyapproach, i.e. whole-brain [
or bundle-specic [
]. The fact alone that these approaches differ in several aspects, makes it difcult to
draw conclusions on the value of each of the individual modeling choices.
Furthermore, while the above mentioned approaches constitute the main ideas for applying ML directly to the
process of tractography, machine learning and especially deep learning (DL) methods have been applied in related
elds. Stacked U-Nets were proposed to segment the volume of individual white matter bundles from images of fODF
peaks [
]. It was also suggested to predict ber orientations from raw diffusion data based on convolutional neural
networks (CNN) [
]. Several ideas for streamline clustering or streamline segmentation have been proposed, including
a CNN based on landmark distances [
], a long short-term memory (LSTM)-based siamese network for rotation
invariant streamline segmentation [
], and a CNN approach for streamline clustering based on the sequence of their
coordinates [
]. Even though the mentioned works are closely related to tractography and contribute to the
common goal of improved analysis of the white matter anatomy of the human brain, we restrict our focus exclusively on
the direct application of ML (and especially DL) for tractography, with the explicit goal of producing streamlines and
addressing the weaknesses of traditional methods. For that reason, we refer the interested reader to the respective
references for more details.
An important factor for effectively advancing this eld of research is a common and appropriate methodology for
training and evaluating the performance of different approaches, which is currently lacking. Over the years, multiple
challenges have been proposed to assess the performance of conventional tractography methods, and a clear and
exhaustive review is provided by Schilling et al.
. However, we argue that the design of these challenges is typically
inappropriate for ML methods. In fact, the 2015 ISMRM Tractography Challenge [
] (along with the Tractometer evalua-
tion tool [
]) has been adopted as the tool of choice for benchmarking new ML tractography pipelines [
Unfortunately, several inherent aws arising specically in the context of ML make it difcult to perform a fair com-
parison between the results obtained from different ML pipelines. In particular, diffusion data preprocessing is left to
participants (
dissimilar inputs
), tracking seeds and a tracking mask are not always given (
varying test environment
the test diffusion volume is sometimes used for training (
data contamination
), training streamlines are not provided
disparate training data
), and testing on a single synthetic subject means that any computed estimator of a model’s
performance is unreliable (
small sample size
). Against the background of a prospectively increasing number of ML-
based approaches tackling the problem of tractography, a carefully designed evaluation framework that appropriately
addresses the specic requirements of ML methods has the potential to support and facilitate research in this eld in
the upcoming years.
In this paper, we follow a threefold strategy. First, we introduce the currently available datasets and evaluation
tools along with useful features and weaknesses regarding machine learning. Then, we provide a comprehensive review
of existing ML-based tractography approaches and derive a set of key concepts distinguishing them from each other.
Subsequently, we identify and discuss the strategies for evaluation of tractography pipelines and identify issues and
limitations arising when applied to ML-based tractography methods. We nally describe important features for an
appropriate evaluation framework the community ought to adopt in the near future to better promote data-driven
streamline tractography and point out the potential advantages for research in data-driven streamline tractography.
Over the years, many diffusion MRI datasets were produced and annotated, either as part of a challenge or research
papers. In this section, we overview several datasets that have been used to train and/or validate supervised learning
algorithms for tractography. Specically, we selected datasets that offer both diffusion data and streamlines. Selected
datasets also needed to have either clearly dened evaluation metrics, or to be large enough (more than 50 subjects) to
be considered as standalone training sets. We include datasets that are either publicly available or simply mentioned in
a research paper without a public release.
We excluded datasets or challenges focused on non-human anatomy(e.g. rat or macaque), where the ground truth is
harder to dene and results might be harder to generalize to human anatomy (for data-driven algorithms), like the 2018
VOTEM Challenge [
] (
). Moreover, we left out datasets focused only on pathological
cases like the 2015 DTI Challenge [
], because we consider it too early for data-driven tractography algorithms, at least
until more conclusive results on healthy subjects. We also excluded tractography atlases when tracking was done on a
single diffusion volume, usually averaged over multiple subjects (e.g. HCP842 [
]), because results tend to be overly
smooth and unsuited for ML methods. However, we include a recent case when tracking was done for each subject: the
100-subjects WM atlas of Zhang et al. [39].
While all the selected datasets are useful in one way or another for data-driven methods, they differ in multiple
ways, which are detailed in the following subsections and summarized in Table 1. The listed properties are the following:
Name: The dataset name and reference
Year: The year of publication of the dataset or paper using the dataset
Public: Is the dataset (diffusion data and streamlines) publicly available?
Real: Is the diffusion data a real acquisition or is it simulated?
Human: Does the diffusion data represent the human brain anatomy?
Subjects: The number of subjects or acquisitions
Bundles: The number of bundles or tracks (if streamlines are available)
: Is a ground truth known? For real acquisitions, streamlines validated by a human expert (e.g. neuroanatomist)
are considered as GT despite the fact that these annotations are subject to inter-rater and intra-rater variations.
Metrics: Well-dened evaluation metrics are available with this dataset.
Split: Is the dataset split into a training and testing set that future works can rely on?
Note that the notion of "ground truth" refers to an indisputable biologically-validated label assigned to an observed
variable. In medical imaging, such ground truth may be obtained with a biopsy [
], throughout careful complementary
analysis [
] or by having several experts agreeing on a given diagnostic [
]. Unfortunately, such restrictive denition
of a ground truth is unreachable most of the time, especially for white matter tracks obtained from tractography, where
no expert can truly assess the existence (or non-existence) of a given streamline in a human brain from MRI images
only. In fact, only synthetically-generated streamlines or man-made phantoms can be considered as real "ground truth".
Despite that, for the purpose of this paper, we also use the term "ground truth" for any data that has been manually
validated by a human expert, typically a neuro-anatomist. In the medical imaging eld, this annotated data would be
called a gold standard, while in the articial intelligence community, it might be called weakly annotated data. Although
such annotations do not meet the fundamental denition of a ground truth, it is nonetheless widely accepted by the
medical imaging AI community [43].
2.1 |The FiberCup dataset and the Tractometer tool
Original FiberCup Tractography Contest (2009)
Fillard et al. proposed the FiberCup Tractography Contest [
] in conjunction with the 2009 MICCAI conference. The
goal was to quantitatively compare tractography methods and algorithms using a clear and reproducible methodology.
They built a realistic diffusion MR 7-bundle phantom with varying congurations (crossing, kissing, splitting, bending).
The organizers acquired diffusion images with b-values of 2000, 4000, and 6000
, and used isotropic resolutions
of 3mm and 6mm, resulting in 6 different diffusion datasets. Contestants were provided all datasets (but not the ground
truth) and were free to apply any preprocessing they wanted on the diffusion images. Evaluation was done by choosing
16 specic voxels, or seed points, in which a unique ber bundle is expected. Participants were expected to submit
a single ber bundle for each of those seed voxels. Quantitative evaluation was done by comparing the 16 pairs of
candidate and ground truth bers using a symmetric Root Mean Square Error (sRMSE).
While the FiberCup Tractography Contest makes a good test case for simple congurations, it does not represent a
true human anatomy and does not impose a choice of b-value and preprocessing, which can induce signicant differences
TA B L E 1 Annotated datasets.
Name Year Public Real Human Subjects Bundles GT Metrics Split
Fibercup [44] 2009 XX1 7 XX
Simulated Fibercup [45] 2012 X1 7 XX
Tracula [46] 2011 XX67 18 X
HARDI 2012 [47] 2012 X2 7 XXX
HARDI 2013 [48] 2013 X2 20 XXX
ISMRM 2015 [15, 16] 2015 XX1 25 XX
HAMLET [29] 2018 XX83 12 X
PyT (BIL&GIN) [49] 2018 XX410 2 X
BST (BIL&GIN) [50] 2018 XX39 5 XX
TractSeg (HCP) [30] 2018 XXX105 72 X
Zhang et al. (HCP) [39] 2018 XX100 58 + 198 X
in data-driven methods. Also, it does not provide any training streamlines, and is thus useful only as a validation tool
for ML-based methods. Furthermore, the fact that it contains only one subject makes it hard to evaluate the true
generalization capability of an ML method trained and tested on that dataset. However, it is the only dataset that
provides seed points in order to have a uniform test environment, which is of utmost importance when comparing
ML-based algorithms. In the end, it is unclear if for ML-based methods there would be any correlation between a good
performance on the FiberCup contest and good performance on human anatomy.
Tractometer evaluation tool (2013)
In 2013, Côté et al. developed the Tractometer evaluation tool, to be used alongside the original FiberCup data, with
the aim of providing quantitative measures that better reect brain connectivity studies. Using a Region of Interest
(ROI)-based ltering method, a complete tractogram can be evaluated on global connectivity metrics, such as the
number of valid and invalid bundles. Furthermore, they propose two seeding masks: a complete mask (mimicking a brain
WM mask), and a ROI mask (mimicking GM-WM interfaces). The tractometer was designed to address the fact that
“metrics are too local and vulnerable to the seeds given, and, as a result, do not capture the global connectivity behavior
of the ber tracking algorithm”[16].
Simulated FiberCup (2014)
In 2014, Neher et al. proposed a simulated version of the FiberCup, allowing new tracking algorithms to be tested using
multiple acquisition parameters [
]. The simulated data can be used alongside the Tractometer tool designed for the
original phantom.
Wilkins et al. also developed a synthetic version of the FiberCup dataset, but did not publicly release the data [
Unfortunately, with regards to ML methods, the simulated FiberCup dataset suffers from the same shortcomings as the
original FiberCup dataset as it contains only one non-human subject whose data is not split a priori into a training and
testing set.
2.2 |Tracula (2011)
Yendiki et al.
published the Tracula method for automated probabilistic reconstruction of 18 major WM pathways.
It uses prior information on the anatomy of bundles from a set of training subjects. The training set was built from 34
schizophrenia patients and 33 healthy controls, using a 1.5T Siemens scanner as part of a multi-site MIND Clinical
Imaging Consortium [
]. The diffusion images include 60 gradient directions acquired with a b-value of 700
along with 10 b=0 images, with an isotropic resolution of 2mm. Whole-brain deterministic tracking was performed,
followed by expert manual labeling using ROIs for 18 major WM bundles. The dataset also includes a measure of the
inter-rater and intra-rater variability for the left and right uncinate.
To our knowledge, this is the earliest apparition of a large-scale human dataset with expert annotation of streamlines.
It is also the only dataset that includes a measure of inter-rater and intra-rater variability, which is a desirable feature for
ML methods (also discussed later in Section 4.4). Unfortunately, the complete set of diffusion images and streamlines
has been incorporated into the method and is not public.
2.3 |HARDI Reconstruction Challenges
HARDI Reconstruction Challenge (2012)
Daducci et al. organized the 2012 HARDI Reconstruction Challenge [
] at the ISBI 2012 conference. The goal of the
challenge was to quantitatively assess the quality of intra-voxel reconstructions by measuring the predicted number
of ber populations and the angular accuracy of the predicted orientations. A training set was released prior to the
challenge, and a test set was used to score the algorithms. As such, the 2012 HARDI dataset contains diffusion images
but no streamlines.
Participants could request a custom acquisition (only once) by sending a list of sampling coordinates in q-space,
and the organizers would then produce a simulated signal for the given parameters. A
16 ×16 ×5
volume was then
produced, containing seven different bundles attempting to recreate realistic 3-D congurations. The metrics proposed
by the authors are ill-posed for ML-based methods because of the limited context available and the focus on local
performances. Like the FiberCup, it would only be useful as a validation tool given the lack of training streamlines, a
limited number of bundles (only seven) and a limited number of non-human subjects (only two).
HARDI Reconstruction challenge (2013)
The 2013 HARDI Reconstruction Challenge [
] was organized one year later at the ISBI 2013 conference. For ML-based
methods, three improvements are relevant compared to the 2012 challenge: a more realistic simulation of the diffusion
signal, a new evaluation system based on connectivity analyses and a larger set of 20 bundles. Indeed, data-driven
methods try to learn an implicit representation without imposing a model on the signal, which means that the signal
used for training and testing should be as close as possible to that in clinical practice. Furthermore, the main benet of
data-driven methods is the ability to use context in order to make good predictions in a multitude of congurations,
which means they have the potential to particularly improve connectivity analyses. Therefore, it would be a better
validation tool for ML-based methods than the 2012 HARDI Reconstruction Challenge. Nonetheless, the dataset suffers
from an inherent limitation as it contains only two non-human subjects.
FIGURE 1 2015 ISMRM Tractography Challenge data generation process (Taken from
2.4 |ISMRM Tractography Challenge (2015)
This dataset has been designed for a tractography challenge organized in conjunction with the 2015 ISMRM con-
ference [
]. During the challenge, participants were asked to reconstruct streamlines from a synthetic human-like
diffusion-weighted MR dataset which was simulated with the aim of replicating a realistic, clinical-like acquisition,
including noise and artifacts. The available data consists of a diffusion dataset with 32 b=1000
images and one
b=0 image, with 2mm isotropic resolution, as well as a T1-like image with 1mm isotropic resolution. Since all data was
generated from an expert segmentation of 25 bundles, in theory, a perfect tracking algorithm should only produce
exactly these specic bundles. Unfortunately, as for the HARDI and FiberCup datasets, the 2015 ISMRM Tractography
Challenge contains data from a limited number of subjects (only one) and lacks a clear separation between training and
testing data. Nonetheless, in combination with the Tractometer tool [
], this dataset has often been used to assess
ML-based tractography methods. Figure 1 shows the data generation process for the challenge.
Once a tractogram has been generated using the challenge diffusion data, the tractometer tool uses a “bundle
recognition algorithm” [
] to cluster the streamlines into bundles. The generated bundles are then compared to the
ground truth, producing groups of “valid bundles” and “invalid bundles”, depending on which regions of the brain the
streamlines connect. Streamlines that do not correspond to a ground truth bundle are classied as “No connections”
streamlines. The metrics computed by the modied Tractometer for the Tractography Challenge are as follows:
Valid bundles (VB): The number of correctly reconstructed ground truth bundles.
Invalid Bundles (IB): The number of reconstructed bundles that do not match any ground truth bundles.
Valid Connections (VC): The ratio of streamlines in valid bundles over the total number of produced streamlines.
Invalid Connections
(IC): The ratio of streamlines in invalid bundles over the total number of produced streamlines.
No Connections
(NC): The ratio of streamlines that are either too short or do not connect two regions of the cortex
over the total number of produced streamlines.
Bundle Overlap
(OL): The ratio of ground truth voxels traversed by at least one streamline over the total number of
ground truth voxels.
Bundle Overreach
(OR): The ratio of voxels traversed by at least one streamline that do not belong to a ground
truth voxel over the total number of ground truth voxels.
F1-score (F1): The harmonic mean of recall (OL) and precision (1-OR).
The denition of streamline-oriented metrics (VB, IB, VC, IC, NC) and volume-oriented metrics (OL, OR, F1) means
that there is no single number that can fully assess the performance of an algorithm. For example, deterministic methods
often score higher on streamline-oriented metrics compared to probabilistic methods. As such, a thorough review of all
scores must be performed in order to properly compare algorithms, and in many cases, the choice of an algorithm over
another may depend on a specic use-case (e.g. bundle reconstruction vs. connectivity analysis).
2.5 |HAMLET (2018)
To validate their method, Reisert et al.
used a dataset of 83 human subjects from two independent cohorts. The rst
cohort comprises 55 healthy volunteers, all scanned by a Siemens 3T TIM PRISMA MRI scanner. The second cohort has
28 volunteers scanned with a Siemens TIM TRIO. The rst cohort was used for training while the second one was used
for testing. Subjects in the second cohort were scanned twice for test-retest experiments, some unique characteristic
to that dataset. The reference streamlines were obtained by rst tracking the whole brain with global tractography,
and then by segmenting the streamlines for 12 bundles with a selection algorithm in MNI space. Unfortunately, the
recovered streamlines have not been manually validated by an expert.
2.6 |Datasets based on the BIL&GIN database
Bundle-Specic Tractography (2018)
Rheault et al. proposed a bundle-specic tracking method based on anatomical priors that improves tracking in the
centrum semiovale crossing regions [
]. Using multiple tractography algorithms, they tracked and segmented ve
bundles (Arcuate Fasciculus - AF left/right, Corpus Callosum - CC, Pyramidal Tracts - PyT left/right) in 39 subjects from
the BIL&GIN database [
]. To compare algorithms, they used an automatic bundle segmentation method based on
clear anatomical denitions. In addition, they dened several performance metrics, such as bundle volume,ratio of valid
streamlines, and efciency. However, the tractograms and automatic bundle segmentation procedure were neither made
public nor validated by an expert. Such a dataset, along with the evaluation procedure, could be extremely useful to
assess if data-driven methods can reliably learn the structure of a specic bundle and reconstruct it in unseen subjects.
A population-based atlas of the human pyramidal tract (2018)
Chenot et al. created a streamline dataset of the left and right PyT based on a population of 410 subjects [
], also from
the BIL&GIN database [
]. To do so, they combined manual ROIs along the bundles’ pathway and the bundle-specic
tractography algorithm of Rheault et al.
. The quality of the segmentations and the high number of subjects would
make this a noteworthy training dataset for data-driven methods. Unfortunately for ML methods, only two bundles
were examined. Furthermore, while the probability maps of the atlas have been rendered public, the tractograms are
still unavailable.
2.7 |Datasets based on the HCP database
TractSeg (2018)
Wasserthal et al. proposed a data-driven method for fast WM tract segmentation without tractography [
]. In doing
so, they built an impressive dataset of 72 manually-validated bundles for 105 subjects from the Human Connectome
Project (HCP) diffusion database [56, 57]. Tractograms were obtained via a four-step semi-automatic approach:
1. Tractography (Multi-Shell Multi-Tissue CSD [5])
2. Initial tract extraction (TractQuerier [58])
3. Tract renement (Manual ROIs [59] + QuickBundles [60])
4. Manual quality control and cleanup
To the best of our knowledge, this is the largest public database to include both diffusion data and reference
streamlines. No further preprocessing of the diffusion data is needed because of the standard procedure of [
]. The
authors dened volume-oriented metrics such as Dice score [
], but did not offer any streamline-oriented metrics
as their method predicts a volume segmentation. The high number of subjects and bundles makes this a remarkable
training set.
In a subsequent paper, the same authors re-used a subset of 20 bundles of the TractSeg dataset to train and validate
their TOM ML algorithm [
]. However, as for original 72-bundle dataset, the TOM dataset does not come with a
predened set of training and testing data and no formal evaluation protocol that users could rely on has been proposed.
Zhang et al. (2018)
Zhang et al.
built a WM ber atlas using 100 HCP subjects. They rst generated streamlines for all subjects
using a two-tensor unscented Kalman lter method [
], and sampled 10,000 streamlines from each subject after a
tractography registration step. Then, using a hierarchical clustering method, the authors generated an initial WM ber
atlas of 800 clusters. Finally, an expert neuroanatomist reviewed the annotations in order to accept or reject each
cluster, and provided the correct annotations when the initial annotation was rejected. The nal, proposed atlas is
comprised of 58 bundles (each composed of multiple clusters), along with “198 short and medium range supercial ber
clusters organized into 16 categories according to the brain lobes they connect” [39].
While the atlas is public, the sampled streamlines from the 100 subjects are all merged into the single template. In
order for ML methods to benet from this dataset, the streamlines would need to be separated back into the space of
the particular original subjects. For this reason, we do not consider this dataset to be "public", in the context of machine
For this review, we regard all supervised machine learning methods published in peer-reviewed journals, conferences or
on arXiv (
) and biorXiv (
). We added the requirement that methods needed to be specically
designed for tractography, i.e. with the purpose of predicting a contextual streamline direction (and not reconstructing
a local, non-conditional fODF or clustering streamlines). This criterion includes whole-brain as well as bundle-specic
tractography methods. A summary of the main properties for all reviewed methods is provided in Table 2.
TA B L E 2 Main properties of data-driven methods for tractography.
Method Model Temporal
input Prediction Implicit
Neher et al. [18] RF 1 last direction 50 samples Resampled DWI Classication X
Poulin et al. [24] GRU Full 1x1x1 voxel SH Regression
Poulin et al. [26] GRU Full 1x1x1 voxel SH Regression
Benou et al. [25] GRU Full 1x1x1 voxel Resampled DWI Classication X
Jörgens et al. [22] MLP 2 last directions 1x1x1 voxel Raw DWI Regression
Wegmayr et al. [23] MLP 4 last directions 3x3x3 voxels SH Regression
Wasserthal et al. [27] CNN N/A Entire WM fODF peaks Regression
Reisert et al. [29] CNN-like N/A Entire WM SH Regression X
RF: Random Forest; MLP: Multilayer perceptron; GRU: Gated recurrent unit; CNN: Convolutional neural network; SH: Spherical har-
monics coefcients; fODF: ber Orientation Distribution Function; Implicit stop: indicates if a method learns its tracking stopping
criterion or if it relies on a usual explicit criterion.
Random Forest classier
To the best of our knowledge, Neher et al. were the rst to propose a machine learning algorithm for (deterministic)
tractography [
]. They employ a RF classier to learn a mapping from raw diffusion measurements to a directional
proposal for streamline continuation. After collecting several of such proposals in a local neighborhood of the current
streamline position (radius: 25% of the smallest side length of a voxel), these are aggregated in a voting scheme to nally
arrive at a single direction in which to grow the streamline.
To dene reference streamlines for their experiments, the authors employ several tractography pipelines and
train their classier on each of the resulting tractograms. They determine the best trained model by evaluating the
performance of each on a replication of the FiberCup phantom (based on the Tractometer metrics of [
]). Finally,
comparing the performance of the latter to all other reference pipelines, they report a superior performance of their
tracking model over all other approaches. While tractograms were scored on a simulated phantom (i.e. no real anatomy),
extended experiments presented in a subsequent paper [
] conrm the superiority of their approach on the 2015
ISMRM Tractography Challenge dataset (simulated data of a human anatomy).
Gated Recurrent Unit (GRU) Tracking
Hypothesizing “that there are high-order dependencies between” the local orientation at a point of a streamline and
the orientations at all other points on the same streamline, Poulin et al. proposed a recurrent neural network (RNN)
based on a GRU [
] to learn the tracking process. Their method implements an implicit model mapping diffusion
measurements to local streamline orientations which not only depends on measurements in a local context, but on all
data previously seen along the extent of a particular streamline. As opposed to [
], the RNN model is implemented
as a regression approach. In their experiments, the authors show that a recurrent model (when trained on reference
streamlines obtained using deterministic CSD-based tractography [
]) was able to outperform most of the original
submissions in the 2015 ISMRM Tractography Challenge with respect to the Tractometer scores (discussed in section 2.4).
In a subsequent paper, Poulin et al.
again suggested using a GRU, but in a bundle-specic fashion. While the model
architecture is very similar, it was trained on a dataset of 37 real subjects, each with a curated set of streamlines for
bundles. After training a single model for each of the selected bundles, the authors showed promising results compared
to existing methods, perhaps indicative that the difcult task of learning to track streamlines necessitates more data
than previously thought.
More recently, Benou and Riklin-Raviv
proposed a GRU-based recurrent neural network similar to that of Poulin
et al.
. In their method, they directly use the resampled diffusion signal as input to the model (like Neher et al.
in order to estimate a discrete, streamline-specic fODF representation which they refer to as “conditional fODF”
(CfODF). Instead of predicting a 3D orientation vector using a regression approach, the authors implement their model
as a classier enabling them to interpret the probabilities obtained for discrete sampled directions (i.e. the classes) as
the mentioned CfODF. This fODF-based formulation further allows for an inherently dened criterion for streamline
termination based on the entropy of the CfODF. The proposed model can be employed for both deterministic and
probabilistic tractography.
Like Poulin et al.
, the authors trained and tested their method on the 2015 ISMRM Tractography Challenge
dataset. They report results after training their method on the dataset ground truth as well as on streamlines obtained
with the MITK diffusion tool [63].
Multi-Layer Perceptron Point-Wise Prediction
Jörgens et al.
propose a multi-layer perceptron (MLP) to predict the next step of a streamline. Like [
their method takes as input the diffusion signal and thus avoids explicit dMRI model-tting. The authors implemented
different congurations of their proposed MLP such as three different input scenarios (point-wise input vs region-wise
input with and without considering previous orientations), different approaches to aggregate the output (maximum
likelihood, mathematical expectation of the categorical prediction and regression) as well as the voting scheme proposed
by Neher et al.
. Results reveal that the best congurations are those having the previous two directions included in
the input of the network thus showing that temporal context is a key component for data-driven tractography. Also, the
regression and classication approaches led to similar results and the use of region-wise information did not provide
any substantial improvement over the use of point-wise information.
Like Poulin et al.
and Benou and Riklin-Raviv
, the authors trained and tested their method on the 2015
ISMRM Tractography Challenge dataset (but did not use the Tractometer tool). Unfortunately, they did not estimate the
tracking capabilities of their method as they only measured point-wise angular errors when predicting the next step of a
Multi-Layer Perceptron Regression Tracking
A similar approach suggested by [
] employs a MLP to predict the next direction of a streamline through regression.
At each point, the input of the model is given by all diffusion measurements in a cubic neighborhood, along with a
certain number of previous steps for the current streamline. In that way, the authors provide the ML model directly with
diffusion information in a local neighborhood (spatial context) as well as a notion of “history” of the current streamline
(temporal context). Dening their reference streamlines as tractograms obtained with a standard tractography method
from in vivo datasets, they train their model on three subjects from the HCP database. Experimental validations on
the 2015 ISMRM Tractography Challenge dataset reveal that their model outperforms some ML methods [
] in
most Tractometer metrics. However, as demonstrated by low overlap scores, the authors acknowledge that their model
produces “rather conned bundles with little spread”, especially in contrast to [
]. While the strength of this model
is to explicitly provide information from a local neighborhood, like for Jörgens et al.
, the notion of context along the
streamline is limited and needs to be dened before training. Since the ideal temporal context (in terms of streamline
length, or steps) is still unknown, this could potentially prohibit the model from taking advantage of all information
relevant to streamline continuation.
Tract orientation mapping using an encoder-decoder CNN
Wasserthal et al.
proposed a data-driven, bundle-specic tracking method. As opposed to the other ML methods
reported in this paper, the authors do not try to directly reconstruct streamlines per se. Instead, their proposed Tract
Orientation Mapping (TOM) method predicts bundle-specic fODF peaks that are then used by a deterministic tracking
method. First, CSD is used to extract three principal directions in all WM voxels. Then, a U-Net CNN [
] is trained to
map these fODF peaks to bundle-specic peaks, i.e. peaks that are only relevant for the streamlines of a given bundle.
Their CNN takes as input 9 channels (the three fODF peaks) and outputs 60 channels, i.e. a 3D bundle-specic fODF
vector for each of the 20 bundles they are looking to recover. While the recovered bundle-specic peaks can be used in
different ways, the authors show that using them directly as input to a deterministic MITK diffusion tractography gives
some of the best results. The approach was trained and tested on 105 HCP subjects, each with reference streamlines
produced by a semi-automatic dissection of 20 large WM bundles (which they recently rendered public [30]).
In a similar line of thought, in their HAMLET project (Hierarchical Harmonic Filters for Learning Tracts from Diffusion MRI)
Reisert et al.
map raw spherical harmonics of order 2 to a spherical tensor eld. In that sense, like Wasserthal et al.
, their ML method does not output streamlines but instead voxel-wise bundle-specic tensors that can subsequently
be used as input to a classical tractography method. The magnitude of the produced
tensor indicates the presence
of a specic bundle whereas the tensor orientation predicts the local streamline direction. Their method implements a
multi-resolution CNN with rotation covariant convolution operations. They trained and tested their method on two
in-house datasets comprising a total of 83 human subjects. The 12 bundles and their associated reference streamlines
have been obtained with global tractography and automatic bundle selection method. Unfortunately, the reference data
was not manually validated by a human expert, and they did not perform any comparisons against other tractography
4.1 |Results on the 2015 ISMRM Tractography Challenge
The 2015 ISMRM Tractography challenge is the only dataset that has been used to assess performance of several data-
driven tractography methods and is thus, as of today, the only available common ground on which to compare methods.
It was used by four different papers namely, the Random-Forest of Neher et al.
, the GRU of Poulin et al.
Benou and Riklin-Raviv
, and the MLP of Wegmayr et al.
. Experimental results reported by the authors have
been transcribed in Table 3, and compared with original submissions in Figure 2. Note that the metrics marked as not
available (N/A) are those the authors did not report in their original paper.
As can be seen, results vary a lot and there is no clear trend showing which method performs best, especially given
the nature of the evaluation metrics. As mentioned in section 2.4, methods can be evaluated using both streamline-
TA B L E 3
Tractometer results. The Bundles and Connections (%) metrics are streamline-oriented metrics whereas the
Avg. bundle (%) metrics are volume-oriented metrics
Model Bundles Connections (%) Avg. bundle (%)
Valid Invalid Valid Invalid No connection Overlap Overreach F1-score
Random-Forest [18] 23 94 52 N/A N/A 59 37 61
GRU [24] 23 130 42 46 13 64 35 65
MLP [23] 23 57 72 N/A N/A 16 28 26
GRU (DeepTract) [25] 23 51 41 33 23 34 17 44
(b) (c)
FIGURE 2 2015 ISMRM Tractography Challenge original submissions (1-20) and new results (21-24)
oriented metrics and volume-oriented metrics, which are not always correlated. For example, a method may have a large
number of valid connections but a low overlap (like the MLP of Wegmayr et al.) which means that although the model
was able to recover most valid bundles, the generated streamlines do not properly cover the spatial extent of those
bundles. Also, a method can be more conservative and score best in terms of invalid connections and overreach like the
GRU of Benou and Riklin-Raviv, but at the same time have a low ratio of valid connections and a poor bundle overlap.
On the other hand, the Random-Forest of Neher et al. does not score best in any category, but is competitive according
to all metrics (its large F1-score underlines that it is a more balanced method compared to MLP and DeepTract). On top
of that, all methods were trained and evaluated differently, so any comparison based on the reported results should be
done with extreme care.
TA B L E 4 Differences in data
Method Preprocessing WM mask Training subjects Reference streamlines
Random-Forest [18] dwidenoise +dwipreproc Not needed 5 HCP subjects CSD (Deterministic)
GRU [24] None Ground Truth Challenge subject CSD (Deterministic)
MLP [23] dwipreproc N/A 3 HCP subjects iFOD (Probabilistic)
DeepTract [25] N/A Not needed Challenge subject Q-Ball (Probabilistic)
4.2 |The 2015 ISMRM Tractography Challenge as an evaluation tool for ML algorithms
As mentioned before, the 2015 ISMRM Tractography Challenge has been adopted as the de facto evaluation tool to
compare ML tractography methods. However,the strengths and weaknesses of that tool should be thoroughly reviewed
to understand and trust any technique reporting results with it. In this section, we present what we consider to be
important issues with the way in which this tool has been used to assess the performance of data-driven methods. In
particular, we detail the discrepancies between the four ML-based methods, differences that may explain some of the
results in Table 3 and potentially undermine any conclusion that one could draw from it. Let us mention that some of
these issues with the 2015 ISMRM dataset are typical for the eld of tractography as a whole.
Table 4 presents a summary of the differences in how the tool is used. Note that the not available (N/A) mark is used
for any information the authors did not mention in their original paper.
Dissimilar inputs
The four ML methods use a different preprocessing pipeline. Among the proposed algorithms, two applied MRtrix’s
dwidenoise or dwipreproc (
), another one denoised using [
] and corrected for eddy currents and head
motion, and another one did not apply any preprocessing at all. Moreover, some used the diffusion signal directly as
input, while others resampled it to a specic number of gradient directions. In some cases, spherical harmonics were
tted to the signal and the SH coefcients were fed as input to the model. Finally, the non-recurrent models are also
given a variable number of previous streamline directions as input.
The output of each of these pipelines contain various degrees of information. For example, fODF peaks are in
theory already aligned with the major WM pathways, and information may be lost depending on the specic model used
to recover the peaks from the diffusion signal. On the other hand, using the raw diffusion signal might contain more
information but is more difcult to understand and process, and thus a data-driven model might require more capacity
to use such an input. Without a thorough investigation of the information contained in each output, any variations in the
Tractometer results could be attributed to the variations in preprocessing. Since we currently do not have any indication
of what is useful for data-driven algorithms, it is impossible to compare ML methods if they do not use the same input
Varying test environment
Since no white matter mask is provided, it must be computed by each participant in case it is needed for tracking. Out of
the four ML methods that were evaluated on the challenge, two needed WM masks; one used the ground truth mask,
and the other did not mention how the mask was computed. Furthermore, since no tracking seeds are supplied with the
data either, their arrangement entirely depends on the WM mask (and the number of seeds per voxel, which is also not
Given the nature of streamline tractography, small variations of the tracking mask or the tracking seeds could have
a substantial impact on the resulting streamlines and by that also on the obtained evaluation metrics. Also, even though
computing a stopping criterion within the algorithm is a worthy improvement, it is a different task than tracking, and
should be evaluated separately. Consequently, all methods should be provided the same tracking mask and seeds to
reduce as much as possible the number of free variables during evaluation.
Data contamination
The use of ML methods requires special care when dealing with available data. Since machine learning models are
obtained by deriving implicit rules
directly from given data
(i.e. training data), testing the true generalization capabilities
of these rules must be done using a different and unseen set of data (i.e. test data).
Two methods suffer from data contamination, or leakage [
]: the GRU in [
] and the MLP in [
]. Here, data
contamination refers to the usage of the same diffusion data for training and testing. This means that the true general-
ization capabilities of the tested method on new, thus unseen subjects are still unknown, since the model has already
seen the specic diffusion patterns that are needed in order to “explore” at test time, and therefore has been given an
“unfair” advantage.
Disparate training data
All methods used different reference streamlines and subjects for training. As mentioned earlier, some employed the
test diffusion data directly, while others relied on a varying number of subjects from the HCP database. Two methods
used deterministic CSD tracking [
] to generate reference streamlines, one used QBI tracking [
] (probabilistic) and
the last one used iFOD tracking [
] (also probabilistic). In order to provide a uniform basis for comparison, the same
comprehensive streamline training set should be available to every algorithm.
Simulation as a substitute for human acquisition
While the diffusion signal of the 2015 ISMRM dataset is typical of that of a human brain, it is nonetheless obtained
through simulation. As such, results on that dataset should not be seen as a measure of future performance on real
human subjects, at least not without further empirical evaluation. Furthermore, at the given resolution and using this
particular conguration of 25 bundles, false positive streamlines that would otherwise be plausible given the underlying
anatomy of a real scan might be impossible to avoid. Indeed, some authors tried training their models using the ground
truth bundles, and still produced over 50 invalid bundles in both cases [18, 25].
Small sample size
The 2015 ISMRM Tractography Challenge dataset has only one subject, which makes it hard to assess the future perfor-
mance of a data-driven algorithm [
]. In order to compute unbiased estimates of future performance, a richer testset
with more subjects is needed. Also, given more subjects, bootstrapping methods [
] (i.e. sampling with replacement)
could help to build more accurate estimators.
4.3 |Other results
Some authors report local performance measures, such as the mean angular error [
]. However, local metrics do
not take into account compounding errors, which can have a major effect on global structure. Consequently, global
evaluation metrics should be preferred.
Tractography papers often report a visual evaluation on unseen, in vivo subjects, as a qualitative evaluation. For
example, Figures 3 and 4 compare some of the proposed data-driven approaches with standard tractographymethods on
FIGURE 3 Comparison between the RF of Neher et al. (top row), and classical deterministic CSD streamline
tractography (bottom row). Results obtained on HCP subject 992774. (Taken from [18] with authorization from the
FIGURE 4 Comparison of various tracking methods: A: Deterministic, B: Deterministic Bundle-Specic
(DET-BST) [50], C: Probabilistic particle lter BST (PROB-PF-BST) [17], D: DeepTracker [26]. Results obtained on a
BIL&GIN subject. (Taken from [26] with authorization from the authors)
white matter bundles with known anatomy. However, in absence of a ground truth or the expertise of a neuroanatomist,
it is hard to draw denitive conclusions on the quality of such results. In addition, Reisert et al.
presented correlation
plots to assess reproducibility, but only offered qualitative comparisons with the reference streamlines without any
quantitative results. To gain trust in these data-driven methods, a more rigorous approach is needed.
Finally, most ML methods offer a reduction in computation time compared to traditional methods. This is a non-
negligible benet, should these methods be adopted in practice.
4.4 |Proposed guidelines for a data-driven tractography evaluation framework
Considering the ML tractography evaluation issues previously underlined, we discuss in this section the fundamental
elements of a better framework we believe the community should adopt in the upcoming years. We start with the
essential characteristics such a framework should have, followed with useful features.
Essential characteristics
First and foremost, an ideal data-driven tractography evaluation framework should come with a public and free-to-use
dataset that anyone could easily rely on. The dataset should include images of real human acquisitions along with a
careful expert selection of ground truth streamlines. It is important to avoid any bias towards a specic tractography
algorithm. In order to achieve this, the streamlines could be rst generated by a large number of different (and ideally
orthogonal) deterministic, probabilistic and global algorithms and then segmented by expert annotators according
to strict anatomical denitions for a given number of bundles. While such manual annotation would be tedious, time
consuming and even error prone, we consider this an indispensable step towards building a realistic and useful dataset
for ML-based development. The need for such a gold standard that quanties human variability is well-known in other
elds, such as automatic image segmentation, cell counting or in machine learning [
]. Despite the fact that
simulated brain images come with a pixel-accurate set of ground truth streamlines that can be generated in a matter of
seconds, by denition synthetic diffusion signals are over-simplistic pictures of real data and, as such, cannot provide
any guarantee of subsequent performance for data-driven methods on real data.
Although there is no consensus regarding the most desirable features a ML tractography algorithm should have and
how it should be evaluated, by its very nature, any ML evaluation framework should aim at measuring how an algorithm
can faithfully reproduce a task it was trained for. As such, a reasonable dataset should include a sufciently large number
of well-separated training and testing images. Thus, statistics resulting from such a dataset would not suffer from
contamination and the reported metrics would be reliable and unbiased estimates of the true generalization power of
a ML algorithm. In addition, to ensure that the observed differences between multiple algorithms are resulting from
the intrinsic properties of the model and not caused by some feature of the evaluation framework, the number of free
variables should be reduced to a minimum. Consequently, the tracking masks and seeds should be provided together
with clearly preprocessed diffusion data, so that the proposed methods can be evaluated in equal conditions. There
should be multiple "classes" of input data, depending on whether an algorithm supports DWI samples, SH coefcients or
fODF peaks. Furthermore, the initial diffusion signal should have the same statistical properties for the training and the
testing set. Finally, the acquired images should ideally be acquired at different MRI scanners with different acquisition
protocols in order to avoid overtting issues.
Evaluation metrics should also be bound to the purpose of tractography algorithms. Considering that tractography
is mostly used for bundle reconstruction, tractometry studies and connectivity analyses, an ideal evaluation framework
should include two sets of metrics : 1) metrics measuring how a ML method can faithfully reproduce a set of predened
bundles it was trained to recover (tractometry), and 2) metrics measuring how it can connect matching regions of the
brain, i.e. produce valid connections (connectivity). Furthermore, since many applications use tractography algorithms
to produce a large number of streamlines (with many false positives), which are then ltered out by a post-processing
algorithm such as RecoBundles [
], the framework should report results before and after post-processing. This would
underline the true recall power of a data-driven algorithm, which is a fundamental characteristic of tract-based and
connectivity-based applications [15].
Lastly, the size of an ideal dataset is of primary importance. While a small-sized dataset could be prone to overtting,
it would be costly to create a very large dataset and also difcult to ensure a coherent manual annotation. One rule
of thumb that can be used to identify the "correct" size of a dataset is through the inspection of the learning curve of
several ML models [
]. These curves show the model performance as a function of the training sample size. Typically,
the performance of several models saturates for a sufcient dataset size. Although imperfect, this procedure is a good
heuristic for estimating the size of the dataset.
Other useful features
Despite any thorough manual annotation protocol, manually annotated bundles can be subject to non-negligible inter-
rater and intra-rater variability. As such, a useful characteristic of a ML tractography dataset would be a measure of
those variations. This would be obtained by having several experts annotating the dataset, and at least one expert
annotating it twice or more times. Such measures would provide a minimal bound beyond which a data-driven algorithm
could be considered "as good as an expert". Another very useful tool would be an openly accessible online evaluation
system. Given such a system, people could upload their test results in order to compare them with the test ground
truth. In that way, an automatic ranking procedure similar to that of Kaggle could be used to sort various ML algorithms
based on their achieved scores. While no ranking method is perfect, it would nonetheless provide a common evaluation
framework that people could rely on.
An ideal dataset would also cover the whole eld of diffusion MRI acquisition protocols, from HCP-like research
acquisitions to clinical acquisitions. It would include single b-value as well as multiple b-values data, along with more
sophisticated acquisition protocols such as b-tensor encoding. It would also need low resolution images together with
high-resolution images. Since data harmonization is also a problem for data-driven algorithms, acquisition from several
sites are needed for test-retest studies. Annotated pathological cases would complete the dataset by allowing careful
preliminary studies on how ML-based methods can be relied on in unhealthy patients.
Finally, since tractography is used more and more in pre-clinical applications, a subset of manually annotated rodent
or macaque brains would be of great interest to train and test future ML algorithms (like the 2018 VOTEM Challenge [
for example).
This is, of course, the ultimate wish list. But, in the era of open data and open science, it needs to be done by
the community, for the community. We can already see this work in progress with more and more accessible and
reproducible data being published every year.
In this paper, we provided an exhaustive review of the current state of the art of machine learning methods in the eld
of tractography. We described the existing datasets that comprise both diffusion data and reference streamlines, which
could generally be useful for new tracking methods based on ML. In particular, we thoroughly examined the widely
used evaluation tool for data-driven tracking methods, the 2015 ISMRM Tractography Challenge, and detailed aws and
shortcoming when used to assess data-driven algorithms. Based on our ndings, we suggested good practices that we
believe would foster the development of a new evaluation framework for ML-based tractography methods with the
potential to effectively advance this eld of research.
There is no doubt that machine learning tractography will have an important role to play in the future to solve some
of the open problems of tractography. At the moment, however, all existing methods show theoretical potential and
in limited test cases. Methods have yet to make solid demonstrations of their performance and efciency in practice.
There is still no ML-based tractography tool that is a scalable and usable on any given diffusion MRI dataset. This is true
for healthy datasets but even more so for pathological brains. Hence, it is fair to say that ML-based tractography is still
at its infancy and is not ready for "prime-time", but is nonetheless a very fertile eld of research to make meaningful
contributions to the eld of connectivity mapping.
[1] Jeurissen B, Descoteaux M, Mori S, Leemans A. Diffusion MRI ber tractography of the brain. NMR in Biomedicine
2017;p. e3785.
[2] Yeh FC, Verstynen TD, Wang Y, Fernández-Miranda JC, Tseng WYI. Deterministic diffusion ber tracking improved by
quantitative anisotropy. PloS one 2013;8(11):e80713.
[3] Basser PJ, Pajevic S, Pierpaoli C, Duda J, Aldroubi A. In vivo ber tractography using DT-MRI data. Magnetic resonance
in medicine 2000;44(4):625–632.
[4] Behrens TE, Berg HJ, Jbabdi S, Rushworth MF, Woolrich MW. Probabilistic diffusion tractography with multiple bre
orientations: What can we gain? Neuroimage 2007;34(1):144–155.
[5] Tournier JD, Calamante F, Connelly A. Improved probabilistic streamlines tractography by 2nd order integration over
bre orientation distributions. In: Proceedings of the international society for magnetic resonance in medicine, vol. 18;
2010. p. 1670.
[6] Tournier JD, Calamante F, Connelly A. MRtrix: diffusion tractography in crossing ber regions. International Journal of
Imaging Systems and Technology 2012;22(1):53–66.
[7] Reisert M, Mader I, Anastasopoulos C, Weigel M, Schnell S, Kiselev V. Global ber reconstruction becomes practical.
Neuroimage 2011;54(2):955–962.
[8] Mangin JF, Fillard P, Cointepas Y, Le Bihan D,Frouin V, Poupon C. Toward global tractography. Neuroimage 2013;80:290–
[9] Jbabdi S, Woolrich MW, Andersson JL, Behrens T. A Bayesian framework for global tractography. Neuroimage
[10] Pierpaoli C, Jezzard P, Basser PJ, Barnett A, Di Chiro G. Diffusion tensor MR imaging of the human brain. Radiology
[11] Caan MW, Khedoe HG, Poot DH, Arjan J, Olabarriaga SD, Grimbergen KA, et al. Estimation of diffusion properties in
crossing ber bundles. IEEE transactions on medical imaging 2010;29(8):1504–1515.
[12] Tournier JD, Calamante F, Gadian DG, Connelly A. Direct estimation of the ber orientation density function from
diffusion-weighted MRI data using spherical deconvolution. NeuroImage 2004;23(3):1176–1185.
[13] Descoteaux M, Deriche R, Knösche TR, Anwander A. Deterministic and probabilistic tractography based on complex
bre orientation distributions. IEEE transactions on medical imaging 2009 feb;28(2):269–86.
[14] Schilling KG, Daducci A, Maier-Hein K, Poupon C, Houde JC, Nath V, et al. Challenges in diffusion MRI tractography–
Lessons learned from international benchmark competitions. Magnetic resonance imaging 2018;.
[15] Maier-Hein KH, Neher PF, Houde JC, Côté MA, Garyfallidis E, Zhong J, et al. The challenge of mapping the human con-
nectome based on diffusion tractography. Nature communications 2017;8(1):1349.
[16] Côté MA, Girard G, Boré A, Garyfallidis E, Houde JC, Descoteaux M. Tractometer: towards validation of tractography
pipelines. Medical image analysis 2013;17(7):844–857.
[17] Girard G, Whittingstall K, Deriche R, Descoteaux M. Towards quantitative connectivity analysis: reducing tractography
biases. Neuroimage 2014;98:266–278.
[18] Neher PF, Côté MA, Houde JC, Descoteaux M, Maier-Hein KH. Fiber tractography using machine learning. NeuroImage
2017 sep;158:417–429.
[19] Duru DG, Ozkan M. SOM Based Diffusion Tensor MR Analysis. In: Image and Signal Processing and Analysis, 2007. ISPA
2007. 5th International Symposium on IEEE; 2007. p. 403–406.
[20] Duru DG, Ozkan M. Self-organizing maps for brain tractography in MRI. In: Neural Engineering (NER), 2013 6th Interna-
tional IEEE/EMBS Conference on IEEE; 2013. p. 1509–1512.
[21] Neher PF, Götz M, Norajitra T, Weber C, Maier-Hein KH. A Machine Learning Based Approach to Fiber Tractography
Using Classier Voting. Springer, Cham; 2015.p. 45–52.
[22] Jörgens D, Smedby Ö, Moreno R. Learning a Single Step of Streamline Tractography Based on Neural Networks. In:
Computational Diffusion MRI Springer; 2018.p. 103–116.
[23] WegmayrV, Giuliari G, Holdener S, Buhmann J. Data-driven ber tractography with neural networks. In: 2018 IEEE 15th
International Symposium on Biomedical Imaging (ISBI 2018) IEEE; 2018. p. 1030–1033.
[24] Poulin P, Côté MA, Houde JC, Petit L, Neher PF, Maier-Hein KH, et al. Learn to Track: Deep Learning for Tractography.
Springer, Cham; 2017.p. 540–547.
[25] BenouI, Riklin-Raviv T. DeepTract: A Probabilistic Deep Learning Frameworkfor White Matter Fiber Tractography. arXiv
preprint arXiv:181205129 2018;.
[26] Poulin P, Rheault F, St-Onge E, Jodoin PM, Descoteaux M. Bundle-Wise Deep Tracker: Learning to track bundle-specic
streamline paths. In: Proceedings of the International Society for Magnetic Resonance in Medicine ISMRM-ESMRMB;
2018. .
[27] Wasserthal J, Neher PF, Maier-Hein KH. Tract orientation mapping for bundle-specic tractography. In: International
Conference on Medical Image Computing and Computer-Assisted Intervention Springer; 2018. p. 36–44.
[28] Lucena OASd, Deep Learning for Brain Analysis in MR Imaging. São Paulo, Brazil: [sn]; 2018. http://repositorio.
[29] Reisert M, Coenen VA, Kaller C, Egger K, Skibbe H. HAMLET: Hierarchical Harmonic Filters for Learning Tracts from
Diffusion MRI. arXiv preprint arXiv:180701068 2018;.
[30] Wasserthal J, Neher P, Maier-Hein KH. TractSeg-Fast and accurate white matter tract segmentation. NeuroImage
[31] KoppersS, Merhof D. Direct Estimation of Fiber Orientations Using Deep Learning in Diffusion Imaging. Springer, Cham;
2016.p. 53–60.
[32] Ngattai Lam PD, Belhomme G, Ferrall J, Patterson B, Styner M, Prieto JC. TRAFIC: Fiber Tract Classication Using Deep
Learning. Proceedings of SPIE–the International Society for Optical Engineering 2018 feb;10574.
[33] Patil SM, Nigam A, Bhavsar A, Chattopadhyay C. Siamese LSTM based Fiber Structural Similarity Network (FS2Net) for
Rotation Invariant Brain Tractography Segmentation. undened 2017;.
[34] Gupta V, Thomopoulos SI, Rashid FM, Thompson PM. FiberNET: An Ensemble Deep Learning Framework for Clustering
White Matter Fibers. Springer,Cham; 2017.p. 548–555.
[35] Gupta V, Thomopoulos SI, Corbin CK, Rashid F, Thompson PM. FIBERNET 2.0: An automatic neural network based tool
for clustering white matter bers in the brain. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI
2018) IEEE; 2018. p. 708–711.
[36] Thomas C, Frank QY, Irfanoglu MO, Modi P, Saleem KS, Leopold DA, et al. Anatomical accuracy of brain connec-
tions derived from diffusion MRI tractography is inherently limited. Proceedings of the National Academy of Sciences
[37] Pujol S, Wells W, Pierpaoli C, Brun C, Gee J, Cheng G, et al. The DTI challenge: toward standardized evaluation of diffu-
sion tensor imaging tractography for neurosurgery. Journal of Neuroimaging 2015;25(6):875–882.
[38] Yeh FC, Panesar S, Fernandes D, Meola A, Yoshino M, Fernandez-Miranda JC, et al. Population-averaged atlas of the
macroscale human structural connectome and its network topology. NeuroImage 2018;178:57–68.
[39] Zhang F, Wu Y, Norton I, Rigolo L, Rathi Y, Makris N, et al. An anatomically curated ber clustering white matter atlas for
consistent white matter tract parcellation across the lifespan. NeuroImage 2018;.
[40] Thon A, Teichgräber U, Tennstedt-Schenk C, Hadjidemetriou S, Winzler S, Malich A, et al. Computer aided detection in
prostate cancer diagnostics: A promising alternative to biopsy? A retrospective study from 104 lesions with histological
ground truth. PLOS ONE 2017 10;12(10):1–21.
[41] Clinic C, Alzheimer’s Disease: Overview of Diagnostic Tests; 2014. [Online; accessed 3-January-2019]. https://my. disease-overview- of-diagnostic-tests/.
[42] Bernard O,Bosch JG, Heyde B, Alessandrini M, Barbosa D, Camarasu-Pop S, et al. Standardized evaluation system for left
ventricular segmentation algorithms in 3D echocardiography. IEEE transactions on medical imaging 2016;35(4):967–
[43] Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, et al. The multimodal brain tumor image segmenta-
tion benchmark (BRATS). IEEE transactions on medical imaging 2015;34(10):1993.
[44] Fillard P, Descoteaux M, Goh A, Gouttard S, Jeurissen B, Malcolm J, et al. Quantitative evaluation of 10 tractography
algorithms on a realistic diffusion MR phantom. Neuroimage 2011;56(1):220–234.
[45] Wilkins B, Lee N, Singh M. Development and evaluation of a simulated FiberCup phantom. In: International Symposium
on Magnetic Resonance in Medicine (ISMRM’12); 2012. p. 1938.
[46] Yendiki A, Panneck P, Srinivasan P, Stevens A, Zöllei L, Augustinack J, et al. Automated probabilistic reconstruction of
white-matter pathways in health and disease using an atlas of the underlying anatomy. Frontiers in neuroinformatics
[47] Daducci A, Canales-Rodríguez EJ, Descoteaux M, Garyfallidis E, Gur Y, Lin YC, et al. Quantitative comparison of
reconstruction methods for intra-voxel ber recovery from diffusion MRI. IEEE transactions on medical imaging
[48] Daducci A, Caruyer E, Descoteaux M, Houde J, Thiran J. HARDI reconstruction challenge 2013. In: Proceedings of the
IEEE International Symposium on Biomedical Imaging (ISBI), San Francisco, CA; 2013. .
[49] Chenot Q, Tzourio-Mazoyer N, Rheault F,Descoteaux M, Crivello F, Zago L, et al. A population-based atlas of the human
pyramidal tract in 410 healthy participants. Brain Structure and Function 2018;p. 1–14.
[50] Rheault F, St-Onge E, Sidhu J, Maier-Hein K, Tzourio-Mazoyer N, Petit L, et al. Bundle-specic tractography with incor-
porated anatomical and orientational priors. NeuroImage 2018;.
[51] Poupon C, Laribiere L, Tournier G, Bernard J, Fournier D, Fillard P, et al. A diffusion hardware phantom looking like a
coronal brain slice. In: Proceedings of the International Society for Magnetic Resonance in Medicine, vol. 18; 2010. p.
[52] Neher PF, Laun FB, Stieltjes B, Maier-Hein KH. Fiberfox: facilitating the creation of realistic white matter software
phantoms. Magnetic resonance in medicine 2014;72(5):1460–1470.
[53] White T, Magnotta VA, Bockholt HJ, Williams S, Wallace S, Ehrlich S, et al. Global white matter abnormalities in
schizophrenia: a multisite diffusion tensor imaging study. Schizophrenia bulletin 2009;37(1):222–232.
[54] Garyfallidis E, Côté MA, Rheault F, Sidhu J, Hau J, Petit L, et al. Recognition of white matter bundles using local and global
streamline-based registration and clustering. NeuroImage 2018;170:283–295.
[55] Mazoyer B, Mellet E, Perchey G, Zago L, Crivello F, Jobard G, et al. BIL&GIN: a neuroimaging, cognitive, behavioral, and
genetic database for the study of human brain lateralization. Neuroimage 2016;124:1225–1231.
[56] Van Essen DC, Smith SM, Barch DM, Behrens TE, Yacoub E, Ugurbil K, et al. The WU-Minn human connectome project:
an overview. Neuroimage 2013;80:62–79.
[57] Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, et al. The minimal preprocessing pipelines
for the Human Connectome Project. Neuroimage 2013;80:105–124.
[58] Wassermann D, Makris N, Rathi Y, Shenton M, Kikinis R, Kubicki M, et al. The white matter query language: a novel
approach for describing human white matter anatomy. Brain Structure and Function 2016;221(9):4705–4721.
[59] Stieltjes B, Brunner RM, Fritzsche K, Laun F. Diffusion tensor imaging: introduction and atlas. Springer Science & Busi-
ness Media; 2013.
[60] Garyfallidis E, Brett M, Correia MM, Williams GB, Nimmo-Smith I. Quickbundles, a method for tractography simplica-
tion. Frontiers in neuroscience 2012;6:175.
[61] Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC medical
imaging 2015;15(1):29.
[62] Reddy CP, Rathi Y. Joint multi-ber NODDI parameter estimation and tractography using the unscented information
lter. Frontiers in neuroscience 2016;10:166.
[63] MITK, MITK Diffusion Imaging; 2018. [Online; accessed 3-January-2019].
[64] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Medical
Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351 of LNCS; 2015. p. 234–241.
[65] Manjón JV, Coupé P, Concha L, Buades A, Collins DL, Robles M. Diffusion weighted image denoising using overcomplete
local PCA. PloS one 2013;8(9):e73021.
[66] Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: Formulation, detection, and avoidance. ACM
Transactionson Knowledge Discovery from Data (TKDD) 2012;6(4):15.
[67] Aganj I, Lenglet C, Sapiro G. ODF reconstruction in q-ball imaging with solid angle consideration. In: Biomedical Imaging:
From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on IEEE; 2009. p. 1398–1401.
[68] Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE
Transactionson Pattern Analysis & Machine Intelligence 1991;(3):252–264.
[69] Efron B, Tibshirani RJ. An introduction to the bootstrap. CRC press; 1994.
[70] Kleesiek J, Petersen J, Döring M, Maier-Hein K, Köthe U, Wick W, et al. Virtual raters for reproducible and objective
assessments in radiology. Scientic reports 2016;6:25007.
[71] Entis JJ, Doerga P, Barrett LF, Dickerson BC. A reliable protocol for the manual segmentation of the human amygdala
and its subregions using ultra-high resolution MRI. Neuroimage 2012;60(2):1226–1235.
[72] Boccardi M, Bocchetta M, Apostolova LG, Barnes J, Bartzokis G, Corbetta G, et al. Delphi denition of the EADC-ADNI
Harmonized Protocol for hippocampal segmentation on magnetic resonance. Alzheimer’s & Dementia 2015;11(2):126–
[73] Piccinini F, Tesei A, Paganelli G, Zoli W, Bevilacqua A. Improving reliability of live/dead cell counting through automated
image mosaicing. Computer methods and programs in biomedicine 2014;117(3):448–463.
[74] Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classication models. Analytica Chimica
Acta 2013;760:25–33.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
While the major white matter tracts are of great interest to numerous studies in neuroscience and medicine, their manual dissection in larger cohorts from diffusion MRI tractograms is time-consuming, requires expert knowledge and is hard to reproduce. In previous work we presented tract orientation mapping (TOM) as a novel concept for bundle-specific tractography. It is based on a learned mapping from the original fiber orientation distribution function (FOD) peaks to tract specific peaks, called tract orientation maps. Each tract orientation map represents the voxel-wise principal orientation of one tract. Here, we present an extension of this approach that combines TOM with accurate segmentations of the tract outline and its start and end region. We also introduce a custom probabilistic tracking algorithm that samples from a Gaussian distribution with fixed standard deviation centered on each peak thus enabling more complete trackings on the tract orientation maps than deterministic tracking. These extensions enable the automatic creation of bundle-specific tractograms with previously unseen accuracy. We show for 72 different bundles on high quality, low quality and phantom data that our approach runs faster and produces more accurate bundle-specific tractograms than 7 state of the art benchmark methods while avoiding cumbersome processing steps like whole brain tractography, non-linear registration, clustering or manual dissection. Moreover, we show on 17 datasets that our approach generalizes well to datasets acquired with different scanners and settings as well as with pathologies. The code of our method is openly available at
Full-text available
With the advances in diffusion MRI and tractography, numerous atlases of the human pyramidal tract (PyT) have been proposed, but the inherent limitation of tractography to resolve crossing bundles within the centrum semiovale has so far prevented the complete description of the most lateral PyT projections. Here, we combined a precise manual positioning of individual subcortical regions of interest along the descending pathway of the PyT with a new bundle-specific tractography algorithm. This later is based on anatomical priors to improve streamlines tracking in crossing areas. We then extracted both left and right PyT in a large cohort of 410 healthy participants and built a population-based atlas of the whole-fanning PyT with a complete description of its most corticolateral projections. Clinical applications are envisaged, the whole-fanning PyT atlas being likely a better marker of corticospinal integrity metrics than those currently used within the frame of prediction of poststroke motor recovery. The present population-based PyT, freely available, provides an interesting tool for clinical applications to locate specific PyT damage and its impact to the short- and long-term motor recovery after stroke.
Full-text available
Anatomical white matter bundles vary in shape, size, length, and complexity, making diffusion MRI tractography reconstruction of some bundles more difficult than others. As a result, bundles reconstruction often suffers from a poor spatial extent recovery. To fill-up the white matter volume as much and as best as possible, millions of streamlines can be generated and filtering techniques applied to address this issue. However, well-known problems and biases are introduced such as the creation of a large number of false positives and over-representation of easy-to-track parts of bundles and under-representation of hard-to-track. To address these challenges, we developed a Bundle-Specific Tractography (BST) algorithm. It incorporates anatomical and orientational prior knowledge during the process of streamline tracing to increase reproducibility, sensitivity, specificity and efficiency when reconstructing certain bundles of interest. BST outperforms classical deterministic, probabilistic, and global tractography methods. The increase in anatomically plausible streamlines, with larger spatial coverage, helps to accurately represent the full shape of bundles, which could greatly enhance and robustify tract-based and connectivity-based neuroimaging studies.
Full-text available
The individual course of white matter fiber tracts is an important factor for analysis of white matter characteristics in healthy and diseased brains. Diffusion-weighted MRI tractography in combination with region-based or clustering-based selection of streamlines is a unique combination of tools which enables the in-vivo delineation and analysis of anatomically well-known tracts. This, however, currently requires complex, computationally intensive processing pipelines which take a lot of time to set up. TractSeg is a novel convolutional neural network-based approach that directly segments tracts in the field of fiber orientation distribution function (fODF) peaks without using tractography, image registration or parcellation. We demonstrate that the proposed approach is much faster than existing methods while providing unprecedented accuracy, using a population of 105 subjects from the Human Connectome Project. We also show initial evidence that TractSeg is able to generalize to differently acquired data sets for most of the bundles. The code and data are openly available at and, respectively.
Full-text available
A comprehensive map of the structural connectome in the human brain has been a coveted resource for understanding macroscopic brain networks. Here we report an expert-vetted, population-averaged atlas of the structural connectome derived from diffusion MRI data (N = 842). This was achieved by creating a high-resolution template of diffusion patterns averaged across individual subjects and using tractography to generate 550,000 trajectories of representative white matter fascicles annotated by 80 anatomical labels. The trajectories were subsequently clustered and labeled by a team of experienced neuroanatomists in order to conform to prior neuroanatomical knowledge. A multi-level network topology was then described using whole-brain connectograms, with subdivisions of the association pathways showing small-worldness in intra-hemisphere connections, projection pathways showing hub structures at thalamus, putamen, and brainstem, and commissural pathways showing bridges connecting cerebral hemispheres to provide global efficiency. This atlas of the structural connectome provides representative organization of human brain white matter, complementary to traditional histologically-derived and voxel-based white matter atlases, allowing for better modeling and simulation of brain connectivity for future connectome studies.
This work presents an anatomically curated white matter atlas to enable consistent white matter tract parcellation across different populations. Leveraging a well-established computational pipeline for fiber clustering, we create a tract-based white matter atlas including information from 100 subjects. A novel anatomical annotation method is proposed that leverages population-based brain anatomical information and expert neuroanatomical knowledge to annotate and categorize the fiber clusters. A total of 256 white matter structures are annotated in the proposed atlas, which provides one of the most comprehensive tract-based white matter atlases covering the entire brain to date. These structures are composed of 58 deep white matter tracts including major long range association and projection tracts, commissural tracts, and tracts related to the brainstem and cerebellar connections, plus 198 short and medium range superficial fiber clusters organized into 16 categories according to the brain lobes they connect. Potential false positive connections are annotated in the atlas to enable their exclusion from analysis or visualization. In addition, the proposed atlas allows for a whole brain white matter parcellation into 800 fiber clusters to enable whole brain connectivity analyses. The atlas and related computational tools are open-source and publicly available. We evaluate the proposed atlas using a testing dataset of 584 diffusion MRI scans from multiple independently acquired populations, across genders, the lifespan (1 day-82 years), and different health conditions (healthy control, neuropsychiatric disorders, and brain tumor patients). Experimental results show successful white matter parcellation across subjects from different populations acquired on multiple scanners, irrespective of age, gender or disease indications. Over 99% of the fiber tracts annotated in the atlas were detected in all subjects on average. One advantage in terms of robustness is that the tract-based pipeline does not require any cortical or subcortical segmentations, which can have limited success in young children and patients with brain tumors or other structural lesions. We believe this is the first demonstration of consistent automated white matter tract parcellation across the full lifespan from birth to advanced age.
In this paper, we propose a novel deep learning architecture combining stacked Bi-directional LSTM and LSTMs with the Siamese network architecture for segmentation of brain fibers, obtained from tractography data, into anatomically meaningful clusters. The proposed network learns the structural difference between fibers of different classes, which enables it to classify fibers with high accuracy. Importantly, capturing such deep inter and intra class structural relationship also ensures that the segmentation is robust to relative rotation among test and training data, hence can be used with unregistered data. Our extensive experimentation over order of hundred-thousands of fibers show that the proposed model achieves state-of-the-art results, even in cases of large relative rotations between test and training data.