Tractography and machine learning: Current state
and open challenges
Philippe Poulin1| Daniel Jörgens2| Pierre-Marc
Jodoin*1| Maxime Descoteaux*1
1Department of Computer Science,
Université de Sherbrooke, Sherbrooke,
2Department of Biomedical Engineering
and Health Systems, KTH Royal Institute of
Technology, Stockholm, Sweden
Philippe Poulin, Department of Computer
Science, Université de Sherbrooke,
Sherbrooke, Québec, Canada
FRQNT;Université de Sherbrooke
Institutional Chair in Neuroinformatics;
NSERC Discovery grant from Pr
Descoteaux and Jodoin
Supervised machine learning (ML) algorithms have recently
been proposed as an alternative to traditional tractography
methods in order to address some of their weaknesses. They
can be path-based and local-model-free, and easily incorpo-
rate anatomical priors to make contextual and non-local de-
cisions that should help the tracking process. ML-based tech-
niques have thus shown promising reconstructions of larger
spatial extent of existing white matter bundles, promising
reconstructions of less false positives, and promising robust-
ness to known position and shape biases of current tractog-
raphy techniques. But as of today, none of these ML-based
methods have shown conclusive performances or have been
adopted as a de facto solution to tractography. One rea-
son for this might be the lack of well-deﬁned and extensive
frameworks to train, evaluate, and compare these methods.
In this paper, we describe several datasets and evalu-
ation tools that contain useful features for ML algorithms,
along with the various methods proposed in the recent years.
We then discuss the strategies that are used to evaluate and
compare those methods, as well as their shortcomings. Fi-
nally, we describe the particular needs of ML tractography
methods and discuss tangible solutions for future works.
Diffusion MRI, tractography, machine learning, benchmark
arXiv:1902.05568v1 [q-bio.NC] 14 Feb 2019
2 PHILIPPE POU LIN E T AL.
In the ﬁeld of diffusion magnetic resonance imaging (dMRI), tractography refers to the process of inferring streamline
structures that are locally aligned with the underlying white matter (WM) dMRI measurements [
]. A simple approach
to obtain such streamlines is an iterative process in which, starting from a seed point, an estimate of the local tissue
orientation is determined and followed for a certain step length before repeating the orientation estimation at the
new position. The tracking procedure may be deterministic [
] (at each point, the algorithm follows the strongest
orientation) or probabilistic [
] (at each point, the algorithm samples a direction closely aligned with the strongest
orientation). Tracking may also be global as some methods recover streamlines all at once [
]. In between the local
and global methods is the category of shortest-path methods, including front evolution, simulated diffusion, geodesic,
and graph-based approaches [
]. Ultimately, the collection of all trajectories created in that way is called a tractogram.
In traditional methods, the estimate of the local tissue ﬁber orientation is usually inferred from an explicit and
local model which ﬁts the (local) diffusion data. These local models include diffusion tensor models [
tensor models [
], and other methods that aim at reconstructing the ﬁber orientation distribution function (fODF)
like constrained spherical deconvolution (CSD) [
], to name a few. However, the choice of the best model is by
itself difﬁcult [
], as it depends on various factors such as data acquisition protocol or targeted WM regions, and
therefore has a direct inﬂuence on the quality of an obtained tractogram [
]. Moreover, traditional methods based
on local orientation alone are prone to make common mistakes, such as missing the full spatial extent of bundles and
producing a great amount of false positive connections .
Another important factor for the performance of a tractography method are the actual rules that regard the
progression of a single step as well as simple global properties of an individual streamline. Traditional methods may
deﬁne several engineered, or “manually-deﬁned”, high-level rules with the aim of improving the anatomical plausibility
of the recovered tractogram. Instances of these are constraints on streamline length (i.e. ﬁltering streamlines that are
too long or too short), streamline shape (e.g. ﬁltering streamlines with sharp turns), or progression rules that make
streamlines “bounce off” the WM border when they are about to leave the WM mask with a certain angle [
the same way as modeling noise and artifacts, and deﬁning the right local model, also the design of these high-level rules
has a direct impact on the performance of a tractography method [16, 15].
To address these inherent difﬁculties, recent proposals suggest that machine learning (ML) algorithms, supervised
or unsupervised, may be used to implicitly learn a local, global or contextual ﬁber orientation model as well as the
tracking procedure. Approaches ranging from the application of self-organizing maps (SOM) [
], random forests
], Multilayer Perceptrons (MLP) [
], Gated Recurrent Units (GRU) [
], as well as Convolutional
Neural Networks (CNN) [
] and Autoencoders [
], have been employed at the core of tractography to drive streamline
progression. Apart from the differences in their underlying architecture, these ML methods differ substantially in
aspects of the exact problem formulation, e.g. deﬁnition of the input data to the model, modeling the predictions as a
] or classiﬁcation problem [
], or even the general tractographyapproach, i.e. whole-brain [
or bundle-speciﬁc [
]. The fact alone that these approaches differ in several aspects, makes it difﬁcult to
draw conclusions on the value of each of the individual modeling choices.
Furthermore, while the above mentioned approaches constitute the main ideas for applying ML directly to the
process of tractography, machine learning and especially deep learning (DL) methods have been applied in related
ﬁelds. Stacked U-Nets were proposed to segment the volume of individual white matter bundles from images of fODF
]. It was also suggested to predict ﬁber orientations from raw diffusion data based on convolutional neural
networks (CNN) [
]. Several ideas for streamline clustering or streamline segmentation have been proposed, including
a CNN based on landmark distances [
], a long short-term memory (LSTM)-based siamese network for rotation
PHILIPPE POUL IN ET AL. 3
invariant streamline segmentation [
], and a CNN approach for streamline clustering based on the sequence of their
]. Even though the mentioned works are closely related to tractography and contribute to the
common goal of improved analysis of the white matter anatomy of the human brain, we restrict our focus exclusively on
the direct application of ML (and especially DL) for tractography, with the explicit goal of producing streamlines and
addressing the weaknesses of traditional methods. For that reason, we refer the interested reader to the respective
references for more details.
An important factor for effectively advancing this ﬁeld of research is a common and appropriate methodology for
training and evaluating the performance of different approaches, which is currently lacking. Over the years, multiple
challenges have been proposed to assess the performance of conventional tractography methods, and a clear and
exhaustive review is provided by Schilling et al.
. However, we argue that the design of these challenges is typically
inappropriate for ML methods. In fact, the 2015 ISMRM Tractography Challenge [
] (along with the Tractometer evalua-
tion tool [
]) has been adopted as the tool of choice for benchmarking new ML tractography pipelines [
Unfortunately, several inherent ﬂaws arising speciﬁcally in the context of ML make it difﬁcult to perform a fair com-
parison between the results obtained from different ML pipelines. In particular, diffusion data preprocessing is left to
), tracking seeds and a tracking mask are not always given (
varying test environment
the test diffusion volume is sometimes used for training (
), training streamlines are not provided
disparate training data
), and testing on a single synthetic subject means that any computed estimator of a model’s
performance is unreliable (
small sample size
). Against the background of a prospectively increasing number of ML-
based approaches tackling the problem of tractography, a carefully designed evaluation framework that appropriately
addresses the speciﬁc requirements of ML methods has the potential to support and facilitate research in this ﬁeld in
the upcoming years.
In this paper, we follow a threefold strategy. First, we introduce the currently available datasets and evaluation
tools along with useful features and weaknesses regarding machine learning. Then, we provide a comprehensive review
of existing ML-based tractography approaches and derive a set of key concepts distinguishing them from each other.
Subsequently, we identify and discuss the strategies for evaluation of tractography pipelines and identify issues and
limitations arising when applied to ML-based tractography methods. We ﬁnally describe important features for an
appropriate evaluation framework the community ought to adopt in the near future to better promote data-driven
streamline tractography and point out the potential advantages for research in data-driven streamline tractography.
2|ANNOTATED DATASETS AND EVALUATION TOOLS
Over the years, many diffusion MRI datasets were produced and annotated, either as part of a challenge or research
papers. In this section, we overview several datasets that have been used to train and/or validate supervised learning
algorithms for tractography. Speciﬁcally, we selected datasets that offer both diffusion data and streamlines. Selected
datasets also needed to have either clearly deﬁned evaluation metrics, or to be large enough (more than 50 subjects) to
be considered as standalone training sets. We include datasets that are either publicly available or simply mentioned in
a research paper without a public release.
We excluded datasets or challenges focused on non-human anatomy(e.g. rat or macaque), where the ground truth is
harder to deﬁne and results might be harder to generalize to human anatomy (for data-driven algorithms), like the 2018
VOTEM Challenge [
). Moreover, we left out datasets focused only on pathological
cases like the 2015 DTI Challenge [
], because we consider it too early for data-driven tractography algorithms, at least
until more conclusive results on healthy subjects. We also excluded tractography atlases when tracking was done on a
4 PHILIPPE POU LIN E T AL.
single diffusion volume, usually averaged over multiple subjects (e.g. HCP842 [
]), because results tend to be overly
smooth and unsuited for ML methods. However, we include a recent case when tracking was done for each subject: the
100-subjects WM atlas of Zhang et al. .
While all the selected datasets are useful in one way or another for data-driven methods, they differ in multiple
ways, which are detailed in the following subsections and summarized in Table 1. The listed properties are the following:
•Name: The dataset name and reference
•Year: The year of publication of the dataset or paper using the dataset
•Public: Is the dataset (diffusion data and streamlines) publicly available?
•Real: Is the diffusion data a real acquisition or is it simulated?
•Human: Does the diffusion data represent the human brain anatomy?
•Subjects: The number of subjects or acquisitions
•Bundles: The number of bundles or tracks (if streamlines are available)
: Is a ground truth known? For real acquisitions, streamlines validated by a human expert (e.g. neuroanatomist)
are considered as GT despite the fact that these annotations are subject to inter-rater and intra-rater variations.
•Metrics: Well-deﬁned evaluation metrics are available with this dataset.
•Split: Is the dataset split into a training and testing set that future works can rely on?
Note that the notion of "ground truth" refers to an indisputable biologically-validated label assigned to an observed
variable. In medical imaging, such ground truth may be obtained with a biopsy [
], throughout careful complementary
] or by having several experts agreeing on a given diagnostic [
]. Unfortunately, such restrictive deﬁnition
of a ground truth is unreachable most of the time, especially for white matter tracks obtained from tractography, where
no expert can truly assess the existence (or non-existence) of a given streamline in a human brain from MRI images
only. In fact, only synthetically-generated streamlines or man-made phantoms can be considered as real "ground truth".
Despite that, for the purpose of this paper, we also use the term "ground truth" for any data that has been manually
validated by a human expert, typically a neuro-anatomist. In the medical imaging ﬁeld, this annotated data would be
called a gold standard, while in the artiﬁcial intelligence community, it might be called weakly annotated data. Although
such annotations do not meet the fundamental deﬁnition of a ground truth, it is nonetheless widely accepted by the
medical imaging AI community .
2.1 |The FiberCup dataset and the Tractometer tool
Original FiberCup Tractography Contest (2009)
Fillard et al. proposed the FiberCup Tractography Contest [
] in conjunction with the 2009 MICCAI conference. The
goal was to quantitatively compare tractography methods and algorithms using a clear and reproducible methodology.
They built a realistic diffusion MR 7-bundle phantom with varying conﬁgurations (crossing, kissing, splitting, bending).
The organizers acquired diffusion images with b-values of 2000, 4000, and 6000
, and used isotropic resolutions
of 3mm and 6mm, resulting in 6 different diffusion datasets. Contestants were provided all datasets (but not the ground
truth) and were free to apply any preprocessing they wanted on the diffusion images. Evaluation was done by choosing
16 speciﬁc voxels, or seed points, in which a unique ﬁber bundle is expected. Participants were expected to submit
a single ﬁber bundle for each of those seed voxels. Quantitative evaluation was done by comparing the 16 pairs of
candidate and ground truth ﬁbers using a symmetric Root Mean Square Error (sRMSE).
While the FiberCup Tractography Contest makes a good test case for simple conﬁgurations, it does not represent a
true human anatomy and does not impose a choice of b-value and preprocessing, which can induce signiﬁcant differences
PHILIPPE POUL IN ET AL. 5
TA B L E 1 Annotated datasets.
Name Year Public Real Human Subjects Bundles GT Metrics Split
Fibercup  2009 XX1 7 XX
Simulated Fibercup  2012 X1 7 XX
Tracula  2011 XX67 18 X
HARDI 2012  2012 X2 7 XXX
HARDI 2013  2013 X2 20 XXX
ISMRM 2015 [15, 16] 2015 XX1 25 XX
HAMLET  2018 XX83 12 X
PyT (BIL&GIN)  2018 XX410 2 X
BST (BIL&GIN)  2018 XX39 5 XX
TractSeg (HCP)  2018 XXX105 72 X
Zhang et al. (HCP)  2018 XX100 58 + 198 X
in data-driven methods. Also, it does not provide any training streamlines, and is thus useful only as a validation tool
for ML-based methods. Furthermore, the fact that it contains only one subject makes it hard to evaluate the true
generalization capability of an ML method trained and tested on that dataset. However, it is the only dataset that
provides seed points in order to have a uniform test environment, which is of utmost importance when comparing
ML-based algorithms. In the end, it is unclear if for ML-based methods there would be any correlation between a good
performance on the FiberCup contest and good performance on human anatomy.
Tractometer evaluation tool (2013)
In 2013, Côté et al. developed the Tractometer evaluation tool, to be used alongside the original FiberCup data, with
the aim of providing quantitative measures that better reﬂect brain connectivity studies. Using a Region of Interest
(ROI)-based ﬁltering method, a complete tractogram can be evaluated on global connectivity metrics, such as the
number of valid and invalid bundles. Furthermore, they propose two seeding masks: a complete mask (mimicking a brain
WM mask), and a ROI mask (mimicking GM-WM interfaces). The tractometer was designed to address the fact that
“metrics are too local and vulnerable to the seeds given, and, as a result, do not capture the global connectivity behavior
of the ﬁber tracking algorithm”.
Simulated FiberCup (2014)
In 2014, Neher et al. proposed a simulated version of the FiberCup, allowing new tracking algorithms to be tested using
multiple acquisition parameters [
]. The simulated data can be used alongside the Tractometer tool designed for the
Wilkins et al. also developed a synthetic version of the FiberCup dataset, but did not publicly release the data [
Unfortunately, with regards to ML methods, the simulated FiberCup dataset suffers from the same shortcomings as the
original FiberCup dataset as it contains only one non-human subject whose data is not split a priori into a training and
6 PHILIPPE POU LIN E T AL.
2.2 |Tracula (2011)
Yendiki et al.
published the Tracula method for automated probabilistic reconstruction of 18 major WM pathways.
It uses prior information on the anatomy of bundles from a set of training subjects. The training set was built from 34
schizophrenia patients and 33 healthy controls, using a 1.5T Siemens scanner as part of a multi-site MIND Clinical
Imaging Consortium [
]. The diffusion images include 60 gradient directions acquired with a b-value of 700
along with 10 b=0 images, with an isotropic resolution of 2mm. Whole-brain deterministic tracking was performed,
followed by expert manual labeling using ROIs for 18 major WM bundles. The dataset also includes a measure of the
inter-rater and intra-rater variability for the left and right uncinate.
To our knowledge, this is the earliest apparition of a large-scale human dataset with expert annotation of streamlines.
It is also the only dataset that includes a measure of inter-rater and intra-rater variability, which is a desirable feature for
ML methods (also discussed later in Section 4.4). Unfortunately, the complete set of diffusion images and streamlines
has been incorporated into the method and is not public.
2.3 |HARDI Reconstruction Challenges
HARDI Reconstruction Challenge (2012)
Daducci et al. organized the 2012 HARDI Reconstruction Challenge [
] at the ISBI 2012 conference. The goal of the
challenge was to quantitatively assess the quality of intra-voxel reconstructions by measuring the predicted number
of ﬁber populations and the angular accuracy of the predicted orientations. A training set was released prior to the
challenge, and a test set was used to score the algorithms. As such, the 2012 HARDI dataset contains diffusion images
but no streamlines.
Participants could request a custom acquisition (only once) by sending a list of sampling coordinates in q-space,
and the organizers would then produce a simulated signal for the given parameters. A
16 ×16 ×5
volume was then
produced, containing seven different bundles attempting to recreate realistic 3-D conﬁgurations. The metrics proposed
by the authors are ill-posed for ML-based methods because of the limited context available and the focus on local
performances. Like the FiberCup, it would only be useful as a validation tool given the lack of training streamlines, a
limited number of bundles (only seven) and a limited number of non-human subjects (only two).
HARDI Reconstruction challenge (2013)
The 2013 HARDI Reconstruction Challenge [
] was organized one year later at the ISBI 2013 conference. For ML-based
methods, three improvements are relevant compared to the 2012 challenge: a more realistic simulation of the diffusion
signal, a new evaluation system based on connectivity analyses and a larger set of 20 bundles. Indeed, data-driven
methods try to learn an implicit representation without imposing a model on the signal, which means that the signal
used for training and testing should be as close as possible to that in clinical practice. Furthermore, the main beneﬁt of
data-driven methods is the ability to use context in order to make good predictions in a multitude of conﬁgurations,
which means they have the potential to particularly improve connectivity analyses. Therefore, it would be a better
validation tool for ML-based methods than the 2012 HARDI Reconstruction Challenge. Nonetheless, the dataset suffers
from an inherent limitation as it contains only two non-human subjects.
PHILIPPE POUL IN ET AL. 7
FIGURE 1 2015 ISMRM Tractography Challenge data generation process (Taken from www.tractometer.org)
2.4 |ISMRM Tractography Challenge (2015)
This dataset has been designed for a tractography challenge organized in conjunction with the 2015 ISMRM con-
]. During the challenge, participants were asked to reconstruct streamlines from a synthetic human-like
diffusion-weighted MR dataset which was simulated with the aim of replicating a realistic, clinical-like acquisition,
including noise and artifacts. The available data consists of a diffusion dataset with 32 b=1000
images and one
b=0 image, with 2mm isotropic resolution, as well as a T1-like image with 1mm isotropic resolution. Since all data was
generated from an expert segmentation of 25 bundles, in theory, a perfect tracking algorithm should only produce
exactly these speciﬁc bundles. Unfortunately, as for the HARDI and FiberCup datasets, the 2015 ISMRM Tractography
Challenge contains data from a limited number of subjects (only one) and lacks a clear separation between training and
testing data. Nonetheless, in combination with the Tractometer tool [
], this dataset has often been used to assess
ML-based tractography methods. Figure 1 shows the data generation process for the challenge.
Once a tractogram has been generated using the challenge diffusion data, the tractometer tool uses a “bundle
recognition algorithm” [
] to cluster the streamlines into bundles. The generated bundles are then compared to the
ground truth, producing groups of “valid bundles” and “invalid bundles”, depending on which regions of the brain the
streamlines connect. Streamlines that do not correspond to a ground truth bundle are classiﬁed as “No connections”
streamlines. The metrics computed by the modiﬁed Tractometer for the Tractography Challenge are as follows:
•Valid bundles (VB): The number of correctly reconstructed ground truth bundles.
•Invalid Bundles (IB): The number of reconstructed bundles that do not match any ground truth bundles.
•Valid Connections (VC): The ratio of streamlines in valid bundles over the total number of produced streamlines.
(IC): The ratio of streamlines in invalid bundles over the total number of produced streamlines.
(NC): The ratio of streamlines that are either too short or do not connect two regions of the cortex
over the total number of produced streamlines.
8 PHILIPPE POU LIN E T AL.
(OL): The ratio of ground truth voxels traversed by at least one streamline over the total number of
ground truth voxels.
(OR): The ratio of voxels traversed by at least one streamline that do not belong to a ground
truth voxel over the total number of ground truth voxels.
•F1-score (F1): The harmonic mean of recall (OL) and precision (1-OR).
The deﬁnition of streamline-oriented metrics (VB, IB, VC, IC, NC) and volume-oriented metrics (OL, OR, F1) means
that there is no single number that can fully assess the performance of an algorithm. For example, deterministic methods
often score higher on streamline-oriented metrics compared to probabilistic methods. As such, a thorough review of all
scores must be performed in order to properly compare algorithms, and in many cases, the choice of an algorithm over
another may depend on a speciﬁc use-case (e.g. bundle reconstruction vs. connectivity analysis).
2.5 |HAMLET (2018)
To validate their method, Reisert et al.
used a dataset of 83 human subjects from two independent cohorts. The ﬁrst
cohort comprises 55 healthy volunteers, all scanned by a Siemens 3T TIM PRISMA MRI scanner. The second cohort has
28 volunteers scanned with a Siemens TIM TRIO. The ﬁrst cohort was used for training while the second one was used
for testing. Subjects in the second cohort were scanned twice for test-retest experiments, some unique characteristic
to that dataset. The reference streamlines were obtained by ﬁrst tracking the whole brain with global tractography,
and then by segmenting the streamlines for 12 bundles with a selection algorithm in MNI space. Unfortunately, the
recovered streamlines have not been manually validated by an expert.
2.6 |Datasets based on the BIL&GIN database
Bundle-Speciﬁc Tractography (2018)
Rheault et al. proposed a bundle-speciﬁc tracking method based on anatomical priors that improves tracking in the
centrum semiovale crossing regions [
]. Using multiple tractography algorithms, they tracked and segmented ﬁve
bundles (Arcuate Fasciculus - AF left/right, Corpus Callosum - CC, Pyramidal Tracts - PyT left/right) in 39 subjects from
the BIL&GIN database [
]. To compare algorithms, they used an automatic bundle segmentation method based on
clear anatomical deﬁnitions. In addition, they deﬁned several performance metrics, such as bundle volume,ratio of valid
streamlines, and efﬁciency. However, the tractograms and automatic bundle segmentation procedure were neither made
public nor validated by an expert. Such a dataset, along with the evaluation procedure, could be extremely useful to
assess if data-driven methods can reliably learn the structure of a speciﬁc bundle and reconstruct it in unseen subjects.
A population-based atlas of the human pyramidal tract (2018)
Chenot et al. created a streamline dataset of the left and right PyT based on a population of 410 subjects [
], also from
the BIL&GIN database [
]. To do so, they combined manual ROIs along the bundles’ pathway and the bundle-speciﬁc
tractography algorithm of Rheault et al.
. The quality of the segmentations and the high number of subjects would
make this a noteworthy training dataset for data-driven methods. Unfortunately for ML methods, only two bundles
were examined. Furthermore, while the probability maps of the atlas have been rendered public, the tractograms are
PHILIPPE POUL IN ET AL. 9
2.7 |Datasets based on the HCP database
Wasserthal et al. proposed a data-driven method for fast WM tract segmentation without tractography [
]. In doing
so, they built an impressive dataset of 72 manually-validated bundles for 105 subjects from the Human Connectome
Project (HCP) diffusion database [56, 57]. Tractograms were obtained via a four-step semi-automatic approach:
1. Tractography (Multi-Shell Multi-Tissue CSD )
2. Initial tract extraction (TractQuerier )
3. Tract reﬁnement (Manual ROIs  + QuickBundles )
4. Manual quality control and cleanup
To the best of our knowledge, this is the largest public database to include both diffusion data and reference
streamlines. No further preprocessing of the diffusion data is needed because of the standard procedure of [
authors deﬁned volume-oriented metrics such as Dice score [
], but did not offer any streamline-oriented metrics
as their method predicts a volume segmentation. The high number of subjects and bundles makes this a remarkable
In a subsequent paper, the same authors re-used a subset of 20 bundles of the TractSeg dataset to train and validate
their TOM ML algorithm [
]. However, as for original 72-bundle dataset, the TOM dataset does not come with a
predeﬁned set of training and testing data and no formal evaluation protocol that users could rely on has been proposed.
Zhang et al. (2018)
Zhang et al.
built a WM ﬁber atlas using 100 HCP subjects. They ﬁrst generated streamlines for all subjects
using a two-tensor unscented Kalman ﬁlter method [
], and sampled 10,000 streamlines from each subject after a
tractography registration step. Then, using a hierarchical clustering method, the authors generated an initial WM ﬁber
atlas of 800 clusters. Finally, an expert neuroanatomist reviewed the annotations in order to accept or reject each
cluster, and provided the correct annotations when the initial annotation was rejected. The ﬁnal, proposed atlas is
comprised of 58 bundles (each composed of multiple clusters), along with “198 short and medium range superﬁcial ﬁber
clusters organized into 16 categories according to the brain lobes they connect” .
While the atlas is public, the sampled streamlines from the 100 subjects are all merged into the single template. In
order for ML methods to beneﬁt from this dataset, the streamlines would need to be separated back into the space of
the particular original subjects. For this reason, we do not consider this dataset to be "public", in the context of machine
3|MACHINE LEARNING METHODS FOR TRACTOGRAPHY
For this review, we regard all supervised machine learning methods published in peer-reviewed journals, conferences or
on arXiv (
) and biorXiv (
). We added the requirement that methods needed to be speciﬁcally
designed for tractography, i.e. with the purpose of predicting a contextual streamline direction (and not reconstructing
a local, non-conditional fODF or clustering streamlines). This criterion includes whole-brain as well as bundle-speciﬁc
tractography methods. A summary of the main properties for all reviewed methods is provided in Table 2.
10 PHILIPPE POU LIN E T AL.
TA B L E 2 Main properties of data-driven methods for tractography.
Method Model Temporal
input Prediction Implicit
Neher et al.  RF 1 last direction 50 samples Resampled DWI Classiﬁcation X
Poulin et al.  GRU Full 1x1x1 voxel SH Regression
Poulin et al.  GRU Full 1x1x1 voxel SH Regression
Benou et al.  GRU Full 1x1x1 voxel Resampled DWI Classiﬁcation X
Jörgens et al.  MLP 2 last directions 1x1x1 voxel Raw DWI Regression
Wegmayr et al.  MLP 4 last directions 3x3x3 voxels SH Regression
Wasserthal et al.  CNN N/A Entire WM fODF peaks Regression
Reisert et al.  CNN-like N/A Entire WM SH Regression X
RF: Random Forest; MLP: Multilayer perceptron; GRU: Gated recurrent unit; CNN: Convolutional neural network; SH: Spherical har-
monics coefﬁcients; fODF: ﬁber Orientation Distribution Function; Implicit stop: indicates if a method learns its tracking stopping
criterion or if it relies on a usual explicit criterion.
Random Forest classiﬁer
To the best of our knowledge, Neher et al. were the ﬁrst to propose a machine learning algorithm for (deterministic)
]. They employ a RF classiﬁer to learn a mapping from raw diffusion measurements to a directional
proposal for streamline continuation. After collecting several of such proposals in a local neighborhood of the current
streamline position (radius: 25% of the smallest side length of a voxel), these are aggregated in a voting scheme to ﬁnally
arrive at a single direction in which to grow the streamline.
To deﬁne reference streamlines for their experiments, the authors employ several tractography pipelines and
train their classiﬁer on each of the resulting tractograms. They determine the best trained model by evaluating the
performance of each on a replication of the FiberCup phantom (based on the Tractometer metrics of [
comparing the performance of the latter to all other reference pipelines, they report a superior performance of their
tracking model over all other approaches. While tractograms were scored on a simulated phantom (i.e. no real anatomy),
extended experiments presented in a subsequent paper [
] conﬁrm the superiority of their approach on the 2015
ISMRM Tractography Challenge dataset (simulated data of a human anatomy).
Gated Recurrent Unit (GRU) Tracking
Hypothesizing “that there are high-order dependencies between” the local orientation at a point of a streamline and
the orientations at all other points on the same streamline, Poulin et al. proposed a recurrent neural network (RNN)
based on a GRU [
] to learn the tracking process. Their method implements an implicit model mapping diffusion
measurements to local streamline orientations which not only depends on measurements in a local context, but on all
data previously seen along the extent of a particular streamline. As opposed to [
], the RNN model is implemented
as a regression approach. In their experiments, the authors show that a recurrent model (when trained on reference
streamlines obtained using deterministic CSD-based tractography [
]) was able to outperform most of the original
submissions in the 2015 ISMRM Tractography Challenge with respect to the Tractometer scores (discussed in section 2.4).
PHILIPPE POUL IN ET AL. 11
In a subsequent paper, Poulin et al.
again suggested using a GRU, but in a bundle-speciﬁc fashion. While the model
architecture is very similar, it was trained on a dataset of 37 real subjects, each with a curated set of streamlines for
bundles. After training a single model for each of the selected bundles, the authors showed promising results compared
to existing methods, perhaps indicative that the difﬁcult task of learning to track streamlines necessitates more data
than previously thought.
More recently, Benou and Riklin-Raviv
proposed a GRU-based recurrent neural network similar to that of Poulin
. In their method, they directly use the resampled diffusion signal as input to the model (like Neher et al.
in order to estimate a discrete, streamline-speciﬁc fODF representation which they refer to as “conditional fODF”
(CfODF). Instead of predicting a 3D orientation vector using a regression approach, the authors implement their model
as a classiﬁer enabling them to interpret the probabilities obtained for discrete sampled directions (i.e. the classes) as
the mentioned CfODF. This fODF-based formulation further allows for an inherently deﬁned criterion for streamline
termination based on the entropy of the CfODF. The proposed model can be employed for both deterministic and
Like Poulin et al.
, the authors trained and tested their method on the 2015 ISMRM Tractography Challenge
dataset. They report results after training their method on the dataset ground truth as well as on streamlines obtained
with the MITK diffusion tool .
Multi-Layer Perceptron Point-Wise Prediction
Jörgens et al.
propose a multi-layer perceptron (MLP) to predict the next step of a streamline. Like [
their method takes as input the diffusion signal and thus avoids explicit dMRI model-ﬁtting. The authors implemented
different conﬁgurations of their proposed MLP such as three different input scenarios (point-wise input vs region-wise
input with and without considering previous orientations), different approaches to aggregate the output (maximum
likelihood, mathematical expectation of the categorical prediction and regression) as well as the voting scheme proposed
by Neher et al.
. Results reveal that the best conﬁgurations are those having the previous two directions included in
the input of the network thus showing that temporal context is a key component for data-driven tractography. Also, the
regression and classiﬁcation approaches led to similar results and the use of region-wise information did not provide
any substantial improvement over the use of point-wise information.
Like Poulin et al.
and Benou and Riklin-Raviv
, the authors trained and tested their method on the 2015
ISMRM Tractography Challenge dataset (but did not use the Tractometer tool). Unfortunately, they did not estimate the
tracking capabilities of their method as they only measured point-wise angular errors when predicting the next step of a
Multi-Layer Perceptron Regression Tracking
A similar approach suggested by [
] employs a MLP to predict the next direction of a streamline through regression.
At each point, the input of the model is given by all diffusion measurements in a cubic neighborhood, along with a
certain number of previous steps for the current streamline. In that way, the authors provide the ML model directly with
diffusion information in a local neighborhood (spatial context) as well as a notion of “history” of the current streamline
(temporal context). Deﬁning their reference streamlines as tractograms obtained with a standard tractography method
from in vivo datasets, they train their model on three subjects from the HCP database. Experimental validations on
the 2015 ISMRM Tractography Challenge dataset reveal that their model outperforms some ML methods [
12 PHILIPPE POU LIN E T AL.
most Tractometer metrics. However, as demonstrated by low overlap scores, the authors acknowledge that their model
produces “rather conﬁned bundles with little spread”, especially in contrast to [
]. While the strength of this model
is to explicitly provide information from a local neighborhood, like for Jörgens et al.
, the notion of context along the
streamline is limited and needs to be deﬁned before training. Since the ideal temporal context (in terms of streamline
length, or steps) is still unknown, this could potentially prohibit the model from taking advantage of all information
relevant to streamline continuation.
Tract orientation mapping using an encoder-decoder CNN
Wasserthal et al.
proposed a data-driven, bundle-speciﬁc tracking method. As opposed to the other ML methods
reported in this paper, the authors do not try to directly reconstruct streamlines per se. Instead, their proposed Tract
Orientation Mapping (TOM) method predicts bundle-speciﬁc fODF peaks that are then used by a deterministic tracking
method. First, CSD is used to extract three principal directions in all WM voxels. Then, a U-Net CNN [
] is trained to
map these fODF peaks to bundle-speciﬁc peaks, i.e. peaks that are only relevant for the streamlines of a given bundle.
Their CNN takes as input 9 channels (the three fODF peaks) and outputs 60 channels, i.e. a 3D bundle-speciﬁc fODF
vector for each of the 20 bundles they are looking to recover. While the recovered bundle-speciﬁc peaks can be used in
different ways, the authors show that using them directly as input to a deterministic MITK diffusion tractography gives
some of the best results. The approach was trained and tested on 105 HCP subjects, each with reference streamlines
produced by a semi-automatic dissection of 20 large WM bundles (which they recently rendered public ).
In a similar line of thought, in their HAMLET project (Hierarchical Harmonic Filters for Learning Tracts from Diffusion MRI)
Reisert et al.
map raw spherical harmonics of order 2 to a spherical tensor ﬁeld. In that sense, like Wasserthal et al.
, their ML method does not output streamlines but instead voxel-wise bundle-speciﬁc tensors that can subsequently
be used as input to a classical tractography method. The magnitude of the produced
tensor indicates the presence
of a speciﬁc bundle whereas the tensor orientation predicts the local streamline direction. Their method implements a
multi-resolution CNN with rotation covariant convolution operations. They trained and tested their method on two
in-house datasets comprising a total of 83 human subjects. The 12 bundles and their associated reference streamlines
have been obtained with global tractography and automatic bundle selection method. Unfortunately, the reference data
was not manually validated by a human expert, and they did not perform any comparisons against other tractography
4|RESULTS & DISCUSSION
4.1 |Results on the 2015 ISMRM Tractography Challenge
The 2015 ISMRM Tractography challenge is the only dataset that has been used to assess performance of several data-
driven tractography methods and is thus, as of today, the only available common ground on which to compare methods.
It was used by four different papers namely, the Random-Forest of Neher et al.
, the GRU of Poulin et al.
Benou and Riklin-Raviv
, and the MLP of Wegmayr et al.
. Experimental results reported by the authors have
been transcribed in Table 3, and compared with original submissions in Figure 2. Note that the metrics marked as not
available (N/A) are those the authors did not report in their original paper.
As can be seen, results vary a lot and there is no clear trend showing which method performs best, especially given
the nature of the evaluation metrics. As mentioned in section 2.4, methods can be evaluated using both streamline-
PHILIPPE POUL IN ET AL. 13
TA B L E 3
Tractometer results. The Bundles and Connections (%) metrics are streamline-oriented metrics whereas the
Avg. bundle (%) metrics are volume-oriented metrics
Model Bundles Connections (%) Avg. bundle (%)
Valid Invalid Valid Invalid No connection Overlap Overreach F1-score
Random-Forest  23 94 52 N/A N/A 59 37 61
GRU  23 130 42 46 13 64 35 65
MLP  23 57 72 N/A N/A 16 28 26
GRU (DeepTract)  23 51 41 33 23 34 17 44
FIGURE 2 2015 ISMRM Tractography Challenge original submissions (1-20) and new results (21-24)
oriented metrics and volume-oriented metrics, which are not always correlated. For example, a method may have a large
number of valid connections but a low overlap (like the MLP of Wegmayr et al.) which means that although the model
was able to recover most valid bundles, the generated streamlines do not properly cover the spatial extent of those
bundles. Also, a method can be more conservative and score best in terms of invalid connections and overreach like the
GRU of Benou and Riklin-Raviv, but at the same time have a low ratio of valid connections and a poor bundle overlap.
On the other hand, the Random-Forest of Neher et al. does not score best in any category, but is competitive according
to all metrics (its large F1-score underlines that it is a more balanced method compared to MLP and DeepTract). On top
of that, all methods were trained and evaluated differently, so any comparison based on the reported results should be
done with extreme care.
14 PHILIPPE POU LIN E T AL.
TA B L E 4 Differences in data
Method Preprocessing WM mask Training subjects Reference streamlines
Random-Forest  dwidenoise +dwipreproc Not needed 5 HCP subjects CSD (Deterministic)
GRU  None Ground Truth Challenge subject CSD (Deterministic)
MLP  dwipreproc N/A 3 HCP subjects iFOD (Probabilistic)
DeepTract  N/A Not needed Challenge subject Q-Ball (Probabilistic)
4.2 |The 2015 ISMRM Tractography Challenge as an evaluation tool for ML algorithms
As mentioned before, the 2015 ISMRM Tractography Challenge has been adopted as the de facto evaluation tool to
compare ML tractography methods. However,the strengths and weaknesses of that tool should be thoroughly reviewed
to understand and trust any technique reporting results with it. In this section, we present what we consider to be
important issues with the way in which this tool has been used to assess the performance of data-driven methods. In
particular, we detail the discrepancies between the four ML-based methods, differences that may explain some of the
results in Table 3 and potentially undermine any conclusion that one could draw from it. Let us mention that some of
these issues with the 2015 ISMRM dataset are typical for the ﬁeld of tractography as a whole.
Table 4 presents a summary of the differences in how the tool is used. Note that the not available (N/A) mark is used
for any information the authors did not mention in their original paper.
The four ML methods use a different preprocessing pipeline. Among the proposed algorithms, two applied MRtrix’s
dwidenoise or dwipreproc (
), another one denoised using [
] and corrected for eddy currents and head
motion, and another one did not apply any preprocessing at all. Moreover, some used the diffusion signal directly as
input, while others resampled it to a speciﬁc number of gradient directions. In some cases, spherical harmonics were
ﬁtted to the signal and the SH coefﬁcients were fed as input to the model. Finally, the non-recurrent models are also
given a variable number of previous streamline directions as input.
The output of each of these pipelines contain various degrees of information. For example, fODF peaks are in
theory already aligned with the major WM pathways, and information may be lost depending on the speciﬁc model used
to recover the peaks from the diffusion signal. On the other hand, using the raw diffusion signal might contain more
information but is more difﬁcult to understand and process, and thus a data-driven model might require more capacity
to use such an input. Without a thorough investigation of the information contained in each output, any variations in the
Tractometer results could be attributed to the variations in preprocessing. Since we currently do not have any indication
of what is useful for data-driven algorithms, it is impossible to compare ML methods if they do not use the same input
Varying test environment
Since no white matter mask is provided, it must be computed by each participant in case it is needed for tracking. Out of
the four ML methods that were evaluated on the challenge, two needed WM masks; one used the ground truth mask,
and the other did not mention how the mask was computed. Furthermore, since no tracking seeds are supplied with the
data either, their arrangement entirely depends on the WM mask (and the number of seeds per voxel, which is also not
Given the nature of streamline tractography, small variations of the tracking mask or the tracking seeds could have
PHILIPPE POUL IN ET AL. 15
a substantial impact on the resulting streamlines and by that also on the obtained evaluation metrics. Also, even though
computing a stopping criterion within the algorithm is a worthy improvement, it is a different task than tracking, and
should be evaluated separately. Consequently, all methods should be provided the same tracking mask and seeds to
reduce as much as possible the number of free variables during evaluation.
The use of ML methods requires special care when dealing with available data. Since machine learning models are
obtained by deriving implicit rules
directly from given data
(i.e. training data), testing the true generalization capabilities
of these rules must be done using a different and unseen set of data (i.e. test data).
Two methods suffer from data contamination, or leakage [
]: the GRU in [
] and the MLP in [
]. Here, data
contamination refers to the usage of the same diffusion data for training and testing. This means that the true general-
ization capabilities of the tested method on new, thus unseen subjects are still unknown, since the model has already
seen the speciﬁc diffusion patterns that are needed in order to “explore” at test time, and therefore has been given an
Disparate training data
All methods used different reference streamlines and subjects for training. As mentioned earlier, some employed the
test diffusion data directly, while others relied on a varying number of subjects from the HCP database. Two methods
used deterministic CSD tracking [
] to generate reference streamlines, one used QBI tracking [
] (probabilistic) and
the last one used iFOD tracking [
] (also probabilistic). In order to provide a uniform basis for comparison, the same
comprehensive streamline training set should be available to every algorithm.
Simulation as a substitute for human acquisition
While the diffusion signal of the 2015 ISMRM dataset is typical of that of a human brain, it is nonetheless obtained
through simulation. As such, results on that dataset should not be seen as a measure of future performance on real
human subjects, at least not without further empirical evaluation. Furthermore, at the given resolution and using this
particular conﬁguration of 25 bundles, false positive streamlines that would otherwise be plausible given the underlying
anatomy of a real scan might be impossible to avoid. Indeed, some authors tried training their models using the ground
truth bundles, and still produced over 50 invalid bundles in both cases [18, 25].
Small sample size
The 2015 ISMRM Tractography Challenge dataset has only one subject, which makes it hard to assess the future perfor-
mance of a data-driven algorithm [
]. In order to compute unbiased estimates of future performance, a richer testset
with more subjects is needed. Also, given more subjects, bootstrapping methods [
] (i.e. sampling with replacement)
could help to build more accurate estimators.
4.3 |Other results
Some authors report local performance measures, such as the mean angular error [
]. However, local metrics do
not take into account compounding errors, which can have a major effect on global structure. Consequently, global
evaluation metrics should be preferred.
Tractography papers often report a visual evaluation on unseen, in vivo subjects, as a qualitative evaluation. For
example, Figures 3 and 4 compare some of the proposed data-driven approaches with standard tractographymethods on
16 PHILIPPE POU LIN E T AL.
FIGURE 3 Comparison between the RF of Neher et al. (top row), and classical deterministic CSD streamline
tractography (bottom row). Results obtained on HCP subject 992774. (Taken from  with authorization from the
FIGURE 4 Comparison of various tracking methods: A: Deterministic, B: Deterministic Bundle-Speciﬁc
(DET-BST) , C: Probabilistic particle ﬁlter BST (PROB-PF-BST) , D: DeepTracker . Results obtained on a
BIL&GIN subject. (Taken from  with authorization from the authors)
white matter bundles with known anatomy. However, in absence of a ground truth or the expertise of a neuroanatomist,
it is hard to draw deﬁnitive conclusions on the quality of such results. In addition, Reisert et al.
plots to assess reproducibility, but only offered qualitative comparisons with the reference streamlines without any
quantitative results. To gain trust in these data-driven methods, a more rigorous approach is needed.
Finally, most ML methods offer a reduction in computation time compared to traditional methods. This is a non-
negligible beneﬁt, should these methods be adopted in practice.
4.4 |Proposed guidelines for a data-driven tractography evaluation framework
Considering the ML tractography evaluation issues previously underlined, we discuss in this section the fundamental
elements of a better framework we believe the community should adopt in the upcoming years. We start with the
essential characteristics such a framework should have, followed with useful features.
PHILIPPE POUL IN ET AL. 17
First and foremost, an ideal data-driven tractography evaluation framework should come with a public and free-to-use
dataset that anyone could easily rely on. The dataset should include images of real human acquisitions along with a
careful expert selection of ground truth streamlines. It is important to avoid any bias towards a speciﬁc tractography
algorithm. In order to achieve this, the streamlines could be ﬁrst generated by a large number of different (and ideally
orthogonal) deterministic, probabilistic and global algorithms and then segmented by expert annotators according
to strict anatomical deﬁnitions for a given number of bundles. While such manual annotation would be tedious, time
consuming and even error prone, we consider this an indispensable step towards building a realistic and useful dataset
for ML-based development. The need for such a gold standard that quantiﬁes human variability is well-known in other
ﬁelds, such as automatic image segmentation, cell counting or in machine learning [
]. Despite the fact that
simulated brain images come with a pixel-accurate set of ground truth streamlines that can be generated in a matter of
seconds, by deﬁnition synthetic diffusion signals are over-simplistic pictures of real data and, as such, cannot provide
any guarantee of subsequent performance for data-driven methods on real data.
Although there is no consensus regarding the most desirable features a ML tractography algorithm should have and
how it should be evaluated, by its very nature, any ML evaluation framework should aim at measuring how an algorithm
can faithfully reproduce a task it was trained for. As such, a reasonable dataset should include a sufﬁciently large number
of well-separated training and testing images. Thus, statistics resulting from such a dataset would not suffer from
contamination and the reported metrics would be reliable and unbiased estimates of the true generalization power of
a ML algorithm. In addition, to ensure that the observed differences between multiple algorithms are resulting from
the intrinsic properties of the model and not caused by some feature of the evaluation framework, the number of free
variables should be reduced to a minimum. Consequently, the tracking masks and seeds should be provided together
with clearly preprocessed diffusion data, so that the proposed methods can be evaluated in equal conditions. There
should be multiple "classes" of input data, depending on whether an algorithm supports DWI samples, SH coefﬁcients or
fODF peaks. Furthermore, the initial diffusion signal should have the same statistical properties for the training and the
testing set. Finally, the acquired images should ideally be acquired at different MRI scanners with different acquisition
protocols in order to avoid overﬁtting issues.
Evaluation metrics should also be bound to the purpose of tractography algorithms. Considering that tractography
is mostly used for bundle reconstruction, tractometry studies and connectivity analyses, an ideal evaluation framework
should include two sets of metrics : 1) metrics measuring how a ML method can faithfully reproduce a set of predeﬁned
bundles it was trained to recover (tractometry), and 2) metrics measuring how it can connect matching regions of the
brain, i.e. produce valid connections (connectivity). Furthermore, since many applications use tractography algorithms
to produce a large number of streamlines (with many false positives), which are then ﬁltered out by a post-processing
algorithm such as RecoBundles [
], the framework should report results before and after post-processing. This would
underline the true recall power of a data-driven algorithm, which is a fundamental characteristic of tract-based and
connectivity-based applications .
Lastly, the size of an ideal dataset is of primary importance. While a small-sized dataset could be prone to overﬁtting,
it would be costly to create a very large dataset and also difﬁcult to ensure a coherent manual annotation. One rule
of thumb that can be used to identify the "correct" size of a dataset is through the inspection of the learning curve of
several ML models [
]. These curves show the model performance as a function of the training sample size. Typically,
the performance of several models saturates for a sufﬁcient dataset size. Although imperfect, this procedure is a good
heuristic for estimating the size of the dataset.
18 PHILIPPE POU LIN E T AL.
Other useful features
Despite any thorough manual annotation protocol, manually annotated bundles can be subject to non-negligible inter-
rater and intra-rater variability. As such, a useful characteristic of a ML tractography dataset would be a measure of
those variations. This would be obtained by having several experts annotating the dataset, and at least one expert
annotating it twice or more times. Such measures would provide a minimal bound beyond which a data-driven algorithm
could be considered "as good as an expert". Another very useful tool would be an openly accessible online evaluation
system. Given such a system, people could upload their test results in order to compare them with the test ground
truth. In that way, an automatic ranking procedure similar to that of Kaggle could be used to sort various ML algorithms
based on their achieved scores. While no ranking method is perfect, it would nonetheless provide a common evaluation
framework that people could rely on.
An ideal dataset would also cover the whole ﬁeld of diffusion MRI acquisition protocols, from HCP-like research
acquisitions to clinical acquisitions. It would include single b-value as well as multiple b-values data, along with more
sophisticated acquisition protocols such as b-tensor encoding. It would also need low resolution images together with
high-resolution images. Since data harmonization is also a problem for data-driven algorithms, acquisition from several
sites are needed for test-retest studies. Annotated pathological cases would complete the dataset by allowing careful
preliminary studies on how ML-based methods can be relied on in unhealthy patients.
Finally, since tractography is used more and more in pre-clinical applications, a subset of manually annotated rodent
or macaque brains would be of great interest to train and test future ML algorithms (like the 2018 VOTEM Challenge [
This is, of course, the ultimate wish list. But, in the era of open data and open science, it needs to be done by
the community, for the community. We can already see this work in progress with more and more accessible and
reproducible data being published every year.
In this paper, we provided an exhaustive review of the current state of the art of machine learning methods in the ﬁeld
of tractography. We described the existing datasets that comprise both diffusion data and reference streamlines, which
could generally be useful for new tracking methods based on ML. In particular, we thoroughly examined the widely
used evaluation tool for data-driven tracking methods, the 2015 ISMRM Tractography Challenge, and detailed ﬂaws and
shortcoming when used to assess data-driven algorithms. Based on our ﬁndings, we suggested good practices that we
believe would foster the development of a new evaluation framework for ML-based tractography methods with the
potential to effectively advance this ﬁeld of research.
There is no doubt that machine learning tractography will have an important role to play in the future to solve some
of the open problems of tractography. At the moment, however, all existing methods show theoretical potential and
in limited test cases. Methods have yet to make solid demonstrations of their performance and efﬁciency in practice.
There is still no ML-based tractography tool that is a scalable and usable on any given diffusion MRI dataset. This is true
for healthy datasets but even more so for pathological brains. Hence, it is fair to say that ML-based tractography is still
at its infancy and is not ready for "prime-time", but is nonetheless a very fertile ﬁeld of research to make meaningful
contributions to the ﬁeld of connectivity mapping.
PHILIPPE POUL IN ET AL. 19
 Jeurissen B, Descoteaux M, Mori S, Leemans A. Diffusion MRI ﬁber tractography of the brain. NMR in Biomedicine
 Yeh FC, Verstynen TD, Wang Y, Fernández-Miranda JC, Tseng WYI. Deterministic diffusion ﬁber tracking improved by
quantitative anisotropy. PloS one 2013;8(11):e80713.
 Basser PJ, Pajevic S, Pierpaoli C, Duda J, Aldroubi A. In vivo ﬁber tractography using DT-MRI data. Magnetic resonance
in medicine 2000;44(4):625–632.
 Behrens TE, Berg HJ, Jbabdi S, Rushworth MF, Woolrich MW. Probabilistic diffusion tractography with multiple ﬁbre
orientations: What can we gain? Neuroimage 2007;34(1):144–155.
 Tournier JD, Calamante F, Connelly A. Improved probabilistic streamlines tractography by 2nd order integration over
ﬁbre orientation distributions. In: Proceedings of the international society for magnetic resonance in medicine, vol. 18;
2010. p. 1670.
 Tournier JD, Calamante F, Connelly A. MRtrix: diffusion tractography in crossing ﬁber regions. International Journal of
Imaging Systems and Technology 2012;22(1):53–66.
 Reisert M, Mader I, Anastasopoulos C, Weigel M, Schnell S, Kiselev V. Global ﬁber reconstruction becomes practical.
 Mangin JF, Fillard P, Cointepas Y, Le Bihan D,Frouin V, Poupon C. Toward global tractography. Neuroimage 2013;80:290–
 Jbabdi S, Woolrich MW, Andersson JL, Behrens T. A Bayesian framework for global tractography. Neuroimage
 Pierpaoli C, Jezzard P, Basser PJ, Barnett A, Di Chiro G. Diffusion tensor MR imaging of the human brain. Radiology
 Caan MW, Khedoe HG, Poot DH, Arjan J, Olabarriaga SD, Grimbergen KA, et al. Estimation of diffusion properties in
crossing ﬁber bundles. IEEE transactions on medical imaging 2010;29(8):1504–1515.
 Tournier JD, Calamante F, Gadian DG, Connelly A. Direct estimation of the ﬁber orientation density function from
diffusion-weighted MRI data using spherical deconvolution. NeuroImage 2004;23(3):1176–1185.
 Descoteaux M, Deriche R, Knösche TR, Anwander A. Deterministic and probabilistic tractography based on complex
ﬁbre orientation distributions. IEEE transactions on medical imaging 2009 feb;28(2):269–86.
 Schilling KG, Daducci A, Maier-Hein K, Poupon C, Houde JC, Nath V, et al. Challenges in diffusion MRI tractography–
Lessons learned from international benchmark competitions. Magnetic resonance imaging 2018;.
 Maier-Hein KH, Neher PF, Houde JC, Côté MA, Garyfallidis E, Zhong J, et al. The challenge of mapping the human con-
nectome based on diffusion tractography. Nature communications 2017;8(1):1349.
 Côté MA, Girard G, Boré A, Garyfallidis E, Houde JC, Descoteaux M. Tractometer: towards validation of tractography
pipelines. Medical image analysis 2013;17(7):844–857.
 Girard G, Whittingstall K, Deriche R, Descoteaux M. Towards quantitative connectivity analysis: reducing tractography
biases. Neuroimage 2014;98:266–278.
 Neher PF, Côté MA, Houde JC, Descoteaux M, Maier-Hein KH. Fiber tractography using machine learning. NeuroImage
20 PHILIPPE POU LIN E T AL.
 Duru DG, Ozkan M. SOM Based Diffusion Tensor MR Analysis. In: Image and Signal Processing and Analysis, 2007. ISPA
2007. 5th International Symposium on IEEE; 2007. p. 403–406.
 Duru DG, Ozkan M. Self-organizing maps for brain tractography in MRI. In: Neural Engineering (NER), 2013 6th Interna-
tional IEEE/EMBS Conference on IEEE; 2013. p. 1509–1512.
 Neher PF, Götz M, Norajitra T, Weber C, Maier-Hein KH. A Machine Learning Based Approach to Fiber Tractography
Using Classiﬁer Voting. Springer, Cham; 2015.p. 45–52.
 Jörgens D, Smedby Ö, Moreno R. Learning a Single Step of Streamline Tractography Based on Neural Networks. In:
Computational Diffusion MRI Springer; 2018.p. 103–116.
 WegmayrV, Giuliari G, Holdener S, Buhmann J. Data-driven ﬁber tractography with neural networks. In: 2018 IEEE 15th
International Symposium on Biomedical Imaging (ISBI 2018) IEEE; 2018. p. 1030–1033.
 Poulin P, Côté MA, Houde JC, Petit L, Neher PF, Maier-Hein KH, et al. Learn to Track: Deep Learning for Tractography.
Springer, Cham; 2017.p. 540–547.
 BenouI, Riklin-Raviv T. DeepTract: A Probabilistic Deep Learning Frameworkfor White Matter Fiber Tractography. arXiv
preprint arXiv:181205129 2018;.
 Poulin P, Rheault F, St-Onge E, Jodoin PM, Descoteaux M. Bundle-Wise Deep Tracker: Learning to track bundle-speciﬁc
streamline paths. In: Proceedings of the International Society for Magnetic Resonance in Medicine ISMRM-ESMRMB;
 Wasserthal J, Neher PF, Maier-Hein KH. Tract orientation mapping for bundle-speciﬁc tractography. In: International
Conference on Medical Image Computing and Computer-Assisted Intervention Springer; 2018. p. 36–44.
 Lucena OASd, Deep Learning for Brain Analysis in MR Imaging. São Paulo, Brazil: [sn]; 2018. http://repositorio.
 Reisert M, Coenen VA, Kaller C, Egger K, Skibbe H. HAMLET: Hierarchical Harmonic Filters for Learning Tracts from
Diffusion MRI. arXiv preprint arXiv:180701068 2018;.
 Wasserthal J, Neher P, Maier-Hein KH. TractSeg-Fast and accurate white matter tract segmentation. NeuroImage
 KoppersS, Merhof D. Direct Estimation of Fiber Orientations Using Deep Learning in Diffusion Imaging. Springer, Cham;
 Ngattai Lam PD, Belhomme G, Ferrall J, Patterson B, Styner M, Prieto JC. TRAFIC: Fiber Tract Classiﬁcation Using Deep
Learning. Proceedings of SPIE–the International Society for Optical Engineering 2018 feb;10574.
 Patil SM, Nigam A, Bhavsar A, Chattopadhyay C. Siamese LSTM based Fiber Structural Similarity Network (FS2Net) for
Rotation Invariant Brain Tractography Segmentation. undeﬁned 2017;.
 Gupta V, Thomopoulos SI, Rashid FM, Thompson PM. FiberNET: An Ensemble Deep Learning Framework for Clustering
White Matter Fibers. Springer,Cham; 2017.p. 548–555.
 Gupta V, Thomopoulos SI, Corbin CK, Rashid F, Thompson PM. FIBERNET 2.0: An automatic neural network based tool
for clustering white matter ﬁbers in the brain. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI
2018) IEEE; 2018. p. 708–711.
 Thomas C, Frank QY, Irfanoglu MO, Modi P, Saleem KS, Leopold DA, et al. Anatomical accuracy of brain connec-
tions derived from diffusion MRI tractography is inherently limited. Proceedings of the National Academy of Sciences
PHILIPPE POUL IN ET AL. 21
 Pujol S, Wells W, Pierpaoli C, Brun C, Gee J, Cheng G, et al. The DTI challenge: toward standardized evaluation of diffu-
sion tensor imaging tractography for neurosurgery. Journal of Neuroimaging 2015;25(6):875–882.
 Yeh FC, Panesar S, Fernandes D, Meola A, Yoshino M, Fernandez-Miranda JC, et al. Population-averaged atlas of the
macroscale human structural connectome and its network topology. NeuroImage 2018;178:57–68.
 Zhang F, Wu Y, Norton I, Rigolo L, Rathi Y, Makris N, et al. An anatomically curated ﬁber clustering white matter atlas for
consistent white matter tract parcellation across the lifespan. NeuroImage 2018;.
 Thon A, Teichgräber U, Tennstedt-Schenk C, Hadjidemetriou S, Winzler S, Malich A, et al. Computer aided detection in
prostate cancer diagnostics: A promising alternative to biopsy? A retrospective study from 104 lesions with histological
ground truth. PLOS ONE 2017 10;12(10):1–21.
 Clinic C, Alzheimer’s Disease: Overview of Diagnostic Tests; 2014. [Online; accessed 3-January-2019]. https://my.
clevelandclinic.org/health/diagnostics/9176-alzheimers- disease-overview- of-diagnostic-tests/.
 Bernard O,Bosch JG, Heyde B, Alessandrini M, Barbosa D, Camarasu-Pop S, et al. Standardized evaluation system for left
ventricular segmentation algorithms in 3D echocardiography. IEEE transactions on medical imaging 2016;35(4):967–
 Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, et al. The multimodal brain tumor image segmenta-
tion benchmark (BRATS). IEEE transactions on medical imaging 2015;34(10):1993.
 Fillard P, Descoteaux M, Goh A, Gouttard S, Jeurissen B, Malcolm J, et al. Quantitative evaluation of 10 tractography
algorithms on a realistic diffusion MR phantom. Neuroimage 2011;56(1):220–234.
 Wilkins B, Lee N, Singh M. Development and evaluation of a simulated FiberCup phantom. In: International Symposium
on Magnetic Resonance in Medicine (ISMRM’12); 2012. p. 1938.
 Yendiki A, Panneck P, Srinivasan P, Stevens A, Zöllei L, Augustinack J, et al. Automated probabilistic reconstruction of
white-matter pathways in health and disease using an atlas of the underlying anatomy. Frontiers in neuroinformatics
 Daducci A, Canales-Rodríguez EJ, Descoteaux M, Garyfallidis E, Gur Y, Lin YC, et al. Quantitative comparison of
reconstruction methods for intra-voxel ﬁber recovery from diffusion MRI. IEEE transactions on medical imaging
 Daducci A, Caruyer E, Descoteaux M, Houde J, Thiran J. HARDI reconstruction challenge 2013. In: Proceedings of the
IEEE International Symposium on Biomedical Imaging (ISBI), San Francisco, CA; 2013. .
 Chenot Q, Tzourio-Mazoyer N, Rheault F,Descoteaux M, Crivello F, Zago L, et al. A population-based atlas of the human
pyramidal tract in 410 healthy participants. Brain Structure and Function 2018;p. 1–14.
 Rheault F, St-Onge E, Sidhu J, Maier-Hein K, Tzourio-Mazoyer N, Petit L, et al. Bundle-speciﬁc tractography with incor-
porated anatomical and orientational priors. NeuroImage 2018;.
 Poupon C, Laribiere L, Tournier G, Bernard J, Fournier D, Fillard P, et al. A diffusion hardware phantom looking like a
coronal brain slice. In: Proceedings of the International Society for Magnetic Resonance in Medicine, vol. 18; 2010. p.
 Neher PF, Laun FB, Stieltjes B, Maier-Hein KH. Fiberfox: facilitating the creation of realistic white matter software
phantoms. Magnetic resonance in medicine 2014;72(5):1460–1470.
 White T, Magnotta VA, Bockholt HJ, Williams S, Wallace S, Ehrlich S, et al. Global white matter abnormalities in
schizophrenia: a multisite diffusion tensor imaging study. Schizophrenia bulletin 2009;37(1):222–232.
22 PHILIPPE POU LIN E T AL.
 Garyfallidis E, Côté MA, Rheault F, Sidhu J, Hau J, Petit L, et al. Recognition of white matter bundles using local and global
streamline-based registration and clustering. NeuroImage 2018;170:283–295.
 Mazoyer B, Mellet E, Perchey G, Zago L, Crivello F, Jobard G, et al. BIL&GIN: a neuroimaging, cognitive, behavioral, and
genetic database for the study of human brain lateralization. Neuroimage 2016;124:1225–1231.
 Van Essen DC, Smith SM, Barch DM, Behrens TE, Yacoub E, Ugurbil K, et al. The WU-Minn human connectome project:
an overview. Neuroimage 2013;80:62–79.
 Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, et al. The minimal preprocessing pipelines
for the Human Connectome Project. Neuroimage 2013;80:105–124.
 Wassermann D, Makris N, Rathi Y, Shenton M, Kikinis R, Kubicki M, et al. The white matter query language: a novel
approach for describing human white matter anatomy. Brain Structure and Function 2016;221(9):4705–4721.
 Stieltjes B, Brunner RM, Fritzsche K, Laun F. Diffusion tensor imaging: introduction and atlas. Springer Science & Busi-
ness Media; 2013.
 Garyfallidis E, Brett M, Correia MM, Williams GB, Nimmo-Smith I. Quickbundles, a method for tractography simpliﬁca-
tion. Frontiers in neuroscience 2012;6:175.
 Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC medical
 Reddy CP, Rathi Y. Joint multi-ﬁber NODDI parameter estimation and tractography using the unscented information
ﬁlter. Frontiers in neuroscience 2016;10:166.
 MITK, MITK Diffusion Imaging; 2018. [Online; accessed 3-January-2019]. http://www.mitk.org/wiki/
 Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Medical
Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351 of LNCS; 2015. p. 234–241.
 Manjón JV, Coupé P, Concha L, Buades A, Collins DL, Robles M. Diffusion weighted image denoising using overcomplete
local PCA. PloS one 2013;8(9):e73021.
 Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: Formulation, detection, and avoidance. ACM
Transactionson Knowledge Discovery from Data (TKDD) 2012;6(4):15.
 Aganj I, Lenglet C, Sapiro G. ODF reconstruction in q-ball imaging with solid angle consideration. In: Biomedical Imaging:
From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on IEEE; 2009. p. 1398–1401.
 Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE
Transactionson Pattern Analysis & Machine Intelligence 1991;(3):252–264.
 Efron B, Tibshirani RJ. An introduction to the bootstrap. CRC press; 1994.
 Kleesiek J, Petersen J, Döring M, Maier-Hein K, Köthe U, Wick W, et al. Virtual raters for reproducible and objective
assessments in radiology. Scientiﬁc reports 2016;6:25007.
 Entis JJ, Doerga P, Barrett LF, Dickerson BC. A reliable protocol for the manual segmentation of the human amygdala
and its subregions using ultra-high resolution MRI. Neuroimage 2012;60(2):1226–1235.
 Boccardi M, Bocchetta M, Apostolova LG, Barnes J, Bartzokis G, Corbetta G, et al. Delphi deﬁnition of the EADC-ADNI
Harmonized Protocol for hippocampal segmentation on magnetic resonance. Alzheimer’s & Dementia 2015;11(2):126–
PHILIPPE POUL IN ET AL. 23
 Piccinini F, Tesei A, Paganelli G, Zoli W, Bevilacqua A. Improving reliability of live/dead cell counting through automated
image mosaicing. Computer methods and programs in biomedicine 2014;117(3):448–463.
 Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classiﬁcation models. Analytica Chimica