Content uploaded by Soumick Chatterjee
Author content
All content in this area was uploaded by Soumick Chatterjee on Jul 14, 2022
Content may be subject to copyright.
JUNE 2022 1
Automated SSIM Regression for Detection and
Quantification of Motion Artefacts in Brain MR
Images
Alessandro Sciarra Soumick Chatterjee , Max D¨
unnwald ,
Giuseppe Placidi , Andreas N ¨
urnberger , Oliver Speck and Steffen Oeltze-Jafra
Abstract—Motion artefacts in magnetic resonance brain
images are a crucial issue. The assessment of MR image
quality is fundamental before proceeding with the clinical
diagnosis. If the motion artefacts alter a correct delineation
of structure and substructures of the brain, lesions, tu-
mours and so on, the patients need to be re-scanned.
Otherwise, neuro-radiologists could report an inaccurate
or incorrect diagnosis. The first step right after scanning
a patient is the ”image quality assessment” in order to
decide if the acquired images are diagnostically accept-
able.An automated image quality assessment based on
the structural similarity index (SSIM) regression through
a residual neural network has been proposed here, with
the possibility to perform also the classification in different
groups - by subdividing with SSIM ranges. This method
predicts SSIM values of an input image in the absence of
a reference ground truth image. The networks were able
to detect motion artefacts, and the best performance for
the regression and classification task has always been
achieved with ResNet-18 with contrast augmentation. Mean
and standard deviation of residuals’ distribution were µ=
−0.0009 and σ=0.0139, respectively. Whilst for the classifi-
cation task in 3, 5 and 10 classes, the best accuracies were
97, 95 and 89%, respectively. The obtained results show
that the proposed method could be a tool in supporting
neuro-radiologists and radiographers in evaluating the im-
age quality before the diagnosis.
Index Terms—Motion artefacts, SSIM, image quality as-
sessment, ResNet, regression, classification.
I. INTRODUCTION
Image quality assessment (IQA) is a fundamental apparatus
for evaluating MR images [1]–[3]. The main purpose of this
Alessandro Sciarra, Max D¨
unnwald and Steffen Oeltze-Jafra are
with the Medicine and Digitalization - MedDigit group, Department of
Neurology, Medical Faculty, University Hospital Magdeburg, Germany
Alessandro Sciarra, Soumick Chatterjee and Oliver Speck are with
the Department of Biomedical Magnetic Resonance, Faculty of Natural
Sciences, Otto von Guericke University, Germany
Soumick Chatterjee and Andreas N¨
urnberger are with the Data and
Knowledge Engineering Group, Faculty of Computer Science, Otto von
Guericke University, Germany
Max D¨
unnwald is with the Faculty of Computer Science, Otto von
Guericke University, Germany
Giuseppe Placidi is with the Department of Life, Health, and Environ-
mental Sciences, University of L’Aquila, Italy
Andreas N¨
urnberger, Oliver Speck and Steffen Oeltze-Jafra are with
the Center for Behavioral Brain Sciences, Magdeburg, Germany
Oliver Speck and Steffen Oeltze-Jafra are with the German Center for
Neurodegenerative Disease, Magdeburg, Germany
process is to find out if the quality can guarantee images are
diagnostically reliable and exempted from artefacts - to avoid
possible unreliable diagnosis [4], [5]. Often the evaluation
process requires time and is also subjectively dependent upon
the observer in charge of carrying it out [6]. Furthermore,
different levels of expertise and experience of the readers
(experts designated to perform the IQA) could lead to a non-
perfect matching assessment. Another intrinsic issue of the
IQA for MR images is the absence of a reference image.
No-Reference IQA techniques with and without the support
of machine and deep learning support have been proposed
in the last years for the evaluation of the visual image
quality [3], [4], [7], [8], [8]–[14]. These techniques are able
to detect and quantify the level of blurriness or corruption
with different levels of accuracy and precision. However, there
are many factors to take into consideration when choosing
which technique to apply, the most important are [15]–[17]:
data requirement - as deep learning requires a large dataset
while traditional machine learning (non deep learning based)
techniques can be trained on lesser data; accuracy - deep
learning provides higher accuracy than traditional machine
learning; training time - deep learning takes longer time
than traditional machine learning; hyperparameter tuning -
deep learning can be tuned in various different ways, and
it is not always possible to find the best parameters, while
machine learning offers limited tuning capabilities. In addition,
when choosing traditional machine learning techniques, the
fundamental step of feature extraction must be considered.
Although the list of traditional machine learning and deep
learning techniques used for regression and classification tasks
is constantly updated [18]–[21], there is still no gold standard
IQA for MR images [2]. The aim of this work is to create
an automated IQA tool that is able to detect the presence
of motion artefacts and quantify the level of corruption or
distortion compared to an ”artefact-free” counterpart, based on
the regression of the structural similarity index (SSIM) [22].
This tool has been designed to be able to work for a large
variety of MR image contrast, such as T1, T2, PD and Flair
weighted images, and independently from the resolution and
orientation of the considered image. Additionally, a contrast
augmentation step has been introduced in order to increase
the range of variability of the weighting. In practice, when
the MRIs are acquired and if there are any artefacts in
arXiv:2206.06725v1 [eess.IV] 14 Jun 2022
2 JUNE 2022
the image, ”artefact-free” counterparts are not available to
compare the image against for quality assessment. But for the
SSIM calculation, it is always necessary to have two images
(corrupted vs motion-artefact free images). For this reason, in
this work, the corrupted images were artificially created by
making use of two different algorithms - one implemented by
Shaw et al. [23] (package of the library TorchIO [24]) and a
second algorithm developed in-house [25]. Furthermore, when
training a neural network model in a fully-supervised manner,
as in this case, a large amount of labelled or annotated data is
typically required [26]. In this research on IQA, the regression
labels for training were created by comparing the artificially-
corrupted images against the original artefact-free images with
the help of SSIM and those SSIM values were finally used as
the regression labels.
II. METHODOLOGY
The proposed automatic IQA tool relies on residual neural
networks (ResNet) [27], [28]. Two different versions of ResNet
were used, with 18 (ResNet-18) and 101 (ResNet-101) residual
blocks. Every model has been trained two times - with
and without the contrast augmentation step. These are steps
executed during the training, Figure 1:
1) Given a 3D input volume one random slice (2D image)
is selected from one of the possible orientations - axial,
sagittal, and coronal. In case of an anisotropic volume,
the slice selection is done only following the original
acquisition orientation.
2) In case of contrast augmentation is enabled, one of the
contrast augmentation algorithms is selected randomly
from the following and applied on the input image:
•Random gamma adjustment [29]
•Random logarithmic adjustment [30]
•Random sigmoid adjustment on the input image [31]
•Random adaptive histogram adjustment [32]
3) Motion corruption is applied on the 2D image using one
of these two options:
•TorchIO [23], [24], Figure 2 (a)
•”in-house” algorithm, Figure 2 (b)
4) The SSIM is calculated between the 2D input image and
the corresponding corrupted one.
5) The calculated SSIM value and the corrupted image are
passed to the chosen model for training
Three datasets - train, validation, and test sets - were used
for this work, Table I. For training, 200 volumes were used,
while 50 were used for validation and 50 for testing. The
first group of 68 volumes were selected from the public IXI
dataset 1, the second group (Table I, Site-A) of 114 volumes
were acquired with a 3T scanner, the third group (Table I,
Site-B) of 93 volumes was acquired at 7T, and a final group
(Table I, Site-B) of 25 volumes was acquired with different
scanners (1.5 and 3T). The volumes from IXI, Site-A, and
Site-B were resampled in order to have an isotropic resolution
of 1.00 mm3.
1Dataset available at: https://brain-development.org/ixi-dataset/
Fig. 1. Graphical illustration of all steps for the training as explained in
Section II.
The loss during training was calculated using the mean
squared error (MSE) [33] and was optimised using the Adam
optimiser [34] with a learning rate of 1e−3and a batch size
of 100 for 2000 epochs. All the images (during training,
validation, and testing) were always normalised, and resized
or padded to have a 2D matrix size of 256x256.
For testing, a total of 10000 images were repetitively
selected randomly and then corrupted from the 50 volumes of
the test dataset - applying the random orientation selection,
the contrast augmentation, and finally the corruption - as
performed during the training stage.
In order to evaluate the performances of the trained models,
first the predicted SSIM values were plotted against the ground
truth SSIM values as shown in Figure 3, next the residuals
were calculated as follow Residuals =SS IMpredicted −
SSIMgroundtruth , Figure 4.
The predicted SSIM value of an image can be considered
SCIARRA et al.: AUTOMATED SSIM REGRESSION FOR DETECTION AND QUANTIFICATION OF MOTION ARTEFACTS IN BRAIN MR IMAGES 3
TABLE I
DATA FO R TR AIN IN G,VAL IDATI ON A ND TE ST ING .
Data Weighting Volumes Matrix Size Resolution (mm3)
m(M) x m(M) x m(M)† m(M) x m(M) x m(M)†
TRAINING
IXI T1,T2,PD 15,15,15 230(240)x230(240)x134(162) 1.00 isotropic
Site-A T1,T2,PD,FLAIR 20,20,20,20 168(168)x224(224)x143(144) 1.00 isotropic
Site-B T1,T2,FLAIR 20,20,20 156(156)x224(224)x100(100) 1.00 isotropic
Site-C T1 3 192(512)x256(512)x36(256) 0.45(1.00)x0.45(0.98)x0.98(4.40)
Site-C T2 11 192(640)x192(640)x32(160) 0.42(1.09)x0.42(1.09)x1.00(4.40)
Site-C FLAIR 1 320x320x34 0.72x0.72x4.40
VALIDATION
IXI T1,T2,PD 1,5,7 230(240)x230(240)x134(162) 1.00 isotropic
Site-A T1,T2,PD,FLAIR 4,4,4,4 168(168)x224(224)x143(144) 1.00 isotropic
Site-B T1,T2,FLAIR 6,6,4 156(156)x224(224)x100(100) 1.00 isotropic
Site-C T1 3 176(240)x240(256)x118(256) 1.00 isotropic
Site-C T2 1 240x320x80 0.80x0.80x2.00
Site-C PD 1 240x320x80 0.80x0.80x2.00
TESTING
IXI T1,T2,PD 2,4,4 230(240)x230(240)x134(162) 1.00 isotropic
Site-A T1,T2,PD,FLAIR 6,4,4,4 168(168)x224(224)x143(144) 1.00 isotropic
Site-B T1,T2,FLAIR 6,6,5 156(156)x224(224)x100(100) 1.00 isotropic
Site-C T1 2 288(320)x288(320)x35(46) 0.72(0.87)x0.72(0.87)x3.00(4.40)
Site-C T2 2 320(512)x320(512)x34(34) 0.44(0.72)x0.45(0.72)x4.40(4.40)
Site-C FLAIR 1 320x320x35 0.70x0.70x4.40
†: ”m” indicates the minimum value while ”M” the maximum.
Fig. 2. Samples of artificially corrupted images. On the left column
original images, on the right the corrupted ones. (a): image corrupted
making use of TorchIO library, (b): image corrupted making use of the
home-made algorithm
equivalent to measuring the distortion or corruption level of the
image. However, when applying this approach to a real clinical
case, it is challenging to compare this value with a subjective
assessment. To get around this problem, the regression task
was simplified into a classification one. For the same, three
different experiments were performed by choosing a different
number of classes - 3, 5 and 10 classes. For every case, the
SSIM range [0-1] was equally divided in order to have equal
sub-ranges. For instance, in case of 3 classes, there were three
sub-ranges, class-1:[0.00-0.33], class-2:[0.34-0.66] and class-
3:[0.67-1.00]. A similar step was also performed for creating
5 and 10 classes.
A second dataset was also used for testing the trained models -
comprised of randomly selected images from clinical acquisi-
tions. This dataset contained five subjects, each with a different
number of scans, as shown in Table II. In this case, there
were no ground truth reference images, and for this reason,
the images were also subjectively evaluated by one expert
using the following classification scheme: class 1 - images
with good to a high quality that might have minor motion
artefacts, but not altering structures and substructures of the
brain (SSIM range between 0.85 and 1.00); class 2 - images
with sufficient to good quality, in this case, the images can
have motion artefacts that prevent a correct delineation of the
brain structures, substructures or lesions (SSIM range between
0.60 and 0.85); and class 3 - image with insufficient quality
and a re-scan will be required (SSIM range between 0.00 and
0.60). Additionally, this dataset contained different contrasts
not included in the training, such as diffusion-weighted images
(DWI).
III. RES ULTS
The results for the first section, the regression task, are
presented in Figures 3 and 4. Figure 3 shows a scatter plot
where the predicted SSIM values are compared against the
ground truth values. Additionally, the plot shows the plotted
linear fitting performed for each trained model. Finally, the
distributions of the ground truth and predicted SSIM values
are also shown. Figure 3 presents general comparisons across
all the trained models and their qualitative dispersion levels. In
this case, the term dispersion implies how much the predicted
4 JUNE 2022
TABLE II
CLI NIC AL DATA
Data Weighting Volumes Matrix Size Resolution (mm3)
m(M) x m(M) x m(M)† m(M) x m(M) x m(M)†
Subj. 1 T1,T2,FLAIR 1,4,2 130(560)x256(560)x26(256) 0.42(1.00)x0.42(0.94)x0.93(4.40)
Subj. 2 T2 3 288(320)x288(320)x28(28) 0.76(0.81)x0.76(0.81)x5.50(5.50)
Subj. 3 T1,T2,FLAIR,DWI,(§) 1,2,1,4,1 256(640)x256(640)x32(150) 0.42(0.90)x0.42(0.90)x0.45(4.40)
Subj. 4 T2, FLAIR, DWI 1,2,6 144(512)x144(512)x20(34) 0.45(1.40)x0.45(1.40)x2.00(4.40)
Subj. 5 T2, FLAIR, DWI 3,1,4 256(640)x256(640)x28(42) 0.40(1.09)x0.40(1.09)x3.30(6.20)
†: ”m” indicates the minimum value while ”M” the maximum.
Fig. 3. Scatter plot for the regression task. There are also shown the
linear fittings for each group of data. On the top: ground truth SSIM
values distribution; right side: predicted SSIM values distributions for
each group of data.
SSIM values differ from the ground-truth SSI Mpredicted =
SSIMground−truth . On the other hand, in Figure 4, the
results are shown separately using the scatter plots - for each
model. The relative residual distribution plots are explained in
section II. For the residual distributions, a further statistical
normal distribution fitting was carried out, making the use
of the python package SciPy [35]. The calculated mean and
standard deviation values are shown in Figure 4. According to
the statistical analysis, the model that has the smallest standard
deviation (σ= 0.0139) and the mean value closer to zero
(µ=−0.0009) was the ResNet-18 model trained with contrast
augmentation, while the model with the mean value farther
from zero and largest standard deviation was the ResNet-101
trained without contrast augmentation. A clear effect of the
contrast augmentation for both models ResNet-18 and ResNet-
101 can be seen from the results - reflected as a reduction of
the standard deviation values, and this visually correlates with
a lower dispersion level in the scatter plots.
The results for the classification task are shown in Figure 5
and table III. Figure 5 shows the logarithmic confusion matri-
ces obtained for the classification task. From the matrices, it
can be noted that all the trained models performed well and in
a similar way. In particular, none of the matrices presents non-
zero elements far from the diagonal, but only the neighbouring
ones - as commonly expected from a classification task. The
table III is complementary to Figure 5. It shows the class-wise,
macro-average and weighted average of precision, recall, and
f1-score for all the trained models. This table also presents
the accuracy. For all the three scenarios, 3, 5 and 10 classes
as presented in section II, once again, the model with the best
results is ResNet-18 trained with contrast augmentation. This
model always obtained the highest accuracy value - 97, 95
and 89% for 3, 5, and 10 class scenarios, respectively. Even
though the ResNet-18 with contrast augmentation performed
better than the other models, no substantial differences can be
discerned from the tabular data. But once again, it is possible
to observe an improvement in terms of performance when the
contrast augmentation is applied.
The results regarding the clinical data samples are shown
in Figure 6. In this case, the obtained SSIM predictions are
shown for each model - overlayed with the subjective scores
- shown in a per-slice manner grouped by the subjects. As
introduced in section II, the subjective ratings for the clinical
data samples were within the classes 1, 2 or 3 - after a careful
visual evaluation. If the predictions obtained with the different
models fall within the classes assigned by the subjective
evaluation, this will imply that there is an agreement between
the objective and subjective evaluations. When the objective
prediction lies outside the class assigned by the expert, this
indicates a disagreement between the two assessments. The
percentage of agreement between subjective and objective
analysis is 76.6±0.8% (mean ±standard deviation), with
a minimum value of 75.5% achieved by ResNet-101 without
contrast augmentation and a maximum of 77.7% by ResNet-
101 with contrast augmentation.
IV. DISCUSSION
The performances of the trained models when solving the
regression task were very similar. However, when the two
models ResNet-18 and ResNet-101 were coupled with contrast
augmentation showed a distinct improvement. Looking at
the Residuals distributions of the errors, for both models,
contrast augmentation has been the reason why the mean
values fell closer to zero and also, the values of the standard
deviation decreased by ≈1.5and ≈1.44 times for ResNet-18
and ResNet-101, respectively. The reduction of the standard
deviations is quite evident also in the scatter plots, where the
dispersion level is visibly less when the contrast augmentation
is applied.
While considering the classification task, the first notable
thing is that there is a linear decrease in the accuracy as
the number of classes increases - 97,95 and 89%. This
SCIARRA et al.: AUTOMATED SSIM REGRESSION FOR DETECTION AND QUANTIFICATION OF MOTION ARTEFACTS IN BRAIN MR IMAGES 5
Fig. 4. Scatter plot SSIM predicted against ground truth values and Residuals distribution for (a) ResNet-18 without contrast augmentation, (b)
ResNet-18 with contrast augmentation, (c) ResNet-101 without contrast augmentation and (d) ResNet-101 with contrast augmentation.
can be explained by the fact that as the number of classes
increases, the difficulty level also increases for each model to
classify the image in the correct pre-defined range of SSIM
values. The confusion matrices confirm this behaviour - by the
increase of the values being out-of-diagonal, i.e., considering
the ResNet-18 not coupled with contrast augmentation, for the
classification task with three classes, the maximum value out-
of-diagonal is 0.04 (for class-2 and class-3), while considering
the classification task with ten classes, the maximum value
is 0.50 (for class-1). This implies that the ResNet-18, not
coupled with contrast augmentation when performing the 10-
classes classification task, classifies incorrectly 50% of the
tested images. When contrast augmentation is applied, there is
an apparent reduction of wrongly classified images of class-
1. Although this is the general trend observed in Figure 5,
there are also contradiction results, i.e., when looking at the
5-classes classification task for class-1 always considering
ResNet-18 without and with contrast augmentation, there is
a net increase of erroneously classified class-1 images, from
9to 21% of tested images.
The final application on clinical data also provided satis-
factory results, with a maximum agreement rate of 77.7%
between the objective and subjective assessments. A direct
comparison with the previous three-classes classification task
is not possible due to the different subjective schemes selected
(Section II). Although there is a visible reduction in the
performance when the trained models are applied to clinical
data, this can be justified by taking into account several factors.
First of all, the clinical data sample involved type of image
data, such as diffusion acquisition and derived diffusion maps,
which were never seen by the models during the training
step, and secondly, the motion artefacts artificially created
did not cover the infinite possible motion artefacts that can
appear in a truly MR motion corrupted image. A possible
improvement can be obtained by introducing new contrasts
in the training set, different resolutions and orientations. For
example, oblique acquisitions have been not considered in this
work. In addition, the artificial corruption methods used for
this work can be further improved, e.g., including corruption
algorithms based on motion log information recorded by a
tracking device, as commonly used for prospective motion
correction [37], [38], [39]. However, this would require the
availability of raw MR data, and it has to be taken into
account also the computational time to de-correct the images,
comparably slower than the current approaches. Another point
to take into account for the subjective assessment is the bias
introduced by each expert while evaluating the image quality.
In this work, the expert’s perception of image quality is
emulated with good accuracy, 76.6±0.8% , which can not
be considered as a standard reference. Although the subjective
6 JUNE 2022
Fig. 5. Confusion matrices for the classification task. First row 3 classes case, second row 5 classes and third row 10 classes. The columns are
for (a) ResNet-18 without contrast augmentation, (b) ResNet-18 with contrast augmentation, (c) ResNet-101 without contrast augmentation, (d)
ResNet-101 with contrast augmentation, respectively.
assessment can be repeated with the help of several experts,
there will always be differences between them, i.e., years of
experience or different sensitivity to the presence of motion
artefacts in the assessed image. It is also noteworthy that the
SSIM ranges defined for the three classes can be re-defined
following a different scheme. In the scenario explored in this
paper, the scheme has been defined by making use of the
artificially corrupted images and the ground truth images - this
allowed an exact calculation of the SSIM values, and it was
simple to define ranges that visually agree with the scheme
defined in Sect. II.
V. CONCLUSION
This research presents an SSIM-regression based IQA tech-
nique using ResNet models, coupled with contrast augmen-
tations to make them robust against changes in the image
contrasts in clinical scenarios. The method managed to predict
the SSIM values from artificially motion corrupted images
without the ground-truth (motion-free) images with high accu-
racy (residual SSIMs as less as −0.0009 ±0.0139). Moreover,
the motion classes obtained from the predicted SSIMs were
very close to the true ones and achieved a maximum weighted
accuracy of 89% for the ten classes scenario as reported in
Table III, and achieved a maximum accuracy value of 97%
when the number of classes was three (Table III). Considering
the complexity of the problem in quantifying the image
degradation level due to motion artefacts and additionally the
variability of the type of contrast, resolution, etc., the results
obtained are very promising. Further evaluations, including
multiple subjective evaluations, will be performed on clinical
data to judge its clinical applicability and robustness against
changes in real-world scenarios. In addition, other trainings
will be carried out in order to have a larger variety of
images that should include common clinical routine acquisi-
tions such as diffusion-weighted imaging and Time-of-Flight
imaging. Furthermore, it would be beneficial to include images
also acquired at lower magnetic field strength (<1.5T).
Considering the results obtained by ResNet models in this
work, it is reasonable to think that future works can also be
targeted towards a different anatomical body part, focusing,
for instance, on abdominal or cardiac imaging. However, the
reproduction of real looking-like motion artefacts plays a key
role in the performances of deep learning models trained to
have a reference-less image quality assessment tool.
SCIARRA et al.: AUTOMATED SSIM REGRESSION FOR DETECTION AND QUANTIFICATION OF MOTION ARTEFACTS IN BRAIN MR IMAGES 7
TABLE III
RESULTS FOR THE CLASSIFICATION TASK.T HE CLASSIFICATION TASK HAS BEEN PERFORMED THREE TIMES,CONSIDERING 3,5 AND 10 CLASSES,
RESPECTIVELY. "PREC." IS THE ABBREVIATION OF THE TERM PRECISION,WHILE "MAC RO AVG "CORRESPONDS TO MACRO AVERAGE AND "WEIGHT.
AVG"TO WEIGHTED AVERAGE CALCULATED USING THE PYTHON PACKAGE SCIKIT-LEARN [36]. (A)IS FOR RESNET-18 WITHOUT CONTRAST
AUG MEN TATION ,(A)IS FOR RES NET-18 WITH CONTRAST AUGMENTATION,(C)IS FOR RESNE T-101 WITHOUT CONTRAST AUGMENTATION,(C)IS
FO R RESNE T-101 WITH CONTRAST AUGMENTATION.
(a) (b) (c) (d)
Class (SSIM) Prec. Recall f1-score Prec. Recall f1-score Prec. Recall f1-score Prec. Recall f1-score Support
1 [0.00 - 0.33] 0.94 0.97 0.95 0.93 0.97 0.95 0.93 0.98 0.96 0.97 0.89 0.93 117
2 [0.033 - 0.66] 0.95 0.96 0.95 0.97 0.96 0.96 0.94 0.97 0.95 0.98 0.94 0.96 4307
3 [0.66 - 1.00] 0.97 0.96 0.97 0.97 0.98 0.97 0.98 0.95 0.96 0.95 0.99 0.97 5576
accuracy 0.96 0.97 0.96 0.96 10000
macro avg 0.95 0.95 0.96 0.96 0.97 0.96 0.95 0.97 0.96 0.97 0.94 0.95 10000
weight. avg 0.96 0.96 0.96 0.97 0.97 0.97 0.96 0.96 0.96 0.96 0.96 0.96 10000
1 [0.00 - 0.20] 0.97 0.91 0.94 0.93 0.79 0.85 0.94 0.97 0.96 0.85 0.88 0.87 33
2 [0.20 - 0.40] 0.86 0.89 0.88 0.85 0.90 0.87 0.83 0.91 0.87 0.93 0.77 0.84 262
3 [0.40 - 0.60] 0.91 0.92 0.91 0.93 0.92 0.93 0.89 0.94 0.91 0.94 0.90 0.92 2320
4 [0.60 - 0.80] 0.94 0.95 0.94 0.95 0.96 0.96 0.94 0.94 0.94 0.94 0.96 0.95 5021
5 [0.80 - 1.00] 0.96 0.93 0.95 0.96 0.96 0.96 0.97 0.92 0.95 0.95 0.96 0.96 2364
accuracy 0.93 0.95 0.93 0.94 10000
macro avg 0.93 0.92 0.92 0.93 0.91 0.91 0.91 0.93 0.92 0.92 0.89 0.91 10000
weight. avg 0.93 0.93 0.93 0.95 0.95 0.95 0.93 0.93 0.93 0.94 0.94 0.94 10000
1 [0.00 - 0.10] 1.00 0.50 0.67 1.00 0.62 0.77 1.00 0.62 0.77 1.00 0.75 0.86 8
2 [0.10 - 0.20] 0.81 0.88 0.85 0.78 0.72 0.75 0.83 0.96 0.89 0.75 0.84 0.79 25
3 [0.20 - 0.30] 0.90 0.90 0.90 0.81 0.84 0.83 0.87 0.89 0.88 0.91 0.79 0.84 62
4 [0.30 - 0.40] 0.81 0.84 0.83 0.80 0.85 0.83 0.76 0.85 0.80 0.88 0.71 0.79 200
5 [0.40 - 0.50] 0.82 0.86 0.84 0.86 0.87 0.87 0.79 0.87 0.83 0.86 0.83 0.84 689
6 [0.50 - 0.60] 0.84 0.84 0.84 0.89 0.87 0.88 0.83 0.86 0.84 0.89 0.84 0.86 1631
7 [0.60 - 0.70] 0.86 0.88 0.87 0.89 0.89 0.89 0.85 0.87 0.86 0.88 0.88 0.88 2706
8 [0.70 - 0.80] 0.87 0.87 0.87 0.89 0.90 0.89 0.88 0.84 0.86 0.86 0.90 0.88 2315
9 [0.80 - 0.90] 0.86 0.88 0.87 0.89 0.92 0.90 0.89 0.85 0.87 0.87 0.91 0.89 1456
10 [0.80 - 1.0] 0.97 0.86 0.91 0.97 0.91 0.94 0.96 0.88 0.91 0.95 0.93 0.94 908
accuracy 0.87 0.89 0.86 0.88 10000
macro avg 0.88 0.83 0.84 0.88 0.84 0.85 0.86 0.85 0.85 0.88 0.84 0.86 10000
weight. avg 0.87 0.87 0.87 0.89 0.89 0.89 0.86 0.86 0.86 0.88 0.88 0.88 10000
REFERENCES
[1] M. Khosravy, N. Patel, N. Gupta, and I. K. Sethi, “Image quality
assessment: A review to full reference indexes,” Recent trends in
communication, computing, and electronics, pp. 279–288, 2019.
[2] L. S. Chow and R. Paramesran, “Review of medical image quality
assessment,” Biomedical signal processing and control, vol. 27, pp. 145–
154, 2016.
[3] B. Mortamet, M. A. Bernstein, C. R. Jack Jr, J. L. Gunter, C. Ward,
P. J. Britson, R. Meuli, J.-P. Thiran, and G. Krueger, “Automatic quality
assessment in structural brain magnetic resonance imaging,” Magnetic
Resonance in Medicine: An Official Journal of the International Society
for Magnetic Resonance in Medicine, vol. 62, no. 2, pp. 365–372, 2009.
[4] P. Bourel, D. Gibon, E. Coste, V. Daanen, and J. Rousseau, “Auto-
matic quality assessment protocol for mri equipment,” Medical physics,
vol. 26, no. 12, pp. 2693–2700, 1999.
[5] P. Jezzard, “The physical basis of spatial distortions in magnetic reso-
nance images,” 2009.
[6] J. J. Ma, U. Nakarmi, C. Y. S. Kin, C. M. Sandino, J. Y. Cheng, A. B.
Syed, P. Wei, J. M. Pauly, and S. S. Vasanawala, “Diagnostic image qual-
ity assessment and classification in medical imaging: Opportunities and
challenges,” in 2020 IEEE 17th International Symposium on Biomedical
Imaging (ISBI). IEEE, 2020, pp. 337–340.
[7] L. S. Chow and H. Rajagopal, “Modified-brisque as no reference
image quality assessment for structural mr images,” Magnetic resonance
imaging, vol. 43, pp. 74–87, 2017.
[8] S. J. Sujit, R. E. Gabr, I. Coronado, M. Robinson, S. Datta, and
P. A. Narayana, “Automated image quality evaluation of structural brain
magnetic resonance images using deep convolutional neural networks,”
in 2018 9th Cairo International Biomedical Engineering Conference
(CIBEC). IEEE, 2018, pp. 33–36.
[9] O. Esteban, D. Birman, M. Schaer, O. O. Koyejo, R. A. Poldrack, and
K. J. Gorgolewski, “Mriqc: Advancing the automatic prediction of image
quality in mri from unseen sites,” PloS one, vol. 12, no. 9, p. e0184661,
2017.
[10] T. K¨
ustner, A. Liebgott, L. Mauch, P. Martirosian, F. Bamberg, K. Niko-
laou, B. Yang, F. Schick, and S. Gatidis, “Automated reference-free
detection of motion artifacts in magnetic resonance images,” Magnetic
Resonance Materials in Physics, Biology and Medicine, vol. 31, no. 2,
pp. 243–256, 2018.
[11] T. K¨
ustner, S. Gatidis, A. Liebgott, M. Schwartz, L. Mauch, P. Mar-
tirosian, H. Schmidt, N. F. Schwenzer, K. Nikolaou, F. Bamberg et al.,
“A machine-learning framework for automatic reference-free quality
assessment in mri,” Magnetic Resonance Imaging, vol. 53, pp. 134–147,
2018.
[12] I. Fantini, L. Rittner, C. Yasuda, and R. Lotufo, “Automatic detection of
motion artifacts on mri using deep cnn,” in 2018 International Workshop
on Pattern Recognition in Neuroimaging (PRNI). IEEE, 2018, pp. 1–4.
[13] I. Fantini, C. Yasuda, M. Bento, L. Rittner, F. Cendes, and R. Lotufo,
“Automatic mr image quality evaluation using a deep cnn: A reference-
free method to rate motion artifacts in neuroimaging,” Computerized
Medical Imaging and Graphics, vol. 90, p. 101897, 2021.
[14] L. L. Backhausen, M. M. Herting, J. Buse, V. Roessner, M. N. Smolka,
and N. C. Vetter, “Quality control of structural mri images applied using
freesurfer—a hands-on workflow to rate motion artifacts,” Frontiers in
neuroscience, vol. 10, p. 558, 2016.
[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,
2016.
[16] A. G´
eron, Hands-on machine learning with Scikit-Learn, Keras, and
TensorFlow: Concepts, tools, and techniques to build intelligent systems.
” O’Reilly Media, Inc.”, 2019.
[17] T. Amr, Hands-On Machine Learning with scikit-learn and Scientific
Python Toolkits: A practical guide to implementing supervised and
8 JUNE 2022
Fig. 6. Evaluation for the clinical dataset. The curves represent the SSIM predictions obtained with the different trained models while the colored
bars the subjective classification performed by the expert. When the curves are within the coloured bars there is an agreement between the objective
and subjective evaluation, disagreement otherwise. The blue dashed lines indicate the separation between the different subjects.
unsupervised machine learning algorithms in Python. Packt Publishing
Ltd, 2020.
[18] W. Rawat and Z. Wang, “Deep convolutional neural networks for image
classification: A comprehensive review,” Neural computation, vol. 29,
no. 9, pp. 2352–2449, 2017.
[19] Y. Li, “Research and application of deep learning in image recognition,”
in 2022 IEEE 2nd International Conference on Power, Electronics and
Computer Applications (ICPECA). IEEE, 2022, pp. 994–999.
[20] V. E. Staartjes and J. M. Kernbach, “Foundations of machine learning-
based clinical prediction modeling: Part v—a practical approach to
regression problems,” in Machine Learning in Clinical Neuroscience.
Springer, 2022, pp. 43–50.
[21] P. Langley et al., “The changing science of machine learning,” Machine
learning, vol. 82, no. 3, pp. 275–279, 2011.
[22] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: from error visibility to structural similarity,” IEEE
transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[23] R. Shaw, C. Sudre, S. Ourselin, and M. J. Cardoso, “Mri k-space motion
artefact augmentation: model robustness and task-specific uncertainty,”
in International Conference on Medical Imaging with Deep Learning–
Full Paper Track, 2018.
[24] F. P´
erez-Garc´
ıa, R. Sparks, and S. Ourselin, “Torchio: a python li-
brary for efficient loading, preprocessing, augmentation and patch-based
sampling of medical images in deep learning,” Computer Methods and
Programs in Biomedicine, vol. 208, p. 106236, 2021.
[25] S. Chatterjee, A. Sciarra, M. D¨
unnwald, S. Oeltze-Jafra, A. N¨
urnberger,
and O. Speck, “Retrospective motion correction of mr images using
prior-assisted deep learning,” arXiv preprint arXiv:2011.14134, 2020.
[26] A. S. Atukorale and T. Downs, “Using labeled and unlabeled data for
training,” 2001.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 770–778.
[28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style,
high-performance deep learning library,” in Advances in Neural
Information Processing Systems 32. Curran Associates, Inc., 2019,
pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/
9015-pytorch- an-imperative-style- high-performance- deep-learning- library.
pdf
[29] W. Kubinger, M. Vincze, and M. Ayromlou, “The role of gamma cor-
rection in colour image processing,” in 9th European Signal Processing
Conference (EUSIPCO 1998). IEEE, 1998, pp. 1–4.
[30] R. Jain, R. Kasturi, and B. Schunck, “Machine vision mcgraw-hill
international editions,” New York, 1995.
[31] G. J. Braun and M. D. Fairchild, “Image lightness rescaling using sig-
moidal contrast enhancement functions,” Journal of Electronic Imaging,
vol. 8, no. 4, pp. 380–393, 1999.
[32] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz,
T. Greer, B. ter Haar Romeny, J. B. Zimmerman, and K. Zuiderveld,
“Adaptive histogram equalization and its variations,” Computer vision,
graphics, and image processing, vol. 39, no. 3, pp. 355–368, 1987.
[33] D. M. Allen, “Mean square error of prediction as a criterion for selecting
variables,” Technometrics, vol. 13, no. 3, pp. 469–475, 1971.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[35] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,
D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J.
van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J.
Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙
I. Polat, Y. Feng, E. W.
Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henrik-
sen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro,
F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0:
Fundamental Algorithms for Scientific Computing in Python,” Nature
Methods, vol. 17, pp. 261–272, 2020.
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[37] M. Herbst, J. Maclaren, C. Lovell-Smith, R. Sostheim, K. Egger,
A. Harloff, J. Korvink, J. Hennig, and M. Zaitsev, “Reproduction
of motion artifacts for performance analysis of prospective motion
correction in mri,” Magnetic Resonance in Medicine, vol. 71, no. 1,
pp. 182–190, 2014.
[38] B. Zahneisen, B. Keating, A. Singh, M. Herbst, and T. Ernst, “Re-
verse retrospective motion correction,” Magnetic resonance in medicine,
vol. 75, no. 6, pp. 2341–2349, 2016.
SCIARRA et al.: AUTOMATED SSIM REGRESSION FOR DETECTION AND QUANTIFICATION OF MOTION ARTEFACTS IN BRAIN MR IMAGES 9
[39] A. Sciarra, H. Mattern, R. Yakupov, S. Chatterjee, D. Stucht, S. Oeltze-
Jafra, F. Godenschweger, and O. Speck, “Quantitative evaluation of
prospective motion correction in healthy subjects at 7t mri,” Magnetic
resonance in medicine, vol. 87, no. 2, pp. 646–657, 2022.