PreprintPDF Available

CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Deep learning methods for chest X-ray interpretation typically rely on pretrained models developed for ImageNet. This paradigm assumes that better ImageNet architectures perform better on chest X-ray tasks and that ImageNet-pretrained weights provide a performance boost over random initialization. In this work, we compare the transfer performance and parameter efficiency of 16 popular convolutional architectures on a large chest X-ray dataset (CheXpert) to investigate these assumptions. First, we find no relationship between ImageNet performance and CheXpert performance for both models without pretraining and models with pretraining. Second, we find that, for models without pretraining, the choice of model family influences performance more than size within a family for medical imaging tasks. Third, we observe that ImageNet pretraining yields a statistically significant boost in performance across architectures, with a higher boost for smaller architectures. Fourth, we examine whether ImageNet architectures are unnecessarily large for CheXpert by truncating final blocks from pretrained models, and find that we can make models 3.25x more parameter-efficient on average without a statistically significant drop in performance. Our work contributes new experimental evidence about the relation of ImageNet to chest x-ray interpretation performance.
Content may be subject to copyright.
CheXtransfer: Performance and Parameter Eiciency of
ImageNet Models for Chest X-Ray Interpretation
Alexander Ke
alexke@cs.stanford.edu
Stanford University
USA
William Ellsworth
willells@cs.stanford.edu
Stanford University
USA
Oishi Banerjee
oishi.banerjee@cs.stanford.edu
Stanford University
USA
Andrew Y. Ng
ang@cs.stanford.edu
Stanford University
USA
Pranav Rajpurkar
pranavsr@cs.stanford.edu
Stanford University
USA
Figure 1: Visual summary of our contributions. From left to right: scatterplot and best-t line for 16 pretrained models show-
ing no relationship between ImageNet and CheXpert performance, CheXpert performance relationship varies across archi-
tecture families much more than within, average CheXpert performance improves with pretraining, models can maintain
performance and improve parameter eciency through truncation of nal blocks. Error bars show one standard deviation.
ABSTRACT
Deep learning methods for chest X-ray interpretation typically rely
on pretrained models developed for ImageNet. This paradigm as-
sumes that better ImageNet architectures perform better on chest
X-ray tasks and that ImageNet-pretrained weights provide a perfor-
mance boost over random initialization. In this work, we compare
the transfer performance and parameter eciency of 16 popular
convolutional architectures on a large chest X-ray dataset (CheX-
pert) to investigate these assumptions. First, we nd no relation-
ship between ImageNet performance and CheXpert performance
for both models without pretraining and models with pretraining.
Second, we nd that, for models without pretraining, the choice
of model family inuences performance more than size within a
family for medical imaging tasks. Third, we observe that ImageNet
pretraining yields a statistically signicant boost in performance
across architectures, with a higher boost for smaller architectures.
Fourth, we examine whether ImageNet architectures are unneces-
sarily large for CheXpert by truncating nal blocks from pretrained
Authors contributed equally to this research.
models, and nd that we can make models 3.25x more parameter-
ecient on average without a statistically signicant drop in per-
formance. Our work contributes new experimental evidence about
the relation of ImageNet to chest x-ray interpretation performance.
CCS CONCEPTS
Applied computing Health informatics
;
Computing
methodologies Computer vision;Neural networks.
KEYWORDS
generalization, eciency, pretraining, chest x-ray interpretation,
ImageNet, truncation
1 INTRODUCTION
Deep learning models for chest X-ray interpretation have high po-
tential for social impact by aiding clinicians in their workow and
increasing access to radiology expertise worldwide. Transfer learn-
ing using pretrained ImageNet [
9
] models has been the standard
approach for developing models not only on chest X-rays [
1
,
20
,
27
]
but also for many other medical imaging modalities [
8
,
10
,
16
,
17
,
29
].
This transfer assumes that better ImageNet architectures perform
better and pretrained weights boost performance on their target
arXiv:2101.06871v1 [cs.CV] 18 Jan 2021
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
medical tasks. However, there has not been a systematic investi-
gation of how ImageNet architectures and weights both relate to
performance on downstream medical tasks.
In this work, we systematically investigate how ImageNet ar-
chitectures and weights both relate to performance on chest X-ray
tasks. Our primary contributions are:
(1)
For models without pretraining and models with pretraining,
we nd no relationship between ImageNet performance and
CheXpert performance (Spearman
𝜌=
0
.
08,
𝜌=
0
.
06 respec-
tively). This nding suggests that architecture improvements
on ImageNet may not lead to improvements on medical imag-
ing tasks.
(2)
For models without pretraining, we nd that within an ar-
chitecture family, the largest and smallest models have small
dierences (ResNet 0.005, DenseNet 0.003, EcientNet 0.004)
in CheXpert AUC, but dierent model families have larger
dierences in AUC (
>
0
.
006). This nding suggests that the
choice of model family inuences performance more than size
within a family for medical imaging tasks.
(3)
We observe that ImageNet pretraining yields a statistically
signicant boost in performance (average boost of 0.016 AUC)
across architectures, with a higher boost for smaller archi-
tectures (Spearman
𝜌=
0
.
72 with number of parameters).
This nding supports the ImageNet pretraining paradigm
for medical imaging tasks, especially for smaller models.
(4)
We nd that by truncating nal blocks of pretrained models,
we can make models 3.25x more parameter-ecient on av-
erage without a statistically signicant drop in performance.
This nding suggests model truncation may be a simple
method to yield lighter pretrained models by preserving
architecture design features while reducing model size.
Our study, to the best of our knowledge, contributes the rst
systematic investigation of the performance and eciency of Im-
ageNet architectures and weights for chest X-ray interpretation.
Our investigation and ndings may be further validated on other
datasets and medical imaging tasks.
2 RELATED WORK
2.1 ImageNet Transfer
Kornblith et al
. [15]
examined the performance of 16 convolutional
neural networks (CNNs) on 12 image classication datasets. They
found that using these ImageNet pretrained architectures either as
feature extractors for logistic regression or ne tuning them on the
target dataset yielded a Spearman
𝜌=
0
.
99 and
𝜌=
0
.
97 between
ImageNet accuracy and transfer accuracy respectively. However,
they showed ImageNet performance was less correlated with trans-
fer accuracy for some ne-grained tasks, corroborating He et al
.
[11]
. They found that without ImageNet pretraining, ImageNet ac-
curacy and transfer accuracy had a weaker Spearman
𝜌=
0
.
59. We
extend Kornblith et al
. [15]
to the medical setting by studying the
relationship between ImageNet and CheXpert performance.
Raghu et al
. [19]
explored properties of transfer learning onto
retinal fundus images and chest X-rays. They studied ResNet50
and InceptionV3 and showed pretraining oers little performance
improvement. Architectures composed of just four to ve sequential
convolution and pooling layers achieved comparable performance
on these tasks as ResNet50 with less than 40% the parameters.
In our work, we nd pretraining does not boost performance for
ResNet50, InceptionV3, InceptionV4, and MNASNet but does boost
performance for the remaining 12 architectures. Thus, we were
able to replicate Raghu et al
. [19]
’s results, but upon studying a
broader set of newer and more popular models, we reached the
opposite conclusion that ImageNet pretraining yields a statistically
signicant boost in performance.
2.2 Medical Task Architectures
Irvin et al
. [13]
compared the performance of ResNet152, DenseNet121,
InceptionV4, and SEResNeXt101 on CheXpert, nding that DenseNet121
performed best. In a recent analysis, all but one of the top ten CheX-
pert competition models used DenseNets as part of their ensemble,
even though they have been surpassed on ImageNet [
21
]. Few
groups design their own networks from scratch, preferring to use
established ResNet and DenseNet architectures for CheXpert [
3
].
This trend extends to retinal fundus and skin cancer tasks as well,
where Inception architectures remain popular [
8
,
16
,
17
,
29
]. The
popularity of these older ImageNet architectures hints that there
may be a disconnect between ImageNet performance and medical
task performance for newer architectures generated through archi-
tecture search. We verify that these newer architectures generated
through search (EcientNet, MobileNet, MNASNet) underperform
older architectures (DenseNet, ResNet) on CheXpert, suggesting
that search has overt to ImageNet and explaining the popularity
of these older architectures in the medical imaging literature.
Bressem et al
. [3]
postulated that deep CNNs that can represent
more complex relationships for ImageNet may not be necessary for
CheXpert, which has greyscale inputs and fewer image classes. They
studied ResNet, DenseNet, VGG, SqueezeNet, and AlexNet perfor-
mance on CheXpert and found that ResNet152, DenseNet161, and
ResNet50 performed best on CheXpert AUC. In terms of AUPRC,
they showed that smaller architectures like AlexNet and VGG can
perform similarly to deeper architectures on CheXpert. Models
such as AlexNet, VGG, and SqueezeNet are no longer popular in
the medical setting, so in our work, we systematically investigate
the performance and eciency of 16 more contemporary ImageNet
architectures with and without pretraining. Additionally, we ex-
tend [
3
] by studying the eects of pretraining, characterizing the
relationship between ImageNet and CheXpert performance, and
drawing conclusions about architecture design.
2.3 Truncated Architectures
The more complex a convolutional architecture becomes, the more
computational and memory resources are needed for its training
and deployment. Model complexity thus may impede the deploy-
ment of CNNs to clinical settings with less resources. Therefore,
eciency, often reported in terms of the number of parameters in
a model, the number of FLOPS in the forward pass, or the latency
of the forward pass, has become increasingly important in model
design. Low-rank factorization [
7
,
14
], transferred/compact con-
volutional lters [
6
], knowledge distillation [
12
], and parameter
pruning [
25
] have all been proposed to make CNNs more ecient.
Layer-wise pruning is a type of parameter pruning that locates
and removes layers that are not as useful to the target task [
22
].
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Through feature diagnosis, a linear classier is trained using the
feature maps at intermediate layers to quantify how much a par-
ticular layer contributes to performance on the target task [
5
]. In
this work, we propose model truncation as a simple method for
layer-wise pruning where the nal pretrained layers after a given
point are pruned o, a classication layer is appended, and this
whole model is netuned on the target task.
3 METHODS
3.1 Training and Evaluation Procedure
We train chest X-ray classication models with dierent architec-
tures with and without pretraining. The task of interest is to predict
the probability of dierent pathologies from one or more chest
X-rays. We use the CheXpert dataset consisting of 224,316 chest
X-rays of 65,240 patients [
13
] labeled for the presence or absence of
14 radiological observations. We evaluate models using the average
of their AUROC metrics (AUC) on the ve CheXpert-dened com-
petition tasks (Atelectasis, Cardiomegaly, Consolidation, Edema,
Pleural Eusion) as well as the No Finding task to balance clinical
importance and prevalence in the validation set.
We select 16 models pretrained on ImageNet from public check-
points implemented in PyTorch 1.4.0: DenseNet (121, 169, 201) and
ResNet (18, 34, 50, 101) from Paszke et al
. [18]
, Inception (V3, V4)
and MNASNet from Cadene
[4]
, and EcientNet (B0, B1, B2, B3)
and MobileNet (V2, V3) from Wightman
[28]
. We netune and eval-
uate these architectures with and without ImageNet pretraining.
For each model, we netune all parameters on the CheXpert
training set. If a model is pretrained, inputs are normalized using
mean and standard deviation learned from ImageNet. If a model
is not pretrained, inputs are normalized with mean and standard
deviation learned from CheXpert. We use the Adam optimizer (
𝛽1=
0
.
9,
𝛽2=
0
.
999) with learning rate of 1
×
10
4
, a batch size of 16, and
a cross-entropy loss function. We train on up to four Nvidia GTX
1080 with CUDA 10.1 and Intel Xeon CPU ES-2609 running Ubuntu
16.04. For one run of an architecture, we train for three epochs and
evaluate each model every 8192 gradient steps. We train each model
and create a nal ensemble model from the ten checkpoints, which
achieved the best average CheXpert AUC across the six tasks on
the validation set. We report all our results on the CheXpert test
set.
We use the nonparametric bootstrap to estimate 95% condence
intervals for each statistic. 1,000 replicates are drawn from the test
set, and the statistic is calculated on each replicate. This procedure
produces a distribution for each statistic, and we report the 2.5 and
97.5 percentiles as a condence interval. Signicance is assessed at
the 𝑝=0.05 level.
3.2 Truncated Architectures
We study truncated versions of DenseNet121, MNASNet, ResNet18,
and EcientNetB0. DenseNet121 and MNASNet were chosen be-
cause we found they have the greatest eciency (by AUC per
parameters) on CheXpert of the models we prole, ResNet18 was
chosen because of its popularity as a compact model for medical
tasks, and EcientNetB0 was chosen because it is the smallest
current-generation model of the 16 we study. DenseNet121 con-
tains four dense blocks separated by transition blocks before the
classication layer. By pruning the nal dense block and associated
transition block, the model now only contains three dense blocks,
yielding DenseNet121Minus1. Similarly, pruning two dense blocks
and associated transition blocks yields DenseNet121Minus2, and
pruning three dense blocks and associated transition blocks yields
DenseNet121Minus3. For MNASNet, we remove up to the four of the
nal MBConv blocks to produce MNASNetMinus1 through MNAS-
NetMinus4. For ResNet18, we remove up to the three of the nal
residual blocks with a similar method to produce ResNet18Minus1
through ResNet18Minus3. For EcientNet, we remove up to two
of the nal MBConv6 blocks to produce EcientNetB0Minus1 and
EcientNetB0Minus2.
After truncating a model, we append a classication block con-
taining a global average pooling layer followed by a fully connected
layer to yield outputs of the correct shape. We initialize the model
with ImageNet pretrained weights, except the randomly initialized
classication block, and netune using the same training procedure
as the 16 ImageNet models.
3.3 Class Activation Maps
We compare the class activation maps (CAMs) among a truncated
DenseNet121 family to visualize their higher resolution CAMs. We
generate CAMs using the Grad-CAM method [
23
], using a weighted
combination of the model’s nal convolutional feature maps, with
weights based on the positive partial derivatives with respect to
class score. This averaged map is scaled by the outputted probability
so more condent predictions appear brighter. Finally, the map is
upsampled to the input image resolution and overlain onto the input
image, highlighting image regions that had the greatest inuence
on a model’s decision.
4 EXPERIMENTS
4.1 ImageNet Transfer Performance
We investigate whether higher performance on natural image clas-
sication translates to higher performance on chest X-ray classica-
tion. We display the relationship between the CheXpert AUC, with
and without ImageNet pretraining, and ImageNet top-1 accuracy
in Figure 2
When models are trained without pretraining, we nd no mono-
tonic relationship between ImageNet top-1 accuracy and average
CheXpert AUC, with Spearman
𝜌=
0
.
082 at
𝑝=
0
.
762. Model
performance without pretraining would describe how a given ar-
chitecture would perform on the target task, independent of any
pretrained weights. When models are trained with pretraining, we
again nd no monotonic relationship between ImageNet top-1 ac-
curacy and average CheXpert AUC with Spearman
𝜌=
0
.
059 at
𝑝=0.829.
Overall, we nd no relationship between ImageNet and CheXpert
performance, so models that succeed on ImageNet do not necessar-
ily succeed on CheXpert. These relationships between ImageNet
performance and CheXpert performance are much weaker than the
relationships between ImageNet performance and performance on
various natural image tasks reported by Kornblith et al. [15].
We compare the CheXpert performance within and across ar-
chitecture families. Without pretraining, we nd that ResNet101
performs only 0.005 AUC greater than ResNet18, which is well
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
Figure 2: Average CheXpert AUC vs. ImageNet Top-1 Accuracy. The left plot shows results obtained without pretraining, while
the right plot shows results with pretraining. There is no monotonic relationship between ImageNet and CheXpert perfor-
mance without pretraining (Spearman 𝜌=0.08) or with pretraining (Spearman 𝜌=0.06).
within the condence interval of this metric (Figure 2). Similarly,
DenseNet201 performs 0.004 AUC greater than DenseNet121 and
EcientNetB3 performs 0.003 AUC greater than EcientNetB0.
With pretraining, we continue to nd minor performance dier-
ences between the largest model and smallest model that we test in
each family. We nd AUC increases of 0.002 for ResNet, 0.004 for
DenseNet and -0.006 for EcientNet. Thus, increasing complexity
within a model family does not yield increases in CheXpert perfor-
mance as meaningful as the corresponding increases in ImageNet
performance.
Without pretraining, we nd that the best model studied per-
forms signicantly better than the worst model studied. Among
models trained without pretraining, we nd that InceptionV3 per-
forms best with 0.866 (0.851, 0.880) AUC, while MobileNetV2 per-
forms worst with 0.814 (0.796, 0.832) AUC. Their dierence in per-
formance is 0.052 (0.043, 0.063) AUC. InceptionV3 is also the third
largest architecture studied and MobileNetV2 the smallest. We nd
a signicant dierence in the CheXpert performance of these mod-
els. This dierence again hints at the importance of architecture
design.
4.2 CheXpert Performance and Eciency
We examine whether larger architectures perform better than smaller
architectures on chest X-ray interpretation, where architecture size
is measured by number of parameters. We display these relation-
ships in Figure 3and Table 1.
Without ImageNet pretraining, we nd a positive monotonic
relationship between the number of parameters of an architecture
and CheXpert performance, with Spearman
𝜌=
0
.
791 signicant
at
𝑝=
2
.
62
×
10
4
. With ImageNet pretraining, there is a weaker
positive monotonic relationship between the number of parameters
and average CheXpert AUC, with Spearman
𝜌=
0
.
565 at
𝑝=
0
.
023.
Although there exists a positive monotonic relationship between
the number of parameters of an architecture and average CheXpert
AUC, the Spearman
𝜌
does not highlight the increase in parame-
ters necessary to realize marginal increases in CheXpert AUC. For
example, ResNet101 is 11.1x larger than EcientNetB0, but with
only increase of 0.005 in CheXpert AUC with pretraining.
Within a model family, increasing the number of parameters
does not lead to meaningful gains in CheXpert AUC. We see this
relationship in all families studied without pretraining (EcientNet,
DenseNet, and ResNet) in Figure 3. For example, DenseNet201
has an AUC 0.003 greater than DenseNet121, but is 2.6x larger.
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Figure 3: Average CheXpert AUC vs. Model Size. The left plot shows results obtained without pretraining, while the right plot
shows results with pretraining. The logarithm of the model size has a near linear relationship with CheXpert performance
when we omit pretraining (Spearman 𝜌=
0
.
79
). However once we incorporate pretraining, the monotonic relationship is
weaker (Spearman 𝜌=0.56).
Model CheXpert AUC #Params (M)
DenseNet121 0.859 (0.846, 0.871) 6.968
DenseNet169 0.860 (0.848, 0.873) 12.508
DenseNet201 0.864 (0.850, 0.876) 18.120
EcientNetB0 0.859 (0.846, 0.871) 4.025
EcientNetB1 0.858 (0.844, 0.872) 6.531
EcientNetB2 0.866 (0.853, 0.880) 7.721
EcientNetB3 0.853 (0.837, 0.867) 10.718
InceptionV3 0.862 (0.848, 0.876) 27.161
InceptionV4 0.861 (0.846, 0.873) 42.680
MNASNet 0.858 (0.845, 0.871) 5.290
MobileNetV2 0.854 (0.839, 0.869) 2.242
MobileNetV3 0.859 (0.847, 0.872) 4.220
ResNet101 0.863 (0.848, 0.876) 44.549
ResNet18 0.862 (0.847, 0.875) 11.690
ResNet34 0.863 (0.849, 0.875) 21.798
ResNet50 0.859 (0.843, 0.871) 25.557
Table 1: CheXpert AUC (with 95% Condence Intervals) and
Number of Parameters for 16 ImageNet-Pretrained Models.
EcientNetB3 has an AUC 0.004 greater than EcientNetB0, but
is 1.9x larger. Despite the positive relationship between model
size and CheXpert performance across all models, bigger does not
necessarily mean better within a model family.
Since within a model family there is a weaker relationship be-
tween model size and CheXpert performance than across all mod-
els, we nd that CheXpert performance is inuenced more by the
macro architecture design than by its size. Models within a fam-
ily have similar architecture design choices but dierent sizes, so
they perform similarly on CheXpert. We observe large discrepan-
cies in performance between architecture families. For example
DenseNet, ResNet, and Inception typically outperform EcientNet
and MobileNet architectures, regardless of their size. EcientNet,
MobileNet, and MNASNet were all generated through neural archi-
tecture search to some degree, a process that optimized for perfor-
mance on ImageNet. Our ndings suggest that this search could
have overt to the natural image objective to the detriment of chest
X-ray tasks.
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
Figure 4: Pretraining Boost vs. Model Size. We dene pretraining boost as the increase in the average CheXpert AUCs achieved
with pretraining vs. without pretraining. Most models benet signicantly from ImageNet pretraining. Smaller models tend
to benet more than larger models (Spearman 𝜌=0.72).
4.3 ImageNet Pretraining Boost
We study the eects of ImageNet pretraining on CheXpert perfor-
mance by dening the pretraining boost as the CheXpert AUC of a
model initialized with ImageNet pretraining minus the CheXpert
AUC of its counterpart without pretraining. The pretraining boosts
of our architectures are reported in Figure 4.
We nd that ImageNet pretraining provides a meaningful boost
for most architectures (on average 0.015 AUC). We nd a Spearman
𝜌=
0
.
718 at
𝑝=
0
.
002 between the number of parameters of a
given model and the pretraining boost. Therefore, this boost tends
to be larger for smaller architectures such as EcientNetB0 (0.023),
MobileNetV2 (0.040) and MobileNetV3 (0.033) and smaller for larger
architectures such as InceptionV4 (
0
.
002) and ResNet101 (0.013).
Further work is required to explain this relationship.
Within a model family, the pretraining boost also does not
meaningfully increase as as model size increases. For example,
DenseNet201 has a pretraining boost only 0.002 AUC greater than
DenseNet121 does. This nding supports our earlier conclusion
that model families perform similarly on CheXpert regardless of
their size.
4.4 Truncated Architectures
We truncate the nal blocks of DenseNet121, MNASNet, ResNet18,
and EcientNetB0 with pretrained weights and study their CheX-
pert performance to understand whether ImageNet models are
unnecessarily large for the chest X-ray task. We express eciency
gains in terms of Times-Smaller, or the number of parameters of the
original architecture divided by the number of parameters of the
truncated architecture: intuitively, how many times larger the orig-
inal architecture is compared to the truncated architecture. The e-
ciency gains and AUC changes of model truncation on DenseNet121,
MNASNet, ResNet18, and EcientNetB0 are displayed in Table 2.
For all four model families, we nd that truncating the nal
block leads to no signicant decrease in CheXpert AUC but can
save 1.4x to 4.2x the parameters. Notably, truncating the nal block
of ResNet18 yields a model that is not signicantly dierent (dif-
ference -0.002 (-0.008, 0.004)) in CheXpert AUC, but is 4.2x smaller.
Truncating the nal two blocks of an EcientNetB0 yields a model
that is not signicantly dierent (dierence 0.004 (-0.003, 0.009)) in
CheXpert AUC, but is 4.7x smaller. However, truncating the second
block and beyond in each of MNASNet, DenseNet121, and ResNet18
yields models that have statistically signicant drops in CheXpert
performance.
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Model AUC Change Times-Smaller
EcientNetB0 0.00% 1x
EcientNetB0Minus1 0.15% 1.4x
EcientNetB0Minus2 -0.45% 4.7x
MNASNet 0.00% 1x
MNASNetMinus1 -0.07% 2.5x
MNASNetMinus2* -2.30% 11.2x
MNASNetMinus3* -2.51% 20.0x
MNASNetMinus4* -6.40% 112.9x
DenseNet121 0.00% 1x
DenseNet121Minus1 -0.04% 1.6x
DenseNet121Minus2* -1.33% 5.3x
DenseNet121Minus3* -4.73% 20.0x
ResNet18 0.00% 1x
ResNet18Minus1 0.24% 4.2x
ResNet18Minus2* -3.70% 17.1x
ResNet18Minus3* -8.33% 73.8x
Table 2: Eciency Trade-O of Truncated Models. Pre-
trained models can be truncated without signicant de-
crease in CheXpert AUC. Truncated models with signi-
cantly dierent AUC from the base model are denoted with
an asterisk.
Figure 5: Comparison of Class Activation Maps Among
Truncated Model Family. CAMs yielded by models, from
left to right, DenseNet121, DenseNet121Minus1, and
DenseNet121Minus2. Displays frontal chest X-ray demon-
strating Atelectasis (top) and Edema (bottom). Further
truncated models more eectively localize the Atelectasis,
as well as tracing the hila and vessel branching for Edema.
Model truncation eectively compresses models performant on
CheXpert, making them more parameter ecient while still using
pretrained weights to capture the pretraining boost. Parameter
ecient models are able to lighten the computational and memory
burdens for deployment to low-resource environments such as
portable devices. In the clinical setting, the simplicity of our model
truncation method encourages its adoption for model compression.
This nding corroborates Raghu et al
. [19]
and Bressem et al
. [3]
,
which show simpler models can achieve performance comparable to
more complex models on CheXpert. Our truncated models can use
readily-available pretrained weights, which may allow these models
to capture the pretraining boost and speed up training. However,
we do not study the performance of these truncated models without
their pretrained weights.
As an additional benet, architectures that truncate pooling
layers will also produce higher-resolution class activation maps,
as shown in Figure 5. The higher-resolution class activation maps
(CAMs) may more eectively localize pathologies with little to no
decrease in classication performance. In clinical settings, improved
explainability through better CAMs may be useful for validating
predictions and diagnosing mispredictions. As a result, clinicians
may have more trust in models that provide these higher-resolution
CAMs.
5 DISCUSSION
In this work, we study the performance and eciency of ImageNet
architectures for chest x-ray interpretation.
Is ImageNet performance correlated with CheXpert? No. We show
no statistically signicant relationship between ImageNet and CheX-
pert performance. This nding extends Kornblith et al
. [15]
—which
found a signicant correlation between ImageNet performance and
transfer performance on typical image classication datasets—to
the medical setting of chest x-ray interpretation. This dierence
could be attributed to unique aspects the chest X-ray interpreta-
tion task and data attributes. The chest X-ray interpretation task
diers from natural image classication in that (1) disease classi-
cation may depend on abnormalities in a small number of pixels,
(2) chest X-ray interpretation is a multi-task classication setup,
and (3) there are far fewer classes than in many natural image
classication datasets. Second, the data attributes for chest X-rays
dier from natural image classication in that X-rays are greyscale
and have similar spatial structures across images (always either
anterior-posterior, posterior-anterior, or lateral).
Does model architecture matter? Yes. For models without pretrain-
ing, we nd that the choice of architecture family may inuence
performance more than model size. Our ndings extend Raghu et al
.
[19]
beyond the eect of ImageNet weights, since we show that
architectures that succeed on ImageNet do not necessarily succeed
on medical imaging tasks. A notable nding of our work is that
newer architectures generated through search on ImageNet (E-
cientNet, MobileNet, MNASNet) underperform older architectures
(DenseNet, ResNet) on CheXpert. This nding suggests that search
may have overt to ImageNet to the detriment of medical task
performance, and ImageNet may not be an appropriate benchmark
for selecting architectures for medical imaging tasks. Instead, med-
ical imaging architectures could be benchmarked on CheXpert or
other large medical datasets. Architectures derived from selection
and search on CheXpert and other large medical datasets may be
applicable to similar medical imaging modalities including other
x-ray studies, or CT scans. Thus architecture search directly on
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
CheXpert or other large medical datasets may allow us to unlock
next generation performance for medical imaging tasks.
Does ImageNet pretraining help? Yes. We nd that ImageNet pre-
training yields a statistically signicant boost in performance for
chest x-ray classication. Our ndings are consistent with Raghu
et al
. [19]
, who nd no pretraining boost on ResNet50 and Incep-
tionV3, but we nd pretraining does boost performance for 12
out of 16 architectures. Our ndings extend He et al
. [11]
—who
nd models without pretraining had comparable performance to
models pretrained on ImageNet for object detection and image
segmentation of natural images—to the medical imaging setting.
Future work may investigate the relationship between network ar-
chitectures and the impact of self-supervised pre-training for chest
x-ray interpretation as has recently been developed by Azizi et al
.
[2], Sowrirajan et al. [24], Sriram et al. [26].
Can models be smaller? Yes. We nd that by truncating nal
blocks of ImageNet-pretrained architectures, we can make models
3.25x more parameter-ecient on average without a statistically
signicant drop in performance. This method preserves the criti-
cal components of architecture design while cutting its size. This
observation suggests model truncation may be a simple method to
yield lighter models, using ImageNet pretrained weights to boost
CheXpert performance. In the clinical setting, truncated models
may provide value through improved parameter-eciency and
higher resolution CAMs. This change may enable deployment to
low-resource clinical environments and further develop model trust
through improved explainability.
In closing, our work contributes to the understanding of the
transfer performance and parameter eciency of ImageNet models
for chest X-ray interpretation. We hope that our new experimental
evidence about the relation of ImageNet to medical task perfor-
mance will shed light on potential future directions for progress.
REFERENCES
[1]
Ioannis D Apostolopoulos and Tzani A Mpesiana. 2020. Covid-19: automatic
detection from x-ray images utilizing transfer learning with convolutional neural
networks. Physical and Engineering Sciences in Medicine (2020), 1.
[2]
Shekoofeh Azizi, Basil Mustafa, Fiona Ryan, Zachary Beaver, Jan Freyberg,
Jonathan Deaton, Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting
Chen, Vivek Natarajan, and Mohammad Norouzi. 2021. Big Self-Super vised
Models Advance Medical Image Classication. arXiv:2101.05224 [eess.IV]
[3]
Keno K. Bressem, Lisa Adams, Christoph Erxleben, Bernd Hamm, Stefan Niehues,
and Janis Vahldiek. 2020. Comparing Dierent Deep Learning Architectures for
Classication of Chest Radiographs. arXiv:2002.08991 [cs.LG]
[4]
Remi Cadene. 2018. pretrainedmodels 0.7.4. https://pypi.org/project/
pretrainedmodels/.
[5]
S. Chen and Q. Zhao. 2019. Shallowing Deep Networks: Layer-Wise Pruning
Based on Feature Representations. IEEE Transactions on Pattern Analysis and
Machine Intelligence 41, 12 (2019), 3048–3056.
[6]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A Survey of Model
Compression and Acceleration for Deep Neural Networks. CoRR abs/1710.09282
(2017). arXiv:1710.09282 http://arxiv.org/abs/1710.09282
[7]
François Chollet. 2016. Xception: Deep Learning with Depthwise Separable
Convolutions. CoRR abs/1610.02357 (2016). arXiv:1610.02357 http://arxiv.org/
abs/1610.02357
[8]
Jerey De Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav
Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Bren-
dan O’Donoghue, Daniel Visentin, George van den Driessche, Balaji Lakshmi-
narayanan, Clemens Meyer, Faith Mackinder, Simon Bouton, Kareem Ayoub,
Reena Chopra, Dominic King, Alan Karthikesalingam, Cían O. Hughes, Rosalind
Raine, Julian Hughes, Dawn A. Sim, Catherine Egan, Adnan Tufail, Hugh Mont-
gomery, Demis Hassabis, Geraint Rees, TrevorBack, Peng T. Khaw, Mustafa Suley-
man, Julien Cornebise, Pearse A. Keane, and Olaf Ronneberger.2018. Clinically ap-
plicable deep learning for diagnosis and referral in retinal disease. Nature Medicine
24, 9 (01 Sep 2018), 1342–1350. https://doi.org/10.1038/s41591- 018-0107- 6
[9]
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A
large-scale hierarchical image database. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition. 248–255.
[10]
Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter,
Helen M. Blau, and Sebastian Thrun. 2017. Dermatologist-level classication
of skin cancer with deep neural networks. Nature 542, 7639 (2017), 115–118.
https://doi.org/10.1038/nature21056
[11]
Kaiming He, Ross B. Girshick, and Piotr Dollár. 2018. Rethinking ImageNet
Pre-training. CoRR abs/1811.08883 (2018). arXiv:1811.08883 http://arxiv.org/abs/
1811.08883
[12]
Georey Hinton, Oriol Vinyals, and Je Dean. 2015. Distilling the Knowledge in
a Neural Network. arXiv:1503.02531 [stat.ML]
[13]
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris
Chute, Henrik Marklund, Behzad Haghgoo, Robyn L. Ball, Katie S. Shpanskaya,
Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones,
David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and An-
drew Y. Ng. 2019. CheXpert: A Large Chest Radiograph Dataset with Uncertainty
Labels and Expert Comparison. CoRR abs/1901.07031 (2019). arXiv:1901.07031
http://arxiv.org/abs/1901.07031
[14]
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up
Convolutional Neural Networks with Low Rank Expansions. CoRR abs/1405.3866
(2014). arXiv:1405.3866 http://arxiv.org/abs/1405.3866
[15]
Simon Kornblith, Jonathon Shlens, and Quoc V. Le. 2018. Do Better ImageNet
Models Transfer Better? CoRR abs/1805.08974 (2018). arXiv:1805.08974 http:
//arxiv.org/abs/1805.08974
[16]
Feng Li, Zheng Liu, Hua Chen, Minshan Jiang, Xuedian Zhang, and Zhizheng
Wu. 2019. Automatic Detection of Diabetic Retinopathy in Retinal Fun-
dus Photographs Based on Deep Learning Algorithm. Translational Vision
Science & Technology 8, 6 (11 2019), 4–4. https://doi.org/10.1167/tvst.8.6.
4arXiv:https://arvojournals.org/arvo/content_public/journal/tvst/938258/i2164-
2591-8-6-4.pdf
[17]
Akinori Mitani, Abigail Huang, Subhashini Venugopalan, Greg S. Corrado, Lily
Peng, Dale R. Webster, Naama Hammel, Yun Liu, and Avinash V. Varadarajan.
2020. Detection of anaemia from retinal fundus images via deep learning. Nature
Biomedical Engineering 4, 1 (01 Jan 2020), 18–27. https://doi.org/10.1038/s41551-
019-0487- z
[18]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-
maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
Library. In Advances in Neural Information Processing Systems 32, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Cur-
ran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015- pytorch-
an-imperative- style-high- performance- deep-learning- library.pdf
[19]
Maithra Raghu, Chiyuan Zhang, Jon M. Kleinberg, and Samy Bengio. 2019. Trans-
fusion: Understanding Transfer Learning with Applications to Medical Imaging.
CoRR abs/1902.07208 (2019). arXiv:1902.07208 http://arxiv.org/abs/1902.07208
[20]
Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Yi Ding, Aarti Bagul, Curtis Langlotz, Katie S. Shpanskaya,
Matthew P. Lungren, and Andrew Y. Ng. 2017. CheXNet: Radiologist-Level
Pneumonia Detection on Chest X-Rays with Deep Learning. CoRR abs/1711.05225
(2017). arXiv:1711.05225 http://arxiv.org/abs/1711.05225
[21]
P. Rajpurkar, Anirudh Joshi, Anuj Pareek, P. Chen, A. Kiani, Jeremy A. Irvin, A.
Ng, and M. Lungren. 2020. CheXpedition: Investigating Generalization Chal-
lenges for Translation of Chest X-Ray Algorithms to the Clinical Setting. ArXiv
abs/2002.11379 (2020).
[22]
Youngmin Ro and Jin Young Choi. 2020. Layer-wise Pruning and Auto-
tuning of Layer-wise Learning Rates in Fine-tuning of Deep Networks.
arXiv:2002.06048 [cs.CV]
[23]
Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael
Cogswell, Devi Parikh, and Dhruv Batra. 2016. Grad-CAM: Why did you say
that? Visual Explanations from Deep Networks via Gradient-based Localization.
CoRR abs/1610.02391 (2016). arXiv:1610.02391 http://arxiv.org/abs/1610.02391
[24]
Hari Sowrirajan, Jingbo Yang, Andrew Y. Ng, and Pranav Rajpurkar. 2020. MoCo
Pretraining Improves Representation and Transferability of Chest X-ray Models.
arXiv:2010.05352 [cs.CV]
[25]
Suraj Srinivas and R. Venkatesh Babu. 2015. Data-free parameter pruning for
Deep Neural Networks. CoRR abs/1507.06149 (2015). arXiv:1507.06149 http:
//arxiv.org/abs/1507.06149
[26]
Anuroop Sriram, Matthew Muckley, KoustuvSinha, Farah Shamout, Joelle Pineau,
Krzysztof J. Geras, Lea Azour, Yindalon Aphinyanaphongs, Nassa Yakubova,
and William Moore. 2021. COVID-19 Deterioration Prediction via Self-Super vised
Representation Learning and Multi-Image Prediction. arXiv:2101.04909 [cs.CV]
[27]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M. Summers. 2017. ChestX-ray8: Hospital-scale Chest X-ray Database and
Benchmarks on Weakly-Supervised Classication and Localization of Common
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Thorax Diseases. CoRR abs/1705.02315 (2017). arXiv:1705.02315 http://arxiv.org/
abs/1705.02315
[28] Ross Wightman. 2020. timm 0.2.1. https://pypi.org/project/timm/.
[29]
Li Zhang, Mengya Yuan, Zhen An, Xiangmei Zhao, Hui Wu, Haibin Li, Ya Wang,
Beibei Sun, Huijun Li, Shibin Ding, Xiang Zeng, Ling Chao, Pan Li, and Weidong
Wu. 2020. Prediction of hypertension, hyperglycemia and dyslipidemia from
retinal fundus photographs via deep learning: A cross-sectional study of chronic
diseases in central China. PLOS ONE 15, 5 (05 2020), 1–11. https://doi.org/10.
1371/journal.pone.0233166
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Existing fine-tuning methods use a single learning rate over all layers. In this paper, first, we discuss that trends of layer-wise weight variations by fine-tuning using a single learning rate do not match the well-known notion that lower-level layers extract general features and higher-level layers extract specific features. Based on our discussion, we propose an algorithm that improves fine-tuning performance and reduces network complexity through layer-wise pruning and auto-tuning of layer-wise learning rates. The proposed algorithm has verified the effectiveness by achieving state-of-the-art performance on the image retrieval benchmark datasets (CUB-200, Cars-196, Stanford online product, and Inshop). Code is available at https://github.com/youngminPIL/AutoLR.
Article
Full-text available
Retinal fundus photography provides a non-invasive approach for identifying early microcirculatory alterations of chronic diseases prior to the onset of overt clinical complications. Here, we developed neural network models to predict hypertension, hyperglycemia, dyslipidemia, and a range of risk factors from retinal fundus images obtained from a cross-sectional study of chronic diseases in rural areas of Xinxiang County, Henan, in central China. 1222 high-quality retinal images and over 50 measurements of anthropometry and biochemical parameters were generated from 625 subjects. The models in this study achieved an area under the ROC curve (AUC) of 0.880 in predicting hyperglycemia, of 0.766 in predicting hypertension, and of 0.703 in predicting dyslipidemia. In addition, these models can predict with AUC>0.7 several blood test erythrocyte parameters, including hematocrit (HCT), mean corpuscular hemoglobin concentration (MCHC), and a cluster of cardiovascular disease (CVD) risk factors. Taken together, deep learning approaches are feasible for predicting hypertension, dyslipidemia, diabetes, and risks of other chronic diseases.
Article
Full-text available
Owing to the invasiveness of diagnostic tests for anaemia and the costs associated with screening for it, the condition is often undetected. Here, we show that anaemia can be detected via machine-learning algorithms trained using retinal fundus images, study participant metadata (including race or ethnicity, age, sex and blood pressure) or the combination of both data types (images and study participant metadata). In a validation dataset of 11,388 study participants from the UK Biobank, the fundus-image-only, metadata-only and combined models predicted haemoglobin concentration (in g dl–1) with mean absolute error values of 0.73 (95% confidence interval: 0.72–0.74), 0.67 (0.66–0.68) and 0.63 (0.62–0.64), respectively, and with areas under the receiver operating characteristic curve (AUC) values of 0.74 (0.71–0.76), 0.87 (0.85–0.89) and 0.88 (0.86–0.89), respectively. For 539 study participants with self-reported diabetes, the combined model predicted haemoglobin concentration with a mean absolute error of 0.73 (0.68–0.78) and anaemia an AUC of 0.89 (0.85–0.93). Automated anaemia screening on the basis of fundus images could particularly aid patients with diabetes undergoing regular retinal imaging and for whom anaemia can increase morbidity and mortality risks. Machine-learning algorithms trained with retinal fundus images, with subject metadata or with both data types, predict haemoglobin concentration with mean absolute errors lower than 0.75 g dl–1 and anaemia with areas under the curve in the range of 0.74–0.89.
Article
Full-text available
Purpose: To achieve automatic diabetic retinopathy (DR) detection in retinal fundus photographs through the use of a deep transfer learning approach using the Inception-v3 network. Methods: A total of 19,233 eye fundus color numerical images were retrospectively obtained from 5278 adult patients presenting for DR screening. The 8816 images passed image-quality review and were graded as no apparent DR (1374 images), mild nonproliferative DR (NPDR) (2152 images), moderate NPDR (2370 images), severe NPDR (1984 images), and proliferative DR (PDR) (936 images) by eight retinal experts according to the International Clinical Diabetic Retinopathy severity scale. After image preprocessing, 7935 DR images were selected from the above categories as a training dataset, while the rest of the images were used as validation dataset. We introduced a 10-fold cross-validation strategy to assess and optimize our model. We also selected the publicly independent Messidor-2 dataset to test the performance of our model. For discrimination between no referral (no apparent DR and mild NPDR) and referral (moderate NPDR, severe NPDR, and PDR), we also computed prediction accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and κ value. Results: The proposed approach achieved a high classification accuracy of 93.49% (95% confidence interval [CI], 93.13%-93.85%), with a 96.93% sensitivity (95% CI, 96.35%-97.51%) and a 93.45% specificity (95% CI, 93.12%-93.79%), while the AUC was up to 0.9905 (95% CI, 0.9887-0.9923) on the independent test dataset. The κ value of our best model was 0.919, while the three experts had κ values of 0.906, 0.931, and 0.914, independently. Conclusions: This approach could automatically detect DR with excellent sensitivity, accuracy, and specificity and could aid in making a referral recommendation for further evaluation and treatment with high reliability. Translational relevance: This approach has great value in early DR screening using retinal fundus photographs.
Article
Full-text available
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at https://github.com/ramprs/grad-cam/, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) (http://gradcam.cloudcv.org) and a video at http://youtu.be/COjUB9Izk6E.
Article
The rapid spread of COVID-19 cases in recent months has strained hospital resources, making rapid and accurate triage of patients presenting to emergency departments a necessity. Machine learning techniques using clinical data such as chest X-rays have been used to predict which patients are most at risk of deterioration. We consider the task of predicting two types of patient deterioration based on chest X-rays: adverse event deterioration (i.e., transfer to the intensive care unit, intubation, or mortality) and increased oxygen requirements beyond 6 L per day. Due to the relative scarcity of COVID-19 patient data, existing solutions leverage supervised pretraining on related non-COVID images, but this is limited by the differences between the pretraining data and the target COVID-19 patient data. In this paper, we use self-supervised learning based on the momentum contrast (MoCo) method in the pretraining phase to learn more general image representations to use for downstream tasks. We present three results. The first is deterioration prediction from a single image, where our model achieves an area under receiver operating characteristic curve (AUC) of 0.742 for predicting an adverse event within 96 hours (compared to 0.703 with supervised pretraining) and an AUC of 0.765 for predicting oxygen requirements greater than 6 L a day at 24 hours (compared to 0.749 with supervised pretraining). We then propose a new transformer-based architecture that can process sequences of multiple images for prediction and show that this model can achieve an improved AUC of 0.786 for predicting an adverse event at 96 hours and an AUC of 0.848 for predicting mortalities at 96 hours. A small pilot clinical study suggested that the prediction accuracy of our model is comparable to that of experienced radiologists analyzing the same information.
Article
In this study, a dataset of X-Ray images from patients with common bacterial pneumonia, confirmed Covid-19 disease, and normal incidents was utilized for the automatic detection of the Coronavirus. The aim of the study is to evaluate the performance of state-of-the-art Convolutional Neural Network architectures proposed over recent years for medical image classification. Specifically, the procedure called transfer learning was adopted. With transfer learning, the detection of various abnormalities in small medical image datasets is an achievable target, often yielding remarkable results. The datasets utilized in this experiment are two. Firstly, a collection of 1427 X-Ray images including 224 images with confirmed Covid-19 disease, 700 images with confirmed common bacterial pneumonia, and 504 images of normal conditions. Secondly, a dataset including 224 images with confirmed Covid-19 disease, 714 images with confirmed bacterial and viral pneumonia, and 504 images of normal conditions. The data was collected from the available X-Ray images on public medical repositories. The results suggest that Deep Learning in X-Rays may extract significant biomarkers related to the Cpvid-19 disease, while the best accuracy, sensitivity, and specificity obtained is 96.78%, 98.66%, and 96.46% respectively. Since by now, all diagnostic tests show failure rates such as to raise concerns, the probability of incorporating X-rays into the diagnosis of the disease could be assessed by the medical community, based on the findings, while more research to evaluate the X-Ray approach from different aspects may be conducted.
Article
Large, labeled datasets have driven deep learning methods to achieve expert-level performance on a variety of medical imaging tasks. We present CheXpert, a large dataset that contains 224,316 chest radiographs of 65,240 patients. We design a labeler to automatically detect the presence of 14 observations in radiology reports, capturing uncertainties inherent in radiograph interpretation. We investigate different approaches to using the uncertainty labels for training convolutional neural networks that output the probability of these observations given the available frontal and lateral radiographs. On a validation set of 200 chest radiographic studies which were manually annotated by 3 board-certified radiologists, we find that different uncertainty approaches are useful for different pathologies. We then evaluate our best model on a test set composed of 500 chest radiographic studies annotated by a consensus of 5 board-certified radiologists, and compare the performance of our model to that of 3 additional radiologists in the detection of 5 selected pathologies. On Cardiomegaly, Edema, and Pleural Effusion, the model ROC and PR curves lie above all 3 radiologist operating points. We release the dataset to the public as a standard benchmark to evaluate performance of chest radiograph interpretation models.