Available via license: CC BY 4.0
Content may be subject to copyright.
CheXtransfer: Performance and Parameter Eiciency of
ImageNet Models for Chest X-Ray Interpretation
Alexander Ke∗
alexke@cs.stanford.edu
Stanford University
USA
William Ellsworth∗
willells@cs.stanford.edu
Stanford University
USA
Oishi Banerjee∗
oishi.banerjee@cs.stanford.edu
Stanford University
USA
Andrew Y. Ng
ang@cs.stanford.edu
Stanford University
USA
Pranav Rajpurkar
pranavsr@cs.stanford.edu
Stanford University
USA
Figure 1: Visual summary of our contributions. From left to right: scatterplot and best-t line for 16 pretrained models show-
ing no relationship between ImageNet and CheXpert performance, CheXpert performance relationship varies across archi-
tecture families much more than within, average CheXpert performance improves with pretraining, models can maintain
performance and improve parameter eciency through truncation of nal blocks. Error bars show one standard deviation.
ABSTRACT
Deep learning methods for chest X-ray interpretation typically rely
on pretrained models developed for ImageNet. This paradigm as-
sumes that better ImageNet architectures perform better on chest
X-ray tasks and that ImageNet-pretrained weights provide a perfor-
mance boost over random initialization. In this work, we compare
the transfer performance and parameter eciency of 16 popular
convolutional architectures on a large chest X-ray dataset (CheX-
pert) to investigate these assumptions. First, we nd no relation-
ship between ImageNet performance and CheXpert performance
for both models without pretraining and models with pretraining.
Second, we nd that, for models without pretraining, the choice
of model family inuences performance more than size within a
family for medical imaging tasks. Third, we observe that ImageNet
pretraining yields a statistically signicant boost in performance
across architectures, with a higher boost for smaller architectures.
Fourth, we examine whether ImageNet architectures are unneces-
sarily large for CheXpert by truncating nal blocks from pretrained
∗Authors contributed equally to this research.
models, and nd that we can make models 3.25x more parameter-
ecient on average without a statistically signicant drop in per-
formance. Our work contributes new experimental evidence about
the relation of ImageNet to chest x-ray interpretation performance.
CCS CONCEPTS
•Applied computing →Health informatics
;
•Computing
methodologies →Computer vision;Neural networks.
KEYWORDS
generalization, eciency, pretraining, chest x-ray interpretation,
ImageNet, truncation
1 INTRODUCTION
Deep learning models for chest X-ray interpretation have high po-
tential for social impact by aiding clinicians in their workow and
increasing access to radiology expertise worldwide. Transfer learn-
ing using pretrained ImageNet [
9
] models has been the standard
approach for developing models not only on chest X-rays [
1
,
20
,
27
]
but also for many other medical imaging modalities [
8
,
10
,
16
,
17
,
29
].
This transfer assumes that better ImageNet architectures perform
better and pretrained weights boost performance on their target
arXiv:2101.06871v1 [cs.CV] 18 Jan 2021
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
medical tasks. However, there has not been a systematic investi-
gation of how ImageNet architectures and weights both relate to
performance on downstream medical tasks.
In this work, we systematically investigate how ImageNet ar-
chitectures and weights both relate to performance on chest X-ray
tasks. Our primary contributions are:
(1)
For models without pretraining and models with pretraining,
we nd no relationship between ImageNet performance and
CheXpert performance (Spearman
𝜌=
0
.
08,
𝜌=
0
.
06 respec-
tively). This nding suggests that architecture improvements
on ImageNet may not lead to improvements on medical imag-
ing tasks.
(2)
For models without pretraining, we nd that within an ar-
chitecture family, the largest and smallest models have small
dierences (ResNet 0.005, DenseNet 0.003, EcientNet 0.004)
in CheXpert AUC, but dierent model families have larger
dierences in AUC (
>
0
.
006). This nding suggests that the
choice of model family inuences performance more than size
within a family for medical imaging tasks.
(3)
We observe that ImageNet pretraining yields a statistically
signicant boost in performance (average boost of 0.016 AUC)
across architectures, with a higher boost for smaller archi-
tectures (Spearman
𝜌=−
0
.
72 with number of parameters).
This nding supports the ImageNet pretraining paradigm
for medical imaging tasks, especially for smaller models.
(4)
We nd that by truncating nal blocks of pretrained models,
we can make models 3.25x more parameter-ecient on av-
erage without a statistically signicant drop in performance.
This nding suggests model truncation may be a simple
method to yield lighter pretrained models by preserving
architecture design features while reducing model size.
Our study, to the best of our knowledge, contributes the rst
systematic investigation of the performance and eciency of Im-
ageNet architectures and weights for chest X-ray interpretation.
Our investigation and ndings may be further validated on other
datasets and medical imaging tasks.
2 RELATED WORK
2.1 ImageNet Transfer
Kornblith et al
. [15]
examined the performance of 16 convolutional
neural networks (CNNs) on 12 image classication datasets. They
found that using these ImageNet pretrained architectures either as
feature extractors for logistic regression or ne tuning them on the
target dataset yielded a Spearman
𝜌=
0
.
99 and
𝜌=
0
.
97 between
ImageNet accuracy and transfer accuracy respectively. However,
they showed ImageNet performance was less correlated with trans-
fer accuracy for some ne-grained tasks, corroborating He et al
.
[11]
. They found that without ImageNet pretraining, ImageNet ac-
curacy and transfer accuracy had a weaker Spearman
𝜌=
0
.
59. We
extend Kornblith et al
. [15]
to the medical setting by studying the
relationship between ImageNet and CheXpert performance.
Raghu et al
. [19]
explored properties of transfer learning onto
retinal fundus images and chest X-rays. They studied ResNet50
and InceptionV3 and showed pretraining oers little performance
improvement. Architectures composed of just four to ve sequential
convolution and pooling layers achieved comparable performance
on these tasks as ResNet50 with less than 40% the parameters.
In our work, we nd pretraining does not boost performance for
ResNet50, InceptionV3, InceptionV4, and MNASNet but does boost
performance for the remaining 12 architectures. Thus, we were
able to replicate Raghu et al
. [19]
’s results, but upon studying a
broader set of newer and more popular models, we reached the
opposite conclusion that ImageNet pretraining yields a statistically
signicant boost in performance.
2.2 Medical Task Architectures
Irvin et al
. [13]
compared the performance of ResNet152, DenseNet121,
InceptionV4, and SEResNeXt101 on CheXpert, nding that DenseNet121
performed best. In a recent analysis, all but one of the top ten CheX-
pert competition models used DenseNets as part of their ensemble,
even though they have been surpassed on ImageNet [
21
]. Few
groups design their own networks from scratch, preferring to use
established ResNet and DenseNet architectures for CheXpert [
3
].
This trend extends to retinal fundus and skin cancer tasks as well,
where Inception architectures remain popular [
8
,
16
,
17
,
29
]. The
popularity of these older ImageNet architectures hints that there
may be a disconnect between ImageNet performance and medical
task performance for newer architectures generated through archi-
tecture search. We verify that these newer architectures generated
through search (EcientNet, MobileNet, MNASNet) underperform
older architectures (DenseNet, ResNet) on CheXpert, suggesting
that search has overt to ImageNet and explaining the popularity
of these older architectures in the medical imaging literature.
Bressem et al
. [3]
postulated that deep CNNs that can represent
more complex relationships for ImageNet may not be necessary for
CheXpert, which has greyscale inputs and fewer image classes. They
studied ResNet, DenseNet, VGG, SqueezeNet, and AlexNet perfor-
mance on CheXpert and found that ResNet152, DenseNet161, and
ResNet50 performed best on CheXpert AUC. In terms of AUPRC,
they showed that smaller architectures like AlexNet and VGG can
perform similarly to deeper architectures on CheXpert. Models
such as AlexNet, VGG, and SqueezeNet are no longer popular in
the medical setting, so in our work, we systematically investigate
the performance and eciency of 16 more contemporary ImageNet
architectures with and without pretraining. Additionally, we ex-
tend [
3
] by studying the eects of pretraining, characterizing the
relationship between ImageNet and CheXpert performance, and
drawing conclusions about architecture design.
2.3 Truncated Architectures
The more complex a convolutional architecture becomes, the more
computational and memory resources are needed for its training
and deployment. Model complexity thus may impede the deploy-
ment of CNNs to clinical settings with less resources. Therefore,
eciency, often reported in terms of the number of parameters in
a model, the number of FLOPS in the forward pass, or the latency
of the forward pass, has become increasingly important in model
design. Low-rank factorization [
7
,
14
], transferred/compact con-
volutional lters [
6
], knowledge distillation [
12
], and parameter
pruning [
25
] have all been proposed to make CNNs more ecient.
Layer-wise pruning is a type of parameter pruning that locates
and removes layers that are not as useful to the target task [
22
].
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Through feature diagnosis, a linear classier is trained using the
feature maps at intermediate layers to quantify how much a par-
ticular layer contributes to performance on the target task [
5
]. In
this work, we propose model truncation as a simple method for
layer-wise pruning where the nal pretrained layers after a given
point are pruned o, a classication layer is appended, and this
whole model is netuned on the target task.
3 METHODS
3.1 Training and Evaluation Procedure
We train chest X-ray classication models with dierent architec-
tures with and without pretraining. The task of interest is to predict
the probability of dierent pathologies from one or more chest
X-rays. We use the CheXpert dataset consisting of 224,316 chest
X-rays of 65,240 patients [
13
] labeled for the presence or absence of
14 radiological observations. We evaluate models using the average
of their AUROC metrics (AUC) on the ve CheXpert-dened com-
petition tasks (Atelectasis, Cardiomegaly, Consolidation, Edema,
Pleural Eusion) as well as the No Finding task to balance clinical
importance and prevalence in the validation set.
We select 16 models pretrained on ImageNet from public check-
points implemented in PyTorch 1.4.0: DenseNet (121, 169, 201) and
ResNet (18, 34, 50, 101) from Paszke et al
. [18]
, Inception (V3, V4)
and MNASNet from Cadene
[4]
, and EcientNet (B0, B1, B2, B3)
and MobileNet (V2, V3) from Wightman
[28]
. We netune and eval-
uate these architectures with and without ImageNet pretraining.
For each model, we netune all parameters on the CheXpert
training set. If a model is pretrained, inputs are normalized using
mean and standard deviation learned from ImageNet. If a model
is not pretrained, inputs are normalized with mean and standard
deviation learned from CheXpert. We use the Adam optimizer (
𝛽1=
0
.
9,
𝛽2=
0
.
999) with learning rate of 1
×
10
−4
, a batch size of 16, and
a cross-entropy loss function. We train on up to four Nvidia GTX
1080 with CUDA 10.1 and Intel Xeon CPU ES-2609 running Ubuntu
16.04. For one run of an architecture, we train for three epochs and
evaluate each model every 8192 gradient steps. We train each model
and create a nal ensemble model from the ten checkpoints, which
achieved the best average CheXpert AUC across the six tasks on
the validation set. We report all our results on the CheXpert test
set.
We use the nonparametric bootstrap to estimate 95% condence
intervals for each statistic. 1,000 replicates are drawn from the test
set, and the statistic is calculated on each replicate. This procedure
produces a distribution for each statistic, and we report the 2.5 and
97.5 percentiles as a condence interval. Signicance is assessed at
the 𝑝=0.05 level.
3.2 Truncated Architectures
We study truncated versions of DenseNet121, MNASNet, ResNet18,
and EcientNetB0. DenseNet121 and MNASNet were chosen be-
cause we found they have the greatest eciency (by AUC per
parameters) on CheXpert of the models we prole, ResNet18 was
chosen because of its popularity as a compact model for medical
tasks, and EcientNetB0 was chosen because it is the smallest
current-generation model of the 16 we study. DenseNet121 con-
tains four dense blocks separated by transition blocks before the
classication layer. By pruning the nal dense block and associated
transition block, the model now only contains three dense blocks,
yielding DenseNet121Minus1. Similarly, pruning two dense blocks
and associated transition blocks yields DenseNet121Minus2, and
pruning three dense blocks and associated transition blocks yields
DenseNet121Minus3. For MNASNet, we remove up to the four of the
nal MBConv blocks to produce MNASNetMinus1 through MNAS-
NetMinus4. For ResNet18, we remove up to the three of the nal
residual blocks with a similar method to produce ResNet18Minus1
through ResNet18Minus3. For EcientNet, we remove up to two
of the nal MBConv6 blocks to produce EcientNetB0Minus1 and
EcientNetB0Minus2.
After truncating a model, we append a classication block con-
taining a global average pooling layer followed by a fully connected
layer to yield outputs of the correct shape. We initialize the model
with ImageNet pretrained weights, except the randomly initialized
classication block, and netune using the same training procedure
as the 16 ImageNet models.
3.3 Class Activation Maps
We compare the class activation maps (CAMs) among a truncated
DenseNet121 family to visualize their higher resolution CAMs. We
generate CAMs using the Grad-CAM method [
23
], using a weighted
combination of the model’s nal convolutional feature maps, with
weights based on the positive partial derivatives with respect to
class score. This averaged map is scaled by the outputted probability
so more condent predictions appear brighter. Finally, the map is
upsampled to the input image resolution and overlain onto the input
image, highlighting image regions that had the greatest inuence
on a model’s decision.
4 EXPERIMENTS
4.1 ImageNet Transfer Performance
We investigate whether higher performance on natural image clas-
sication translates to higher performance on chest X-ray classica-
tion. We display the relationship between the CheXpert AUC, with
and without ImageNet pretraining, and ImageNet top-1 accuracy
in Figure 2
When models are trained without pretraining, we nd no mono-
tonic relationship between ImageNet top-1 accuracy and average
CheXpert AUC, with Spearman
𝜌=
0
.
082 at
𝑝=
0
.
762. Model
performance without pretraining would describe how a given ar-
chitecture would perform on the target task, independent of any
pretrained weights. When models are trained with pretraining, we
again nd no monotonic relationship between ImageNet top-1 ac-
curacy and average CheXpert AUC with Spearman
𝜌=
0
.
059 at
𝑝=0.829.
Overall, we nd no relationship between ImageNet and CheXpert
performance, so models that succeed on ImageNet do not necessar-
ily succeed on CheXpert. These relationships between ImageNet
performance and CheXpert performance are much weaker than the
relationships between ImageNet performance and performance on
various natural image tasks reported by Kornblith et al. [15].
We compare the CheXpert performance within and across ar-
chitecture families. Without pretraining, we nd that ResNet101
performs only 0.005 AUC greater than ResNet18, which is well
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
Figure 2: Average CheXpert AUC vs. ImageNet Top-1 Accuracy. The left plot shows results obtained without pretraining, while
the right plot shows results with pretraining. There is no monotonic relationship between ImageNet and CheXpert perfor-
mance without pretraining (Spearman 𝜌=0.08) or with pretraining (Spearman 𝜌=0.06).
within the condence interval of this metric (Figure 2). Similarly,
DenseNet201 performs 0.004 AUC greater than DenseNet121 and
EcientNetB3 performs 0.003 AUC greater than EcientNetB0.
With pretraining, we continue to nd minor performance dier-
ences between the largest model and smallest model that we test in
each family. We nd AUC increases of 0.002 for ResNet, 0.004 for
DenseNet and -0.006 for EcientNet. Thus, increasing complexity
within a model family does not yield increases in CheXpert perfor-
mance as meaningful as the corresponding increases in ImageNet
performance.
Without pretraining, we nd that the best model studied per-
forms signicantly better than the worst model studied. Among
models trained without pretraining, we nd that InceptionV3 per-
forms best with 0.866 (0.851, 0.880) AUC, while MobileNetV2 per-
forms worst with 0.814 (0.796, 0.832) AUC. Their dierence in per-
formance is 0.052 (0.043, 0.063) AUC. InceptionV3 is also the third
largest architecture studied and MobileNetV2 the smallest. We nd
a signicant dierence in the CheXpert performance of these mod-
els. This dierence again hints at the importance of architecture
design.
4.2 CheXpert Performance and Eciency
We examine whether larger architectures perform better than smaller
architectures on chest X-ray interpretation, where architecture size
is measured by number of parameters. We display these relation-
ships in Figure 3and Table 1.
Without ImageNet pretraining, we nd a positive monotonic
relationship between the number of parameters of an architecture
and CheXpert performance, with Spearman
𝜌=
0
.
791 signicant
at
𝑝=
2
.
62
×
10
−4
. With ImageNet pretraining, there is a weaker
positive monotonic relationship between the number of parameters
and average CheXpert AUC, with Spearman
𝜌=
0
.
565 at
𝑝=
0
.
023.
Although there exists a positive monotonic relationship between
the number of parameters of an architecture and average CheXpert
AUC, the Spearman
𝜌
does not highlight the increase in parame-
ters necessary to realize marginal increases in CheXpert AUC. For
example, ResNet101 is 11.1x larger than EcientNetB0, but with
only increase of 0.005 in CheXpert AUC with pretraining.
Within a model family, increasing the number of parameters
does not lead to meaningful gains in CheXpert AUC. We see this
relationship in all families studied without pretraining (EcientNet,
DenseNet, and ResNet) in Figure 3. For example, DenseNet201
has an AUC 0.003 greater than DenseNet121, but is 2.6x larger.
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Figure 3: Average CheXpert AUC vs. Model Size. The left plot shows results obtained without pretraining, while the right plot
shows results with pretraining. The logarithm of the model size has a near linear relationship with CheXpert performance
when we omit pretraining (Spearman 𝜌=
0
.
79
). However once we incorporate pretraining, the monotonic relationship is
weaker (Spearman 𝜌=0.56).
Model CheXpert AUC #Params (M)
DenseNet121 0.859 (0.846, 0.871) 6.968
DenseNet169 0.860 (0.848, 0.873) 12.508
DenseNet201 0.864 (0.850, 0.876) 18.120
EcientNetB0 0.859 (0.846, 0.871) 4.025
EcientNetB1 0.858 (0.844, 0.872) 6.531
EcientNetB2 0.866 (0.853, 0.880) 7.721
EcientNetB3 0.853 (0.837, 0.867) 10.718
InceptionV3 0.862 (0.848, 0.876) 27.161
InceptionV4 0.861 (0.846, 0.873) 42.680
MNASNet 0.858 (0.845, 0.871) 5.290
MobileNetV2 0.854 (0.839, 0.869) 2.242
MobileNetV3 0.859 (0.847, 0.872) 4.220
ResNet101 0.863 (0.848, 0.876) 44.549
ResNet18 0.862 (0.847, 0.875) 11.690
ResNet34 0.863 (0.849, 0.875) 21.798
ResNet50 0.859 (0.843, 0.871) 25.557
Table 1: CheXpert AUC (with 95% Condence Intervals) and
Number of Parameters for 16 ImageNet-Pretrained Models.
EcientNetB3 has an AUC 0.004 greater than EcientNetB0, but
is 1.9x larger. Despite the positive relationship between model
size and CheXpert performance across all models, bigger does not
necessarily mean better within a model family.
Since within a model family there is a weaker relationship be-
tween model size and CheXpert performance than across all mod-
els, we nd that CheXpert performance is inuenced more by the
macro architecture design than by its size. Models within a fam-
ily have similar architecture design choices but dierent sizes, so
they perform similarly on CheXpert. We observe large discrepan-
cies in performance between architecture families. For example
DenseNet, ResNet, and Inception typically outperform EcientNet
and MobileNet architectures, regardless of their size. EcientNet,
MobileNet, and MNASNet were all generated through neural archi-
tecture search to some degree, a process that optimized for perfor-
mance on ImageNet. Our ndings suggest that this search could
have overt to the natural image objective to the detriment of chest
X-ray tasks.
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
Figure 4: Pretraining Boost vs. Model Size. We dene pretraining boost as the increase in the average CheXpert AUCs achieved
with pretraining vs. without pretraining. Most models benet signicantly from ImageNet pretraining. Smaller models tend
to benet more than larger models (Spearman 𝜌=−0.72).
4.3 ImageNet Pretraining Boost
We study the eects of ImageNet pretraining on CheXpert perfor-
mance by dening the pretraining boost as the CheXpert AUC of a
model initialized with ImageNet pretraining minus the CheXpert
AUC of its counterpart without pretraining. The pretraining boosts
of our architectures are reported in Figure 4.
We nd that ImageNet pretraining provides a meaningful boost
for most architectures (on average 0.015 AUC). We nd a Spearman
𝜌=−
0
.
718 at
𝑝=
0
.
002 between the number of parameters of a
given model and the pretraining boost. Therefore, this boost tends
to be larger for smaller architectures such as EcientNetB0 (0.023),
MobileNetV2 (0.040) and MobileNetV3 (0.033) and smaller for larger
architectures such as InceptionV4 (
−
0
.
002) and ResNet101 (0.013).
Further work is required to explain this relationship.
Within a model family, the pretraining boost also does not
meaningfully increase as as model size increases. For example,
DenseNet201 has a pretraining boost only 0.002 AUC greater than
DenseNet121 does. This nding supports our earlier conclusion
that model families perform similarly on CheXpert regardless of
their size.
4.4 Truncated Architectures
We truncate the nal blocks of DenseNet121, MNASNet, ResNet18,
and EcientNetB0 with pretrained weights and study their CheX-
pert performance to understand whether ImageNet models are
unnecessarily large for the chest X-ray task. We express eciency
gains in terms of Times-Smaller, or the number of parameters of the
original architecture divided by the number of parameters of the
truncated architecture: intuitively, how many times larger the orig-
inal architecture is compared to the truncated architecture. The e-
ciency gains and AUC changes of model truncation on DenseNet121,
MNASNet, ResNet18, and EcientNetB0 are displayed in Table 2.
For all four model families, we nd that truncating the nal
block leads to no signicant decrease in CheXpert AUC but can
save 1.4x to 4.2x the parameters. Notably, truncating the nal block
of ResNet18 yields a model that is not signicantly dierent (dif-
ference -0.002 (-0.008, 0.004)) in CheXpert AUC, but is 4.2x smaller.
Truncating the nal two blocks of an EcientNetB0 yields a model
that is not signicantly dierent (dierence 0.004 (-0.003, 0.009)) in
CheXpert AUC, but is 4.7x smaller. However, truncating the second
block and beyond in each of MNASNet, DenseNet121, and ResNet18
yields models that have statistically signicant drops in CheXpert
performance.
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Model AUC Change Times-Smaller
EcientNetB0 0.00% 1x
EcientNetB0Minus1 0.15% 1.4x
EcientNetB0Minus2 -0.45% 4.7x
MNASNet 0.00% 1x
MNASNetMinus1 -0.07% 2.5x
MNASNetMinus2* -2.30% 11.2x
MNASNetMinus3* -2.51% 20.0x
MNASNetMinus4* -6.40% 112.9x
DenseNet121 0.00% 1x
DenseNet121Minus1 -0.04% 1.6x
DenseNet121Minus2* -1.33% 5.3x
DenseNet121Minus3* -4.73% 20.0x
ResNet18 0.00% 1x
ResNet18Minus1 0.24% 4.2x
ResNet18Minus2* -3.70% 17.1x
ResNet18Minus3* -8.33% 73.8x
Table 2: Eciency Trade-O of Truncated Models. Pre-
trained models can be truncated without signicant de-
crease in CheXpert AUC. Truncated models with signi-
cantly dierent AUC from the base model are denoted with
an asterisk.
Figure 5: Comparison of Class Activation Maps Among
Truncated Model Family. CAMs yielded by models, from
left to right, DenseNet121, DenseNet121Minus1, and
DenseNet121Minus2. Displays frontal chest X-ray demon-
strating Atelectasis (top) and Edema (bottom). Further
truncated models more eectively localize the Atelectasis,
as well as tracing the hila and vessel branching for Edema.
Model truncation eectively compresses models performant on
CheXpert, making them more parameter ecient while still using
pretrained weights to capture the pretraining boost. Parameter
ecient models are able to lighten the computational and memory
burdens for deployment to low-resource environments such as
portable devices. In the clinical setting, the simplicity of our model
truncation method encourages its adoption for model compression.
This nding corroborates Raghu et al
. [19]
and Bressem et al
. [3]
,
which show simpler models can achieve performance comparable to
more complex models on CheXpert. Our truncated models can use
readily-available pretrained weights, which may allow these models
to capture the pretraining boost and speed up training. However,
we do not study the performance of these truncated models without
their pretrained weights.
As an additional benet, architectures that truncate pooling
layers will also produce higher-resolution class activation maps,
as shown in Figure 5. The higher-resolution class activation maps
(CAMs) may more eectively localize pathologies with little to no
decrease in classication performance. In clinical settings, improved
explainability through better CAMs may be useful for validating
predictions and diagnosing mispredictions. As a result, clinicians
may have more trust in models that provide these higher-resolution
CAMs.
5 DISCUSSION
In this work, we study the performance and eciency of ImageNet
architectures for chest x-ray interpretation.
Is ImageNet performance correlated with CheXpert? No. We show
no statistically signicant relationship between ImageNet and CheX-
pert performance. This nding extends Kornblith et al
. [15]
—which
found a signicant correlation between ImageNet performance and
transfer performance on typical image classication datasets—to
the medical setting of chest x-ray interpretation. This dierence
could be attributed to unique aspects the chest X-ray interpreta-
tion task and data attributes. The chest X-ray interpretation task
diers from natural image classication in that (1) disease classi-
cation may depend on abnormalities in a small number of pixels,
(2) chest X-ray interpretation is a multi-task classication setup,
and (3) there are far fewer classes than in many natural image
classication datasets. Second, the data attributes for chest X-rays
dier from natural image classication in that X-rays are greyscale
and have similar spatial structures across images (always either
anterior-posterior, posterior-anterior, or lateral).
Does model architecture matter? Yes. For models without pretrain-
ing, we nd that the choice of architecture family may inuence
performance more than model size. Our ndings extend Raghu et al
.
[19]
beyond the eect of ImageNet weights, since we show that
architectures that succeed on ImageNet do not necessarily succeed
on medical imaging tasks. A notable nding of our work is that
newer architectures generated through search on ImageNet (E-
cientNet, MobileNet, MNASNet) underperform older architectures
(DenseNet, ResNet) on CheXpert. This nding suggests that search
may have overt to ImageNet to the detriment of medical task
performance, and ImageNet may not be an appropriate benchmark
for selecting architectures for medical imaging tasks. Instead, med-
ical imaging architectures could be benchmarked on CheXpert or
other large medical datasets. Architectures derived from selection
and search on CheXpert and other large medical datasets may be
applicable to similar medical imaging modalities including other
x-ray studies, or CT scans. Thus architecture search directly on
Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, and Pranav Rajpurkar
CheXpert or other large medical datasets may allow us to unlock
next generation performance for medical imaging tasks.
Does ImageNet pretraining help? Yes. We nd that ImageNet pre-
training yields a statistically signicant boost in performance for
chest x-ray classication. Our ndings are consistent with Raghu
et al
. [19]
, who nd no pretraining boost on ResNet50 and Incep-
tionV3, but we nd pretraining does boost performance for 12
out of 16 architectures. Our ndings extend He et al
. [11]
—who
nd models without pretraining had comparable performance to
models pretrained on ImageNet for object detection and image
segmentation of natural images—to the medical imaging setting.
Future work may investigate the relationship between network ar-
chitectures and the impact of self-supervised pre-training for chest
x-ray interpretation as has recently been developed by Azizi et al
.
[2], Sowrirajan et al. [24], Sriram et al. [26].
Can models be smaller? Yes. We nd that by truncating nal
blocks of ImageNet-pretrained architectures, we can make models
3.25x more parameter-ecient on average without a statistically
signicant drop in performance. This method preserves the criti-
cal components of architecture design while cutting its size. This
observation suggests model truncation may be a simple method to
yield lighter models, using ImageNet pretrained weights to boost
CheXpert performance. In the clinical setting, truncated models
may provide value through improved parameter-eciency and
higher resolution CAMs. This change may enable deployment to
low-resource clinical environments and further develop model trust
through improved explainability.
In closing, our work contributes to the understanding of the
transfer performance and parameter eciency of ImageNet models
for chest X-ray interpretation. We hope that our new experimental
evidence about the relation of ImageNet to medical task perfor-
mance will shed light on potential future directions for progress.
REFERENCES
[1]
Ioannis D Apostolopoulos and Tzani A Mpesiana. 2020. Covid-19: automatic
detection from x-ray images utilizing transfer learning with convolutional neural
networks. Physical and Engineering Sciences in Medicine (2020), 1.
[2]
Shekoofeh Azizi, Basil Mustafa, Fiona Ryan, Zachary Beaver, Jan Freyberg,
Jonathan Deaton, Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting
Chen, Vivek Natarajan, and Mohammad Norouzi. 2021. Big Self-Super vised
Models Advance Medical Image Classication. arXiv:2101.05224 [eess.IV]
[3]
Keno K. Bressem, Lisa Adams, Christoph Erxleben, Bernd Hamm, Stefan Niehues,
and Janis Vahldiek. 2020. Comparing Dierent Deep Learning Architectures for
Classication of Chest Radiographs. arXiv:2002.08991 [cs.LG]
[4]
Remi Cadene. 2018. pretrainedmodels 0.7.4. https://pypi.org/project/
pretrainedmodels/.
[5]
S. Chen and Q. Zhao. 2019. Shallowing Deep Networks: Layer-Wise Pruning
Based on Feature Representations. IEEE Transactions on Pattern Analysis and
Machine Intelligence 41, 12 (2019), 3048–3056.
[6]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A Survey of Model
Compression and Acceleration for Deep Neural Networks. CoRR abs/1710.09282
(2017). arXiv:1710.09282 http://arxiv.org/abs/1710.09282
[7]
François Chollet. 2016. Xception: Deep Learning with Depthwise Separable
Convolutions. CoRR abs/1610.02357 (2016). arXiv:1610.02357 http://arxiv.org/
abs/1610.02357
[8]
Jerey De Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav
Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Bren-
dan O’Donoghue, Daniel Visentin, George van den Driessche, Balaji Lakshmi-
narayanan, Clemens Meyer, Faith Mackinder, Simon Bouton, Kareem Ayoub,
Reena Chopra, Dominic King, Alan Karthikesalingam, Cían O. Hughes, Rosalind
Raine, Julian Hughes, Dawn A. Sim, Catherine Egan, Adnan Tufail, Hugh Mont-
gomery, Demis Hassabis, Geraint Rees, TrevorBack, Peng T. Khaw, Mustafa Suley-
man, Julien Cornebise, Pearse A. Keane, and Olaf Ronneberger.2018. Clinically ap-
plicable deep learning for diagnosis and referral in retinal disease. Nature Medicine
24, 9 (01 Sep 2018), 1342–1350. https://doi.org/10.1038/s41591- 018-0107- 6
[9]
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A
large-scale hierarchical image database. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition. 248–255.
[10]
Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter,
Helen M. Blau, and Sebastian Thrun. 2017. Dermatologist-level classication
of skin cancer with deep neural networks. Nature 542, 7639 (2017), 115–118.
https://doi.org/10.1038/nature21056
[11]
Kaiming He, Ross B. Girshick, and Piotr Dollár. 2018. Rethinking ImageNet
Pre-training. CoRR abs/1811.08883 (2018). arXiv:1811.08883 http://arxiv.org/abs/
1811.08883
[12]
Georey Hinton, Oriol Vinyals, and Je Dean. 2015. Distilling the Knowledge in
a Neural Network. arXiv:1503.02531 [stat.ML]
[13]
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris
Chute, Henrik Marklund, Behzad Haghgoo, Robyn L. Ball, Katie S. Shpanskaya,
Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones,
David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and An-
drew Y. Ng. 2019. CheXpert: A Large Chest Radiograph Dataset with Uncertainty
Labels and Expert Comparison. CoRR abs/1901.07031 (2019). arXiv:1901.07031
http://arxiv.org/abs/1901.07031
[14]
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up
Convolutional Neural Networks with Low Rank Expansions. CoRR abs/1405.3866
(2014). arXiv:1405.3866 http://arxiv.org/abs/1405.3866
[15]
Simon Kornblith, Jonathon Shlens, and Quoc V. Le. 2018. Do Better ImageNet
Models Transfer Better? CoRR abs/1805.08974 (2018). arXiv:1805.08974 http:
//arxiv.org/abs/1805.08974
[16]
Feng Li, Zheng Liu, Hua Chen, Minshan Jiang, Xuedian Zhang, and Zhizheng
Wu. 2019. Automatic Detection of Diabetic Retinopathy in Retinal Fun-
dus Photographs Based on Deep Learning Algorithm. Translational Vision
Science & Technology 8, 6 (11 2019), 4–4. https://doi.org/10.1167/tvst.8.6.
4arXiv:https://arvojournals.org/arvo/content_public/journal/tvst/938258/i2164-
2591-8-6-4.pdf
[17]
Akinori Mitani, Abigail Huang, Subhashini Venugopalan, Greg S. Corrado, Lily
Peng, Dale R. Webster, Naama Hammel, Yun Liu, and Avinash V. Varadarajan.
2020. Detection of anaemia from retinal fundus images via deep learning. Nature
Biomedical Engineering 4, 1 (01 Jan 2020), 18–27. https://doi.org/10.1038/s41551-
019-0487- z
[18]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-
maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
Library. In Advances in Neural Information Processing Systems 32, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Cur-
ran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015- pytorch-
an-imperative- style-high- performance- deep-learning- library.pdf
[19]
Maithra Raghu, Chiyuan Zhang, Jon M. Kleinberg, and Samy Bengio. 2019. Trans-
fusion: Understanding Transfer Learning with Applications to Medical Imaging.
CoRR abs/1902.07208 (2019). arXiv:1902.07208 http://arxiv.org/abs/1902.07208
[20]
Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Yi Ding, Aarti Bagul, Curtis Langlotz, Katie S. Shpanskaya,
Matthew P. Lungren, and Andrew Y. Ng. 2017. CheXNet: Radiologist-Level
Pneumonia Detection on Chest X-Rays with Deep Learning. CoRR abs/1711.05225
(2017). arXiv:1711.05225 http://arxiv.org/abs/1711.05225
[21]
P. Rajpurkar, Anirudh Joshi, Anuj Pareek, P. Chen, A. Kiani, Jeremy A. Irvin, A.
Ng, and M. Lungren. 2020. CheXpedition: Investigating Generalization Chal-
lenges for Translation of Chest X-Ray Algorithms to the Clinical Setting. ArXiv
abs/2002.11379 (2020).
[22]
Youngmin Ro and Jin Young Choi. 2020. Layer-wise Pruning and Auto-
tuning of Layer-wise Learning Rates in Fine-tuning of Deep Networks.
arXiv:2002.06048 [cs.CV]
[23]
Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael
Cogswell, Devi Parikh, and Dhruv Batra. 2016. Grad-CAM: Why did you say
that? Visual Explanations from Deep Networks via Gradient-based Localization.
CoRR abs/1610.02391 (2016). arXiv:1610.02391 http://arxiv.org/abs/1610.02391
[24]
Hari Sowrirajan, Jingbo Yang, Andrew Y. Ng, and Pranav Rajpurkar. 2020. MoCo
Pretraining Improves Representation and Transferability of Chest X-ray Models.
arXiv:2010.05352 [cs.CV]
[25]
Suraj Srinivas and R. Venkatesh Babu. 2015. Data-free parameter pruning for
Deep Neural Networks. CoRR abs/1507.06149 (2015). arXiv:1507.06149 http:
//arxiv.org/abs/1507.06149
[26]
Anuroop Sriram, Matthew Muckley, KoustuvSinha, Farah Shamout, Joelle Pineau,
Krzysztof J. Geras, Lea Azour, Yindalon Aphinyanaphongs, Nassa Yakubova,
and William Moore. 2021. COVID-19 Deterioration Prediction via Self-Super vised
Representation Learning and Multi-Image Prediction. arXiv:2101.04909 [cs.CV]
[27]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M. Summers. 2017. ChestX-ray8: Hospital-scale Chest X-ray Database and
Benchmarks on Weakly-Supervised Classication and Localization of Common
CheXtransfer: Performance and Parameter Eiciency of ImageNet Models for Chest X-Ray Interpretation
Thorax Diseases. CoRR abs/1705.02315 (2017). arXiv:1705.02315 http://arxiv.org/
abs/1705.02315
[28] Ross Wightman. 2020. timm 0.2.1. https://pypi.org/project/timm/.
[29]
Li Zhang, Mengya Yuan, Zhen An, Xiangmei Zhao, Hui Wu, Haibin Li, Ya Wang,
Beibei Sun, Huijun Li, Shibin Ding, Xiang Zeng, Ling Chao, Pan Li, and Weidong
Wu. 2020. Prediction of hypertension, hyperglycemia and dyslipidemia from
retinal fundus photographs via deep learning: A cross-sectional study of chronic
diseases in central China. PLOS ONE 15, 5 (05 2020), 1–11. https://doi.org/10.
1371/journal.pone.0233166