ArticlePDF Available

Pediatric Bone Age Assessment Using Deep Convolutional Neural Networks

Authors:

Abstract and Figures

Skeletal bone age assessment is a common clinical practice to diagnose endocrine and metabolic disorders in child development. In this paper, we describe a fully automated deep learning approach to the problem of bone age assessment using data from Pediatric Bone Age Challenge organized by RSNA 2017. The dataset for this competition is consisted of 12.6k radiological images of left hand labeled by the bone age and sex of patients. Our approach utilizes several deep learning architectures: U-Net, ResNet-50, and custom VGG-style neural networks trained end-to-end. We use images of whole hands as well as specific parts of a hand for both training and inference. This approach allows us to measure importance of specific hand bones for the automated bone age analysis. We further evaluate performance of the method in the context of skeletal development stages. Our approach outperforms other common methods for bone age assessment.
Content may be subject to copyright.
Pediatric Bone Age Assessment Using Deep
Convolutional Neural Networks
Vladimir Iglovikov
1
, Alexander Rakhlin
2
, Alexandr Kalinin
3
, and Alexey Shvets
4
1Lyft Inc., San Francisco, CA 94107, USA
iglovikov@gmail.com
2
National Research University of Electronic Technology, Zelenograd, Moscow, Russia
rakhlin@gmx.net
3University of Michigan, Ann Arbor, MI 48109, USA
akalinin@umich.edu
4Massachusetts Institute of Technology, Cambridge, MA 02142, USA
shvets@mit.edu
Abstract.
Skeletal bone age assessment is a common clinical practice
to diagnose endocrine and metabolic disorders in child development. In
this paper, we describe a fully automated deep learning approach to the
problem of bone age assessment using data from the 2017 Pediatric Bone
Age Challenge organized by the Radiological Society of North America.
The dataset for this competition is consisted of 12.6k radiological images.
Each radiograph in this dataset is an image of a left hand labeled by the
bone age and the sex of a patient. Our approach utilizes several deep
neural network architectures trained end-to-end. We use images of whole
hands as well as specic parts of a hand for both training and inference.
This approach allows us to measure the importance of specic hand bones
for the automated bone age analysis. We further evaluate performance of
the method in the context of skeletal development stages. Our approach
outperforms other common methods for bone age assessment.
Keywords:
Medical Imaging, Computer-aided diagnosis (CAD), Computer
Vision, Image Recognition, Deep Learning
1 Introduction
During organism development the bones of the skeleton change in size and shape,
and thus a dierence between a child’s assigned bone and chronological ages
might indicate a growth problem. Clinicians use bone age assessment in order to
estimate the maturity of a child’s skeletal system. Bone age assessment methods
usually start with taking a single X-ray image of the left hand from the wrist to
ngertips, see Fig. . It is a safe and painless procedure that uses a small amount
of radiation. The bones on the X-ray image are compared with radiographs in
a standardized atlas of bone development. Such bone age atlases are based on
large numbers of radiographs collected from children of the same sex and age.
2 Iglovikov et al.
Over the past decades, the bone age assessment procedure has been performed
manually using either the Greulich and Pyle (GP) [] or Tanner-Whitehouse
(TW) [] methods. The GP procedure determines the bone age by comparing
the patient’s radiograph with an atlas of representative ages. The TWtechnique
is based on a scoring system that examines  specic bones. In both cases,
bone assessment procedure requires a considerable time. Only recently software
solutions, such as BoneXpert [], have been developed and approved for the
clinical use in Europe. BoneXpert uses the Active Appearance Model (AAM) [],
a computer vision algorithm, which reconstructs the contours of  bones of a
hand. Then the system determines the overall bone age according to their shape,
texture, and intensity based on the GP or TW techniques. The accuracy of this
system is around . years (.months). However, it is sensitive to the image
quality and does not utilize the carpal bones, despite their importance for skeletal
maturity assessment in infants and toddlers.
Recent advances in deep learning and its applications to computer vision
allowed many researchers to drastically improve results obtained with image
processing systems particularly related to medical image analyses []. Deep
learning based approaches are gaining more attention because in a number of cases
they were shown to achieve and even overcome human level performance, making
end-to-end image processing automated and suciently fast. In the domain of
medical imaging, convolutional neural networks (CNN) have been successfully
used for diabetic retinopathy screening [], heart disease diagnostics [], lung
cancer detection [], and other applications []. In the case of bone age assessment,
a manually performed procedure requires around  minutes of doctor’s time per
a patient. When the same procedure is done using software based on classical
Fig. 1.
Bones of a human hand and wrist (adapted from the Human Anatomy library [8]).
Automatic Bone Age Assessment with Deep Learning 3
Fig. 2. Bone age distribution for females and males in the training dataset.
computer vision methods, it takes -minutes, but still requires substantial
doctoral supervision and expertise. Deep learning based methods allow to avoid
feature engineering by automatically learning the hierarchy of discriminative
features directly from a set of labeled examples. Using a deep learning approach
processing of one image typically takes less than sec, while accuracy of these
methods in many cases exceeds that of conventional methods. Deep neural network
based solutions for bone age assessment from hand radiographs were suggested
before [,,]. However, these studies did not perform a numerical evaluation
of the performance of their models using dierent hand bones. In addition, we
nd that the performance of deep learning models for bone age assessment can be
further improved with better preprocessing and training networks on radiographs
from scratch instead of ne-tuning from natural image domain.
In this paper, we present a novel deep learning based method for bone age
assessment. We validate the performance of this method using the data from
the  Pediatric Bone Age Challenge organized by the Radiological Society of
North America (RSNA) []. This data set is now freely available and can be
accessed at []. In our approach, we rst preprocess radiographs by segmenting
the hand, normalizing contrast, detecting key points and using them to register
segmented hand images. Then, we train several deep network architectures
using dierent parts of images to evaluate how various bones contribute to the
models’ performance across four major skeletal development stages. Finally, we
compare predictions across dierent models and evaluate the overall performance
of our approach. We demonstrate that the suggested method is more robust and
demonstrates superior performance compared to other commonly used solutions.
2 Preprocessing
The goal of the rst step in the preprocessing pipeline is to extract a region
of interest (a hand mask) from the image and remove all extraneous objects.
Images were collected at dierent hospitals and simple background removal
4 Iglovikov et al.
methods did not produce satisfactory results. Thus, there is a compelling need
for a reliable hand segmentation technique. However, this type of algorithms
typically requires large manually labeled training set. To alleviate labeling costs,
we employ a technique called positive mining. In general, positive mining is an
iterative procedure where manual labeling is combined with automatic processing.
It allows us quickly obtain accurate masks for all images in the training set.
Overall our preprocessing method includes binary image segmentation as a
rst step and then the analysis of connected components for the post-processing
of segmentation results. For the image segmentation, we use U-Net deep network
architecture originally proposed in []. U-Net is capable of learning from a
relatively small training set that makes it a good architecture to combine with
positive mining. In general, the U-Net architecture consists of a contracting path
to capture context and a symmetric expanding path that enables precise local-
ization. The contracting path follows the typical architecture of a convolutional
network with convolution and pooling operations and progressively downsampled
feature maps. Every step in the expansive path consists of an upsampling of the
feature map followed by a convolution. Hence, the expansive branch increases the
resolution of the output. In order to localize, upsampled features in the expansive
path are combined with the high resolution features from the contracting path
via skip-connections []. This architecture proved itself very useful for segmenta-
tion problems with limited amounts of data, e.g. see []. We also employ batch
normalization technique to improve convergence during training [].
In our algorithms, we use the generalized loss function
𝐿=𝐻log 𝐽,()
where 𝐻is a binary cross entropy that dened as
𝐻=
1
𝑛
𝑛
𝑖=1
(𝑦𝑖log ^𝑦𝑖+(1𝑦𝑖) log(1 ^𝑦𝑖)) ,()
where
𝑦𝑖
is a binary value of the corresponding pixel
𝑖
(
^𝑦𝑖
is a predicted probability
for the pixel). In the second term of Eq. (),
𝐽
is a dierentiable generalization
of the Jaccard Index
𝐽=1
𝑛
𝑛
𝑖=1 𝑦𝑖^𝑦𝑖
𝑦𝑖+ ^𝑦𝑖𝑦𝑖^𝑦𝑖.()
For more details see [].
In this work, we rst manually label  masks using the online annotation
service Supervisely [] that takes approximately min to process each image.
These masks are used to train the U-Net model that then is used to perform hand
segmentation on the rest of the training set. Since each image is supposed to
contain only one hand, for each prediction we remove all connected components
except for the largest one that is kept and further subjected to the standard hole
lling protocol. This procedure allows us to predict hand masks in the unlabeled
train set, however, visual inspection reveals that the quality of mask predictions
is inconsistent and requires additional improvements. Thus, we visually inspect
Automatic Bone Age Assessment with Deep Learning 5
Fig. 3.
Iterative procedure of positive mining utilizing U-Net architecture for image
segmentation: (A) raw input data; (B) mask manually labeled with the online annotation
tool Supervisely [22]; (C) new data; (D) raw prediction; (E) post processed prediction;
(F) raw image with mask plotted together for visual inspection.
all predicted masks and keep only those of acceptable quality, while discarding
the rest. This manual process allows us to curate approximately -images per
second. Expanding the initial training set with the additional good quality masks
increases the size of the labeled images for segmentation procedure and improves
segmentation results. To achieve an acceptable quality on the whole training set,
we repeat this procedure times. Finally, we manually label approximately 
of the corner cases that U-Net is not able to capture well. The whole iterative
procedure is schematically shown in Fig.
Original GP and TWmethods focus on specic hand bones, including
phalanges, metacarpal and carpal bones, see Fig. . So, we choose to train
separate models on several specic regions in high resolution and subsequently
evaluate their performance. To correctly locate these regions it is necessary to
transform all the images to the same size and position, i.e. to register them in one
coordinate space. Hence, our model comprises two sub-models: image registration
and bone age assessment of a specic region.
3 Key points detection
One of our goals is to evaluate the importance of specic regions of a hand for the
automated bone age assessment. This opens remarkable opportunity of running a
model on smaller image crops with higher resolution that might result in reduced
processing time and higher accuracy. To crop a specic region, we have to register
hand radiographs, or in other words, align them into a common coordinate space.
To this end, we rst detect coordinates of several specic key points of a hand.
6 Iglovikov et al.
Then, we calculate ane transformation parameters (zoom, rotation, translation,
and mirror) to t the image into the desired position (Fig. ).
Three characteristic points on the image are chosen: the tip of the distal
phalanx of the third nger, tip of the distal phalanx of the thumb, and center of
the capitate. All images are re-scaled to the same resolution:
2080 ×1600
pixels,
and padded with zeros when necessary. To create training set for key points
model, we manually label  radiographs. Pixel coordinates of key points serve
as training targets for our regression model. Registration procedure is shown in
Fig. .
Key points model is implemented as a deep convolutional neural network,
inspired by a popular VGG family of models [], with a regression output. The
network architecture is schematically shown in Fig. . The VGG module consists
of two convolutional layers with the Exponential Linear Unit (ELU) activation
function [], batch normalization [], and max pooling. The input image is
passed through a stack of three VGG blocks followed by three Fully Connected
layers. VGG blocks consist of ,, convolution layers respectively. For
better generalization, dropout units are applied in-between. The model is trained
with Mean Squared Error loss function (MSE) using Adam optimizer []:
𝑀𝑆𝐸 =1
𝑛
𝑛
𝑖=1
(^𝑦𝑖𝑦𝑖)2.()
To reduce computational costs, we downscale input images to x pixels.
At the same time, target coordinates for key points are re-scaled from
[0,2079] ×
[0,1599]
to the uniform square
[1,1] ×[1,1]
. At the inference stage after the
model detects key points, we project their coordinates back to the original image
size, i.e
2080 ×1600
pixels. To improve generalization of our model, we apply
Fig. 4.
Image registration. (Left) Key points: the tip of the middle nger (the yellow dot),
the center of the capitate (the red dot), the tip of the thumb (the blue dot). Registration
positions: for the tip of the middle nger and for the center of the capitate (white dots).
(Right) Registered image after key points are found and the ane transformation and
scaling are applied.
Automatic Bone Age Assessment with Deep Learning 7
Fig. 5.
VGG-style neural network architectures for regression (top) and classication
(bottom) tasks.
input augmentations: rotation, translation and zoom. The model output consists
of coordinates, for every key point.
At the next step, we calculate ane transformations (zoom, rotation, transla-
tion) for all radiographs. Our goal is to preserve proportions of an image and to
t it into uniform position such that for every image: ) the tip of the middle
nger is aligned horizontally and positioned approximately  pixels below the
top edge of the image; ) the capitate (see Fig. ) is aligned horizontally and
positioned approximately  pixels above the bottom edge of the image. By
convention, bone age assessment uses radiographs of the left hand, but sometimes
the images in the dataset get mirrored. To detect these images and adjust them
appropriately the key point for the thumb is used. The results of the segmentation,
normalization and registration are shown in the fourth row of Fig. .
4 Bone age assessment model
Although CNNs are more commonly used in classication tasks, bone age as-
sessment is a regression task by nature. In order to access performance in both
settings, we compare two types of CNNs: regression and classication. Both
models share similar architectures and training protocols, and only dier in two
nal layers.
8 Iglovikov et al.
Fig. 6.
Preprocessing pipeline: (rst row) original images; (second row) binary hand
masks that are applied to the original images to remove background; (third row) masked
and normalized images; (bottom row) registered images.
4.1 Regression model
Our rst model is a VGG-style CNN [] with a regression output. This network
represents a stack of six convolutional blocks with ,,,,,
lters followed by two fully connected layers of  neurons each and a single
output (see Fig. ). The input size varies depending on the considered region
of an image, Fig. . For better generalization, we apply dropout layers before
the fully connected layers. For regression targets, we scale bone age in the range
[1,1]. The network is trained by minimizing Mean Absolute Error (MAE):
𝑀𝐴𝐸 =1
𝑛
𝑛
𝑖=1
|^𝑦𝑖𝑦𝑖|()
with Adam optimizer. We begin training with the learning rate
103
and then
progressively lower it to
105
. Due to a limited data set size, we use train time
augmentation with zoom, rotation and shift to avoid overtting.
4.2 Classication model
The classication model (Fig. ) is similar to the regression one, except for the
two nal layers. First, we assign each bone age a class. Bone ages expressed in
months, hence, we assume  classes overall. The second to the last layer is a
softmax layer with  outputs. This layer outputs vector of probabilities of 
classes. The probability of a class takes a real value in the range
[0,1]
. In the nal
layer, the softmax layer is multiplied by a vector of distinct bone ages uniformly
distributed over  integer values
[0,1, ..., 238,239]
. Thereby, the model outputs
single value that corresponds to the expectation of the bone age. We train this
model using the same protocol as the regression model.
Automatic Bone Age Assessment with Deep Learning 9
4.3 Region-specic modelling
In accordance with the targeted features of skeletal development stages described
in [,,], we crop three specic regions from registered radiographs (
2080×1600
pixel), as shown in Fig. :
. whole hand (2000 ×1500 pixel)
. carpal bones (750 ×750 pixel)
. metacarpals and proximal phalanges (600 ×1400 pixel)
4.4 Experiment setup
We split labeled radiographs into two sets preserving sex ratio. The training
set contains , images, and validation set contains , images. We create
several models with a breakdown by:
. type (regression, classication)
. sex (males, females, mixed)
. region (A, B, C)
Given these conditions, we produce  basic models (
2×3×3
). Furthermore,
we construct several meta-models as a linear average of regional models and,
nally, an average of dierent models.
5 Results
The performance of all models is evaluated on validation data set, as presented
in Fig. . The leftmost column represents the performance of a regression model
for both sexes. For carpal bones (region B) this model has the lowest accuracy
with MAE . months. The region of metacarpals and proximal phalanges
(region C) has higher accuracy with MAE equal to . months. MAE of the
whole image (region A) is . months. The linear ensemble of the three regional
models outperforms all of the above models with MAE . months (bottom
row). This regional pattern MAE(B)
>
MAE(C)
>
MAE(A)
>
MAE (ensemble)
is further observed for other model types and patient cohorts with few exceptions.
Separate regression models for male and female cohorts (second and third columns)
demonstrated higher accuracy when compared to those trained on a mixed
population. The ensemble of regional models has MAE equal to . months for
males and . months for females (bottom row). For males, the top performing
region is the whole hand (A) that has MAE equal to . months. In contrast,
for the female cohort region of metacarpals and proximal phalanges (C) has
MAE equal to . months and this result is the most accurate across the three
10 Iglovikov et al.
Fig. 7.
A registered radiograph with three specic regions: (A) a whole hand; (B) carpal
bones; (C) metacarpals and proximal phalanges.
regions. Classication models (fourth and fth columns) perform slightly better
than regression networks. The ensemble of the regional models has MAE equal to
. months for males and . months for females (bottom row). For the male
cohort the whole hand region (A) has the highest accuracy with MAE equal to
. months. For the female cohort the result produced using the metacarpals
and proximal phalanges region (C) is on par with that obtained using the whole
hand and has MAE equal to . months for both of them.
In the last column, we analyze the ensemble of classication and regression
models. As shown, the MAEs of regional models follow overall pattern MAE(B)
>
MAE(C)
>
MAE(A). The ensemble of the regional models (bottom row) has
the best accuracy with MAE equal to . month. This result clearly outperforms
state-of-the-art results with MAE equal to .months for the BoneXpert software
[] and .-. months for the recent application of deep neural networks [].
Next we evaluate our method in the context of the age distribution. Following
[,], we consider four major skeletal development stages: pre-puberty, early-
and-mid puberty, late puberty, and post-puberty. Infant and toddler categories
were excluded due to scarcity of the data: the development data set contained
only  radiographs for bone age less than  months and only of them were
presented in our validation subset. The model accuracies for the four skeletal
development stages, for dierent regions and sexes are depicted in the Fig. .
Note that the skeletal development stages are dierent for males and females.
Based on this data, we report two important ndings.
First, unlike Lee et al. [], we do not observe better results when training on
carpal bones compared to other areas. With the two exceptions, the metacarpals
and proximal phalanges provide better accuracy than the carpal bones do. These
Automatic Bone Age Assessment with Deep Learning 11
Fig. 8.
Mean absolute errors on the validation data set for regression and classication
models for dierent bones and sexes. Colors correspond to dierent regions. Table:
regions are shown in rows, models in columns. There is a total of 15 individual models
and 9 ensembles.
exceptions are pre-puberty for the male cohort and post-puberty for the female
cohort, where the accuracy of the carpal bones is higher. However, the validation
dataset sizes for these two skeletal development stages ( for males in pre-
puberty and  for females in post-puberty) are too small to draw statistically
signicant conclusion. At the same time, we notice that Gilsanz and Ratib []
proposed carpal bones as the best predictor of skeletal maturity only in infants
and toddlers. Thereafter, we nd no sound evidence to support the suggestion
that the carpal bones can be considered as the best predictor in the pre-puberty,
see [].
The second interesting nding is the inuence of the dataset on the accuracy
of a model. For the both sexes the accuracy peaks at the late-puberty, the most
frequent age in the data set. This dependency is particularly evident in the male
cohort, where the accuracy essentially mirrors data distribution (see Fig. ). This
nding is very important as it suggests a straightforward way for the future
improvement of the method.
6 Conclusion
In this study, we investigate the application of deep convolutional neural networks
to the problem of the automatic bone age assessment. The automatic bone age
12 Iglovikov et al.
Fig. 9.
Mean absolute error in months as a function of skeletal development stages
for dierent sexes. Dierent colors on the plot correspond to dierent regions of a
radiograph. For males and females the development stages are labelled at the bottom
of each plot.
assessment system based on our approach can estimate skeletal maturity with
accuracy similar to that of an expert radiologist and surpasses existing automated
models, e.g. see []. In addition, we numerically evaluate dierent zones of a
hand in bone age assessment. We nd that bone age assessment could be done
just for carpal bones or for metacarpals and proximal phalanges with around
-
%
increase in error compared to the whole hand assessment. Therefore, we
can establish bone age using just part of the radiogram with high enough quality,
lowering computational overheads.
Despite the challenging quality of the radiographs, our approach succeeds in
image preprocessing, cleaning and standardization. These transformations, in
turn, greatly help in improving the robustness and performance of deep learning
models. Moreover, the accuracy of our approach can be improved even further.
First, our solution could be easily combined with other, more complex network
architectures, such as Resnet []. Another way to improve it is to substantially
extend the training data set by additional examples. Furthermore, bone ages that
serve as labels for training may also be rened based on work of independent
experts. Thereby, implementing these simple steps could potentially lead to a
development of a state-of-the-art bone assessment software system. It would have
a potential for the deployment in the clinical environment in order to help doctors
in making a nal age bone assessment decision accurately and in real time, with
just one click. Moreover, a cloud-based system could be deployed for this problem
and process radoigraphs independently of their origin. This potentially could help
thousands of doctors to get a qualied evaluation even in hard-to-reach areas.
To conclude, in the nal stage of the RSNA Pediatric Bone Age As-
sessement challenge our solution has been evaluated by organizers using the test
Automatic Bone Age Assessment with Deep Learning 13
set. This data set consisted of  radiographs equally divided between sexes.
All labels have been hidden from participants. Based on organizers’ report our
method achieved MAE equal to . months, which is higher compared to the
performance on the data withheld from the training set. The explanation of such
improvement may be hidden in more accurate labelling of the test set or better
quality of the radiographs. Each of radiographs in the test data set was evaluated
and crosschecked by three experts independently, compared to a one expert that
labelled training radiographs. This again demonstrate the importance of the
input from domain experts.
Acknowledgments
The authors would like to thank Open Data Science com-
munity [] for many valuable discussions and educational help in the growing
eld of machine/deep learning.
References
1.
2017 Data Science Bowl, Predicting Lung Cancer: 2nd place solution write-up,
Daniel Hammack and Julian de Wit.
http://blog.kaggle.com/2017/06/29/2017-
data-science-bowl-predicting- lung-cancer- 2nd-place-solution-write-up- daniel-
hammack-and-julian- de-wit/ (2017), online; accessed December 12, 2017
2.
Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do, B.T., Way,
G.P., Ferrero, E., Agapow, P.M., Xie, W., Rosen, G.L., et al.: Opportunities and
obstacles for deep learning in biology and medicine. bioRxiv p. 142760 (2017)
3.
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
4.
Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Training models of shape
from sets of examples. In: BMVC92, pp. 9–18. Springer (1992)
5.
Gilsanz, V., Ratib, O.: Hand bone age: a digital atlas of skeletal maturity. Springer
Science & Business Media (2005)
6.
Greulich, W.W., Pyle, S.I.: Radiographic atlas of skeletal development of the hand
and wrist. The American Journal of the Medical Sciences 238(3), 393 (1959)
7.
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
8.
Human Anatomy Library: Anatomy of the left hand.
http://humananatomylibrary.
com/anatomy-of-the- left-hand/anatomy-of-the- left-hand-3d-human-anatomy-
model-human-anatomy-library/ (2016), online; accessed December 12, 2017
9.
Iglovikov, V., Mushinskiy, S., Osin, V.: Satellite imagery feature detection us-
ing deep convolutional neural network: A kaggle competition. arXiv preprint
arXiv:1706.06169 (2017)
10. Ioe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: International Conference on Machine Learning.
pp. 448–456 (2015)
11.
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
12.
Korshunova, I.: Diagnosing heart diseases with deep neural networks.
https:
//irakorshunova.github.io/2016/03/15/heart.html
(2016), online; accessed Decem-
ber 12, 2017
14 Iglovikov et al.
13.
Larson, D.B., Chen, M.C., Lungren, M.P., Halabi, S.S., Stence, N.V., Langlotz,
C.P.: Performance of a deep-learning neural network model in assessing skeletal
maturity on pediatric hand radiographs. Radiology p. 170236 (2017)
14.
Lee, H., Tajmir, S., Lee, J., Zissen, M., Yeshiwas, B.A., Alkasab, T.K., Choy, G.,
Do, S.: Fully automated deep learning system for bone age assessment. Journal of
Digital Imaging pp. 1–15 (2017)
15. Open Data Science (ODS). https://ods.ai, online; accessed December 12, 2017
16.
Rakhlin, A.: Diabetic retinopathy detection through integration of deep learning
classication framework. bioRxiv p. 225508 (2017)
17.
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical
image segmentation. In: International Conference on Medical Image Computing
and Computer-Assisted Intervention. pp. 234–241. Springer (2015)
18.
RSNA Pediatric Bone Age Challenge.
http://rsnachallenges.cloudapp.net/
competitions/4 (2017), online; accessed December 12, 2017
19.
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
20.
Spampinato, C., Palazzo, S., Giordano, D., Aldinucci, M., Leonardi, R.: Deep
learning for automated skeletal bone age assessment in X-ray images. Medical image
analysis 36, 41–51 (2017)
21.
Stanford University Articial Intelligence in Medicine & Imaging: Bone age images
used in the 2017 RSNA bone age challenge competition.
https://aimi.stanford.edu/
available-labeled-medical-datasets (2017), online; accessed December 12, 2017
22. Supervisely. https://supervise.ly/, online; accessed December 12, 2017
23.
Tanner, J., Whitehouse, R., Cameron, N., Marshall, W., Healy, M., Goldstein, H.:
Assessment of skeletal maturity and prediction of adult height (TW2 method).
Academic Press, London (1983)
24.
Thodberg, H.H., Kreiborg, S., Juul, A., Pedersen, K.D.: The BoneXpert method
for automated determination of skeletal maturity. IEEE transactions on medical
imaging 28(1), 52–66 (2009)
... A large number of feature channels in up-sampling part allows propagating context information to higher resolution layers. This type of network architecture proven themselves in binary image segmentation competitions such as satellite image analysis [16] and medical image analysis [17], [18]. Moreover, the results of deep learning elaboration will be further processed by more standard image processing techniques. ...
Article
Full-text available
Archaeological research that uncovers ancient artifacts is of critical importance, since ruins and objects are the only way to deal with the history of a city. Artificial intelligence techniques can support the archaeologists to discover remains which are difficult to be identified manually as they must be sought on a vast territory or because they are hidden from direct observation. Here we present an original methodology that integrates deep learning and computer vision techniques to identify, from aerial pictures, remains of the Centuriation, that is an ancient Roman system for the division of the territory.
Article
Full-text available
Purpose To provide a flexible, end-to-end platform for visually distinguishing diseased from undiseased tissue in a medical image, in particular pathology slides, and classifying diseased regions by subtype. Highly accurate results are obtained using small training datasets and reduced-scale source images that can be easily shared. Approach An ensemble of lightweight convolutional neural networks (CNNs) is trained on different subsets of images derived from a relatively small number of annotated whole-slide histopathology images (WSIs). The WSIs are first reduced in scale to preserve anatomic features critical to analysis but also facilitate convenient handling and storage. The segmentation and subtyping tasks are performed sequentially on the reduced-scale images using the same basic workflow – generating and sifting tiles from the image, then classifying each tile with an ensemble of appropriately trained CNNs. For segmentation, the CNN predictions are combined using a function to favor a selected similarity metric, and a mask or map for a a candidate image is produced from tiles whose combined predictions exceed a decision boundary. For subtyping, the resulting mask is applied to the candidate image, and new tiles are derived from the unoccluded regions. These are classified by the subtyping CNNs to produce an overall subtype prediction. Results and conclusion This approach was applied successfully to two very different datasets of large WSIs, one (PAIP2020) involving multiple subtypes of colorectal cancer and the other (CAMELYON16) single-type breast cancer metastases. Scored using standard similarity metrics, the segmentations outperformed more complex models typifying the state of the art.
Article
Pediatric bone age assessment (BAA) is a common clinical technique for evaluating children's endocrine, genetic, and growth disorders. However, the deep learning BAA method based on global images neglects fine-grained concerns, and regions of interest (ROIs) need additional annotation and complex processing. To overcome these shortcomings, we proposed an interpretable deep-learning architecture based on multiple-instance learning to address BAA efficiently without additional annotations. We cropped the entire image into small patches and got patch features by feature extraction network. Then, an attention backbone ranked feature vectors of the entire image and aggregates its information according to its relative importance. Finally, each image's features and gender were aggregated to predict bone age. The proposed method can identify ROIs by attention-based multi-instance aggregation without additional labels and produce interpretable heatmaps. Moreover, by cropping the complete image into patches and reducing the dimensionality, the proposed model can notice the fine-grained information of the image and improve the model training speed. We validated the proposed method in the Radiological Society of North America 2017 dataset. The results showed that the proposed model achieved an advanced performance of MAE 4.17 months. Furthermore, the visualization results indicated that the proposed model was highly interpretable, which can localize the ROIs without spatial labeling. In conclusion, a novel method for high performance and interpretable bone age prediction without additional manual annotations has been developed, which can be used to effectively assess the pediatric's bone age.
Article
Deep learning methodologies have been employed in several different fields, with an outstanding success in image recognition applications, such as material quality control, medical imaging, autonomous driving, etc. Deep learning models rely on the abundance of labeled observations to train a prospective model. These models are composed of millions of parameters to estimate, increasing the need of more training observations. Frequently, it is expensive to gather labeled observations of data, making the usage of deep learning models not ideal, as the model might overfit data. In a semisupervised setting, unlabeled data are used to improve the levels of accuracy and generalization of a model with small labeled datasets. Nevertheless, in many situations different unlabeled data sources might be available. This raises the risk of a significant distribution mismatch between the labeled and unlabeled datasets. Such phenomena can cause a considerable performance hit to typical semisupervised deep learning (SSDL) frameworks, which often assume that both labeled and unlabeled datasets are drawn from similar distributions. Therefore, in this article we study the latest approaches for SSDL for image recognition. Emphasis is made in SSDL models designed to deal with a distribution mismatch between the labeled and unlabeled datasets. We address open challenges with the aim to encourage the community to tackle them, and overcome the high data demand of traditional deep learning pipelines under real-world usage settings.
Article
Full-text available
Bone age assessment (BAA) evaluates individual skeletal maturity by comparing the characteristics of skeletal development to the standard in a specific population. The X-ray image examination for bone age is tedious and subjective, and it requires high professional skills. Therefore, AI techniques are desired to innovate and improve BAA methods. Most of the BAA method use the whole X-ray image in an end-to-end model directly. Such whole-image-based approaches fail to characterize local changes and provide limited aid for diagnosis and understanding disease progress. To address these issues, we collected and curated a dataset of 2129 cases for the study of BAA with fine-grained skeletal maturity level labels of the 13 ROIs in hand bone based on the expert knowledge from TW method. We designed a four-stage automatic BAA model based on recursive feature pyramid network. Firstly, the palm region was segmented using U-Net, followed by the extraction of multi-target ROIs of hand bone using a recursive feature pyramid network. Given the extracted ROIs, we employed a transfer learning model with attention mechanism to predict the skeletal maturity level of each ROI. Finally, the bone age is assessed based on the percentile curve of bone maturity. The proposed BAA model can automate the BAA. In addition, it provides the detection result of the 13 ROIs and their ROI-level skeletal maturity. The MAE can reach 0.61 years on the dataset with the labeling precision of one year. All the data and annotations used in this paper are released publicly.
Article
Effectively locating fine discriminative regions and fully analyzing fine region features is a challenging task in bone age assessment (BAA). Existing annotation-based methods are labor-intensive and attention-based methods can only learn semantic global features from coarse regions. In this article, in order to efficiently locate coarse and fine regions and fully learn features of them, we present a “coarse and fine feature joint learning network” (CFJLNet, including a coarse and a fine branch). Specifically, in each of the two branches of the CFJLNet, we first obtain attention used to locate discriminative regions and then design a “dual feature fusion” (DFF) module that can capture long-term dependencies in attention while fully fuse attention and features. Furthermore, for accurate localization of fine regions, we design an “Attention in Attention” ( $A^{2}$ ) module to force coarse attention to guide the generation of fine attention. In particular, the visualization show that regions located by fine attention are consistent with the regions of interest (ROIs) used by TW3 method. We empirically analyze the contributions of the DFF and $A^{2}$ modules in our network and demonstrate their superiority. The experimental results showed a mean absolute error (MAE) of 4.07 months on the RSNA dataset. To validate the generalizability of the proposed network, experiments are conducted on two face datasets, and it also achieves excellent performance.
Article
Full-text available
Bone age is commonly used to reflect growth and development trends in children, predict adult heights, and diagnose endocrine disorders. Nevertheless, the existing automated bone age assessment (BAA) models do not consider the nonlinearity and continuity of hand bone development simultaneously. In addition, most existing BAA models are based on datasets from European and American children and may not be applicable to the developmental characteristics of Chinese children. Thus, this work proposes a cascade model that fuses prior knowledge. Specifically, a novel bone age representation is defined, which incorporates nonlinear and continuous features of skeletal development and is implemented by a cascade model. Moreover, corresponding regions of interest (RoIs) based on RUS-CHN were extracted by YOLO v5 as prior knowledge inputs to the model. In addition, based on MobileNet v2, an improved feature extractor was proposed by introducing the Convolutional Block Attention Module and increasing the receptive field to improve the accuracy of the evaluation. The experimental results show that the mean absolute error (MAE) is 4.44 months and significant correlations with the reference bone age is (r = 0.994, p < 0.01); accuracy is 94.04% for ground truth within ±1 year. Overall, the model design adequately considers hand bone development features and has high accuracy and consistency, and it also has some applicability on public datasets, showing potential for practical and clinical applications.
Article
Full-text available
Bone age assessment plays a critical role in the investigation of endocrine, genetic, and growth disorders in children. This process is usually conducted manually, with some drawbacks, such as reliance on the pediatrician’s experience and extensive labor, as well as high variations among methods. Most deep learning models use one neural network to extract the global information from the whole input image, ignoring the local details that doctors care about. In this paper, we propose a global-local feature fusion convolutional neural network, including a global pathway to capture the global contextual information and a local pathway to extract the fine-grained information from local patches. The fine-grained information is integrated into the global context information layer-by-layer to assist in predicting bone age. We evaluated the proposed method on a dataset with 11,209 X-ray images with an age range of 4–18 years. Compared with other state-of-the-art methods, the proposed global-local network reduces the mean absolute error of the estimated ages to 0.427 years for males and 0.455 years for females; the average accuracy rate is within 6 months and 12 months, reaching 70% and 91%, respectively. In addition, the effectiveness and rationality of the model were verified on a public dataset.
Article
Rapid and accurate measurement of bone age from hand X-ray images is a significant task for children's maturity assessment and metabolic disorders diagnosis. With the development of deep learning technology, assessment methods based on convolutional neural networks (CNNs) have become mainstream. However, the existing CNN method generates the assessment results solely based on images, which ignores the clinical practice and thus weakens the evaluation performance. In this article, an automatic bone age assessment method based on CNN and graph convolutional network (GCN) is proposed. The overall method uses CNN for feature extraction and GCN for bone key regions inference, which mimics the physician's clinical process. Specifically, the key regions of the hand bone are first defined according to the clinical standard. Then, independent CNN pathways are established to extract the features of different key regions. Finally, a novel region aggregation GCN (RAGCN) is designed, which can aggregate the region features into the overall bone age representation according to the adjacency relation of the regions. In addition, RAGCN can also infer the importance of different regions in the feature aggregation process. The proposed method is validated on the Radiological Society of North America (RSNA) dataset and the Radiological Hand Pose Estimation (RHPE) dataset. The mean absolute error (MAE) is 4.09 months on the RSNA dataset and 6.78 months on the RHPE dataset, it is competitive and superior to other state-of-the-art methods.
Conference Paper
Full-text available
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
Preprint
Full-text available
Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across a variety of domains. Biology and medicine are data rich, but the data are complex and often ill-understood. Problems of this nature may be particularly well-suited to deep learning techniques. We examine applications of deep learning to a variety of biomedical problems -- patient classification, fundamental biological processes, and treatment of patients -- to predict whether deep learning will transform these tasks or if the biomedical sphere poses unique challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these problems, but promising advances have been made on the prior state of the art. Even when improvement over a previous baseline has been modest, we have seen signs that deep learning methods may speed or aid human investigation. More work is needed to address concerns related to interpretability and how to best model each problem. Furthermore, the limited amount of labeled data for training presents problems in some domains, as can legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning powering changes at the bench and bedside with the potential to transform several areas of biology and medicine.
Article
Full-text available
Skeletal maturity progresses through discrete phases, a fact that is used routinely in pediatrics where bone age assessments (BAAs) are compared to chronological age in the evaluation of endocrine and metabolic disorders. While central to many disease evaluations, little has changed to improve the tedious process since its introduction in 1950. In this study, we propose a fully automated deep learning pipeline to segment a region of interest, standardize and preprocess input radiographs, and perform BAA. Our models use an ImageNet pretrained, fine-tuned convolutional neural network (CNN) to achieve 57.32 and 61.40% accuracies for the female and male cohorts on our held-out test images. Female test radiographs were assigned a BAA within 1 year 90.39% and within 2 years 98.11% of the time. Male test radiographs were assigned 94.18% within 1 year and 99.00% within 2 years. Using the input occlusion method, attention maps were created which reveal what features the trained model uses to perform BAA. These correspond to what human experts look at when manually performing BAA. Finally, the fully automated BAA system was deployed in the clinical environment as a decision supporting system for more accurate and efficient BAAs at much faster interpretation time (<2 s) than the conventional method.
Article
Full-text available
Skeletal bone age assessment is a common clinical practice to investigate endocrinology, genetic and growth disorders in children. It is generally performed by radiological examination of the left hand by using either the Greulich and Pyle (G&P) method or the Tanner-Whitehouse (TW) one. However, both clinical procedures show several limitations, from the examination effort of radiologists to (most importantly) significant intra- and inter-operator variability. To address these problems, several automated approaches (especially relying on the TW method) have been proposed; nevertheless, none of them has been proved able to generalize to different races, age ranges and genders.
Article
Purpose To compare the performance of a deep-learning bone age assessment model based on hand radiographs with that of expert radiologists and that of existing automated models. Materials and Methods The institutional review board approved the study. A total of 14 036 clinical hand radiographs and corresponding reports were obtained from two children's hospitals to train and validate the model. For the first test set, composed of 200 examinations, the mean of bone age estimates from the clinical report and three additional human reviewers was used as the reference standard. Overall model performance was assessed by comparing the root mean square (RMS) and mean absolute difference (MAD) between the model estimates and the reference standard bone ages. Ninety-five percent limits of agreement were calculated in a pairwise fashion for all reviewers and the model. The RMS of a second test set composed of 1377 examinations from the publicly available Digital Hand Atlas was compared with published reports of an existing automated model. Results The mean difference between bone age estimates of the model and of the reviewers was 0 years, with a mean RMS and MAD of 0.63 and 0.50 years, respectively. The estimates of the model, the clinical report, and the three reviewers were within the 95% limits of agreement. RMS for the Digital Hand Atlas data set was 0.73 years, compared with 0.61 years of a previously reported model. Conclusion A deep-learning convolutional neural network model can estimate skeletal maturity with accuracy similar to that of an expert radiologist and to that of existing automated models. (©) RSNA, 2017.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
This paper describes our approach to the DSTL Satellite Imagery Feature Detection challenge run by Kaggle. The primary goal of this challenge is accurate semantic segmentation of different classes in satellite imagery. Our approach is based on an adaptation of fully convolutional neural network for multispectral data processing. In addition, we defined several modifications to the training objective and overall training pipeline, e.g. boundary effect estimation, also we discuss usage of data augmentation strategies and reflectance indices. Our solution scored third place out of 419 entries. Its accuracy is comparable to the first two places, but unlike those solutions, it doesn't rely on complex ensembling techniques and thus can be easily scaled for deployment in production as a part of automatic feature labeling systems for satellite imagery analysis.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
For decades, the determination of bone maturity has relied on a visual evaluation of skeletal development in the hand and wrist, most commonly using the Greulich and Pyle atlas. The Gilsanz and Ratib digital atlas takes advantage of the advent of digital imaging and provides a more effective and objective approach to skeletal maturity assessment. This atlas integrates the key morphological features of ossification in the bones of the hand and wrist and provides idealized, sex- and age-specific images of skeletal development. This computer-generated set of images should serve as a reasonable alternative to the reference books currently available. © Springer-Verlag Berlin Heidelberg 2005. All rights are reserved.