PreprintPDF Available

EchoGNN: Explainable Ejection Fraction Estimation with Graph Neural Networks


Abstract and Figures

Ejection fraction (EF) is a key indicator of cardiac function, allowing identification of patients prone to heart dysfunctions such as heart failure. EF is estimated from cardiac ultrasound videos known as echocardiograms (echo) by manually tracing the left ventricle and estimating its volume on certain frames. These estimations exhibit high inter-observer variability due to the manual process and varying video quality. Such sources of inaccuracy and the need for rapid assessment necessitate reliable and explainable machine learning techniques. In this work, we introduce EchoGNN, a model based on graph neural networks (GNNs) to estimate EF from echo videos. Our model first infers a latent echo-graph from the frames of one or multiple echo cine series. It then estimates weights over nodes and edges of this graph, indicating the importance of individual frames that aid EF estimation. A GNN regressor uses this weighted graph to predict EF. We show, qualitatively and quantitatively, that the learned graph weights provide explainability through identification of critical frames for EF estimation, which can be used to determine when human intervention is required. On EchoNet-Dynamic public EF dataset, EchoGNN achieves EF prediction performance that is on par with state of the art and provides explainability, which is crucial given the high inter-observer variability inherent in this task.
Content may be subject to copyright.
EchoGNN: Explainable Ejection Fraction
Estimation with Graph Neural Networks
Masoud Mokhtari1[0000000194715573], Teresa Tsang2, Purang
Abolmaesumi1?, and Renjie Liao1?
1Electrical and Computer Engineering, University of British Columbia, Vancouver,
BC, Canada
{masoud, purang, rjliao}
2Vancouver General Hospital, Vancouver, BC, Canada
Abstract. Ejection fraction (EF) is a key indicator of cardiac function,
allowing identification of patients prone to heart dysfunctions such as
heart failure. EF is estimated from cardiac ultrasound videos known as
echocardiograms (echo) by manually tracing the left ventricle and es-
timating its volume on certain frames. These estimations exhibit high
inter-observer variability due to the manual process and varying video
quality. Such sources of inaccuracy and the need for rapid assessment
necessitate reliable and explainable machine learning techniques. In this
work, we introduce EchoGNN, a model based on graph neural networks
(GNNs) to estimate EF from echo videos. Our model first infers a latent
echo-graph from the frames of one or multiple echo cine series. It then
estimates weights over nodes and edges of this graph, indicating the im-
portance of individual frames that aid EF estimation. A GNN regressor
uses this weighted graph to predict EF. We show, qualitatively and quan-
titatively, that the learned graph weights provide explainability through
identification of critical frames for EF estimation, which can be used to
determine when human intervention is required. On EchoNet-Dynamic
public EF dataset, EchoGNN achieves EF prediction performance that is
on par with state of the art and provides explainability, which is crucial
given the high inter-observer variability inherent in this task. Our source
code is publicly available at:
Keywords: Ultrasound ·Ejection Fraction ·Cardiac Imaging ·Explain-
able Models ·Graph Neural Networks ·Deep Learning
1 Introduction
Ejection fraction (EF) is a ratio indicating the volume of blood pumped by the
heart. This measurement is crucial in monitoring cardiovascular health and is a
potential indicator of heart failure [9,17]. EF is computed using the stroke vol-
ume, which is the blood volume difference in the Left Ventricle (LV) during the
?Co-Corresponding Authors
arXiv:2208.14003v1 [eess.IV] 30 Aug 2022
2 M. Mokhtari et al.
End-Systolic (ES) and End-Diastolic (ED) phases of the cardiac cycle denoted by
ESV and EDV, respectively [2]. These volumes are estimated from ultrasound
videos of the heart, i.e. echocardiograms (echo), which involves detecting the
frames corresponding to ES and ED and tracing the LV region. The manual pro-
cess of detecting the correct frames and making proper traces is prone to human
error. Therefore, the American Society of Echocardiography recommends per-
forming EF estimation for up to 5 cardiac cycles and averaging the results [16].
However, this guideline is seldom followed in practice, and a single representative
beat is selected for evaluation instead. This results in inter-observer variations
from 7.6% to 13.9% in the EF ratio [18].
Automatic EF estimation techniques aid professionals by adding another
layer of verification. Additionally, with the emergence of Point-of-Care Ultra-
sound (POCUS) imaging devices, which are routinely used by less experienced
echo users, automation of clinical measurements such as EF is further needed [1].
However, to be adopted broadly, such automation techniques must be explain-
able to detect when human intervention is required. Different machine learn-
ing (ML) architectures have been proposed to perform automatic EF estima-
tion [12,10,18,21,23], most of which lack reliable explainability mechanisms.
Some of these models fail to provide the model’s confidence on their predic-
tions [18,21,23] or have low accuracy due to unrealistic data augmentation during
training and over-reliance on ground truth labels [21].
In this work, we introduce EchoGNN, a novel deep learning model for explain-
able EF estimation. Our approach first infers a latent graph between frames of
one or multiple echo cine series. It then estimates EF based on this latent graph
via Graph Neural Networks (GNNs) [22], which are a class of deep learning mod-
els that efficiently capture graph data. To the best of our knowledge, our work
is the first one that investigates GNNs in the context of ultrasound videos and
EF estimation. Moreover, our work brings explainability through latent graph
learning, inspiring further work in this domain. Our contributions are threefold:
We introduce EchoGNN, a novel deep learning model for explainable EF
estimation through GNN-based latent graph learning.
We present a weakly-supervised training pipeline for EF estimation without
direct reliance on ground truth ES/ED frame labels.
Our model has a much lower number of parameters compared to prior work,
significantly reducing computational and memory requirements.
2 Related Work
Most prior works use Convolutional Neural Networks (CNNs) in their EF es-
timation pipeline [12,10,18,21]. Ouyang et al. [18] uses ResNet-based (2+1)D
convolutions [24] to estimate and average EF for all possible 32-frame clips in an
echo, while Kazemi Esfeh et al. [12] uses a similar approach under the Bayesian
Neural Networks (BNNs) setting. Recent work uses the encoder of ResNetAE [8]
to reduce data dimensionality before using transformers [25] to jointly perform
Explainable Ejection Fraction Estimation with Graph Neural Networks 3
ES/ED frame detection and EF estimation [21]. While these methods show differ-
ent levels of accuracy and success in predicting EF, they either lack explainability
or significantly rely on accurate clinical labels, which are inherently noisy and
subject to significant inter-observer variability. As an example, the transformer-
based approach requires ES/ED frame index labels in addition to EF labels in its
training pipeline [21]. Lastly, while Kazemi Esfeh et al. and Jafari et al. [12,10]
report uncertainty on their predictions, they still lack explainable indicators as
to why models fail or succeed for different cases. Our proposed framework based
on GNNs aims to alleviate these shortcomings. It provides explainability by only
relying on EF labels and not requiring ES/ED frame labels in a supervised man-
ner. Lastly, as an added advantage, the number of parameters for our model is
significantly less than prior work, which is highly desirable for deploying such
models on mobile clinical devices.
3 Methodology
We consider the following supervised problem for EF estimation: assume for each
patient i[N]in dataset D, there is a ground truth EF ratio yiR, and there
are K number of echo videos xi
kRT×W×H, where k[K], T is the number
of frames, and H and W are the height and width of each frame. The goal of
our model is to learn a function f:RK×T×H×WRto estimate EF from echo
videos. For notational simplicity and since our evaluation dataset only contains
one video per patient, we assume that K= 1. However, it must be noted that
our model is flexible in this regard and can handle multiple videos per patient.
3.1 EchoGNN Architecture
As shown in Fig. 1, EchoGNN is composed of three main components: Video
Encoder, Attention Encoder, and Graph Regressor. In the following subsections,
we discuss the details pertaining to each component.
Video Encoder The original echo videos are high-dimensional and must be
mapped into lower-dimensional embeddings to reduce memory footprint and
remove redundant information.
The Video Encoder is used to learn a mapping fve :RT×H×WRT×dfrom
input echo videos xiRT×W×Hto d-dimensional embeddings hi
jRd, where
j[T]is the frame number. The temporal dimension is preserved because the
Attention Encoder requires embeddings for all frames to produce interpretable
weights over them. We use a custom network consisting of 3D convolutions and
residual connections to use both the spatial and temporal information in the
video in generating the embeddings. This network’s architecture is provided in
the supp. material. Lastly, following [25], periodic positional encodings are added
to the generated frame embeddings to encode the sequential nature of video data.
4 M. Mokhtari et al.
Fig. 1. EchoGNN has three main components. (1) Video Encoder: encodes video
frames into vector embeddings while preserving the temporal dimension; (2) Atten-
tion Encoder: infers weights over the nodes (video frames) and edges (relationships
among frames) of the echo-graph; (3) Graph Regressor: estimates EF using the in-
ferred weighted graph; this figure shows an example where each patient has an apical
two-chamber (AP2) and an apical four-chamber (AP4) echo video.
Attention Encoder For each patient, we construct an echo-graph, which is a
complete graph where each node corresponds to a frame in the echo video, and
the edges show the non-Euclidean relationship between these frames. Formally,
we denote the echo-graph with Gecho(V, E)where V is the set of nodes corre-
sponding to echo frames such that |V|=T, and E is the set of edges between
the nodes to show the relationship between video frames such that if v1, v2V
are connected, then ev1,v2E. We use the frame embeddings from our Video
Encoder as node features of Gecho . That is, {hi
1, hi
2, ..., hi
T}are the set of features
for {v1, v2, ..., vT}. These embeddings can be represented as a matrix HiRT×d
such that each row is the embedding for a frame in the echo video for patient i.
Inspired by [14], we propose using GNNs to learn and assign weights to both
edges and nodes of the echo-graph. The edge and node weights are learned to
encode the importance of each frame (node weights) and the relationships among
frames (edge weights) for the final EF estimation.
The Attention Encoder infers weights over edges and nodes of the echo-graph
using message passing based GNNs [7]. A single message passing step is enough
for each node to capture information from all other nodes due to echo-graph
being a complete graph. More specifically, the following operations are used to
obtain weights over each edge evk,vs:
Explainable Ejection Fraction Estimation with Graph Neural Networks 5
uk,s =MLP1([hi
s]) (node edge)(1)
vs=MLP2(Xk6=suk,s) (edge node)(2)
zk,s =MLP3([vkkvs]) (node edge)(3)
ak,s =σ(zk,s ),(4)
where σis the Sigmoid function, [.k.]is the concatenation operator, and ak,s
[0,1] is the inferred weight for the directed edge from vkto vs. Similarly, weights
for each node ws[0,1] are generated by inserting another edge node oper-
ation after Eq. 3. All MLPs use two fully connected linear layers with ELU [4]
activation and batch normalization.
Regressor Our Regressor network uses GNN layers with the learned weighted
echo-graph to perform EF estimation. Specifically, for each patient, the output
of the Attention Encoder can be represented as a weighted adjacency matrix
A[0,1]T×Tand a node weight vector w[0,1]T. The Regressor uses Ato
generate embeddings over frames of the echo video:
Hl=gl(A, Hl1), l = 1, ..., L (5)
where HlRTdgis the matrix of learned node embeddings at layer l, H0is the
matrix of frame embeddings from the Video Encoder, and glis composed of a
Graph Convolutional Network (GCN) layer followed by batch normalization and
ELU activation [15]. To represent the whole graph with a single vector embed-
ding, the node embeddings are averaged using the frame weights wgenerated by
the Attention Encoder:
graph =PT
j=1 wjHl
j=1 wj
where Hl
jRdis the jth row of Hl, and wjis the jth scalar weight in the frame
weight vector. hi
graph is mapped into an EF estimate using an MLP with two
fully connected linear layers, ELU activation and batch normalization.
Learning Algorithm The model is differentiable in an end-to-end manner.
Therefore, we use gradient descent with Mean-Absolute-Error (MAE) between
predicted EF estimates ˜yiand ground truth EF values yiYas the optimization
objective, which is computed as L=1
i=1 |˜yiyi|.
4 Experiments
4.1 Dataset
We use EchoNet-Dynamic public EF dataset consisting of 10,030 AP4 echo
videos obtained between 2016 and 2018 at Stanford University Hospital. Each
6 M. Mokhtari et al.
echo frame has a dimension of 112 ×112, and the dataset provides ESV, EDV,
contour tracings of LV, and EF ratios for each patient [18]. We use the provided
splits in the dataset from mutually exclusive patients, including 7465 samples
for training, 1288 samples for validation, and 1277 samples for testing. The data
distribution in the training set is unbalanced with only 12.7% of samples having
EF ratio below 40%. Clinically, however, such patients are most critical to be
detected for timely intervention [3,11].
Frame Sampling: To stay within reasonable memory requirements, we use a
fixed number of frames per echo denoted by Tfixed. During training, we uniformly
sample an initial frame index jin [1, T i
total Tfixed], where Ti
total is the total
number of frames in echo video i. We then use Tfixed samples starting from j.
Following [18], we set Tfixed to 64 and use zero padding in the temporal dimension
when Ti
total < Tfixed. During test time, we extract multiple back to back clips
with each clip containing Tfixed frames and the first clip starting from index 0.
We use zero padding in the temporal dimension if Ti
total < Tfixed and overlap the
last clip with the previous one if the last clip overshoots Ti
total. We set Tfixed to 64
and independently estimate EF for each clip and report the average prediction.
Data Augmentation: Occasionally, AP4 echo is zoomed in on the LV region
for certain clinical studies [5,20]. To allow learning of this under-represented
distribution, we augment our training set by using a fixed cropping window of
90 ×72 centered at the top of each frame and interpolating the result to achieve
the original 112 ×112 dimension, which creates the desired zoom-in effect.
4.2 Implementation
The Video Encoder uses custom convolution blocks with 16, 32, 64, 128, and 256
channels. The Attention Encoder uses a hidden dimension of 128 for MLP layers,
and the Regressor uses 3-layer GNN with 128, 64 and 32 hidden dimensions
followed by an MLP with a hidden dimension of 16. We use the Adam optimizer
[13] with a learning rate of 1e-4, a batch size of 80, and 2500 training epochs. Our
framework is implemented using PyTorch [19] and PyG [6], and the training was
performed on two Nvidia Titan V GPUs. Pretraining: We use ES/ED index
labels in a pretraining step to train the Video Encoder and the Attention Encoder
to give higher weights to ES and ED frames. Classification Loss: We bin the
EF values into 4 ranges [030],(30,40],(40,55],(55,100] and use a cross-entropy
loss encouraging the model to learn EF’s clinical categories [3].
4.3 Results and Discussion
Explainability The key advantage of EchoGNN over prior work is the explain-
ability it provides through the learned weights on the echo-graph. As shown in
Fig. 2, the learned weights can indicate when human intervention is required. We
observe two different scenarios: (1) the model learns the periodic nature of echo
videos and assigns larger weights to frames and edges that are in between ES and
ED phases before performing EF estimation. This means that the location of ES
and ED can be approximated using these weights as illustrated in Fig. 2. (2) The
Explainable Ejection Fraction Estimation with Graph Neural Networks 7
model cannot detect the location of ES and ED frames and distributes weights
more evenly. We see that in these cases, we have either an atypical zoomed-in
AP4 echo or an echo where the LV is not entirely visible and is cropped. In such
cases, an expert can evaluate the video and determine if new videos must be
obtained. More explainability examples are provided in the supp. material.
To quantitatively measure the explainability of EchoGNN, for the cases
where the model learns the periodic nature of the data (1173 samples out of
1277), we use the average frame distance (aFD) as in [21], which is computed
as aFD =1
i=1 |ji˜
ji|with jiand ˜
jibeing the true and approximated in-
dices, respectively, for sample i. As shown in Table 1, our model achieves better
ED aFD and comparable ES aFD without using ground-truth ES/ED locations
for training, whereas Reynaud et al. [21] uses such supervision. This shows the
explainability power of EchoGNN. aFD computation details are provided in the
supp. material.
Fig. 2. (Top) An example where the model has learned the periodic nature of the data,
and the learned weights allow identification of ES/ED locations. (Bottom) Another
example where the LV region is cropped (as shown by the arrow), and learned weights
are distributed more evenly indicating the need for expert intervention.
EF Estimation To evaluate the error in predicted EF values, we use Mean-
Absolute-Error (MAE). Additionally, as a measure of the amount of explained
variance in the data, we report the model’s R2score. Moreover, we report the
F1score for the task of indicating whether EF values are lower than 40%, which
is a strong indicator of heart failure [11].
As shown in Table 1, our model significantly outperforms [21] without direct
supervision of ES and ED frame locations during training. Our model has similar
8 M. Mokhtari et al.
predictive performances as [12] with a much lower number of parameters and
the added benefit of explainability through the learned latent graph structures.
EchoNet (AF) [18] requires large amounts of RAM due to sampling all 32-frame
clips in a video, making us unable to train and evaluate the model. Because of
this we only report results from the paper and cannot produce additional metrics
such as F1score which is not originally reported, and hence we show this with
N/A in Table 1. This model’s weak performance compared to our model shows
the sensitivity of EchoNet (AF) to frame locations in a clip. Lastly, our model has
a significantly lower number of parameters, making it desirable for deployment
on mobile clinical devices. Our model’s EF scatter plot and confusion matrix are
provided in the supp. material.
Table 1. Summary of quantitative results. Lower values are better for all metrics
besides R2and F1. EchoNet (AF) averages predictions on all possible 32-frame clips in
a sampled video. Transformer (R) and (M) are transformer-based models with different
sampling techniques. The Bayesian model uses BNNs. We mark the models that cannot
predict ES/ED locations as "-" in the aFD metric. EchoGNN is the only model that
provides explainability and ES/ED location estimations without direct supervision.
Model R2MAE F1
EchoNet (AF) [18] 0.4 7.35 N/A - - 31.5
Transformer (R) [21] 0.48 6.76 0.70 2.86 7.88 346.8
Transformer (M) [21] 0.52 5.95 0.55 3.35 7.17 346.8
Bayesian [12] 0.75 4.46 0.77 - - 31.5
EchoGNN (ours) 0.76 4.45 0.78 4.15 3.68 1.7
4.4 Ablation Study
In Table 2, we see that the classification loss improves model’s performance for
under-represented samples, while pretraining and data augmentation reduce EF
error and increase the model’s ability to represent the variance in data.
Table 2. Ablation study results. Aug., Class., and Pretrain columns indicate if the
model uses data augmentation, classification loss and pretraining, respectively. We see
that the classification loss improves performance for under-represented groups, while
pretraining and data augmentation reduce overall EF error.
Aug. Class. Pretrain R2MAE F1<40%
3 7 7 0.75 4.48 0.76
3 3 7 0.74 4.59 0.77
3 7 3 0.75 4.47 0.73
7 3 3 0.75 4.47 0.77
3 3 3 0.76 4.45 0.78
Explainable Ejection Fraction Estimation with Graph Neural Networks 9
5 Limitations
While our model outperforms prior works for EF estimation and also provides
explainability, there are certain limitations that can be addressed in future work.
Firstly, while the explainability provided over frames and edges of the echo-graph
allows identification of cases that need closer inspection, they do not allow finding
regions of each frame that the model is uncertain about. We argue that an
attention map over the pixels in each frame can further help with explainability.
Secondly, creating a complete graph for long videos leads to large memory cost.
While this is not an issue for echo, where videos are relatively short, alternative
graph construction methods should be considered for longer videos.
6 Conclusion
In this work, we introduce a deep learning model that provides the benefit of
explainability via GNN-based latent graph learning. While we showcased the
success of our framework for EF estimation, we argue that the same pipeline
could be used for other datasets and problems, introducing a new paradigm for
video processing and prediction tasks from clinical data and beyond.
Acknowledgements. This research was supported in part by the Natural Sci-
ences and Engineering Research Council of Canada (NSERC), the Canadian
Institutes of Health Research (CIHR) and computational resources provided by
Advanced Research Computing at the University of British Columbia.
1. Amaral, C., Ralston, D., Becker, T.: Prehospital point-of-care ultrasound: A trans-
formative technology. SAGE Open Medicine 8, 205031212093270 (07 2020) 2
2. Bamira, D., Picard, M.: Imaging: Echocardiology—assessment of cardiac structure
and function. In: Vasan, R.S., Sawyer, D.B. (eds.) Encyclopedia of Cardiovascular
Research and Medicine, pp. 35–54. Elsevier, Oxford (2018) 2
3. Carroll, M.: Ejection fraction: Normal range, low range, and treatment (Nov 2021), 6
4. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
learning by exponential linear units (elus). arXiv: Learning (2016) 5
5. Ferraioli, D., Santoro, G., Bellino, M., Citro, R.: Ventricular septal defect compli-
cating inferior acute myocardial infarction: A case of percutaneous closure. Journal
of Cardiovascular Echography 29, 17 (01 2019) 6
6. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric.
In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)
7. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
passing for quantum chemistry. CoRR abs/1704.01212 (2017) 4
8. Hou, B.: ResNetAE (2019), 2
10 M. Mokhtari et al.
9. Huang, H., Nijjar, P., Misialek, J., Blaes, A., Derrico, N., Kazmirczak, F., Klem,
I., Farzaneh-Far, A., Shenoy, C.: Accuracy of left ventricular ejection fraction by
contemporary multiple gated acquisition scanning in patients with cancer: Compar-
ison with cardiovascular magnetic resonance. Journal of Cardiovascular Magnetic
Resonance 19 (12 2017) 1
10. Jafari, M.H., Woudenberg, N.V., Luong, C., Abolmaesumi, P., Tsang, T.: Deep
bayesian image segmentation for a more robust ejection fraction estimation. In:
2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1264–
1268 (2021) 2,3
11. Kalogeropoulos, A.P., Fonarow, G.C., Georgiopoulou, V., Burkman, G., Si-
wamogsatham, S., Patel, A., Li, S., Papadimitriou, L., Butler, J.: Characteristics
and Outcomes of Adult Outpatients With Heart Failure and Improved or Recov-
ered Ejection Fraction. JAMA Cardiology 1(5), 510–518 (08 2016) 6,7
12. Kazemi Esfeh, M.M., Luong, C., Behnami, D., Tsang, T., Abolmaesumi, P.: A
deep bayesian video analysis framework: Towards a more robust estimation of
ejection fraction. In: Medical Image Computing and Computer Assisted Interven-
tion MICCAI 2020. pp. 582–590. Springer International Publishing (2020) 2,3,
13. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International
Conference on Learning Representations (12 2014) 6
14. Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational infer-
ence for interacting systems (2018) 4
15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016) 5
16. Lang, R.M., Badano, L.P., Mor-Avi, V., Afilalo, J., Armstrong, A., Ernande, L.,
Flachskampf, F.A., Foster, E., Goldstein, S.A., Kuznetsova, T., Lancellotti, P.,
Muraru, D., Picard, M.H., Rietzschel, E.R., Rudski, L., Spencer, K.T., Tsang, W.,
Voigt, J.U.: Recommendations for cardiac chamber quantification by echocardiog-
raphy in adults: An update from the american society of echocardiography and the
european association of cardiovascular imaging. Journal of the American Society
of Echocardiography 28(1), 1–39.e14 (2015) 2
17. Loehr, L., Rosamond, W., Chang, P., Folsom, A., Chambless, L.: Heart failure
incidence and survival (from the atherosclerosis risk in communities study). The
American journal of cardiology 101, 1016–22 (04 2008) 1
18. Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C., Heidenreich,
P., Harrington, R., Liang, D., Ashley, E., Zou, J.: Video-based ai for beat-to-beat
assessment of cardiac function. Nature 580 (04 2020) 2,6,8
19. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.:
Pytorch: An imperative style, high-performance deep learning library. In: Wallach,
H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.)
Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran
Associates, Inc. (2019) 6
20. Patil, V., Patil, H.: Isolated non-compaction cardiomyopathy presented with ven-
tricular tachycardia. Heart views : the official journal of the Gulf Heart Association
12, 74–8 (04 2011) 6
21. Reynaud, H., Vlontzos, A., Hou, B., Beqiri, A., Leeson, P., Kainz, B.: Ultrasound
video transformers for cardiac ejection fraction estimation. In: de Bruijne, M.,
Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical
Explainable Ejection Fraction Estimation with Graph Neural Networks 11
Image Computing and Computer Assisted Intervention MICCAI 2021. pp. 495–
505. Springer International Publishing (2021) 2,3,7,8
22. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)
23. Smistad, E., Østvik, A., Salte, I.M., Melichova, D., Nguyen, T.M., Haugaa, K.,
Brunvand, H., Edvardsen, T., Leclerc, S., Bernard, O., Grenne, B., Løvstakken,
L.: Real-time automatic ejection fraction and foreshortening detection using deep
learning. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control
67(12), 2595–2604 (2020) 2
24. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at
spatiotemporal convolutions for action recognition. CoRR abs/1711.11248 (2017)
25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017) 2,3
Supplementary Material
****** ******** et al.
*** ********** ** ******* *********, **, ******
Fig. 1. Video Encoder network architecture. We use modular blocks containing 3D
convolutions with residual connections to generate low-dimensional frame embeddings.
Fig. 2. (left) The confusion matrix for our best-performing model. The chosen EF cat-
egories indicate different levels of heart failure risk with patients having EF below 40%
needing medical monitoring. (right) the scatter plot showing how close our model’s EF
estimates are to the ground truth. We see that the model struggles with EF values be-
tween 30% and 40%, and we argue that this is due to the high inter-observer varaibility
in the ground truth labels, which is more prominent for samples that lie in pathological
arXiv:2208.14003v1 [eess.IV] 30 Aug 2022
2 ****** ******** et al.
Fig. 3. ED/ES frame approximation from learned echo-graph weights: we first use a
threshold to change the the sum of outgoing edge weights into binary format (alterna-
tively, frame weights can be used). Please note that this threshold is selected based on
aFD performance on the validation set. The consecutive 1-valued weights form a block
together. The left-most and the right-most frame in each block is the approximated ED
and ES locations, respectively. We reject samples where the size of the block is equal
to 55, meaning that the model has not learned the periodic nature of data. Rejecting
these samples, we achieve an average frame distance of 4.15 for ES and 3.68 for ED.
Fig. 4. Examples of model’s explainability capability. (left) We can see examples where
the learned frame weights allow clear identification of ES/ED locations. (right) We see
examples where we have atypical zoomed-in AP4 echo or echo where the LV is not
entirely visible and is cropped, and therefore, the model distributes frame weights
more evenly, not clearly indicating the position of ED and ES.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Point-of-care ultrasound at the bedside has evolved into an essential component of emergency patient care. Current evidence supports its use across a wide spectrum of medical and traumatic diseases in a variety of settings. The prehospital use of ultrasound has evolved from a niche technology to impending widespread adoption across emergency medical services systems internationally. Recent technological advances and a growing evidence base support this trend. However, concerns regarding feasibility, education, and quality assurance must be addressed proactively. This topical review describes the history of prehospital ultrasound, initial training needs, ongoing skill maintenance, quality assurance and improvement requirements, available devices, and indications for prehospital ultrasound.
Full-text available
Accurate assessment of cardiac function is crucial for the diagnosis of cardiovascular disease¹, screening for cardiotoxicity² and decisions regarding the clinical management of patients with a critical illness³. However, human assessment of cardiac function focuses on a limited sampling of cardiac cycles and has considerable inter-observer variability despite years of training4,5. Here, to overcome this challenge, we present a video-based deep learning algorithm—EchoNet-Dynamic—that surpasses the performance of human experts in the critical tasks of segmenting the left ventricle, estimating ejection fraction and assessing cardiomyopathy. Trained on echocardiogram videos, our model accurately segments the left ventricle with a Dice similarity coefficient of 0.92, predicts ejection fraction with a mean absolute error of 4.1% and reliably classifies heart failure with reduced ejection fraction (area under the curve of 0.97). In an external dataset from another healthcare system, EchoNet-Dynamic predicts the ejection fraction with a mean absolute error of 6.0% and classifies heart failure with reduced ejection fraction with an area under the curve of 0.96. Prospective evaluation with repeated human measurements confirms that the model has variance that is comparable to or less than that of human experts. By leveraging information across multiple cardiac cycles, our model can rapidly identify subtle changes in ejection fraction, is more reproducible than human evaluation and lays the foundation for precise diagnosis of cardiovascular disease in real time. As a resource to promote further innovation, we also make publicly available a large dataset of 10,030 annotated echocardiogram videos.
Full-text available
Volume and ejection fraction (EF) measurements of the left ventricle (LV) in 2D echocardiography is associated with a high uncertainty due to inter-observer variability of the manual measurement, but also due to ultrasound acquisition errors such as apical foreshortening. In this work, a real-time and fully automated ejection fraction measurement and foreshortening detection method is proposed. The method uses several deep learning components, such as view classification, cardiac cycle timing, segmentation and landmark extraction, to measure the amount of foreshortening, LV volume and EF. A dataset of 500 patients from an outpatient clinic was used to train the deep neural networks, while a separate dataset of 100 patients from another clinic was used for evaluation, where LV volume and EF were measured by an expert using clinical protocols and software. A quantitative analysis using 3D ultrasound showed that EF is considerably affected by apical foreshortening, and that the proposed method can detect and quantify the amount of apical foreshortening. The bias and standard deviation of the automatic EF measurements were –3.6±8.1%, while the mean absolute difference was measured at 7.2% which are all within the inter-observer variability and comparable with related studies. The proposed real-time pipeline allows for a continuous acquisition and measurement workflow without user interaction, and has the potential to significantly reduce time spent on analysis and measurement error due to foreshortening, while providing quantitative volume measurements in the everyday echo lab.
Full-text available
Ventricular septal defect (VSD) is one of the most serious mechanical complications of acute myocardial infarction (AMI). Despite the incidence of post-AMI VSD in reperfusion era has reduced from 1%-2% to 0.17%-0.31%, it is a still life-threatening condition with poor prognosis. Surgical VSD closure is considered the best treatment approach since conservative management carries an extremely high mortality rate. Over the last decade, percutaneous transcatheter closure has emerged as an alternative therapeutic strategy for a patient with post-AMI VSD, with outcomes similar to cardiac surgery (30-day mortality 14%-66%). We present a case of inferior AMI complicated by posterobasal VSD and cardiogenic shock successfully treated with percutaneous closure. The role of echocardiography in diagnosis, management, and percutaneous procedure guiding has been emphasized. © 2019 Journal of Cardiovascular Echography | Published by Wolters Kluwer - Medknow.
Conference Paper
Full-text available
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
Ejection Fraction (EF) is a widely-used and critical index of cardiac health. EF measures the efficacy of the cyclic contraction of the ventricles and the outward pumpage of blood through the arteries. Timely and robust evaluation of EF is essential, as reduced EF indicates dysfunction in blood delivery during the ventricular systole, and is associated with a number of cardiac and non-cardiac risk factors and mortality-related outcomes. Automated reliable EF estimation in echocardiography (echo) has proven challenging due to low and variable image quality, and limited amounts of data for training data-driven algorithms which delays the integration of the technologies in the clinical workflow. In this paper, we introduce a Bayesian learning framework for automated EF assessment in echo videos. Our key contribution is to automatically estimate the epistemic uncertainty, i.e. the model uncertainty, in EF estimation. We anticipate that such information about uncertainty can be incorporated in clinical decision making. We use a ResNet18-based (2 + 1)D as the baseline architecture for video analysis and provide its side-by-side comparison of our probabilistic approach using public data from 10,031 echo exams. Our results clearly indicate the superior performance of the Bayesian model in the clinically critical lower EF population.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.