Available via license: CC BY-NC-SA 4.0
Content may be subject to copyright.
EchoGNN: Explainable Ejection Fraction
Estimation with Graph Neural Networks
Masoud Mokhtari1[0000−0001−9471−5573], Teresa Tsang2, Purang
Abolmaesumi1?, and Renjie Liao1?
1Electrical and Computer Engineering, University of British Columbia, Vancouver,
BC, Canada
{masoud, purang, rjliao}@ece.ubc.ca
2Vancouver General Hospital, Vancouver, BC, Canada
t.tsang@ubc.ca
Abstract. Ejection fraction (EF) is a key indicator of cardiac function,
allowing identification of patients prone to heart dysfunctions such as
heart failure. EF is estimated from cardiac ultrasound videos known as
echocardiograms (echo) by manually tracing the left ventricle and es-
timating its volume on certain frames. These estimations exhibit high
inter-observer variability due to the manual process and varying video
quality. Such sources of inaccuracy and the need for rapid assessment
necessitate reliable and explainable machine learning techniques. In this
work, we introduce EchoGNN, a model based on graph neural networks
(GNNs) to estimate EF from echo videos. Our model first infers a latent
echo-graph from the frames of one or multiple echo cine series. It then
estimates weights over nodes and edges of this graph, indicating the im-
portance of individual frames that aid EF estimation. A GNN regressor
uses this weighted graph to predict EF. We show, qualitatively and quan-
titatively, that the learned graph weights provide explainability through
identification of critical frames for EF estimation, which can be used to
determine when human intervention is required. On EchoNet-Dynamic
public EF dataset, EchoGNN achieves EF prediction performance that is
on par with state of the art and provides explainability, which is crucial
given the high inter-observer variability inherent in this task. Our source
code is publicly available at: https://github.com/MasoudMo/echognn.
Keywords: Ultrasound ·Ejection Fraction ·Cardiac Imaging ·Explain-
able Models ·Graph Neural Networks ·Deep Learning
1 Introduction
Ejection fraction (EF) is a ratio indicating the volume of blood pumped by the
heart. This measurement is crucial in monitoring cardiovascular health and is a
potential indicator of heart failure [9,17]. EF is computed using the stroke vol-
ume, which is the blood volume difference in the Left Ventricle (LV) during the
?Co-Corresponding Authors
arXiv:2208.14003v1 [eess.IV] 30 Aug 2022
2 M. Mokhtari et al.
End-Systolic (ES) and End-Diastolic (ED) phases of the cardiac cycle denoted by
ESV and EDV, respectively [2]. These volumes are estimated from ultrasound
videos of the heart, i.e. echocardiograms (echo), which involves detecting the
frames corresponding to ES and ED and tracing the LV region. The manual pro-
cess of detecting the correct frames and making proper traces is prone to human
error. Therefore, the American Society of Echocardiography recommends per-
forming EF estimation for up to 5 cardiac cycles and averaging the results [16].
However, this guideline is seldom followed in practice, and a single representative
beat is selected for evaluation instead. This results in inter-observer variations
from 7.6% to 13.9% in the EF ratio [18].
Automatic EF estimation techniques aid professionals by adding another
layer of verification. Additionally, with the emergence of Point-of-Care Ultra-
sound (POCUS) imaging devices, which are routinely used by less experienced
echo users, automation of clinical measurements such as EF is further needed [1].
However, to be adopted broadly, such automation techniques must be explain-
able to detect when human intervention is required. Different machine learn-
ing (ML) architectures have been proposed to perform automatic EF estima-
tion [12,10,18,21,23], most of which lack reliable explainability mechanisms.
Some of these models fail to provide the model’s confidence on their predic-
tions [18,21,23] or have low accuracy due to unrealistic data augmentation during
training and over-reliance on ground truth labels [21].
In this work, we introduce EchoGNN, a novel deep learning model for explain-
able EF estimation. Our approach first infers a latent graph between frames of
one or multiple echo cine series. It then estimates EF based on this latent graph
via Graph Neural Networks (GNNs) [22], which are a class of deep learning mod-
els that efficiently capture graph data. To the best of our knowledge, our work
is the first one that investigates GNNs in the context of ultrasound videos and
EF estimation. Moreover, our work brings explainability through latent graph
learning, inspiring further work in this domain. Our contributions are threefold:
•We introduce EchoGNN, a novel deep learning model for explainable EF
estimation through GNN-based latent graph learning.
•We present a weakly-supervised training pipeline for EF estimation without
direct reliance on ground truth ES/ED frame labels.
•Our model has a much lower number of parameters compared to prior work,
significantly reducing computational and memory requirements.
2 Related Work
Most prior works use Convolutional Neural Networks (CNNs) in their EF es-
timation pipeline [12,10,18,21]. Ouyang et al. [18] uses ResNet-based (2+1)D
convolutions [24] to estimate and average EF for all possible 32-frame clips in an
echo, while Kazemi Esfeh et al. [12] uses a similar approach under the Bayesian
Neural Networks (BNNs) setting. Recent work uses the encoder of ResNetAE [8]
to reduce data dimensionality before using transformers [25] to jointly perform
Explainable Ejection Fraction Estimation with Graph Neural Networks 3
ES/ED frame detection and EF estimation [21]. While these methods show differ-
ent levels of accuracy and success in predicting EF, they either lack explainability
or significantly rely on accurate clinical labels, which are inherently noisy and
subject to significant inter-observer variability. As an example, the transformer-
based approach requires ES/ED frame index labels in addition to EF labels in its
training pipeline [21]. Lastly, while Kazemi Esfeh et al. and Jafari et al. [12,10]
report uncertainty on their predictions, they still lack explainable indicators as
to why models fail or succeed for different cases. Our proposed framework based
on GNNs aims to alleviate these shortcomings. It provides explainability by only
relying on EF labels and not requiring ES/ED frame labels in a supervised man-
ner. Lastly, as an added advantage, the number of parameters for our model is
significantly less than prior work, which is highly desirable for deploying such
models on mobile clinical devices.
3 Methodology
We consider the following supervised problem for EF estimation: assume for each
patient i∈[N]in dataset D, there is a ground truth EF ratio yi∈R, and there
are K number of echo videos xi
k∈RT×W×H, where k∈[K], T is the number
of frames, and H and W are the height and width of each frame. The goal of
our model is to learn a function f:RK×T×H×W→Rto estimate EF from echo
videos. For notational simplicity and since our evaluation dataset only contains
one video per patient, we assume that K= 1. However, it must be noted that
our model is flexible in this regard and can handle multiple videos per patient.
3.1 EchoGNN Architecture
As shown in Fig. 1, EchoGNN is composed of three main components: Video
Encoder, Attention Encoder, and Graph Regressor. In the following subsections,
we discuss the details pertaining to each component.
Video Encoder The original echo videos are high-dimensional and must be
mapped into lower-dimensional embeddings to reduce memory footprint and
remove redundant information.
The Video Encoder is used to learn a mapping fve :RT×H×W→RT×dfrom
input echo videos xi∈RT×W×Hto d-dimensional embeddings hi
j∈Rd, where
j∈[T]is the frame number. The temporal dimension is preserved because the
Attention Encoder requires embeddings for all frames to produce interpretable
weights over them. We use a custom network consisting of 3D convolutions and
residual connections to use both the spatial and temporal information in the
video in generating the embeddings. This network’s architecture is provided in
the supp. material. Lastly, following [25], periodic positional encodings are added
to the generated frame embeddings to encode the sequential nature of video data.
4 M. Mokhtari et al.
Fig. 1. EchoGNN has three main components. (1) Video Encoder: encodes video
frames into vector embeddings while preserving the temporal dimension; (2) Atten-
tion Encoder: infers weights over the nodes (video frames) and edges (relationships
among frames) of the echo-graph; (3) Graph Regressor: estimates EF using the in-
ferred weighted graph; this figure shows an example where each patient has an apical
two-chamber (AP2) and an apical four-chamber (AP4) echo video.
Attention Encoder For each patient, we construct an echo-graph, which is a
complete graph where each node corresponds to a frame in the echo video, and
the edges show the non-Euclidean relationship between these frames. Formally,
we denote the echo-graph with Gecho(V, E)where V is the set of nodes corre-
sponding to echo frames such that |V|=T, and E is the set of edges between
the nodes to show the relationship between video frames such that if v1, v2∈V
are connected, then ev1,v2∈E. We use the frame embeddings from our Video
Encoder as node features of Gecho . That is, {hi
1, hi
2, ..., hi
T}are the set of features
for {v1, v2, ..., vT}. These embeddings can be represented as a matrix Hi∈RT×d
such that each row is the embedding for a frame in the echo video for patient i.
Inspired by [14], we propose using GNNs to learn and assign weights to both
edges and nodes of the echo-graph. The edge and node weights are learned to
encode the importance of each frame (node weights) and the relationships among
frames (edge weights) for the final EF estimation.
The Attention Encoder infers weights over edges and nodes of the echo-graph
using message passing based GNNs [7]. A single message passing step is enough
for each node to capture information from all other nodes due to echo-graph
being a complete graph. More specifically, the following operations are used to
obtain weights over each edge evk,vs:
Explainable Ejection Fraction Estimation with Graph Neural Networks 5
uk,s =MLP1([hi
kkhi
s]) (node →edge)(1)
vs=MLP2(Xk6=suk,s) (edge →node)(2)
zk,s =MLP3([vkkvs]) (node →edge)(3)
ak,s =σ(zk,s ),(4)
where σis the Sigmoid function, [.k.]is the concatenation operator, and ak,s ∈
[0,1] is the inferred weight for the directed edge from vkto vs. Similarly, weights
for each node ws∈[0,1] are generated by inserting another edge →node oper-
ation after Eq. 3. All MLPs use two fully connected linear layers with ELU [4]
activation and batch normalization.
Regressor Our Regressor network uses GNN layers with the learned weighted
echo-graph to perform EF estimation. Specifically, for each patient, the output
of the Attention Encoder can be represented as a weighted adjacency matrix
A∈[0,1]T×Tand a node weight vector w∈[0,1]T. The Regressor uses Ato
generate embeddings over frames of the echo video:
Hl=gl(A, Hl−1), l = 1, ..., L (5)
where Hl∈RT∗dgis the matrix of learned node embeddings at layer l, H0is the
matrix of frame embeddings from the Video Encoder, and glis composed of a
Graph Convolutional Network (GCN) layer followed by batch normalization and
ELU activation [15]. To represent the whole graph with a single vector embed-
ding, the node embeddings are averaged using the frame weights wgenerated by
the Attention Encoder:
hi
graph =PT
j=1 wj∗Hl
j
PT
j=1 wj
,(6)
where Hl
j∈Rdis the jth row of Hl, and wjis the jth scalar weight in the frame
weight vector. hi
graph is mapped into an EF estimate using an MLP with two
fully connected linear layers, ELU activation and batch normalization.
Learning Algorithm The model is differentiable in an end-to-end manner.
Therefore, we use gradient descent with Mean-Absolute-Error (MAE) between
predicted EF estimates ˜yiand ground truth EF values yi∈Yas the optimization
objective, which is computed as L=1
NPN
i=1 |˜yi−yi|.
4 Experiments
4.1 Dataset
We use EchoNet-Dynamic public EF dataset consisting of 10,030 AP4 echo
videos obtained between 2016 and 2018 at Stanford University Hospital. Each
6 M. Mokhtari et al.
echo frame has a dimension of 112 ×112, and the dataset provides ESV, EDV,
contour tracings of LV, and EF ratios for each patient [18]. We use the provided
splits in the dataset from mutually exclusive patients, including 7465 samples
for training, 1288 samples for validation, and 1277 samples for testing. The data
distribution in the training set is unbalanced with only 12.7% of samples having
EF ratio below 40%. Clinically, however, such patients are most critical to be
detected for timely intervention [3,11].
Frame Sampling: To stay within reasonable memory requirements, we use a
fixed number of frames per echo denoted by Tfixed. During training, we uniformly
sample an initial frame index jin [1, T i
total −Tfixed], where Ti
total is the total
number of frames in echo video i. We then use Tfixed samples starting from j.
Following [18], we set Tfixed to 64 and use zero padding in the temporal dimension
when Ti
total < Tfixed. During test time, we extract multiple back to back clips
with each clip containing Tfixed frames and the first clip starting from index 0.
We use zero padding in the temporal dimension if Ti
total < Tfixed and overlap the
last clip with the previous one if the last clip overshoots Ti
total. We set Tfixed to 64
and independently estimate EF for each clip and report the average prediction.
Data Augmentation: Occasionally, AP4 echo is zoomed in on the LV region
for certain clinical studies [5,20]. To allow learning of this under-represented
distribution, we augment our training set by using a fixed cropping window of
90 ×72 centered at the top of each frame and interpolating the result to achieve
the original 112 ×112 dimension, which creates the desired zoom-in effect.
4.2 Implementation
The Video Encoder uses custom convolution blocks with 16, 32, 64, 128, and 256
channels. The Attention Encoder uses a hidden dimension of 128 for MLP layers,
and the Regressor uses 3-layer GNN with 128, 64 and 32 hidden dimensions
followed by an MLP with a hidden dimension of 16. We use the Adam optimizer
[13] with a learning rate of 1e-4, a batch size of 80, and 2500 training epochs. Our
framework is implemented using PyTorch [19] and PyG [6], and the training was
performed on two Nvidia Titan V GPUs. Pretraining: We use ES/ED index
labels in a pretraining step to train the Video Encoder and the Attention Encoder
to give higher weights to ES and ED frames. Classification Loss: We bin the
EF values into 4 ranges [0−30],(30,40],(40,55],(55,100] and use a cross-entropy
loss encouraging the model to learn EF’s clinical categories [3].
4.3 Results and Discussion
Explainability The key advantage of EchoGNN over prior work is the explain-
ability it provides through the learned weights on the echo-graph. As shown in
Fig. 2, the learned weights can indicate when human intervention is required. We
observe two different scenarios: (1) the model learns the periodic nature of echo
videos and assigns larger weights to frames and edges that are in between ES and
ED phases before performing EF estimation. This means that the location of ES
and ED can be approximated using these weights as illustrated in Fig. 2. (2) The
Explainable Ejection Fraction Estimation with Graph Neural Networks 7
model cannot detect the location of ES and ED frames and distributes weights
more evenly. We see that in these cases, we have either an atypical zoomed-in
AP4 echo or an echo where the LV is not entirely visible and is cropped. In such
cases, an expert can evaluate the video and determine if new videos must be
obtained. More explainability examples are provided in the supp. material.
To quantitatively measure the explainability of EchoGNN, for the cases
where the model learns the periodic nature of the data (1173 samples out of
1277), we use the average frame distance (aFD) as in [21], which is computed
as aFD =1
NPN
i=1 |ji−˜
ji|with jiand ˜
jibeing the true and approximated in-
dices, respectively, for sample i. As shown in Table 1, our model achieves better
ED aFD and comparable ES aFD without using ground-truth ES/ED locations
for training, whereas Reynaud et al. [21] uses such supervision. This shows the
explainability power of EchoGNN. aFD computation details are provided in the
supp. material.
Fig. 2. (Top) An example where the model has learned the periodic nature of the data,
and the learned weights allow identification of ES/ED locations. (Bottom) Another
example where the LV region is cropped (as shown by the arrow), and learned weights
are distributed more evenly indicating the need for expert intervention.
EF Estimation To evaluate the error in predicted EF values, we use Mean-
Absolute-Error (MAE). Additionally, as a measure of the amount of explained
variance in the data, we report the model’s R2score. Moreover, we report the
F1score for the task of indicating whether EF values are lower than 40%, which
is a strong indicator of heart failure [11].
As shown in Table 1, our model significantly outperforms [21] without direct
supervision of ES and ED frame locations during training. Our model has similar
8 M. Mokhtari et al.
predictive performances as [12] with a much lower number of parameters and
the added benefit of explainability through the learned latent graph structures.
EchoNet (AF) [18] requires large amounts of RAM due to sampling all 32-frame
clips in a video, making us unable to train and evaluate the model. Because of
this we only report results from the paper and cannot produce additional metrics
such as F1score which is not originally reported, and hence we show this with
N/A in Table 1. This model’s weak performance compared to our model shows
the sensitivity of EchoNet (AF) to frame locations in a clip. Lastly, our model has
a significantly lower number of parameters, making it desirable for deployment
on mobile clinical devices. Our model’s EF scatter plot and confusion matrix are
provided in the supp. material.
Table 1. Summary of quantitative results. Lower values are better for all metrics
besides R2and F1. EchoNet (AF) averages predictions on all possible 32-frame clips in
a sampled video. Transformer (R) and (M) are transformer-based models with different
sampling techniques. The Bayesian model uses BNNs. We mark the models that cannot
predict ES/ED locations as "-" in the aFD metric. EchoGNN is the only model that
provides explainability and ES/ED location estimations without direct supervision.
Model R2MAE F1
<40%
ES
aFD
ED
aFD
#params
(×106)
EchoNet (AF) [18] 0.4 7.35 N/A - - 31.5
Transformer (R) [21] 0.48 6.76 0.70 2.86 7.88 346.8
Transformer (M) [21] 0.52 5.95 0.55 3.35 7.17 346.8
Bayesian [12] 0.75 4.46 0.77 - - 31.5
EchoGNN (ours) 0.76 4.45 0.78 4.15 3.68 1.7
4.4 Ablation Study
In Table 2, we see that the classification loss improves model’s performance for
under-represented samples, while pretraining and data augmentation reduce EF
error and increase the model’s ability to represent the variance in data.
Table 2. Ablation study results. Aug., Class., and Pretrain columns indicate if the
model uses data augmentation, classification loss and pretraining, respectively. We see
that the classification loss improves performance for under-represented groups, while
pretraining and data augmentation reduce overall EF error.
Aug. Class. Pretrain R2MAE F1<40%
3 7 7 0.75 4.48 0.76
3 3 7 0.74 4.59 0.77
3 7 3 0.75 4.47 0.73
7 3 3 0.75 4.47 0.77
3 3 3 0.76 4.45 0.78
Explainable Ejection Fraction Estimation with Graph Neural Networks 9
5 Limitations
While our model outperforms prior works for EF estimation and also provides
explainability, there are certain limitations that can be addressed in future work.
Firstly, while the explainability provided over frames and edges of the echo-graph
allows identification of cases that need closer inspection, they do not allow finding
regions of each frame that the model is uncertain about. We argue that an
attention map over the pixels in each frame can further help with explainability.
Secondly, creating a complete graph for long videos leads to large memory cost.
While this is not an issue for echo, where videos are relatively short, alternative
graph construction methods should be considered for longer videos.
6 Conclusion
In this work, we introduce a deep learning model that provides the benefit of
explainability via GNN-based latent graph learning. While we showcased the
success of our framework for EF estimation, we argue that the same pipeline
could be used for other datasets and problems, introducing a new paradigm for
video processing and prediction tasks from clinical data and beyond.
Acknowledgements. This research was supported in part by the Natural Sci-
ences and Engineering Research Council of Canada (NSERC), the Canadian
Institutes of Health Research (CIHR) and computational resources provided by
Advanced Research Computing at the University of British Columbia.
References
1. Amaral, C., Ralston, D., Becker, T.: Prehospital point-of-care ultrasound: A trans-
formative technology. SAGE Open Medicine 8, 205031212093270 (07 2020) 2
2. Bamira, D., Picard, M.: Imaging: Echocardiology—assessment of cardiac structure
and function. In: Vasan, R.S., Sawyer, D.B. (eds.) Encyclopedia of Cardiovascular
Research and Medicine, pp. 35–54. Elsevier, Oxford (2018) 2
3. Carroll, M.: Ejection fraction: Normal range, low range, and treatment (Nov 2021),
https://www.healthline.com/health/ejection-fraction 6
4. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
learning by exponential linear units (elus). arXiv: Learning (2016) 5
5. Ferraioli, D., Santoro, G., Bellino, M., Citro, R.: Ventricular septal defect compli-
cating inferior acute myocardial infarction: A case of percutaneous closure. Journal
of Cardiovascular Echography 29, 17 (01 2019) 6
6. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric.
In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)
6
7. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
passing for quantum chemistry. CoRR abs/1704.01212 (2017) 4
8. Hou, B.: ResNetAE (2019), https://github.com/farrell236/ResNetAE 2
10 M. Mokhtari et al.
9. Huang, H., Nijjar, P., Misialek, J., Blaes, A., Derrico, N., Kazmirczak, F., Klem,
I., Farzaneh-Far, A., Shenoy, C.: Accuracy of left ventricular ejection fraction by
contemporary multiple gated acquisition scanning in patients with cancer: Compar-
ison with cardiovascular magnetic resonance. Journal of Cardiovascular Magnetic
Resonance 19 (12 2017) 1
10. Jafari, M.H., Woudenberg, N.V., Luong, C., Abolmaesumi, P., Tsang, T.: Deep
bayesian image segmentation for a more robust ejection fraction estimation. In:
2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1264–
1268 (2021) 2,3
11. Kalogeropoulos, A.P., Fonarow, G.C., Georgiopoulou, V., Burkman, G., Si-
wamogsatham, S., Patel, A., Li, S., Papadimitriou, L., Butler, J.: Characteristics
and Outcomes of Adult Outpatients With Heart Failure and Improved or Recov-
ered Ejection Fraction. JAMA Cardiology 1(5), 510–518 (08 2016) 6,7
12. Kazemi Esfeh, M.M., Luong, C., Behnami, D., Tsang, T., Abolmaesumi, P.: A
deep bayesian video analysis framework: Towards a more robust estimation of
ejection fraction. In: Medical Image Computing and Computer Assisted Interven-
tion – MICCAI 2020. pp. 582–590. Springer International Publishing (2020) 2,3,
8
13. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International
Conference on Learning Representations (12 2014) 6
14. Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational infer-
ence for interacting systems (2018) 4
15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016) 5
16. Lang, R.M., Badano, L.P., Mor-Avi, V., Afilalo, J., Armstrong, A., Ernande, L.,
Flachskampf, F.A., Foster, E., Goldstein, S.A., Kuznetsova, T., Lancellotti, P.,
Muraru, D., Picard, M.H., Rietzschel, E.R., Rudski, L., Spencer, K.T., Tsang, W.,
Voigt, J.U.: Recommendations for cardiac chamber quantification by echocardiog-
raphy in adults: An update from the american society of echocardiography and the
european association of cardiovascular imaging. Journal of the American Society
of Echocardiography 28(1), 1–39.e14 (2015) 2
17. Loehr, L., Rosamond, W., Chang, P., Folsom, A., Chambless, L.: Heart failure
incidence and survival (from the atherosclerosis risk in communities study). The
American journal of cardiology 101, 1016–22 (04 2008) 1
18. Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C., Heidenreich,
P., Harrington, R., Liang, D., Ashley, E., Zou, J.: Video-based ai for beat-to-beat
assessment of cardiac function. Nature 580 (04 2020) 2,6,8
19. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.:
Pytorch: An imperative style, high-performance deep learning library. In: Wallach,
H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.)
Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran
Associates, Inc. (2019) 6
20. Patil, V., Patil, H.: Isolated non-compaction cardiomyopathy presented with ven-
tricular tachycardia. Heart views : the official journal of the Gulf Heart Association
12, 74–8 (04 2011) 6
21. Reynaud, H., Vlontzos, A., Hou, B., Beqiri, A., Leeson, P., Kainz, B.: Ultrasound
video transformers for cardiac ejection fraction estimation. In: de Bruijne, M.,
Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical
Explainable Ejection Fraction Estimation with Graph Neural Networks 11
Image Computing and Computer Assisted Intervention – MICCAI 2021. pp. 495–
505. Springer International Publishing (2021) 2,3,7,8
22. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)
2
23. Smistad, E., Østvik, A., Salte, I.M., Melichova, D., Nguyen, T.M., Haugaa, K.,
Brunvand, H., Edvardsen, T., Leclerc, S., Bernard, O., Grenne, B., Løvstakken,
L.: Real-time automatic ejection fraction and foreshortening detection using deep
learning. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control
67(12), 2595–2604 (2020) 2
24. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at
spatiotemporal convolutions for action recognition. CoRR abs/1711.11248 (2017)
2
25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017) 2,3
Supplementary Material
****** ******** et al.
*** ********** ** ******* *********, **, ******
Fig. 1. Video Encoder network architecture. We use modular blocks containing 3D
convolutions with residual connections to generate low-dimensional frame embeddings.
Fig. 2. (left) The confusion matrix for our best-performing model. The chosen EF cat-
egories indicate different levels of heart failure risk with patients having EF below 40%
needing medical monitoring. (right) the scatter plot showing how close our model’s EF
estimates are to the ground truth. We see that the model struggles with EF values be-
tween 30% and 40%, and we argue that this is due to the high inter-observer varaibility
in the ground truth labels, which is more prominent for samples that lie in pathological
boundaries.
arXiv:2208.14003v1 [eess.IV] 30 Aug 2022
2 ****** ******** et al.
Fig. 3. ED/ES frame approximation from learned echo-graph weights: we first use a
threshold to change the the sum of outgoing edge weights into binary format (alterna-
tively, frame weights can be used). Please note that this threshold is selected based on
aFD performance on the validation set. The consecutive 1-valued weights form a block
together. The left-most and the right-most frame in each block is the approximated ED
and ES locations, respectively. We reject samples where the size of the block is equal
to 55, meaning that the model has not learned the periodic nature of data. Rejecting
these samples, we achieve an average frame distance of 4.15 for ES and 3.68 for ED.
Fig. 4. Examples of model’s explainability capability. (left) We can see examples where
the learned frame weights allow clear identification of ES/ED locations. (right) We see
examples where we have atypical zoomed-in AP4 echo or echo where the LV is not
entirely visible and is cropped, and therefore, the model distributes frame weights
more evenly, not clearly indicating the position of ED and ES.