Available via license: CC BY-NC-SA 4.0

Content may be subject to copyright.

EchoGNN: Explainable Ejection Fraction

Estimation with Graph Neural Networks

Masoud Mokhtari1[0000−0001−9471−5573], Teresa Tsang2, Purang

Abolmaesumi1?, and Renjie Liao1?

1Electrical and Computer Engineering, University of British Columbia, Vancouver,

BC, Canada

{masoud, purang, rjliao}@ece.ubc.ca

2Vancouver General Hospital, Vancouver, BC, Canada

t.tsang@ubc.ca

Abstract. Ejection fraction (EF) is a key indicator of cardiac function,

allowing identiﬁcation of patients prone to heart dysfunctions such as

heart failure. EF is estimated from cardiac ultrasound videos known as

echocardiograms (echo) by manually tracing the left ventricle and es-

timating its volume on certain frames. These estimations exhibit high

inter-observer variability due to the manual process and varying video

quality. Such sources of inaccuracy and the need for rapid assessment

necessitate reliable and explainable machine learning techniques. In this

work, we introduce EchoGNN, a model based on graph neural networks

(GNNs) to estimate EF from echo videos. Our model ﬁrst infers a latent

echo-graph from the frames of one or multiple echo cine series. It then

estimates weights over nodes and edges of this graph, indicating the im-

portance of individual frames that aid EF estimation. A GNN regressor

uses this weighted graph to predict EF. We show, qualitatively and quan-

titatively, that the learned graph weights provide explainability through

identiﬁcation of critical frames for EF estimation, which can be used to

determine when human intervention is required. On EchoNet-Dynamic

public EF dataset, EchoGNN achieves EF prediction performance that is

on par with state of the art and provides explainability, which is crucial

given the high inter-observer variability inherent in this task. Our source

code is publicly available at: https://github.com/MasoudMo/echognn.

Keywords: Ultrasound ·Ejection Fraction ·Cardiac Imaging ·Explain-

able Models ·Graph Neural Networks ·Deep Learning

1 Introduction

Ejection fraction (EF) is a ratio indicating the volume of blood pumped by the

heart. This measurement is crucial in monitoring cardiovascular health and is a

potential indicator of heart failure [9,17]. EF is computed using the stroke vol-

ume, which is the blood volume diﬀerence in the Left Ventricle (LV) during the

?Co-Corresponding Authors

arXiv:2208.14003v1 [eess.IV] 30 Aug 2022

2 M. Mokhtari et al.

End-Systolic (ES) and End-Diastolic (ED) phases of the cardiac cycle denoted by

ESV and EDV, respectively [2]. These volumes are estimated from ultrasound

videos of the heart, i.e. echocardiograms (echo), which involves detecting the

frames corresponding to ES and ED and tracing the LV region. The manual pro-

cess of detecting the correct frames and making proper traces is prone to human

error. Therefore, the American Society of Echocardiography recommends per-

forming EF estimation for up to 5 cardiac cycles and averaging the results [16].

However, this guideline is seldom followed in practice, and a single representative

beat is selected for evaluation instead. This results in inter-observer variations

from 7.6% to 13.9% in the EF ratio [18].

Automatic EF estimation techniques aid professionals by adding another

layer of veriﬁcation. Additionally, with the emergence of Point-of-Care Ultra-

sound (POCUS) imaging devices, which are routinely used by less experienced

echo users, automation of clinical measurements such as EF is further needed [1].

However, to be adopted broadly, such automation techniques must be explain-

able to detect when human intervention is required. Diﬀerent machine learn-

ing (ML) architectures have been proposed to perform automatic EF estima-

tion [12,10,18,21,23], most of which lack reliable explainability mechanisms.

Some of these models fail to provide the model’s conﬁdence on their predic-

tions [18,21,23] or have low accuracy due to unrealistic data augmentation during

training and over-reliance on ground truth labels [21].

In this work, we introduce EchoGNN, a novel deep learning model for explain-

able EF estimation. Our approach ﬁrst infers a latent graph between frames of

one or multiple echo cine series. It then estimates EF based on this latent graph

via Graph Neural Networks (GNNs) [22], which are a class of deep learning mod-

els that eﬃciently capture graph data. To the best of our knowledge, our work

is the ﬁrst one that investigates GNNs in the context of ultrasound videos and

EF estimation. Moreover, our work brings explainability through latent graph

learning, inspiring further work in this domain. Our contributions are threefold:

•We introduce EchoGNN, a novel deep learning model for explainable EF

estimation through GNN-based latent graph learning.

•We present a weakly-supervised training pipeline for EF estimation without

direct reliance on ground truth ES/ED frame labels.

•Our model has a much lower number of parameters compared to prior work,

signiﬁcantly reducing computational and memory requirements.

2 Related Work

Most prior works use Convolutional Neural Networks (CNNs) in their EF es-

timation pipeline [12,10,18,21]. Ouyang et al. [18] uses ResNet-based (2+1)D

convolutions [24] to estimate and average EF for all possible 32-frame clips in an

echo, while Kazemi Esfeh et al. [12] uses a similar approach under the Bayesian

Neural Networks (BNNs) setting. Recent work uses the encoder of ResNetAE [8]

to reduce data dimensionality before using transformers [25] to jointly perform

Explainable Ejection Fraction Estimation with Graph Neural Networks 3

ES/ED frame detection and EF estimation [21]. While these methods show diﬀer-

ent levels of accuracy and success in predicting EF, they either lack explainability

or signiﬁcantly rely on accurate clinical labels, which are inherently noisy and

subject to signiﬁcant inter-observer variability. As an example, the transformer-

based approach requires ES/ED frame index labels in addition to EF labels in its

training pipeline [21]. Lastly, while Kazemi Esfeh et al. and Jafari et al. [12,10]

report uncertainty on their predictions, they still lack explainable indicators as

to why models fail or succeed for diﬀerent cases. Our proposed framework based

on GNNs aims to alleviate these shortcomings. It provides explainability by only

relying on EF labels and not requiring ES/ED frame labels in a supervised man-

ner. Lastly, as an added advantage, the number of parameters for our model is

signiﬁcantly less than prior work, which is highly desirable for deploying such

models on mobile clinical devices.

3 Methodology

We consider the following supervised problem for EF estimation: assume for each

patient i∈[N]in dataset D, there is a ground truth EF ratio yi∈R, and there

are K number of echo videos xi

k∈RT×W×H, where k∈[K], T is the number

of frames, and H and W are the height and width of each frame. The goal of

our model is to learn a function f:RK×T×H×W→Rto estimate EF from echo

videos. For notational simplicity and since our evaluation dataset only contains

one video per patient, we assume that K= 1. However, it must be noted that

our model is ﬂexible in this regard and can handle multiple videos per patient.

3.1 EchoGNN Architecture

As shown in Fig. 1, EchoGNN is composed of three main components: Video

Encoder, Attention Encoder, and Graph Regressor. In the following subsections,

we discuss the details pertaining to each component.

Video Encoder The original echo videos are high-dimensional and must be

mapped into lower-dimensional embeddings to reduce memory footprint and

remove redundant information.

The Video Encoder is used to learn a mapping fve :RT×H×W→RT×dfrom

input echo videos xi∈RT×W×Hto d-dimensional embeddings hi

j∈Rd, where

j∈[T]is the frame number. The temporal dimension is preserved because the

Attention Encoder requires embeddings for all frames to produce interpretable

weights over them. We use a custom network consisting of 3D convolutions and

residual connections to use both the spatial and temporal information in the

video in generating the embeddings. This network’s architecture is provided in

the supp. material. Lastly, following [25], periodic positional encodings are added

to the generated frame embeddings to encode the sequential nature of video data.

4 M. Mokhtari et al.

Fig. 1. EchoGNN has three main components. (1) Video Encoder: encodes video

frames into vector embeddings while preserving the temporal dimension; (2) Atten-

tion Encoder: infers weights over the nodes (video frames) and edges (relationships

among frames) of the echo-graph; (3) Graph Regressor: estimates EF using the in-

ferred weighted graph; this ﬁgure shows an example where each patient has an apical

two-chamber (AP2) and an apical four-chamber (AP4) echo video.

Attention Encoder For each patient, we construct an echo-graph, which is a

complete graph where each node corresponds to a frame in the echo video, and

the edges show the non-Euclidean relationship between these frames. Formally,

we denote the echo-graph with Gecho(V, E)where V is the set of nodes corre-

sponding to echo frames such that |V|=T, and E is the set of edges between

the nodes to show the relationship between video frames such that if v1, v2∈V

are connected, then ev1,v2∈E. We use the frame embeddings from our Video

Encoder as node features of Gecho . That is, {hi

1, hi

2, ..., hi

T}are the set of features

for {v1, v2, ..., vT}. These embeddings can be represented as a matrix Hi∈RT×d

such that each row is the embedding for a frame in the echo video for patient i.

Inspired by [14], we propose using GNNs to learn and assign weights to both

edges and nodes of the echo-graph. The edge and node weights are learned to

encode the importance of each frame (node weights) and the relationships among

frames (edge weights) for the ﬁnal EF estimation.

The Attention Encoder infers weights over edges and nodes of the echo-graph

using message passing based GNNs [7]. A single message passing step is enough

for each node to capture information from all other nodes due to echo-graph

being a complete graph. More speciﬁcally, the following operations are used to

obtain weights over each edge evk,vs:

Explainable Ejection Fraction Estimation with Graph Neural Networks 5

uk,s =MLP1([hi

kkhi

s]) (node →edge)(1)

vs=MLP2(Xk6=suk,s) (edge →node)(2)

zk,s =MLP3([vkkvs]) (node →edge)(3)

ak,s =σ(zk,s ),(4)

where σis the Sigmoid function, [.k.]is the concatenation operator, and ak,s ∈

[0,1] is the inferred weight for the directed edge from vkto vs. Similarly, weights

for each node ws∈[0,1] are generated by inserting another edge →node oper-

ation after Eq. 3. All MLPs use two fully connected linear layers with ELU [4]

activation and batch normalization.

Regressor Our Regressor network uses GNN layers with the learned weighted

echo-graph to perform EF estimation. Speciﬁcally, for each patient, the output

of the Attention Encoder can be represented as a weighted adjacency matrix

A∈[0,1]T×Tand a node weight vector w∈[0,1]T. The Regressor uses Ato

generate embeddings over frames of the echo video:

Hl=gl(A, Hl−1), l = 1, ..., L (5)

where Hl∈RT∗dgis the matrix of learned node embeddings at layer l, H0is the

matrix of frame embeddings from the Video Encoder, and glis composed of a

Graph Convolutional Network (GCN) layer followed by batch normalization and

ELU activation [15]. To represent the whole graph with a single vector embed-

ding, the node embeddings are averaged using the frame weights wgenerated by

the Attention Encoder:

hi

graph =PT

j=1 wj∗Hl

j

PT

j=1 wj

,(6)

where Hl

j∈Rdis the jth row of Hl, and wjis the jth scalar weight in the frame

weight vector. hi

graph is mapped into an EF estimate using an MLP with two

fully connected linear layers, ELU activation and batch normalization.

Learning Algorithm The model is diﬀerentiable in an end-to-end manner.

Therefore, we use gradient descent with Mean-Absolute-Error (MAE) between

predicted EF estimates ˜yiand ground truth EF values yi∈Yas the optimization

objective, which is computed as L=1

NPN

i=1 |˜yi−yi|.

4 Experiments

4.1 Dataset

We use EchoNet-Dynamic public EF dataset consisting of 10,030 AP4 echo

videos obtained between 2016 and 2018 at Stanford University Hospital. Each

6 M. Mokhtari et al.

echo frame has a dimension of 112 ×112, and the dataset provides ESV, EDV,

contour tracings of LV, and EF ratios for each patient [18]. We use the provided

splits in the dataset from mutually exclusive patients, including 7465 samples

for training, 1288 samples for validation, and 1277 samples for testing. The data

distribution in the training set is unbalanced with only 12.7% of samples having

EF ratio below 40%. Clinically, however, such patients are most critical to be

detected for timely intervention [3,11].

Frame Sampling: To stay within reasonable memory requirements, we use a

ﬁxed number of frames per echo denoted by Tﬁxed. During training, we uniformly

sample an initial frame index jin [1, T i

total −Tﬁxed], where Ti

total is the total

number of frames in echo video i. We then use Tﬁxed samples starting from j.

Following [18], we set Tﬁxed to 64 and use zero padding in the temporal dimension

when Ti

total < Tﬁxed. During test time, we extract multiple back to back clips

with each clip containing Tﬁxed frames and the ﬁrst clip starting from index 0.

We use zero padding in the temporal dimension if Ti

total < Tﬁxed and overlap the

last clip with the previous one if the last clip overshoots Ti

total. We set Tﬁxed to 64

and independently estimate EF for each clip and report the average prediction.

Data Augmentation: Occasionally, AP4 echo is zoomed in on the LV region

for certain clinical studies [5,20]. To allow learning of this under-represented

distribution, we augment our training set by using a ﬁxed cropping window of

90 ×72 centered at the top of each frame and interpolating the result to achieve

the original 112 ×112 dimension, which creates the desired zoom-in eﬀect.

4.2 Implementation

The Video Encoder uses custom convolution blocks with 16, 32, 64, 128, and 256

channels. The Attention Encoder uses a hidden dimension of 128 for MLP layers,

and the Regressor uses 3-layer GNN with 128, 64 and 32 hidden dimensions

followed by an MLP with a hidden dimension of 16. We use the Adam optimizer

[13] with a learning rate of 1e-4, a batch size of 80, and 2500 training epochs. Our

framework is implemented using PyTorch [19] and PyG [6], and the training was

performed on two Nvidia Titan V GPUs. Pretraining: We use ES/ED index

labels in a pretraining step to train the Video Encoder and the Attention Encoder

to give higher weights to ES and ED frames. Classiﬁcation Loss: We bin the

EF values into 4 ranges [0−30],(30,40],(40,55],(55,100] and use a cross-entropy

loss encouraging the model to learn EF’s clinical categories [3].

4.3 Results and Discussion

Explainability The key advantage of EchoGNN over prior work is the explain-

ability it provides through the learned weights on the echo-graph. As shown in

Fig. 2, the learned weights can indicate when human intervention is required. We

observe two diﬀerent scenarios: (1) the model learns the periodic nature of echo

videos and assigns larger weights to frames and edges that are in between ES and

ED phases before performing EF estimation. This means that the location of ES

and ED can be approximated using these weights as illustrated in Fig. 2. (2) The

Explainable Ejection Fraction Estimation with Graph Neural Networks 7

model cannot detect the location of ES and ED frames and distributes weights

more evenly. We see that in these cases, we have either an atypical zoomed-in

AP4 echo or an echo where the LV is not entirely visible and is cropped. In such

cases, an expert can evaluate the video and determine if new videos must be

obtained. More explainability examples are provided in the supp. material.

To quantitatively measure the explainability of EchoGNN, for the cases

where the model learns the periodic nature of the data (1173 samples out of

1277), we use the average frame distance (aFD) as in [21], which is computed

as aFD =1

NPN

i=1 |ji−˜

ji|with jiand ˜

jibeing the true and approximated in-

dices, respectively, for sample i. As shown in Table 1, our model achieves better

ED aFD and comparable ES aFD without using ground-truth ES/ED locations

for training, whereas Reynaud et al. [21] uses such supervision. This shows the

explainability power of EchoGNN. aFD computation details are provided in the

supp. material.

Fig. 2. (Top) An example where the model has learned the periodic nature of the data,

and the learned weights allow identiﬁcation of ES/ED locations. (Bottom) Another

example where the LV region is cropped (as shown by the arrow), and learned weights

are distributed more evenly indicating the need for expert intervention.

EF Estimation To evaluate the error in predicted EF values, we use Mean-

Absolute-Error (MAE). Additionally, as a measure of the amount of explained

variance in the data, we report the model’s R2score. Moreover, we report the

F1score for the task of indicating whether EF values are lower than 40%, which

is a strong indicator of heart failure [11].

As shown in Table 1, our model signiﬁcantly outperforms [21] without direct

supervision of ES and ED frame locations during training. Our model has similar

8 M. Mokhtari et al.

predictive performances as [12] with a much lower number of parameters and

the added beneﬁt of explainability through the learned latent graph structures.

EchoNet (AF) [18] requires large amounts of RAM due to sampling all 32-frame

clips in a video, making us unable to train and evaluate the model. Because of

this we only report results from the paper and cannot produce additional metrics

such as F1score which is not originally reported, and hence we show this with

N/A in Table 1. This model’s weak performance compared to our model shows

the sensitivity of EchoNet (AF) to frame locations in a clip. Lastly, our model has

a signiﬁcantly lower number of parameters, making it desirable for deployment

on mobile clinical devices. Our model’s EF scatter plot and confusion matrix are

provided in the supp. material.

Table 1. Summary of quantitative results. Lower values are better for all metrics

besides R2and F1. EchoNet (AF) averages predictions on all possible 32-frame clips in

a sampled video. Transformer (R) and (M) are transformer-based models with diﬀerent

sampling techniques. The Bayesian model uses BNNs. We mark the models that cannot

predict ES/ED locations as "-" in the aFD metric. EchoGNN is the only model that

provides explainability and ES/ED location estimations without direct supervision.

Model R2MAE F1

<40%

ES

aFD

ED

aFD

#params

(×106)

EchoNet (AF) [18] 0.4 7.35 N/A - - 31.5

Transformer (R) [21] 0.48 6.76 0.70 2.86 7.88 346.8

Transformer (M) [21] 0.52 5.95 0.55 3.35 7.17 346.8

Bayesian [12] 0.75 4.46 0.77 - - 31.5

EchoGNN (ours) 0.76 4.45 0.78 4.15 3.68 1.7

4.4 Ablation Study

In Table 2, we see that the classiﬁcation loss improves model’s performance for

under-represented samples, while pretraining and data augmentation reduce EF

error and increase the model’s ability to represent the variance in data.

Table 2. Ablation study results. Aug., Class., and Pretrain columns indicate if the

model uses data augmentation, classiﬁcation loss and pretraining, respectively. We see

that the classiﬁcation loss improves performance for under-represented groups, while

pretraining and data augmentation reduce overall EF error.

Aug. Class. Pretrain R2MAE F1<40%

3 7 7 0.75 4.48 0.76

3 3 7 0.74 4.59 0.77

3 7 3 0.75 4.47 0.73

7 3 3 0.75 4.47 0.77

3 3 3 0.76 4.45 0.78

Explainable Ejection Fraction Estimation with Graph Neural Networks 9

5 Limitations

While our model outperforms prior works for EF estimation and also provides

explainability, there are certain limitations that can be addressed in future work.

Firstly, while the explainability provided over frames and edges of the echo-graph

allows identiﬁcation of cases that need closer inspection, they do not allow ﬁnding

regions of each frame that the model is uncertain about. We argue that an

attention map over the pixels in each frame can further help with explainability.

Secondly, creating a complete graph for long videos leads to large memory cost.

While this is not an issue for echo, where videos are relatively short, alternative

graph construction methods should be considered for longer videos.

6 Conclusion

In this work, we introduce a deep learning model that provides the beneﬁt of

explainability via GNN-based latent graph learning. While we showcased the

success of our framework for EF estimation, we argue that the same pipeline

could be used for other datasets and problems, introducing a new paradigm for

video processing and prediction tasks from clinical data and beyond.

Acknowledgements. This research was supported in part by the Natural Sci-

ences and Engineering Research Council of Canada (NSERC), the Canadian

Institutes of Health Research (CIHR) and computational resources provided by

Advanced Research Computing at the University of British Columbia.

References

1. Amaral, C., Ralston, D., Becker, T.: Prehospital point-of-care ultrasound: A trans-

formative technology. SAGE Open Medicine 8, 205031212093270 (07 2020) 2

2. Bamira, D., Picard, M.: Imaging: Echocardiology—assessment of cardiac structure

and function. In: Vasan, R.S., Sawyer, D.B. (eds.) Encyclopedia of Cardiovascular

Research and Medicine, pp. 35–54. Elsevier, Oxford (2018) 2

3. Carroll, M.: Ejection fraction: Normal range, low range, and treatment (Nov 2021),

https://www.healthline.com/health/ejection-fraction 6

4. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network

learning by exponential linear units (elus). arXiv: Learning (2016) 5

5. Ferraioli, D., Santoro, G., Bellino, M., Citro, R.: Ventricular septal defect compli-

cating inferior acute myocardial infarction: A case of percutaneous closure. Journal

of Cardiovascular Echography 29, 17 (01 2019) 6

6. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric.

In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)

6

7. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message

passing for quantum chemistry. CoRR abs/1704.01212 (2017) 4

8. Hou, B.: ResNetAE (2019), https://github.com/farrell236/ResNetAE 2

10 M. Mokhtari et al.

9. Huang, H., Nijjar, P., Misialek, J., Blaes, A., Derrico, N., Kazmirczak, F., Klem,

I., Farzaneh-Far, A., Shenoy, C.: Accuracy of left ventricular ejection fraction by

contemporary multiple gated acquisition scanning in patients with cancer: Compar-

ison with cardiovascular magnetic resonance. Journal of Cardiovascular Magnetic

Resonance 19 (12 2017) 1

10. Jafari, M.H., Woudenberg, N.V., Luong, C., Abolmaesumi, P., Tsang, T.: Deep

bayesian image segmentation for a more robust ejection fraction estimation. In:

2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1264–

1268 (2021) 2,3

11. Kalogeropoulos, A.P., Fonarow, G.C., Georgiopoulou, V., Burkman, G., Si-

wamogsatham, S., Patel, A., Li, S., Papadimitriou, L., Butler, J.: Characteristics

and Outcomes of Adult Outpatients With Heart Failure and Improved or Recov-

ered Ejection Fraction. JAMA Cardiology 1(5), 510–518 (08 2016) 6,7

12. Kazemi Esfeh, M.M., Luong, C., Behnami, D., Tsang, T., Abolmaesumi, P.: A

deep bayesian video analysis framework: Towards a more robust estimation of

ejection fraction. In: Medical Image Computing and Computer Assisted Interven-

tion – MICCAI 2020. pp. 582–590. Springer International Publishing (2020) 2,3,

8

13. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International

Conference on Learning Representations (12 2014) 6

14. Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational infer-

ence for interacting systems (2018) 4

15. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutional

networks. arXiv preprint arXiv:1609.02907 (2016) 5

16. Lang, R.M., Badano, L.P., Mor-Avi, V., Aﬁlalo, J., Armstrong, A., Ernande, L.,

Flachskampf, F.A., Foster, E., Goldstein, S.A., Kuznetsova, T., Lancellotti, P.,

Muraru, D., Picard, M.H., Rietzschel, E.R., Rudski, L., Spencer, K.T., Tsang, W.,

Voigt, J.U.: Recommendations for cardiac chamber quantiﬁcation by echocardiog-

raphy in adults: An update from the american society of echocardiography and the

european association of cardiovascular imaging. Journal of the American Society

of Echocardiography 28(1), 1–39.e14 (2015) 2

17. Loehr, L., Rosamond, W., Chang, P., Folsom, A., Chambless, L.: Heart failure

incidence and survival (from the atherosclerosis risk in communities study). The

American journal of cardiology 101, 1016–22 (04 2008) 1

18. Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C., Heidenreich,

P., Harrington, R., Liang, D., Ashley, E., Zou, J.: Video-based ai for beat-to-beat

assessment of cardiac function. Nature 580 (04 2020) 2,6,8

19. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,

Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,

Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.:

Pytorch: An imperative style, high-performance deep learning library. In: Wallach,

H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.)

Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran

Associates, Inc. (2019) 6

20. Patil, V., Patil, H.: Isolated non-compaction cardiomyopathy presented with ven-

tricular tachycardia. Heart views : the oﬃcial journal of the Gulf Heart Association

12, 74–8 (04 2011) 6

21. Reynaud, H., Vlontzos, A., Hou, B., Beqiri, A., Leeson, P., Kainz, B.: Ultrasound

video transformers for cardiac ejection fraction estimation. In: de Bruijne, M.,

Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical

Explainable Ejection Fraction Estimation with Graph Neural Networks 11

Image Computing and Computer Assisted Intervention – MICCAI 2021. pp. 495–

505. Springer International Publishing (2021) 2,3,7,8

22. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph

neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)

2

23. Smistad, E., Østvik, A., Salte, I.M., Melichova, D., Nguyen, T.M., Haugaa, K.,

Brunvand, H., Edvardsen, T., Leclerc, S., Bernard, O., Grenne, B., Løvstakken,

L.: Real-time automatic ejection fraction and foreshortening detection using deep

learning. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control

67(12), 2595–2604 (2020) 2

24. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at

spatiotemporal convolutions for action recognition. CoRR abs/1711.11248 (2017)

2

25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,

L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017) 2,3

Supplementary Material

****** ******** et al.

*** ********** ** ******* *********, **, ******

Fig. 1. Video Encoder network architecture. We use modular blocks containing 3D

convolutions with residual connections to generate low-dimensional frame embeddings.

Fig. 2. (left) The confusion matrix for our best-performing model. The chosen EF cat-

egories indicate diﬀerent levels of heart failure risk with patients having EF below 40%

needing medical monitoring. (right) the scatter plot showing how close our model’s EF

estimates are to the ground truth. We see that the model struggles with EF values be-

tween 30% and 40%, and we argue that this is due to the high inter-observer varaibility

in the ground truth labels, which is more prominent for samples that lie in pathological

boundaries.

arXiv:2208.14003v1 [eess.IV] 30 Aug 2022

2 ****** ******** et al.

Fig. 3. ED/ES frame approximation from learned echo-graph weights: we ﬁrst use a

threshold to change the the sum of outgoing edge weights into binary format (alterna-

tively, frame weights can be used). Please note that this threshold is selected based on

aFD performance on the validation set. The consecutive 1-valued weights form a block

together. The left-most and the right-most frame in each block is the approximated ED

and ES locations, respectively. We reject samples where the size of the block is equal

to 55, meaning that the model has not learned the periodic nature of data. Rejecting

these samples, we achieve an average frame distance of 4.15 for ES and 3.68 for ED.

Fig. 4. Examples of model’s explainability capability. (left) We can see examples where

the learned frame weights allow clear identiﬁcation of ES/ED locations. (right) We see

examples where we have atypical zoomed-in AP4 echo or echo where the LV is not

entirely visible and is cropped, and therefore, the model distributes frame weights

more evenly, not clearly indicating the position of ED and ES.