Content uploaded by Kobby Panford-Quainoo

Author content

All content in this area was uploaded by Kobby Panford-Quainoo on Oct 03, 2020

Content may be subject to copyright.

Bilateral Trade Modeling

with Graph Neural Networks

Kobby Panford-Quainoo∗

African Masters in Machine Intelligence

African Institute for Mathematical Sciences

Kigali, Rwanda

kpanford-quainoo@aimsammi.org

Avishek Joey Bose

Department of Computer Science

McGill University and Mila

Montreal, Canada

joey.bose@mail.mcgill.ca

Michaël Defferrard

Institute for Electrical Engineering

École Polytechnique Fédérale de Lausanne

Lausanne, Switzerland

michael.defferrard@epfl.ch

Abstract

Bilateral trade agreements confer preferred trading status between participating

countries, enabling increased trade and potential economic growth. Predicting

such trade ﬂows often serve as important economic indicators used by economists

and policy makers with impactful ramiﬁcations in economic policies adopted by

respective countries. However, traditional approaches to predicting potential trade

partners is through the use of gravity methods which are cumbersome to deﬁne due

to the exponentially growing number of constants that need to be considered. In

this work, we present a framework for directly predicting bilateral trade partners

from observed trade records using graph representation learning. Furthermore,

we show as a downstream task that modeling bilateral trade as a graph allows for

the classiﬁcation of countries into various income levels. Empirically, we observe

accuracies of up to

98%

for predicting trading partners and

68%

on income level

classiﬁcation.

1 Introduction

International trade involves the exchange of goods, capital, and services between countries, and

where two countries are concerned, it is referred to as bilateral trade. Often, the deﬁcit and surpluses

created via bilateral trade represent important economic development indicators, which drive the

adoption of speciﬁc domestic economic policies –i.e., relaxation of restrictions and trade barriers, in

either country. Consequently, various models have been employed by economists to understand trade

patterns and factors that account for the observed trade activities between countries. For instance,

the Ricardian model introduced the idea of comparative advantage of nations, whereby a country

exports more of the goods they can produce at a lower cost [

1

]. In a similar vein, the “factor of

abundance” argues that the trading behavior of a country is inﬂuenced by what they conﬁdently

produce in abundance [2].

The most popular method with practical beneﬁts is known as the Gravity Model of trade, which is

motivated by Newton’s law of gravitation. The gravity model relates the bilateral trade ﬂows between

∗Code can be found at https://github.com/panford/BiTrade-Graphs.

Submitted to the African Institute for Mathematical Sciences for a Master’s degree in Machine Intelligence

(AIMS 2019), Rwanda.

two countries using the respective gross domestic product (GDP) of each country while taking into

account the geographical distance. Intuitively, trade ﬂow is high when participating countries have

high GDPs and are geographically close to each other [

3

]. While the gravity model is an effective

empirical measure for bilateral trade ﬂow, it lacks both a theoretical justiﬁcation [

4

], as well as suffers

from practical limitations. In particular, model performance is dictated by deﬁning handcrafted

features such as cultural differences and political terms that require signiﬁcant domain knowledge.

models show that countries with high GDPs will have a high trade ﬂow and will more likely trade with

each other compared to countries with low GDPs [

3

]. Again, trade ﬂow is smaller if they are distant

apart and will less likely trade. Trade ﬂow, therefore, serves as the basis for predicting potential trade

partners for countries. This is an important task in economics because it allows policymakers to relax

restrictions and trade barriers to foster the partnership between the countries and consequently expand

their economic capacities. One difﬁculty with using the gravity model is that a lot of other dummy

variables like cultural differences, political terms and others must be handcrafted and factored into

the equation. This makes the cost of capturing the information that actually affects trade patterns

very expensive.

Present work In this paper, we take a data-driven approach to modeling bilateral trade. We ﬁrst

observe that trade ﬂows can naturally be interpreted as a graph wherein countries are nodes and edges

represent countries undertaking bilateral trade. We leverage recent advances in graph representation

learning and predict trade links between countries, crucially without ﬁrst estimating trade ﬂow

heuristics. We further analyze the graphical structure of trade relationships between countries and

use it to power a supervised learning approach to predict the income levels of countries using graph

neural networks (GNNs). Empirically we observe

98%

and

68%

accuracies in predicting bilateral

trade links and income levels, respectively. Our work is motivated by the difﬁculty in estimating

the trade ﬂow and we intend to tackle trade partner prediction and income level classiﬁcation from a

graph perspective. Our main contributions are:

•To show that international trade data can naturally be modelled as a graph.

•

Directly predict trade links between countries without ﬁrst having to estimate the trade ﬂow

values between them.

•

To show that the trade relationship between countries could be a major ingredient when

predicting their income levels.

2 Background

2.1 Country Classiﬁcation

The World Bank deﬁnes four income groups in the world. Namely: high-, upper middle, lower middle

and low-income. The division into these income groups is based on the total annual income called

the gross national income (GNI) per capita. The GNI of a country gives an idea of its economic

strength and weaknesses and in general, the standard of living of the average citizen. Countries

are classiﬁed into various income groups if their GNI falls within a certain threshold, deﬁned in

table 1 [

5

]. This classiﬁcation by income levels can be used to measure progress over time or analyse

data for countries falling into the same income groups.

income group GNI threshold

lower 1,006 and below

lower middle 1,006 - 3,955

upper middle 3,956 - 12,235

high 12,235 and above

Table 1: Country GNI and income group according to World Bank, 2017.

2.2 Bilateral Trade Flows

Inspired by Newton’s law of universal gravitation, the gravity model provides a theoretical approach

to representing the numerical trade strength between any two countries. The gravity model is used

2

to compute a trade ﬂow value that shows that the strength of trade ﬂow between any two countries

increases with increasing respective net income or GDP and decreasing with increasing distance [

3

,

6

].

It is expressed as

Fij =MGDPi·GDPj

Dij

,

where

Fij

is the trade ﬂow between countries

i

an

j

,

GDPi

is the GDP of country

i

,

M

is a

proportionality constant and

Dij

is the geographical distance between countries

i

and

j

. A more

convenient way to deal with this equation is to express it in

log

and introduce coefﬁcients and

placeholder variables to account for other unanticipated factors which are not exactly deterministic.

The gravity equation may then be expressed in the form

ln Fij =c0+c1ln GDPi+c2ln GDPj+c3ln Dij +c4d+c5Pij +ij ,

where ckare constants, Pij is a political inﬂuence term and ij is an error correction term.

2.3 Graph Neural Networks

Given a graph

G= (V, X, A,E)

, individual entities are referred to as nodes

V

with some characteristic

node features

X∈R|V|×D

, where

|V|

and

D

are the number of nodes and features respectively.

An edge

aij

is said to exist between nodes

i

and

j

if they are connected and vice versa. This can

be composed into a dense square adjacency matrix

A∈R|V|×|V |

which may be symmetric or not

depending on whether or not the graph is directed. Edges in a directed graph have arrows going from

one node

i

to another node

j

to show that node

i

is connected to node

j

. The reverse is not true and

aij 6=aji

. On the other hand, edges in an undirected graph have no direction, i.e.

aij =aji

. Edge

weights Eis numerical indication of the strength of relationship between the connected nodes.

Graph Neural Networks (GNNs) are a family of approaches that aim to generalize neural networks,

developed for Euclidean data, to graphs [

7

]. They can tackle tasks such as node [

8

,

9

], graph [

10

,

11

,

12

] and edge classiﬁcation [

13

], link prediction [

14

] and node clustering [

15

] or community

detection [16].

Node, graph and edge classiﬁcation problems involve discriminating between classes of nodes, graphs

and edges and providing labels to unlabeled ones at test time. Link prediction is predicting missing

links between two nodes in the graph. Clustering is an unsupervised learning technique that leverages

the similarities in features to put data points into inherent groups other than predeﬁned target labels.

Node clustering and community detection

2

therefore seek to detect groups of nodes referred to as

clusters or communities in the absence of target labels.

Depending on the task at hand, several Graph Neural Network techniques have been proposed and

many improvements have also been developed. These are based on speciﬁc applications and other

properties of the graph such as being directed (i.e. follower on Twitter) or heterogeneous (i.e. paper -

author nodes in citation network) and edges having weights or features (i.e. net import and export

trade value between two countries) [17].

2.3.1 ChebNet

Some of the early techniques proposed in graph representation learning sought to learn the local

neighbourhood structure of nodes present in graphs by using ﬁlters that would share structural

resemblances as well as the successes of those originally used on particularly images. Bruna et

al. [

18

] introduced the spectral network, a convolutional network based on spectral ﬁltering. Spectral

graph theory deﬁnes the convolution operation on graphs as

gθ(L)∗x=gθ(UΛU>)x. (1)

Here,

g

is the spectral ﬁlter deﬁned over

x

,

U

and

Λ

are the eigenvalues and eigenvectors of the

Laplacian

L

respectively. Defferrard et al. [

19

] use a local approximation to avoid the expensive

computation of the Laplacian eigenvectors (Λ) by deﬁning ﬁlters as polynomials of the Laplacian:

gθ(Λ) ∗x=

K−1

X

k=0

θkΛkx. (2)

2

Node clustering and community detection are terms used by the machine learning on graphs and data mining

communities respectively.

3

They compute an approximation of the Chebyshev polynomial

Tk(˜

L)

in (4) from the truncated

Laplacian ˜

Lshown in equation (3).

˜

L= 2L/λmax −In.(3)

gθ(L)∗x=

K−1

X

k=0

θkTk(˜

L)x. (4)

Here,

Λk

is the truncated Laplacian eigenvectors computed from the Laplacian

L=UΛkU>

,

θk

is

the Chebyshev coefﬁcient evaluated at the k-th order. This network is referred to as ChebNet.

2.3.2 Graph Convolutional Network (GCN)

GCN [

8

] is a GNN variant that approximates the ChebNet to the ﬁrst-order by setting

K= 1

and

λmax = 2. The convolution ﬁlter on the graph reduces to

gθ∗x≈θIN+D−1

2AD−1

2x, (5)

where

A

is the adjacency matrix and

D

is the degree or the number of neighbours of a node. From

the equation 1,

A

and

D

can immediately be re-normalized by adding the identity matrix

IN

so that

the features and states of the node itself are captured and this transforms into the aggregation function

F=˜

D−1

2˜

A˜

D1

2Xand node update H=FΘ.

By stacking a sufﬁcient number of GCN layers, hierarchical information can be extracted from the

graph-structured data by propagating local messages across layers. This message propagation rule

across hidden layers, land l+ 1 is given by 6.

hl+1 =σ˜

D−1

2˜

A˜

D−1

2h(l)W(l)(6)

z = softmax h(l+1)W(1) (7)

where

σ

is the activation function,

˜

D

is the degree of node with self-loops,

˜

A

is the adjacency matrix

with self-loops and

W

is the weight matrix. z in equation 7 is the ﬁnal softmax layer used in a

semi-supervised classiﬁcation task.

2.3.3 Graph Attention Network (GAT)

Graph Attention Networks (GAT) [

20

] use self-attention to compute attention coefﬁcients that assign

weights by importance to nodes in each ones neighbourhood. It is an extension of attention, as

proposed by Badanau et al. [

21

], that allows each neighbouring node to be attended to. The following

equations summarize the attention mechanism used by GAT.

eij =a(Whi,Whj),(8)

eij

is the attention score between node

i

and node

j

computed from the shared weight

W

and

the hidden states of

i

and

j

. This provides a new set of feature information about node

i

and its

neighboring nodes to be learned together with a trainable parameter

a

which in effect aligns the

attention scores

eij

with the output features. Using softmax, attention scores (

α

) are normalised to

the resulting equation in (10).

αij = softmax(eij ),(9)

αij =exp LeakyReLU a>[WhikWhj]

Pk∈N exp (LeakyReLU (a>[WhikWhk])) (10)

[

20

] also deﬁnes the multi-head attention on graphs by concatenating a

K

number of independent

attention mechanisms (12). Each of these attention mechanisms implements (11).

ht+1 =σ

X

j∈N

αij Whj

(11)

4

ht+1 =

K

k=1σ

X

j∈N

αk

ij Wkhj

(12)

In the case where the output of the last hidden layer of the network is used for prediction with a

sigmoid or softmax activation, features from the

K

attention layers are averaged out as demonstrated

by equation (13).

ht+1 =σ

1

K

K

X

k=1 X

j∈N

αk

ij Wkhj

(13)

2.3.4 Attention-Based Graph Neural Network (AGNN)

While the attention computed for neighboring nodes does not change going from one layer to another,

AGNN [

22

] uses a layer-wise parameter

α

that learns to weight neighboring nodes of node

i

according

to their contribution to the label of the target node. This is done by storing a single

α(t)

for layer

t∈ {1,2, ..., l}

for

l

number of layers. Propagation of hidden state information across layers is

guided by the rule in equation 14.

h(t+1) =F(t)h(t)(14)

F(t)

i= softmax hα(t)cos(h(t)

i, h(t)

j)ij∈N (i)∪{i}(15)

F(t)

i

is the propagation vector computed at layer

t

as a summary by relevance of each neighboring

node jto node i.

2.3.5 Graph Auto-Encoder (GAE)

Previous models we have discussed are (semi-) supervised learning methods on graph data where

each example has a label showing the class it belongs to. We go further to discuss the Graph Auto-

Encoder (GAE) [

23

], an extension of auto-encoders [

24

,

25

], for unsupervised learning on graphs.

GAE primarily learns a latent representation for the data without looking at the labels and the edge

directions. GAE is composed of an encoder (inference model) that learns a code which targets

minimizing a reconstruction loss and a decoder (generative model) whose output is a reconstruction

of the original representation from the code. Kipf and Welling [

23

] proposed a GCN encoder function

f(z|X,A) = GCN (A)(16)

and a decoder function

p(A|z) =

N

Y

i=1

N

Y

j=1

p(Aij |zi,zj),(17)

where p(Aij |zi,zj)can be

p(Aij |zi,zj) = σ(z>

izj).(18)

specialised in predicting new edges between nodes in the newly constructed adjacency matrix.

2.3.6 Variational Graph Auto-Encoder (VGAE)

The inference model of the Variational Graph Auto-Encoder (VGAE) learns a latent code which

follows a controlled distribution with an encoder function

f(z|X,A)

. The encoder is generally made

up of a simple double-layered GCN whose shared parameters are normally distributed.

q(z|X,A) =

N

Y

j=1

q(zi|X,A), where q(zi|X,A) = N(zi|µi,diag(σ2

i)).(19)

The generative model then reconstructs a new adjacency matrix with a decoder function

p(A|z)

(same as shown for GAEs). Training is done by minimizing a variational lower bound:

L=E[log p(A|Z)] −KL[q(Z|X,A)||p(Z)],(20)

5

Table 2: Summary of data features and representation.

Feature notation number Representation

nodes V111 countries

node features X 38 population etc.4

edges A476 1 if countries iand jhave traded and 0 otherwise

edge weights E476 net trade value (USD)

node labels Y4 income group

3 Data and Tasks

Here, we show our proposed approach to representing countries in the graph-structured bilateral trade

data between countries for income-level classiﬁcation and trade partner prediction. We ﬁrst describe

the basic components of a typical graph and relate each to a component in our data.

Our primary goals are to (1) classify countries into their respective income levels: high, upper

middle, lower middle and low, and (2) predict potential trade partners. These are essentially multi-

class node classiﬁcation and link prediction tasks. We trained Graph Neural Network models and

baseline models to perform the aforementioned tasks. We based our evaluation of performance on

classiﬁcation accuracy on test examples and area under the curve (AUC) and average precision (AP)

for node classiﬁcation and link prediction task respectively. All experiments were performed on

data speciﬁcally collected for this study and we do not include results on other standard benchmark

datasets traditionally used to test the efﬁciency of novel approaches. In the next subsection, we talk

about how we collected the data, the representation approach we took and the GNN models adopted

for the downstream tasks.

3.1 Data Collection and Representation

The data used for this study are taken from two different sources: the United Nations Comtrade

Database and Kaggle. The United Nations Comtrade Database

3

is an international trade database

containing the reporter-partner trade statistics collated for about 170 countries over a certain period of

time. These trade statistics include (1) imports, exports, re-exports, and re-imports, (2) commodities

exchanges between the trade partner and reporter, (3) trade value in US dollars.

The data retrieved from Kaggle, on the other hand, contains the proﬁle of speciﬁc countries, which

included geographical, ﬁnancial, geological, and other information. To ensure consistency, we used

data for a particular year and accumulated the trade and proﬁle information for a total of 111 countries

together, along with their income groups, which are used as target labels.

Each country considered is a node in our graph and contains 38 node features, which are simply the

collected proﬁle information. We then used the net trade balance between the countries to construct

an adjacency matrix such that there is an edge between the countries if the trade balance was not zero

and vice versa. An entry of

1

is assigned if there is an edge and

0

otherwise. The trade values are in

US dollars and constitute the edge weight matrix. A summary of features and the representation is

provided in Table 2.

4 Experiments

We evaluate our approach to modeling bilateral trade using GNNs in two settings: link prediction of

withheld edges between trading countries and income classiﬁcation of countries (node classiﬁcation)

as a separate downstream task. We train all GNN based models using PyTorch Geometric [

26

] using

80%

and

20%

for training and test sets. For optimization of model parameters, we use the Adam

optimizer [27], while hyperparameters are tuned using Bayesian optimization [28, 29].

3https://comtrade.un.org/data/

3See appendix for full list of features.

6

(a)

(b)

Figure 1: (a) shows the graph of countries (numbered for country reference) and links from trade

data. (b) is a graph of countries with node color indicating their income level.

node classiﬁer

link predictor

1

2

Figure 2: (1) Input graph has partially labelled nodes and edges of different widths corresponding to

the values of trade ﬂow. The node classiﬁer infers the remaining nodes. (2) Link predictor predicts

missing edges illustrated with dotted lines

5 Setup and Baselines

In addition to GNNs, we test a multi-layer perceptron (MLP) and a logistic regression model as

baseline models for node classiﬁcation. Both MLP and the logistic regression model consume node

features as input, but critically do not have access to any underlying graph structure in the form

of an adjacency matrix. We hypothesize that effective learning requires the utilization of local

neighborhood information available via the adjacency matrix —i.e., adoption of speciﬁc domestic

trade policies in one country can inﬂuence similar policies to its neighbors.

5.1 Hyperparameter tuning

In our experiments, we used bayesian optimization [

28

,

29

]

5

to tune the hyperparameters. We deﬁned

the bounds on the parameter space as shown in table 3 and obtained the best results for the learning

rate of 0.001 and weight decay of 0.005 for all GNN models. The baseline linear model performed

better at a slightly different hyperparameter setting of 0.4601 for learning rate and 0.06976 for weight

decay.

5.2 Node Classiﬁcation

We used the GCN, ChebNet, GAT and AGNN models described in Section 2 to perform a multi-class

node classiﬁcation with the input graph. We split the data into train and test sets. We do this by

5Code available at https://github.com/fmfn/BayesianOptimization.

7

Table 3: hyperparameters and predeﬁned optimization bounds

hyperparameter upper bound lower bound

learning rate 0.0001 1.0

weight decay 0.005 1.0

randomly masking out

20%

of the total nodes and edge indices from the adjacency matrix and using

the

80%

for training. The learned model is then used to predict labels for the masked out set of nodes.

We then report the classiﬁcation accuracy on the test accuracy as a measure of model performance.

5.3 Link Prediction

In the link prediction task, we used the graph autoencoder (GAE) and variational graph autoencoder

(VGAE) [

23

] to learn a latent representation of the input graph. The input graph is having few

observed edges and hence a sparse adjacency matrix.This input graph is fed into the link prediction

model, which then reconstructs a new adjacency matrix representing a new neighborhood structure

for each node in the graph. We evaluate the reconstruction accuracy using the area under the curve

(AUC) and average precision (AP) metrics. In Table 5, we report high AUC and AP scores for both

GAE and VGAE. This is indicative of how well these models are reliably able to reconstruct the

adjacency matrix from the learned latent representation.

5.4 Results and Evaluation

Figure 3: Accuracy on test sets.

On the node classiﬁcation tasks, we summarize results denoting mean results after 100 runs in Table

4. We report that the GCN has the best accuracy score. ChebNet also compares competitively with

the linear baseline model.

8

(a) (b)

Figure 4: (a) AUCs and APs on adjacency matrix reconstruction by (a) GAE and (b) VGAE.

Table 4: Results for multi-class node classiﬁcation task

GCN ChebNet GAT AGNN Linear Logistic Regression

Test Accuracy 0.6812 0.6436 0.6158 0.6003 0.6491 0.5758

Table 5: Results for link prediction task

GAE VGAE

AUC 0.9840 0.9888

Average Precision 0.9835 0.9896

In Table 5, we report high AUC and AP scores for both GAE and VGAE. This is indicative of

how well these models are reliably able to reconstruct the adjacency matrix from the learned latent

representation.

6 Discussion and Conclusion

In this paper, we approach modeling bilateral trade and related downstream tasks — potential

trade partners prediction among countries and income level classiﬁcation — as a problem in graph

representation learning. We leverage historical mutual trade relationships ﬁrst to construct a graph and

where nodes are countries and edges represent active trade between any two given countries before

utilizing graph neural networks. Our approach naturally points to a new direction machine learning,

particularly graph representation learning and application in the ﬁeld of Economics. Empirically,

we conﬁrm that our approach does well for the intended tasks and can potentially aid future trade

analysis. While we considered modeling trade as a static graph an exciting future direction is to

model the time evolution of bilateral trade as a dynamic graph. This will encourage analyses of

(1) how countries evolve from one income class to the other with time and (2) how trade activities

between countries change over time . In the latter case, a temporal prediction of an edge between any

two countries will be an indication of how likely they will partner in trade in the future.

References

[1]

David Ricardo. On the Principle of Political Economy and Taxation, chapter 1. Batoche Books,

52 Eby Street South, Kitchener, Ontario, N2G 3L1Canada, 3 edition, 1817.

[2] Patrick Steiner. Determinants of bilateral trade ﬂows, 2015.

[3]

Alan Deardorff. Determinants of bilateral trade: Does gravity work in a neoclassical world?

The Regionalization of the World Economy, pages 7–32, 1998.

9

[4] James E Anderson. The gravity model: Annual review of economics. 2011.

[5]

World Bank Data Team. New country classiﬁcations by income level, 2017. Last accessed 31

December 2019.

[6]

Thomas Chaney. The gravity equation in international trade: An explanation. Working Paper

19285, National Bureau of Economic Research, August 2013.

[7]

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural

network model. IEEE Transactions on Neural Networks, 20(1):61–80, Jan 2009.

[8]

Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional

networks. arXiv preprint arXiv:1609.02907, 2016.

[9]

Smriti Bhagat, Graham Cormode, and S. Muthukrishnan. Node classiﬁcation in social networks.

CoRR, abs/1101.3291, 2011.

[10]

Edouard Pineau and Nathan de Lara. Graph classiﬁcation with recurrent variational neural

networks. CoRR, abs/1902.02721, 2019.

[11]

Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wen-bing Huang, and Junzhou Huang. Semi-

supervised graph classiﬁcation: A hierarchical graph perspective. CoRR, abs/1904.05003,

2019.

[12]

Antoine Jean-Pierre Tixier, Giannis Nikolentzos, Polykarpos Meladianos, and Michalis

Vazirgiannis. Classifying graphs as images with convolutional neural networks. CoRR,

abs/1708.02218, 2017.

[13]

C. Aggarwal, G. He, and P. Zhao. Edge classiﬁcation in networks. In 2016 IEEE 32nd

International Conference on Data Engineering (ICDE), pages 1038–1049, May 2016.

[14]

Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. CoRR,

abs/1802.09691, 2018.

[15]

Ramnath Balasubramanyan, Frank Lin, and William W. Cohen. Node clustering in graphs: An

empirical study. 2010.

[16] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3):75 – 174, 2010.

[17]

Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph

neural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018.

[18]

Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally

connected networks on graphs. CoRR, abs/1312.6203, 2013.

[19]

Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks

on graphs with fast localized spectral ﬁltering. CoRR, abs/1606.09375, 2016.

[20]

Petar Veliˇ

ckovi´

c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua

Bengio. Graph Attention Networks. International Conference on Learning Representations,

2018. accepted as poster.

[21]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly

learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[22]

Kiran K. Thekumparampil, Sewoong Oh, Chong Wang, and Li-Jia Li. Attention-based graph

neural network for semi-supervised learning, 2018.

[23]

Thomas N Kipf and Max Welling. Variational graph auto-encoders. NIPS Workshop on Bayesian

Deep Learning, 2016.

[24]

Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value

decomposition. Biological Cybernetics, 59:291–294, 1988.

10

[25]

Geoffrey E Hinton and Richard S. Zemel. Autoencoders, minimum description length and

helmholtz free energy. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural

Information Processing Systems 6, pages 3–10. Morgan-Kaufmann, 1994.

[26] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.

In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

[27]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,

abs/1412.6980, 2014.

[28]

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine

learning algorithms. In Proceedings of the 25th International Conference on Neural Information

Processing Systems - Volume 2, NIPS’12, page 2951–2959, Red Hook, NY, USA, 2012. Curran

Associates Inc.

[29]

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of

expensive cost functions, with application to active user modeling and hierarchical reinforcement

learning. CoRR, abs/1012.2599, 2010.

11