PreprintPDF Available

Bilateral Trade Modeling with Graph Neural Networks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Bilateral trade agreements confer preferred trading status between participating countries, enabling increased trade and potential economic growth. Predicting such trade flows often serve as important economic indicators used by economists and policy makers with impactful ramifications in economic policies adopted by respective countries. However, traditional approaches to predicting potential trade partners is through the use of gravity methods which are cumbersome to define due to the exponentially growing number of constants that need to be considered. In this work, we present a framework for directly predicting bilateral trade partners from observed trade records using graph representation learning. Furthermore, we show as a downstream task that modelling bilateral trade as a graph allows for the classification of countries into various income levels. Empirically, we observe accuracies of up to 98% for predicting trading partners and 68% on income level classification.
Content may be subject to copyright.
Bilateral Trade Modeling
with Graph Neural Networks
Kobby Panford-Quainoo
African Masters in Machine Intelligence
African Institute for Mathematical Sciences
Kigali, Rwanda
kpanford-quainoo@aimsammi.org
Avishek Joey Bose
Department of Computer Science
McGill University and Mila
Montreal, Canada
joey.bose@mail.mcgill.ca
Michaël Defferrard
Institute for Electrical Engineering
École Polytechnique Fédérale de Lausanne
Lausanne, Switzerland
michael.defferrard@epfl.ch
Abstract
Bilateral trade agreements confer preferred trading status between participating
countries, enabling increased trade and potential economic growth. Predicting
such trade flows often serve as important economic indicators used by economists
and policy makers with impactful ramifications in economic policies adopted by
respective countries. However, traditional approaches to predicting potential trade
partners is through the use of gravity methods which are cumbersome to define due
to the exponentially growing number of constants that need to be considered. In
this work, we present a framework for directly predicting bilateral trade partners
from observed trade records using graph representation learning. Furthermore,
we show as a downstream task that modeling bilateral trade as a graph allows for
the classification of countries into various income levels. Empirically, we observe
accuracies of up to
98%
for predicting trading partners and
68%
on income level
classification.
1 Introduction
International trade involves the exchange of goods, capital, and services between countries, and
where two countries are concerned, it is referred to as bilateral trade. Often, the deficit and surpluses
created via bilateral trade represent important economic development indicators, which drive the
adoption of specific domestic economic policies –i.e., relaxation of restrictions and trade barriers, in
either country. Consequently, various models have been employed by economists to understand trade
patterns and factors that account for the observed trade activities between countries. For instance,
the Ricardian model introduced the idea of comparative advantage of nations, whereby a country
exports more of the goods they can produce at a lower cost [
1
]. In a similar vein, the “factor of
abundance” argues that the trading behavior of a country is influenced by what they confidently
produce in abundance [2].
The most popular method with practical benefits is known as the Gravity Model of trade, which is
motivated by Newton’s law of gravitation. The gravity model relates the bilateral trade flows between
Code can be found at https://github.com/panford/BiTrade-Graphs.
Submitted to the African Institute for Mathematical Sciences for a Master’s degree in Machine Intelligence
(AIMS 2019), Rwanda.
two countries using the respective gross domestic product (GDP) of each country while taking into
account the geographical distance. Intuitively, trade flow is high when participating countries have
high GDPs and are geographically close to each other [
3
]. While the gravity model is an effective
empirical measure for bilateral trade flow, it lacks both a theoretical justification [
4
], as well as suffers
from practical limitations. In particular, model performance is dictated by defining handcrafted
features such as cultural differences and political terms that require significant domain knowledge.
models show that countries with high GDPs will have a high trade flow and will more likely trade with
each other compared to countries with low GDPs [
3
]. Again, trade flow is smaller if they are distant
apart and will less likely trade. Trade flow, therefore, serves as the basis for predicting potential trade
partners for countries. This is an important task in economics because it allows policymakers to relax
restrictions and trade barriers to foster the partnership between the countries and consequently expand
their economic capacities. One difficulty with using the gravity model is that a lot of other dummy
variables like cultural differences, political terms and others must be handcrafted and factored into
the equation. This makes the cost of capturing the information that actually affects trade patterns
very expensive.
Present work In this paper, we take a data-driven approach to modeling bilateral trade. We first
observe that trade flows can naturally be interpreted as a graph wherein countries are nodes and edges
represent countries undertaking bilateral trade. We leverage recent advances in graph representation
learning and predict trade links between countries, crucially without first estimating trade flow
heuristics. We further analyze the graphical structure of trade relationships between countries and
use it to power a supervised learning approach to predict the income levels of countries using graph
neural networks (GNNs). Empirically we observe
98%
and
68%
accuracies in predicting bilateral
trade links and income levels, respectively. Our work is motivated by the difficulty in estimating
the trade flow and we intend to tackle trade partner prediction and income level classification from a
graph perspective. Our main contributions are:
To show that international trade data can naturally be modelled as a graph.
Directly predict trade links between countries without first having to estimate the trade flow
values between them.
To show that the trade relationship between countries could be a major ingredient when
predicting their income levels.
2 Background
2.1 Country Classification
The World Bank defines four income groups in the world. Namely: high-, upper middle, lower middle
and low-income. The division into these income groups is based on the total annual income called
the gross national income (GNI) per capita. The GNI of a country gives an idea of its economic
strength and weaknesses and in general, the standard of living of the average citizen. Countries
are classified into various income groups if their GNI falls within a certain threshold, defined in
table 1 [
5
]. This classification by income levels can be used to measure progress over time or analyse
data for countries falling into the same income groups.
income group GNI threshold
lower 1,006 and below
lower middle 1,006 - 3,955
upper middle 3,956 - 12,235
high 12,235 and above
Table 1: Country GNI and income group according to World Bank, 2017.
2.2 Bilateral Trade Flows
Inspired by Newton’s law of universal gravitation, the gravity model provides a theoretical approach
to representing the numerical trade strength between any two countries. The gravity model is used
2
to compute a trade flow value that shows that the strength of trade flow between any two countries
increases with increasing respective net income or GDP and decreasing with increasing distance [
3
,
6
].
It is expressed as
Fij =MGDPi·GDPj
Dij
,
where
Fij
is the trade flow between countries
i
an
j
,
GDPi
is the GDP of country
i
,
M
is a
proportionality constant and
Dij
is the geographical distance between countries
i
and
j
. A more
convenient way to deal with this equation is to express it in
log
and introduce coefficients and
placeholder variables to account for other unanticipated factors which are not exactly deterministic.
The gravity equation may then be expressed in the form
ln Fij =c0+c1ln GDPi+c2ln GDPj+c3ln Dij +c4d+c5Pij +ij ,
where ckare constants, Pij is a political influence term and ij is an error correction term.
2.3 Graph Neural Networks
Given a graph
G= (V, X, A,E)
, individual entities are referred to as nodes
V
with some characteristic
node features
XR|V|×D
, where
|V|
and
D
are the number of nodes and features respectively.
An edge
aij
is said to exist between nodes
i
and
j
if they are connected and vice versa. This can
be composed into a dense square adjacency matrix
AR|V|×|V |
which may be symmetric or not
depending on whether or not the graph is directed. Edges in a directed graph have arrows going from
one node
i
to another node
j
to show that node
i
is connected to node
j
. The reverse is not true and
aij 6=aji
. On the other hand, edges in an undirected graph have no direction, i.e.
aij =aji
. Edge
weights Eis numerical indication of the strength of relationship between the connected nodes.
Graph Neural Networks (GNNs) are a family of approaches that aim to generalize neural networks,
developed for Euclidean data, to graphs [
7
]. They can tackle tasks such as node [
8
,
9
], graph [
10
,
11
,
12
] and edge classification [
13
], link prediction [
14
] and node clustering [
15
] or community
detection [16].
Node, graph and edge classification problems involve discriminating between classes of nodes, graphs
and edges and providing labels to unlabeled ones at test time. Link prediction is predicting missing
links between two nodes in the graph. Clustering is an unsupervised learning technique that leverages
the similarities in features to put data points into inherent groups other than predefined target labels.
Node clustering and community detection
2
therefore seek to detect groups of nodes referred to as
clusters or communities in the absence of target labels.
Depending on the task at hand, several Graph Neural Network techniques have been proposed and
many improvements have also been developed. These are based on specific applications and other
properties of the graph such as being directed (i.e. follower on Twitter) or heterogeneous (i.e. paper -
author nodes in citation network) and edges having weights or features (i.e. net import and export
trade value between two countries) [17].
2.3.1 ChebNet
Some of the early techniques proposed in graph representation learning sought to learn the local
neighbourhood structure of nodes present in graphs by using filters that would share structural
resemblances as well as the successes of those originally used on particularly images. Bruna et
al. [
18
] introduced the spectral network, a convolutional network based on spectral filtering. Spectral
graph theory defines the convolution operation on graphs as
gθ(L)x=gθ(UΛU>)x. (1)
Here,
g
is the spectral filter defined over
x
,
U
and
Λ
are the eigenvalues and eigenvectors of the
Laplacian
L
respectively. Defferrard et al. [
19
] use a local approximation to avoid the expensive
computation of the Laplacian eigenvectors (Λ) by defining filters as polynomials of the Laplacian:
gθ(Λ) x=
K1
X
k=0
θkΛkx. (2)
2
Node clustering and community detection are terms used by the machine learning on graphs and data mining
communities respectively.
3
They compute an approximation of the Chebyshev polynomial
Tk(˜
L)
in (4) from the truncated
Laplacian ˜
Lshown in equation (3).
˜
L= 2L/λmax In.(3)
gθ(L)x=
K1
X
k=0
θkTk(˜
L)x. (4)
Here,
Λk
is the truncated Laplacian eigenvectors computed from the Laplacian
L=UΛkU>
,
θk
is
the Chebyshev coefficient evaluated at the k-th order. This network is referred to as ChebNet.
2.3.2 Graph Convolutional Network (GCN)
GCN [
8
] is a GNN variant that approximates the ChebNet to the first-order by setting
K= 1
and
λmax = 2. The convolution filter on the graph reduces to
gθxθIN+D1
2AD1
2x, (5)
where
A
is the adjacency matrix and
D
is the degree or the number of neighbours of a node. From
the equation 1,
A
and
D
can immediately be re-normalized by adding the identity matrix
IN
so that
the features and states of the node itself are captured and this transforms into the aggregation function
F=˜
D1
2˜
A˜
D1
2Xand node update H=FΘ.
By stacking a sufficient number of GCN layers, hierarchical information can be extracted from the
graph-structured data by propagating local messages across layers. This message propagation rule
across hidden layers, land l+ 1 is given by 6.
hl+1 =σ˜
D1
2˜
A˜
D1
2h(l)W(l)(6)
z = softmax h(l+1)W(1) (7)
where
σ
is the activation function,
˜
D
is the degree of node with self-loops,
˜
A
is the adjacency matrix
with self-loops and
W
is the weight matrix. z in equation 7 is the final softmax layer used in a
semi-supervised classification task.
2.3.3 Graph Attention Network (GAT)
Graph Attention Networks (GAT) [
20
] use self-attention to compute attention coefficients that assign
weights by importance to nodes in each ones neighbourhood. It is an extension of attention, as
proposed by Badanau et al. [
21
], that allows each neighbouring node to be attended to. The following
equations summarize the attention mechanism used by GAT.
eij =a(Whi,Whj),(8)
eij
is the attention score between node
i
and node
j
computed from the shared weight
W
and
the hidden states of
i
and
j
. This provides a new set of feature information about node
i
and its
neighboring nodes to be learned together with a trainable parameter
a
which in effect aligns the
attention scores
eij
with the output features. Using softmax, attention scores (
α
) are normalised to
the resulting equation in (10).
αij = softmax(eij ),(9)
αij =exp LeakyReLU a>[WhikWhj]
Pk∈N exp (LeakyReLU (a>[WhikWhk])) (10)
[
20
] also defines the multi-head attention on graphs by concatenating a
K
number of independent
attention mechanisms (12). Each of these attention mechanisms implements (11).
ht+1 =σ
X
j∈N
αij Whj
(11)
4
ht+1 =
K
k=1σ
X
j∈N
αk
ij Wkhj
(12)
In the case where the output of the last hidden layer of the network is used for prediction with a
sigmoid or softmax activation, features from the
K
attention layers are averaged out as demonstrated
by equation (13).
ht+1 =σ
1
K
K
X
k=1 X
j∈N
αk
ij Wkhj
(13)
2.3.4 Attention-Based Graph Neural Network (AGNN)
While the attention computed for neighboring nodes does not change going from one layer to another,
AGNN [
22
] uses a layer-wise parameter
α
that learns to weight neighboring nodes of node
i
according
to their contribution to the label of the target node. This is done by storing a single
α(t)
for layer
t∈ {1,2, ..., l}
for
l
number of layers. Propagation of hidden state information across layers is
guided by the rule in equation 14.
h(t+1) =F(t)h(t)(14)
F(t)
i= softmax hα(t)cos(h(t)
i, h(t)
j)ij∈N (i)∪{i}(15)
F(t)
i
is the propagation vector computed at layer
t
as a summary by relevance of each neighboring
node jto node i.
2.3.5 Graph Auto-Encoder (GAE)
Previous models we have discussed are (semi-) supervised learning methods on graph data where
each example has a label showing the class it belongs to. We go further to discuss the Graph Auto-
Encoder (GAE) [
23
], an extension of auto-encoders [
24
,
25
], for unsupervised learning on graphs.
GAE primarily learns a latent representation for the data without looking at the labels and the edge
directions. GAE is composed of an encoder (inference model) that learns a code which targets
minimizing a reconstruction loss and a decoder (generative model) whose output is a reconstruction
of the original representation from the code. Kipf and Welling [
23
] proposed a GCN encoder function
f(z|X,A) = GCN (A)(16)
and a decoder function
p(A|z) =
N
Y
i=1
N
Y
j=1
p(Aij |zi,zj),(17)
where p(Aij |zi,zj)can be
p(Aij |zi,zj) = σ(z>
izj).(18)
specialised in predicting new edges between nodes in the newly constructed adjacency matrix.
2.3.6 Variational Graph Auto-Encoder (VGAE)
The inference model of the Variational Graph Auto-Encoder (VGAE) learns a latent code which
follows a controlled distribution with an encoder function
f(z|X,A)
. The encoder is generally made
up of a simple double-layered GCN whose shared parameters are normally distributed.
q(z|X,A) =
N
Y
j=1
q(zi|X,A), where q(zi|X,A) = N(zi|µi,diag(σ2
i)).(19)
The generative model then reconstructs a new adjacency matrix with a decoder function
p(A|z)
(same as shown for GAEs). Training is done by minimizing a variational lower bound:
L=E[log p(A|Z)] KL[q(Z|X,A)||p(Z)],(20)
5
Table 2: Summary of data features and representation.
Feature notation number Representation
nodes V111 countries
node features X 38 population etc.4
edges A476 1 if countries iand jhave traded and 0 otherwise
edge weights E476 net trade value (USD)
node labels Y4 income group
3 Data and Tasks
Here, we show our proposed approach to representing countries in the graph-structured bilateral trade
data between countries for income-level classification and trade partner prediction. We first describe
the basic components of a typical graph and relate each to a component in our data.
Our primary goals are to (1) classify countries into their respective income levels: high, upper
middle, lower middle and low, and (2) predict potential trade partners. These are essentially multi-
class node classification and link prediction tasks. We trained Graph Neural Network models and
baseline models to perform the aforementioned tasks. We based our evaluation of performance on
classification accuracy on test examples and area under the curve (AUC) and average precision (AP)
for node classification and link prediction task respectively. All experiments were performed on
data specifically collected for this study and we do not include results on other standard benchmark
datasets traditionally used to test the efficiency of novel approaches. In the next subsection, we talk
about how we collected the data, the representation approach we took and the GNN models adopted
for the downstream tasks.
3.1 Data Collection and Representation
The data used for this study are taken from two different sources: the United Nations Comtrade
Database and Kaggle. The United Nations Comtrade Database
3
is an international trade database
containing the reporter-partner trade statistics collated for about 170 countries over a certain period of
time. These trade statistics include (1) imports, exports, re-exports, and re-imports, (2) commodities
exchanges between the trade partner and reporter, (3) trade value in US dollars.
The data retrieved from Kaggle, on the other hand, contains the profile of specific countries, which
included geographical, financial, geological, and other information. To ensure consistency, we used
data for a particular year and accumulated the trade and profile information for a total of 111 countries
together, along with their income groups, which are used as target labels.
Each country considered is a node in our graph and contains 38 node features, which are simply the
collected profile information. We then used the net trade balance between the countries to construct
an adjacency matrix such that there is an edge between the countries if the trade balance was not zero
and vice versa. An entry of
1
is assigned if there is an edge and
0
otherwise. The trade values are in
US dollars and constitute the edge weight matrix. A summary of features and the representation is
provided in Table 2.
4 Experiments
We evaluate our approach to modeling bilateral trade using GNNs in two settings: link prediction of
withheld edges between trading countries and income classification of countries (node classification)
as a separate downstream task. We train all GNN based models using PyTorch Geometric [
26
] using
80%
and
20%
for training and test sets. For optimization of model parameters, we use the Adam
optimizer [27], while hyperparameters are tuned using Bayesian optimization [28, 29].
3https://comtrade.un.org/data/
3See appendix for full list of features.
6
(a)
(b)
Figure 1: (a) shows the graph of countries (numbered for country reference) and links from trade
data. (b) is a graph of countries with node color indicating their income level.
node classifier
link predictor
1
2
Figure 2: (1) Input graph has partially labelled nodes and edges of different widths corresponding to
the values of trade flow. The node classifier infers the remaining nodes. (2) Link predictor predicts
missing edges illustrated with dotted lines
5 Setup and Baselines
In addition to GNNs, we test a multi-layer perceptron (MLP) and a logistic regression model as
baseline models for node classification. Both MLP and the logistic regression model consume node
features as input, but critically do not have access to any underlying graph structure in the form
of an adjacency matrix. We hypothesize that effective learning requires the utilization of local
neighborhood information available via the adjacency matrix —i.e., adoption of specific domestic
trade policies in one country can influence similar policies to its neighbors.
5.1 Hyperparameter tuning
In our experiments, we used bayesian optimization [
28
,
29
]
5
to tune the hyperparameters. We defined
the bounds on the parameter space as shown in table 3 and obtained the best results for the learning
rate of 0.001 and weight decay of 0.005 for all GNN models. The baseline linear model performed
better at a slightly different hyperparameter setting of 0.4601 for learning rate and 0.06976 for weight
decay.
5.2 Node Classification
We used the GCN, ChebNet, GAT and AGNN models described in Section 2 to perform a multi-class
node classification with the input graph. We split the data into train and test sets. We do this by
5Code available at https://github.com/fmfn/BayesianOptimization.
7
Table 3: hyperparameters and predefined optimization bounds
hyperparameter upper bound lower bound
learning rate 0.0001 1.0
weight decay 0.005 1.0
randomly masking out
20%
of the total nodes and edge indices from the adjacency matrix and using
the
80%
for training. The learned model is then used to predict labels for the masked out set of nodes.
We then report the classification accuracy on the test accuracy as a measure of model performance.
5.3 Link Prediction
In the link prediction task, we used the graph autoencoder (GAE) and variational graph autoencoder
(VGAE) [
23
] to learn a latent representation of the input graph. The input graph is having few
observed edges and hence a sparse adjacency matrix.This input graph is fed into the link prediction
model, which then reconstructs a new adjacency matrix representing a new neighborhood structure
for each node in the graph. We evaluate the reconstruction accuracy using the area under the curve
(AUC) and average precision (AP) metrics. In Table 5, we report high AUC and AP scores for both
GAE and VGAE. This is indicative of how well these models are reliably able to reconstruct the
adjacency matrix from the learned latent representation.
5.4 Results and Evaluation
Figure 3: Accuracy on test sets.
On the node classification tasks, we summarize results denoting mean results after 100 runs in Table
4. We report that the GCN has the best accuracy score. ChebNet also compares competitively with
the linear baseline model.
8
(a) (b)
Figure 4: (a) AUCs and APs on adjacency matrix reconstruction by (a) GAE and (b) VGAE.
Table 4: Results for multi-class node classification task
GCN ChebNet GAT AGNN Linear Logistic Regression
Test Accuracy 0.6812 0.6436 0.6158 0.6003 0.6491 0.5758
Table 5: Results for link prediction task
GAE VGAE
AUC 0.9840 0.9888
Average Precision 0.9835 0.9896
In Table 5, we report high AUC and AP scores for both GAE and VGAE. This is indicative of
how well these models are reliably able to reconstruct the adjacency matrix from the learned latent
representation.
6 Discussion and Conclusion
In this paper, we approach modeling bilateral trade and related downstream tasks — potential
trade partners prediction among countries and income level classification — as a problem in graph
representation learning. We leverage historical mutual trade relationships first to construct a graph and
where nodes are countries and edges represent active trade between any two given countries before
utilizing graph neural networks. Our approach naturally points to a new direction machine learning,
particularly graph representation learning and application in the field of Economics. Empirically,
we confirm that our approach does well for the intended tasks and can potentially aid future trade
analysis. While we considered modeling trade as a static graph an exciting future direction is to
model the time evolution of bilateral trade as a dynamic graph. This will encourage analyses of
(1) how countries evolve from one income class to the other with time and (2) how trade activities
between countries change over time . In the latter case, a temporal prediction of an edge between any
two countries will be an indication of how likely they will partner in trade in the future.
References
[1]
David Ricardo. On the Principle of Political Economy and Taxation, chapter 1. Batoche Books,
52 Eby Street South, Kitchener, Ontario, N2G 3L1Canada, 3 edition, 1817.
[2] Patrick Steiner. Determinants of bilateral trade flows, 2015.
[3]
Alan Deardorff. Determinants of bilateral trade: Does gravity work in a neoclassical world?
The Regionalization of the World Economy, pages 7–32, 1998.
9
[4] James E Anderson. The gravity model: Annual review of economics. 2011.
[5]
World Bank Data Team. New country classifications by income level, 2017. Last accessed 31
December 2019.
[6]
Thomas Chaney. The gravity equation in international trade: An explanation. Working Paper
19285, National Bureau of Economic Research, August 2013.
[7]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural
network model. IEEE Transactions on Neural Networks, 20(1):61–80, Jan 2009.
[8]
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907, 2016.
[9]
Smriti Bhagat, Graham Cormode, and S. Muthukrishnan. Node classification in social networks.
CoRR, abs/1101.3291, 2011.
[10]
Edouard Pineau and Nathan de Lara. Graph classification with recurrent variational neural
networks. CoRR, abs/1902.02721, 2019.
[11]
Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wen-bing Huang, and Junzhou Huang. Semi-
supervised graph classification: A hierarchical graph perspective. CoRR, abs/1904.05003,
2019.
[12]
Antoine Jean-Pierre Tixier, Giannis Nikolentzos, Polykarpos Meladianos, and Michalis
Vazirgiannis. Classifying graphs as images with convolutional neural networks. CoRR,
abs/1708.02218, 2017.
[13]
C. Aggarwal, G. He, and P. Zhao. Edge classification in networks. In 2016 IEEE 32nd
International Conference on Data Engineering (ICDE), pages 1038–1049, May 2016.
[14]
Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. CoRR,
abs/1802.09691, 2018.
[15]
Ramnath Balasubramanyan, Frank Lin, and William W. Cohen. Node clustering in graphs: An
empirical study. 2010.
[16] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3):75 – 174, 2010.
[17]
Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph
neural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018.
[18]
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally
connected networks on graphs. CoRR, abs/1312.6203, 2013.
[19]
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. CoRR, abs/1606.09375, 2016.
[20]
Petar Veliˇ
ckovi´
c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph Attention Networks. International Conference on Learning Representations,
2018. accepted as poster.
[21]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[22]
Kiran K. Thekumparampil, Sewoong Oh, Chong Wang, and Li-Jia Li. Attention-based graph
neural network for semi-supervised learning, 2018.
[23]
Thomas N Kipf and Max Welling. Variational graph auto-encoders. NIPS Workshop on Bayesian
Deep Learning, 2016.
[24]
Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value
decomposition. Biological Cybernetics, 59:291–294, 1988.
10
[25]
Geoffrey E Hinton and Richard S. Zemel. Autoencoders, minimum description length and
helmholtz free energy. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural
Information Processing Systems 6, pages 3–10. Morgan-Kaufmann, 1994.
[26] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.
In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[27]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,
abs/1412.6980, 2014.
[28]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine
learning algorithms. In Proceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 2, NIPS’12, page 2951–2959, Red Hook, NY, USA, 2012. Curran
Associates Inc.
[29]
Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of
expensive cost functions, with application to active user modeling and hierarchical reinforcement
learning. CoRR, abs/1012.2599, 2010.
11
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Node classification and graph classification are two graph learning problems that predict the class label of a node and the class label of a graph respectively. A node of a graph usually represents a real-world entity, e.g., a user in a social network, or a protein in a protein-protein interaction network. In this work, we consider a more challenging but practically useful setting, in which a node itself is a graph instance. This leads to a hierarchical graph perspective which arises in many domains such as social network, biological network and document collection. For example, in a social network, a group of people with shared interests forms a user group, whereas a number of user groups are interconnected via interactions or common members. We study the node classification problem in the hierarchical graph where a "node" is a graph instance, e.g., a user group in the above example. As labels are usually limited in real-world data, we design two novel semi-supervised solutions named SEmi-supervised grAph cLassification via Cautious/Active Iteration (or SEAL-C/AI in short). SEAL-C/AI adopt an iterative framework that takes turns to build or update two classifiers, one working at the graph instance level and the other at the hierarchical graph level. To simplify the representation of the hierarchical graph, we propose a novel supervised, self-attentive graph embedding method called SAGE, which embeds graph instances of arbitrary size into fixed-length vectors. Through experiments on synthetic data and Tencent QQ group data, we demonstrate that SEAL-C/AI not only outperform competing methods by a significant margin in terms of accuracy/Macro-F1, but also generate meaningful interpretations of the learned representations.
Article
Full-text available
Traditional methods for link prediction can be categorized into three main types: graph structure feature-based, latent feature-based, and explicit feature-based. Graph structure feature methods leverage some handcrafted node proximity scores, e.g., common neighbors, to estimate the likelihood of links. Latent feature methods rely on factorizing networks' matrix representations to learn an embedding for each node. Explicit feature methods train a machine learning model on two nodes' explicit attributes. Each of the three types of methods has its unique merits. In this paper, we propose SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction), a new framework for link prediction which combines the power of all the three types into a single graph neural network (GNN). GNN is a new type of neural network which directly accepts graphs as input and outputs their labels. In SEAL, the input to the GNN is a local subgraph around each target link. We prove theoretically that our local subgraphs also reserve a great deal of high-order graph structure features related to link existence. Another key feature is that our GNN can naturally incorporate latent features and explicit features. It is achieved by concatenating node embeddings (latent features) and node attributes (explicit features) in the node information matrix for each subgraph, thus combining the three types of features to enhance GNN learning. Through extensive experiments, SEAL shows unprecedentedly strong performance against a wide range of baseline methods, including various link prediction heuristics and network embedding methods.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Article
Recently popularized graph neural networks achieve the state-of-the-art accuracy on a number of standard benchmark datasets for graph-based semi-supervised learning, improving significantly over existing approaches. These architectures alternate between a propagation layer that aggregates the hidden states of the local neighborhood and a fully-connected layer. Perhaps surprisingly, we show that a linear model, that removes all the intermediate fully-connected layers, is still able to achieve a performance comparable to the state-of-the-art models. This significantly reduces the number of parameters, which is critical for semi-supervised learning where number of labeled examples are small. This in turn allows a room for designing more innovative propagation layers. Based on this insight, we propose a novel graph neural network that removes all the intermediate fully-connected layers, and replaces the propagation layers with attention mechanisms that respect the structure of the graph. The attention mechanism allows us to learn a dynamic and adaptive local summary of the neighborhood to achieve more accurate predictions. In a number of experiments on benchmark citation networks datasets, we demonstrate that our approach outperforms competing methods. By examining the attention weights among neighbors, we show that our model provides some interesting insights on how neighbors influence each other.
Article
The gravity equation in international trade states that bilateral exports are proportional to economic size and inversely proportional to geographic distance. While the role of size is well understood, that of distance remains mysterious. I offer an explanation for the role of distance: If (i) the distribution of firm sizes is Pareto, (ii) the average squared distance of a firm’s exports is an increasing power function of its size, and (iii) a parameter restriction holds, then the distance elasticity of trade is constant for long distances. When the firm size distribution follows Zipf’s law, trade is inversely proportional to distance.
Article
We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.
Article
Convolutional neural networks (CNNs) have greatly improved state-of-the-art performances in a number of fields, notably computer vision and natural language processing. In this work, we are interested in generalizing the formulation of CNNs from low-dimensional regular Euclidean domains, where images (2D), videos (3D) and audios (1D) are represented, to high-dimensional irregular domains such as social networks or biological networks represented by graphs. This paper introduces a formulation of CNNs on graphs in the context of spectral graph theory. We borrow the fundamental tools from the emerging field of signal processing on graphs, which provides the necessary mathematical background and efficient numerical schemes to design localized graph filters efficient to learn and evaluate. As a matter of fact, we introduce the first technique that offers the same computational complexity than standard CNNs, while being universal to any graph structure. Numerical experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs, as long as the graph is well-constructed.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
Modeling networks is an active area of research and is used for many applications ranging from bioinformatics to social network analysis. An important operation that is often performed in the course of graph analysis is node clustering. Pop-ular methods for node clustering such as the normalized cut method have their roots in graph partition optimization and spectral graph theory. Recently, there has been increasing interest in modeling graphs probabilistically using stochastic block models and other approaches that extend it. In this paper, we present an em-pirical study that compares the node clustering performances of state-of-the-art algorithms from both the probabilistic and spectral families on undirected graphs. Our experiments show that no family dominates over the other and that network characteristics play a significant role in determining the best model to use.