Content uploaded by Cong Li
Author content
All content in this area was uploaded by Cong Li on Dec 12, 2024
Content may be subject to copyright.
Deep Attributed Network Representation Learning via Attribute
Enhanced Neighborhood
Cong Lia,∗,Min Shia,Bo Quband Xiang Lic,∗
aAdaptive Networks and Control Lab, Department of Electronic Engineering, School of Information Science and Technology, Fudan
University, Shanghai, 200433, China
bPeng Cheng Laboratory, Shenzhen, 518000, China
cthe Institute of Complex Networks and Intelligent Systems, Shanghai Research Institute for Intelligent Autonomous Systems, Tongji
University, Shanghai, 201210, China
ARTICLE INFO
Keywords:
Attributed network representation learn-
ing
Graph autoencoders
Network structure
Link prediction
Node classification
Social homophily
ABSTRACT
Attributed network representation learning aims at learning node embeddings by integrating network
structure and attribute information. It is a challenge to fully capture the microscopic structure and the
attribute semantics simultaneously, where the microscopic structure includes the one-step, two-step
and multi-step relations, indicating the first-order, second-order and high-order proximity of nodes,
respectively. In this paper, we propose a deep attributed network representation learning via attribute
enhanced neighborhood (DANRL-ANE) model to improve the robustness and effectiveness of node
representations. The DANRL-ANE model adopts the idea of the autoencoder, and expands the decoder
component to three branches to capture different order proximity. We linearly combine the adjacency
matrix with the attribute similarity matrix as the input of DANRL-ANE, where the attribute similarity
matrix is calculated by the cosine similarity between the attributes based on the social homophily.
Moreover, the sigmoid cross-entropy loss function is extended to capture the neighborhood character,
so that the first-order proximity could be well-preserved. We compare our model with the state-of-
the-art models, especially, the latest graph autoencoders (GAEs) method, ARGA, and demonstrate the
contribution of each module on real-world datasets and two network analysis tasks, i.e., link prediction
and node classification. The DANRL-ANE model performs well on various networks, even on sparse
networks or networks with isolated nodes when the attribute information is sufficient.
1. Introduction
Networks are generally utilized to explore and model
complex systems, such as online social networks and citation
networks, where an entity is represented as a node and the
interaction between two entities is represented as an edge.
Network analysis derives many machine learning applica-
tions, such as the online advertisement targeting [1] and
anomaly detection [2]. Identifying effective features of nodes
(or edges) in a network is essential. However, traditional
methods tend to manually mine the specific domain features
depending on the expert experience, which not only require
the high cost of labor and time, but also limit the scalability
of models on different prediction tasks [3]. Network repre-
sentation learning (NRL) [4] as an alternative of automatic
feature mining has been proved to be beneficial to various
network analysis tasks, such as the node classification [5][6],
link prediction [7], clustering [8] and visualization [9].
Early NRL methods are mostly based on the matrix
factorization [10]. To reduce the computational complexity
on large-scale networks, inspired by word modeling, Perozzi
et al. propose DeepWalk [4]. The shallow NRL models
cannot well capture the highly non-linearity that is universal
in networks, which would lead to the sub-optimal network
representation [11]. Then, deep models, such as structural
deep network embedding (SDNE) [11], have emerged.
∗Corresponding authors. E-mail address: cong_li@fudan.edu.cn
(Cong Li) and lix2021@tongji.edu.cn (Xiang Li)
ORCID (s):
Considering only the network structure is not enough
to learn the informative and accurate node representations,
especially when the structure is sparse. Social science theo-
ries like the homophily [12][13] and social influence theory
[14] suggest that there is a strong correlation between the
structure and the attributes. Therefore, many studies focus
on attributed NRL [15][16][17], which mainly learn the
consistent node representations from the network structure
and node attributes. However, the models are susceptible to
the sparsity of heterogeneous information sources, i.e., the
structure and the attributes. Afterwards, the deep coupling
paradigm is introduced to enhance the robustness of the
representations. Attributed network representation learning
(ANRL) [18] is one of the representative methods, but
might be affected by the characteristics of the local network
structure. In addition, the previous attributed NRL methods
rarely preserve all the microscopic structural information,
i.e., the first-order, second-order, and high-order proximity
[19], together, where the proximities indicate the one-step,
two-step, and multi-step relationship between two nodes.
Nevertheless, taking full advantage of the microscopic struc-
ture tends to be essential for learning the network represen-
tation [20].
Utilizing the microscopic structural information as well
as the node attributes, we propose a novel deep coupling
attributed NRL model, namely, the deep attributed network
representation learning via attribute enhanced neighborhood
(DANRL-ANE) model. The proposed model consists of
First Author et al.: Preprint submitted to Elsevier Page 1 of 14
Short Title of the Article
three coupled modules, i.e., the self-built first-order proxim-
ity preserved, attribute enhanced neighborhood autoencoder
and community-aware skip-gram module, which preserve
the first-order, second-order, and high-order proximity, re-
spectively. The three modules share connections to an en-
coder. Especially, we model the attributes based on the social
homophily, and incorporate the attribute semantics into the
adjacency matrix to enhance the direct neighborhood of each
node.
The main contributions of the work are as follows:
(i) We propose a deep three-part coupling model, DANRL-
ANE, which learns the node representations by jointly
mining the microscopic structure and node attributes. The
attributes are preprocessed to be the input of the DANRL-
ANE model together with the adjacency matrix, which could
help obtain the second-order proximity.
(ii) We construct the self-built first-order proximity pre-
served module, which extends the sigmoid cross-entropy
loss function for capturing the local pairwise relationship
between node pairs on undirected and unweighted networks.
(iii) The proposed model could be well applied to the ma-
chine learning tasks that benefit from the pairwise properties
between nodes, i.e., the link prediction and node classifica-
tion, and is not susceptible to the sparsity of the structure,
the sparsity of the attributes, or the local network structure.
Moreover, the model could deal with the networks with
isolated nodes when we obtain sufficient node attributes.
The rest of the paper is organized as follows. We discuss
the related work in Section 2, and introduce the preliminaries
involved in the paper in Section 3. We give the detailed
description of the proposed model in Section 4. Then, we
compare the model with state-of-the-art models, and show
the experimental settings and results in Section 5. Finally,
we conclude the paper in Section 6.
2. Related Work
The network representation is first proposed as the part
of dimensionality reduction technologies [21][22][23] in
the early 20th century. However, the early methods suf-
fers from both computational and statistical performance
drawbacks [3]. Afterwards, Perozzi et al. [4] generalize the
advancements in language modeling to large-scale networks.
A large number of related algorithms have been proposed.
DeepWalk [4] uses the uniform sampling to collect the node
sequences [20]. Node2vec [3] extends the DeepWalk model,
which captures the diversity of connectivity patterns in a net-
work. The skip-gram-based methods capture the high-order
proximity [24]. LINE [25] utilizes different designed objec-
tive functions to preserve the first-order and second-order
proximity. However, all of the methods cannot preserve the
different 𝑘(𝑘≥3)-step relationships in distinct subspaces
[20]. Therefore, Cao et al. propose GraRep [20], which con-
catenates all the local 𝑘(𝑘≥3)-step representations as the
representations of nodes. The mentioned methods all utilize
the network structure only to learn network representation.
Besides the structure, nodes in the real world are usually
affiliated with various attributes. Researchers begin to focus
on mining the network features from attributed networks,
such as GAT2VEC [26] and SANE [27]. To further capture
the highly non-linearity, algorithms, such as DANE [15],
ASNE [16] and MDNE [17], have been recently designed
based on the deep learning technologies. The algorithms all
model the network structure, encode the attribute informa-
tion, and then depend on the strong correlation between the
structure and the attributes to obtain the consistent network
embedding. For instance, DANE [15] employs the autoen-
coder to preserve the high-order structural proximity and
attribute semantics, the joint probability to capture the first-
order proximity from the structure and attributes, and the
likelihood estimation to learn the node embeddings. The
methods might be susceptible to the sparsity of either the
structure or the attributes. For learning the robust represen-
tations, Zhang et al. [18] propose a deep coupling model
ANRL, which preserves the second-order and high-order
proximity from the structure. On the basis of the encoder
part, ANRL constructs a neighbor enhancement autoencoder
module, and designs an attribute-aware skip-gram module.
Notably, the neighbor enhancement autoencoder belongs to
the graph autoencoders (GAEs) that encode node structural
information and node attribute information at the same time,
and adversarially regularized graph autoencoder (ARGA)
[28] is the latest research method in the direction. Neverthe-
less, the design of the neighbor enhancement autoencoder
makes ANRL limited by the choice of datasets.
In summary, the study of attributed NRL still has open
problems in at least two aspects: (1) Since the network struc-
ture and node attributes are two heterogeneous information
sources, we need to consider how to preserve their character-
istics in a vector space; (2) The first-order, second-order and
high-order proximity define different neighborhood relations
among directly or indirectly connected nodes. The local
closeness proximities reflect the entire microscopic structure
features of original networks, yet how to design a proper
model to capture the proximities is a challenge. Here, we
propose the DANRL-ANE model under the paradigm of
deep coupling, in which three coupled modules are de-
signed to capture the different order proximity. Especially,
the attribute information is mined as the supplement of the
adjacency matrix.
3. Preliminaries
In this section, we first introduce two types of node
attributes, notations and definitions, which are used in this
work. Then, the schematic of attributed network representa-
tion learning is given.
3.1. Node Attributes
The node attributes refer to the auxiliary information
used to describe a node besides the network structure. For
instance, in social networks, personal information such as
age, gender and hobbies can be used as attributes. Regardless
First Author et al.: Preprint submitted to Elsevier Page 2 of 14
Short Title of the Article
Structure Proximity
2 7 12
56
9 10
Attribute Proximity
2
10
1 8 9 11 12 From the same grade in the same school
5 10 Attend the same online book club 11
3 4 6Major in computer science
(a) Input: Attributed information network (b) Output: Representation
1
3
4811 6
3
4
5
79
1
8
12
2
Figure 1: An illustration of attributed network representation
learning. A social network with node attribute information
is the input, and the vector representations of nodes is the
output. Input: the numbered nodes denote the users, the edges
between nodes represent the social relations between users,
and the same color nodes represent that the users have the
similar attribute information. Output: the vectors preserve the
network structural information and attribute semantics.
of the semantics, the attributes could be categorized into two
types: the discrete attributes and continuous attributes [16].
∙Discrete attributes. The typical example of the discrete
attributes is the categorical attributes, which can be trans-
formed into the binary vectors via one-hot encoding.
∙Continuous attributes. The continuous attributes nat-
urally exist in social networks. They could be artificially
generated from the transformation of the categorical vari-
ables. The continuous attributes could be represented as the
real-valued vectors after being preprocessed. For example,
in the document modeling, after obtaining the bag-of-words
representation of a document, it is common to transform it
to a real-valued vector via TF-IDF to reduce the noises [16].
The proposed model DANRL-ANE is suitable for the
networks with discrete attributes or continuous attributes.
3.2. Notations
Let 𝐺= (𝑉 , 𝐸 , 𝐴, 𝑋)be an attributed information
network, where 𝑉= {𝑣𝑖, ..., 𝑣𝑛}denotes a set of nodes,
𝐸 ⊂ (𝑉×𝑉)denotes a set of edges among nodes, 𝐴is
the adjacency matrix and 𝑋is the node attribute matrix. In
the adjacency matrix 𝐴, if there is an edge between nodes 𝑣𝑖
and 𝑣𝑗,𝑎𝑖𝑗 >0, particularly, if the network is unweighted,
𝑎𝑖𝑗 = 1; otherwise, 𝑎𝑖𝑗 = 0. If the network is undirected,
𝑎𝑖𝑗 =𝑎𝑗𝑖 . In the node attribute matrix 𝑋, the element 𝑥𝑖𝑘
indicates the attribute information of node 𝑣𝑖on the attribute
𝑘, which can be a discrete attribute or continuous attribute.
In this work, we focus on the undirected and unweighted
networks.
3.3. The Closeness Proximity
We here introduce the definition of the first-order, second-
order and high-order proximity involved in DANRL-ANE
model.
Definition 1. First-order proximity. The first-order proxim-
ity describes the pairwise proximity between nodes [11]. For
each node pair (𝑣𝑖, 𝑣𝑗), if there is an edge between them,
the first-order proximity between nodes 𝑣𝑖and 𝑣𝑗is 𝑎𝑖𝑗;
otherwise, the first-order proximity between nodes 𝑣𝑖and 𝑣𝑗
is 0.
Definition 2. Second-order proximity. The second-order
proximity between a pair of nodes describes the proximity
of the neighborhood structure of the node pair [11]. Let
𝐴𝑖= [𝑎𝑖1, 𝑎𝑖2, ..., 𝑎𝑖𝑛]denote the first-order proximity
between node 𝑣𝑖and all other nodes, then the second-
order proximity between nodes 𝑣𝑖and 𝑣𝑗is decided by the
similarity measure, such as cosine similarity, between 𝐴𝑖and
𝐴𝑗. The second-order proximity captures the 2-step relation
between node pairs, which could be measured by the 2-step
transition probability from node 𝑣𝑖to node 𝑣𝑗, equivalently
[19].
Definition 3. High-order proximity. Compared with the
second-order proximity, the high-order proximity captures
the more global structure, which explores the 𝑘-step (𝑘≥3)
relation between node pairs [19]. The high-order proximity
could be measured by the 𝑘-step(𝑘≥3) transition probabil-
ity from node 𝑣𝑖to node 𝑣𝑗.
3.4. Attributed network representation learning
The goal of the attributed network representation learn-
ing is that with a given attributed information network
𝐺= (𝑉 , 𝐸 , 𝐴, 𝑋), learning a mapping function makes
the whole network embedded into a new low-dimensional
vector space, namely, 𝑓∶𝐺→𝑌∈ℝ𝑛×𝑑, where 𝑑
denotes the dimension of embedding. Then, each node can
be represented by a low-dimensional and dense vector. The
vectors store the relationship information between each node
and the others, and record the attribute semantics. Taking
the node representations as the input is beneficial for the
subsequent machine-learning-based network analysis tasks.
A schematic of attributed network representation learning
is shown in Fig. 1. The nodes close to each other in the
original network and/or nodes with the similar attributes
are also close to each other in the new vector space. For
instance, nodes {𝑣1, 𝑣8, 𝑣9, 𝑣11, 𝑣12 }share the similar node
attributes, but there is no path between node 𝑣1and any
of nodes {𝑣8, 𝑣9, 𝑣11, 𝑣12 }. Hence, nodes {𝑣8, 𝑣9, 𝑣11, 𝑣12 }
have closer representations, and node 𝑣1is far away from the
above nodes in the representation space. Similarly, nodes 𝑣5
and 𝑣10 have the similar node attributes but a long distance
in topology, thus their representations are not close. Notably,
nodes 𝑣2and 𝑣7are directly connected. However, since
nodes 𝑣2and 𝑣7do not have the similar node attributes, they
are far away from each other in the embedded space. Differ-
ently, nodes {𝑣3, 𝑣4, 𝑣6}have the similar node attributes and
are interconnected with each other, so they are embedded
closely.
First Author et al.: Preprint submitted to Elsevier Page 3 of 14
Short Title of the Article
4. The DANRL-ANE Model
4.1. Overview
The proposed DANRL-ANE model is a deep three-part
coupling model, which consists of the self-built first-order
proximity preserved module, the attribute enhanced neigh-
borhood autoencoder module and the community-aware
skip-gram module. Fig. 2shows the framework of DANRL-
ANE model. The input of the encoder is the reconstructed
adjacency matrix, which is obtained by integrating the node
attributes and adjacency matrix. The self-built first-order
proximity preserved module captures the direct relations
between nodes, the attribute enhanced neighborhood autoen-
coder module reconstructs the target neighbors of nodes to
learn the relations between the neighborhoods of two nodes,
and the community-aware skip-gram module is trained
on the linear node sequences to preserve the high-order
relations. By training the three modules iteratively until the
model converges, the final node representations are obtained,
namely, the representation output of the autoencoder.
4.2. Preprocessing
4.2.1. Attribute Similarity Matrix 𝑋(𝑆)∈ℝ𝑛×𝑛
We propose to construct an attribute similarity matrix
𝑋(𝑆)to capture the similarity between nodes from the at-
tribute level. Each element 𝑥(𝑆)
𝑖𝑗 in the attribute similarity
matrix 𝑋(𝑆)indicates the similarity between nodes 𝑣𝑖and
𝑣𝑗, which could be calculated based on the similarity mea-
surement methods. Previous work [29] has shown that cosine
similarity is a good measure in both continuous and binary
vector spaces. Hence, we utilize the cosine similarity to
calculate the attribute similarity
𝑥(𝑆)
𝑖𝑗 =𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑋𝑖, 𝑋𝑗) =
𝑋𝑖𝑋⊤
𝑗
𝑋𝑖𝑋𝑗,(1)
where 𝑋𝑖is the 𝑖-th row of the node attribute matrix 𝑋.
4.2.2. Reconstructed Adjacency Matrix 𝑅∈ℝ𝑛×𝑛
Different from the attribute similarity matrix 𝑋(𝑆), the
adjacency matrix 𝐴describes the similarity between nodes
from the structure level. By setting the hyperparameters 𝜂
and 𝜓, we linearly combine the adjacency matrix 𝐴and
attribute similarity matrix 𝑋(𝑆)to build the reconstructed
adjacency matrix 𝑅
𝑅=𝜂𝐴 +𝜓 𝑋 (𝑆).(2)
4.3. Model Design
4.3.1. A Joint Optimization Framework of the
DANRL-ANE Model
The entire model consists of three coupled modules,
which share the same encoder component. The encoder aims
at mapping the input data into the representation space by
one or multiple layers of non-linear functions. In DANRL-
ANE model, the input-output relationship of each layer of
Decoder
Encoder
……
……
Community-aware skip-gram
Self-built first-order
proximity preserved
module
Reconstructed adjacency matrix
Direct neighbor
structure information
Attribute
information
Figure 2: The architecture of DANRL-ANE model. DANRL-
ANE is a three-part coupling model: the left is the attribute
enbanced neighborhood autoencoder module, the middle is the
community-aware skip-gram module and the right is the self-
built first-order proximity preserved module, where the input of
the model is the aggregation of node neighborhood information
and node attributes.
the encoder is defined as
𝑦(1)
𝑖=𝛿(𝑅𝑖𝑊(1) +𝑏(1))
𝑦(𝑘)
𝑖=𝛿(𝑦(𝑘−1)
𝑖𝑊(𝑘)+𝑏(𝑘)), 𝑘 ∈ {2, ..., 𝐾},
(3)
where 𝑅𝑖is the 𝑖-th row of the input matrix 𝑅, and records
the reconstructed neighbor relationship of node 𝑣𝑖. The
symbol 𝛿(⋅)denotes the non-linear activation functions, and
we choose the suitable one based on their performance on
different tasks and datasets. The model parameters 𝑊(𝑘)and
𝑏(𝑘)represent the transformation matrix and bias vector in
the 𝑘-th layer, respectively, and 𝐾is the layer number of the
encoder.
The joint optimization objective of the DANRL-ANE
model
𝐿=𝐿𝑁𝑆
𝑠𝑔 +𝛼𝐿𝑀
𝑎𝑒 +𝛽𝐿𝐹 𝑜𝑃 +𝛾 𝐿𝑟𝑒𝑔 (4)
with
𝐿𝑟𝑒𝑔 =1
2
𝐾
𝑘=1
(𝑊(𝑘)2
𝐹+
𝑊(𝑘)2
𝐹).(5)
The functions 𝐿𝐹 𝑜𝑃 ,𝐿𝑀
𝑎𝑒 ,𝐿𝑁𝑆
𝑠𝑔 are the loss functions of
the self-built first-order proximity preserved module, the
attribute enhanced neighborhood autoencoder module and
the community-aware skip-gram module, respectively. The
hyperparameters 𝛼and 𝛽are employed to balance the effect
of the modules. The hyperparameter 𝛾is the coefficient of
regularizer 𝐿𝑟𝑒𝑔 that is used to prevent overfitting, where
.𝐹is the Frobenius norm. The number of layers in the
First Author et al.: Preprint submitted to Elsevier Page 4 of 14
Short Title of the Article
encoder and decoder is 𝐾. The matrices 𝑊(𝑘)and
𝑊(𝑘)
represent the weight matrix of the encoder and decoder in
the 𝑘-th layer, respectively. The above hyperparameters are
independent of each other.
We optimize Equation (4) by the stochastic gradient
algorithm, which is a general way to optimize the deep model
[17]. Although when using stochastic gradient descent di-
rectly over randomized weights, it is difficult to obtain the
optimal result of models due to the existence of many local
optima [30], the time complexity of algorithms involving the
strategy is linear to the number of nodes/edges, which makes
the algorithms scalable to large-scale networks [19].
Next, we give the detailed description of each module.
4.3.2. Self-built First-order Proximity Preserved
Module
The first-order proximity can reveal the similarity be-
tween nodes. We here propose a self-built first-order proxim-
ity preserved module. Inspired by the LINE [25] and DANE
[15] model, we first define the joint probability 𝑤𝑖𝑗(𝑣𝑖, 𝑣𝑗)
with the sigmoid function 𝜎(𝑥) = 1
1+exp(−𝑥), which models
the first-order proximity between nodes 𝑣𝑖and 𝑣𝑗. Let 𝑥=
𝑦(𝐾)
𝑖𝑦(𝐾)T
𝑗, then we obtain
𝑤𝑖𝑗 (𝑣𝑖, 𝑣𝑗) = 1
1 + exp(−𝑦(𝐾)
𝑖𝑦(𝐾)T
𝑗)
,(6)
where 𝑦(𝐾)
𝑖and 𝑦(𝐾)
𝑗denote the representation of nodes 𝑣𝑖
and 𝑣𝑗, respectively.
To preserve the first-order proximity, we note that the
first-order proximity describes the existence, namely, 𝑎𝑖𝑗 =
1, or non-existence, namely, 𝑎𝑖𝑗 = 0, of edge between node
pairs on undirected and unweighted networks that we focus
on, which is equivalent to a binomial classification problem.
The sigmoid cross-entropy loss function is a typical objec-
tive function for a binomial classification, which is defined
as
𝐿𝑆𝐶 𝐸 = −[𝑡log 𝑝𝑠(𝑧) + (1 − 𝑡) log(1 − 𝑝𝑠(𝑧))],(7)
where 𝑡is the label of a sample, and could be either 1or 0.
If the sample belongs to the positive class, 𝑡= 1, else 𝑡= 0.
The symbol 𝑝𝑠(𝑧)indicates the probability that the sample
belongs to the positive class based on sigmoid function in
Equation (6), where the 𝑧is the sample, and the 𝑠indicates
the sigmoid function.
We here set 𝑡=𝑎𝑖𝑗 = 1 and 𝑝𝑠(𝑧) = 𝑤𝑖𝑗 (𝑣𝑖, 𝑣𝑗),
since only the existing edge contributes to the network rep-
resentation. Hence, the objective of the self-built first-order
proximity preserved module could be obtained by averaging
all losses 𝐿𝑆𝐶 𝐸 .
𝐿𝐹 𝑜𝑃 =
𝑎𝑖𝑗 =1
(− log 𝑤𝑖𝑗 (𝑣𝑖, 𝑣𝑗)),(8)
where 𝑎𝑖𝑗 is the element of the adjacency matrix 𝐴, and
𝑤𝑖𝑗 (𝑣𝑖, 𝑣𝑗)is shown in Equation (6).
4.3.3. Attribute Enhanced Neighborhood Autoencoder
Module
The deep autoencoder model is widely used to mine the
proximity between the neighborhood structure of node pairs
[11], since it could capture the data manifolds, and preserve
the similarity between samples [31]. The autoencoder con-
sists of two parts, which are the encoder and decoder, and
the decoder is the inverse process of the encoder, i.e. the
representation is reconstructed [32].
The purpose of the autoencoder is to minimize the recon-
struction error between the input and the reconstructed data,
so that the abstract representation of the mid-layer output
can capture the manifold structure in the input data. To be
specific, the objective of the autoencoder is
𝐿𝑎𝑒 =
𝑛
𝑖=1
𝑅𝑖−𝑅𝑖2
2,(9)
where 𝑛is the number of nodes in the network, and
𝑅𝑖
represents the reconstructed output of the input data 𝑅𝑖.
The autoencoder tends to preserve the information of
zero elements, when reconstructed adjacency matrix 𝑅is a
sparse matrix. In order to preserve non-zero elements, we
employ the Hadamard product as the penalty factor, and
extend the loss function of the autoencoder, inspired by
SDNE [11]. The modified objective is
𝐿𝑀
𝑎𝑒 =
𝑛
𝑖=1 (
𝑅𝑖−𝑅𝑖)⊙ 𝑏𝑖2
2,(10)
where ⊙indicates the Hadamard product, and 𝑏𝑖= {𝑏𝑖𝑗 }𝑛
𝑗=1.
Here, 𝑏𝑖𝑗 = 1 if 𝑅𝑖𝑗 = 0, else 𝑏𝑖𝑗 =𝜒 > 1. The 𝜒is a tunable
parameter.
The attribute enhanced neighborhood autoencoder mod-
ule takes each row vector of the reconstructed adjacency ma-
trix 𝑅as the sample input, where the row vector denotes the
neighbor structural information with the attribute semantics
of the corresponding node. In other words, the autoencoder
could preserve the second-order proximity.
4.3.4. Community-aware skip-gram module
Inspired by DeepWalk [4], we design the community-
aware skip-gram module for capturing the high-order prox-
imity in this work. To reduce the time complexity, we
adopt the node sequences sampling procedure performed by
node2vec [3], where the return parameter 𝑝𝑛= 1 and in-
out parameter 𝑞𝑛= 1, and use the negative sampling to
approximate the loss function
𝐿𝑠𝑔 = −
𝑛
𝑖=1
𝑐∈𝐶
−𝜏≤𝑗≤𝜏,𝑗 ≠0
log 𝑝(𝑣𝑖+𝑗𝑅𝑖)
= −
𝑛
𝑖=1
𝑐∈𝐶
−𝜏≤𝑗≤𝜏,𝑗 ≠0
log
exp(ℎ′T
(𝑖+𝑗)𝑦(𝐾)T
𝑖)
𝑛
𝑓=1 exp(ℎ′T
𝑓𝑦(𝐾)T
𝑖)
,
(11)
where 𝑛is the number of nodes in the network, 𝑐∈𝐶
denotes the node sequences sampled by the random walk, 𝜏
First Author et al.: Preprint submitted to Elsevier Page 5 of 14
Short Title of the Article
is the size of the window. The input data 𝑅𝑖occupies the 𝑖-th
row of the input matrix 𝑅. The 𝑣𝑖+𝑗is the context node of the
current node 𝑣𝑖located in the generated random sequences in
the window 𝜏. The node representation 𝑦(𝐾)
𝑖is the output of
the sample input 𝑅𝑖through the 𝐾layer encoder. The matrix
𝐻′is the transition matrix between the representation output
layer of the autoencoder and the output layer of the skip-
gram, and ℎ′
𝑖is the 𝑖-th column of the transition matrix 𝐻′.
Then, we obtain
𝐿𝑁𝑆
𝑠𝑔 = −
𝑛
𝑖=1
𝑐∈𝐶
−𝜏≤𝑗≤𝜏,𝑗 ≠0
{log 𝜎(ℎ′T
(𝑖+𝑗)𝑦(𝐾)T
𝑖)+
𝑛𝑒𝑔
𝑠=1
𝔼𝑣𝑛∼𝑃𝑛(𝑣)[log 𝜎(−ℎ′T
𝑠𝑦(𝐾)T
𝑖)]},
(12)
where 𝜎(𝑥) = 1
1+exp(−𝑥)is the sigmoid function, 𝑛𝑒𝑔
denotes the number of the sampled negative samples. The
sampling distribution 𝑃𝑛(𝑣) ∝ 𝑑3∕4
𝑣is set as suggested in
[33] and 𝑑𝑣represents the degree of node 𝑣𝑛.
Minimizing Equation (12), we can get the result that
if the two nodes co-occur, they have similar embedding
vectors.
Algorithm 1describes the learning process of the entire
model and all model parameters are denoted as Θ.
5. Performance Analysis
In this section, we analyze the performance of the whole
DANRL-ANE, the aggregation input of DANRL-ANE and
each module in DANRL-ANE by three types of experiments:
a) the DANRL-ANE model vs. state-of-the-art models; b)
the attribute enhanced neighborhood autoencoder vs. ARGA
[28]; c) comparison among modules and DANRL-ANE.
Two tasks, i.e., the link prediction and node classification,
as well as the public datasets are employed. Meanwhile, we
perform the time complexity analysis, which is to prove the
effectiveness of the DANRL-ANE model.
5.1. Experimental Settings
5.1.1. Datasets
In the experiments, we select five datasets that have been
made publicly available to facilitate the evaluation of NRL
algorithms across different tasks [18][27][34], which belong
to two network types, i.e., the citation networks and social
networks. The dataset statistics are summarized in Table 1.
To further illustrate the application of the DANRL-ANE
model to various networks, we analyze the basic topologi-
cal properties of employed datasets, including the density,
average degree, average clustering coefficient and average
distance of the networks, as shown in Table 2.
∙Citation networks: Citeseer [18], PubMed [18] and
Cora [27].
The node indicates the publication, and the edge indi-
cates the citing or cited relation between publications. By
using the bag-of-words model to deal with a publication, and
Algorithm 1 Framework of DANRL-ANE Model
Input: An attributed information network 𝐺=
(𝑉 , 𝐸 , 𝐴, 𝑋), preprocessing hyperparameters 𝜂and 𝜓,
hadamard product operation parameter 𝜒, walks per
node 𝑟, walk length 𝑙, window size 𝜏, return 𝑝, in-out 𝑞,
negative samples 𝑛𝑒𝑔, trade-off parameters 𝛼and 𝛽,
regularizer coefficient 𝛾, embedding dimension 𝑑
Output: node vector representations 𝑌∈ℝ𝑛×𝑑
1: Use cosine similarity measurement method on attribute
matrix to achieve attribute similarity matrix 𝑋(𝑆)
2: Obtain the reconstructed adjacent matrix 𝑅by linearly
combining the adjacency matrix 𝐴with attribute simi-
larity matrix 𝑋(𝑆)by 𝜂and 𝜓
3: Adopt random walk procedure of node2vec model with
𝑝and 𝑞both set as 1, and start 𝑟times of random walks
with length 𝑙at each node
4: Random initialize all parameters Θ
5: while not converged do
6: Sample a mini-batch of nodes with its context
7: Compute the gradient of ▿𝐿𝐹 𝑜𝑃 based on Equation
(8)
8: Update first-order proximity preserved module pa-
rameters
9: Compute the gradient of ▿𝐿𝑀
𝑎𝑒 based on Equation
(10) and the gradient of ▿𝐿𝑟𝑒𝑔 based on Equation (5)
10: Update autoencoder module parameters
11: Compute the gradient of ▿𝐿𝑁𝑆
𝑠𝑔 based on Equation
(12)
12: Update skip-gram module parameters
13: end while
14: Obtain representations 𝑌=𝑌(𝐾)based on Equation (3)
Table 1
Dataset statistics
Datasets #Nodes #Edges #Attributes #Labels
Citeseer 3,312 4,714 3,703 6
PubMed 19,717 44,338 500 3
Cora 2,708 5,429 1,433 7
Facebook 4,039 88,234 1,283 -
Flickr 7,575 239,738 12,047 9
stemming and removing the stop-words, a vocabulary of the
remaining unique words is used as the node attributes.
In Citeseer, publications are classified into six classes,
i.e., Agents, AI, DB, IR, ML and HCI; in PubMed, publi-
cations are classified into three classes, i.e., Diabetes Mel-
litus Experimental, Diabetes Mellitus Type 1 and Diabetes
Mellitus Type 2; and in Cora, publications are classified into
seven classes, i.e., Case Based, Genetic Algorithms, Neural
Networks, Probabilistic Methods, Reinforcement Learning,
Rule Learning and Theory. The group categories are re-
garded as the labels of nodes.
∙Social networks: Facebook [18] and Flickr [34].
(1) Facebook: It is one of the most famous online social
networks. In the dataset, the node denotes the user, and the
First Author et al.: Preprint submitted to Elsevier Page 6 of 14
Short Title of the Article
Table 2
Basic network topology properties for datasets
Properties Density Average degree Average Clustering coefficient Average distance
Citeseer 0.0009 2.81 0.14 unconnected
PubMed 0.0002 4.50 0.06 6.34
Cora 0.0015 3.90 0.24 unconnected
Facebook 0.0108 43.69 0.61 3.69
Flickr 0.0084 63.30 0.33 2.41
edge represents the friendship relation between two users.
Furthermore, the personal profile is treated as the attribute
information used to describe the user. Note that there are no
labels in the dataset, so we cannot employ Facebook for the
node classification.
(2) Flickr: It is an image hosting and sharing website.
Similarly, the node and the edge represent the user and the
following or followed relation between users, respectively.
The users can specify a list of tags that reflect their interests,
which are processed into the attributes. The photos are
organized under the pre-specified categories, so the labels
refer to the photo interest groups that the users join in.
5.1.2. Baselines
To evaluate our model, we here choose eight state-of-
the-art models as experimental comparison methods, which
can be divided into the following groups:
∙Structure-only: The set of baseline models aim to
capture the structural information, including: (i) the skip-
gram based models which focus on preserving different or-
der proximity of the structure, such as DeepWalk, node2vec,
LINE and GraRep, (ii) the autoencoder based model, such as
SDNE. Particularly,
(1) DeepWalk: The truncated random walk is employed
to capture the high-order proximity.
(2) node2vec: The biased random walk is designed to
explore the high-order structural information.
(3) LINE: The objective is to preserve the first-order
and second-order proximity. Specifically, LINE models the
direct and indirect neighbor relationship between node pairs
through joint probability and conditional probability, respec-
tively.
(4) GraRep: All the local 𝑘(𝑘≥3)-step relational infor-
mation between node pairs are considered and concatenated
as the final representations of nodes.
(5) SDNE: Laplacian Eigenmaps and the deep model
autoencoder are employed to preserve the first-order and
second-order proximity, respectively.
∙Structure & Attribute: The models preserve the struc-
tural and attribute information based on the deep learning
model autoencoder, which can be further classified: (i) the
consistent learning based model, such as DANE, (ii) the deep
coupling framework based model, such as ANRL, (iii) the
GCN [35] based model, such as ARGA.
(1) DANE: The joint probability and autoencoder are
used to mine the corresponding first-order and high-order
proximity from the network structure, and to capture the
corresponding first-order proximity and attribute semantics
from the node attributes. Then, the likelihood estimation is
used to learn the consistent network embedding from the
structure and the attributes.
(2) ANRL: ANRL is a deep coupling model. The neigh-
bor enhancement autoencoder module encodes the attribute
semantics, and captures the second-order proximity. The
attribute-aware skip-gram module is designed to preserve
the high-order proximity. Furthermore, a large number of
experiments in [18] have proved that in the ANRL variants,
the performance of ANRL-WAN is superior. Hence, in the
paper, we choose the ANRL-WAN as the benchmark.
(3) ARGA: ARGA employs the training scheme of the
generative adversarial networks (GANs) [36] as a promotion
strategy for graph autoencoders (GAEs), and aims to recon-
struct the adjacency matrix. In ARGA model, the encoder
leverages GCN to encode node structural information and
node attribute information, as well as the decoder is a simple
linear product.
5.1.3. Parameter settings
∙Baseline parameters: For all baselines, we use the
open-source code provided by the original author, and tune
the parameters to make each model achieve the optimal per-
formance on the different datasets and experimental tasks.
We set the final embedding dimension 𝑑as 128. For LINE,
we concatenate the representations of the first-order and
second-order proximity as the final embeddings. We set the
walks per node 𝑟as 10, walk length 𝑙as 80, window size 𝜏
as 10, negative samples 𝑛𝑒𝑔as 10.
∙Hyperparameters of DANRL-ANE: For the hyper-
parameters in the DANRL-ANE model, we first tune the
preprocessing hyperparameters 𝜂and 𝜓, and take 0.5 as a
step to perform grid search from 0.5 to 2.5. If the model
performance cannot obtain the local optimum, we will adjust
the step size according to the change trend of experimental
results. Specifically, when in a growing trend, the step size
is still set to 0.5, otherwise a smaller step size is set.
Once we get the local optimum for the preprocessing
hyperparameters 𝜂and 𝜓, we then utilize the alternative
optimization, which optimizes one variable with the remain-
ing variables fixed alternately, to adjust the loss function
hyperparameters 𝜒,𝛼,𝛽and 𝛾until all of them converged.
Notably, all of the loss function hyperparameters 𝜒,𝛼,𝛽
and 𝛾first take 1 and 5 as the initial value and the initial
First Author et al.: Preprint submitted to Elsevier Page 7 of 14
Short Title of the Article
Table 3
Detailed architecture information for datasets
Datasets #Number of hidden layers in encoder #Number of neurons in each layer in encoder
Citeseer 3 3312:1000:500:128
PubMed 3 19717:1000:500:128
Cora 3 2708:1000:500:128
(Link Prediction) (Link Prediction)
2 (Node Classification) 2708:256:128 (Node Classification)
Facebook 3 4039:1000:500:128
(Link Prediction) (Link Prediction)
Flickr 2 (Link Prediction) 7575:256:128 (Link Prediction)
2 (Node Classification) 7575:500:128 (Node Classification)
step size, respectively, and then adopt a similar step size
adjustment rule as above. The procedure works on each
dataset and different hyperparameters are obtained for differ-
ent datasets. In addition, after the performance comparison
of trying the application of 𝑅𝑒𝑙𝑢,𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝑙𝑢,𝑠𝑜𝑓 𝑡𝑠𝑖𝑔 𝑛,
𝑡𝑎𝑛ℎ and 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 in the DANRL-ANE model, we use the
𝑡𝑎𝑛ℎ function, 𝑡𝑎𝑛ℎ𝑥 =𝑠𝑖𝑛ℎ𝑥
𝑐𝑜𝑠ℎ𝑥 =𝑒𝑥−𝑒−𝑥
𝑒𝑥+𝑒−𝑥, as the non-linear
activation function of the autoencoder.
∙Architecture of DANRL-ANE: Recent work [16]
shows that the tower structure design with halved layer
size for each successive higher layer, which is followed by
hidden layers component, is effective. Firstly, we follow
the above principle to construct the neural networks. We
find that increasing the number of neural network layers
does not necessarily make the model perform better. The
time complexity is exponentially related to the number of
neurons, and increasing the number of layers will greatly
increase the training time. Moreover, too many layers in the
model are particularly prone to overfitting, that is, the model
generalization ability becomes poor.
Therefore, we further test and compare the performance
of different model designs by traversing the method of
pruning layers. For example, for the Citeseer dataset, we start
from 128, namely, the dimension of node representation,
and double the number of neurons in successive hidden
layers until 1024. The reason is that the performance of the
model decreases when the number of neurons in the hidden
layer reaches 2048. Then, according to the arrangement and
combination, we compare the corresponding experimental
results after removing one layer, two layers and three layers
among the hidden layers, 1024-512-256, respectively. Next,
we fine-tune the model architecture by slightly reducing the
number of neurons in each layer to achieve better perfor-
mance. Note that the frame structure of encoder and decoder
is symmetrical. We here only shows the number of hidden
layers and the number of neurons in each layer in the encoder
in Table 3.
There is no layer between the representation layer of the
autoencoder and the output layer of the skip-gram.
5.2. DANRL-ANE vs. state-of-the-art models
To verify the excellence and applicability of the whole
model, we first compare the proposed DANRL-ANE model
with state-of-the-art models.
5.2.1. Link Prediction
Link prediction is a widely used task to evaluate the
performance of network embedding [19], which refers to the
task of predicting either missing interactions or links that
may appear in future in an evolving network [21][37]. We
illustrate the link prediction in Fig. 3, where the input is an
attribute information network with missing edges or possible
future connections. We deduce the missing edges or edges
that may exist in future by learning node representations and
vector similarity measures.
In the experiment, as done in [3], we hold out 50%
existing edges as positive instances, and ensure that the
remaining network is connected. Besides, we randomly gen-
erate the same number of nonexistent edges from the original
network, which are as negative instances. The positive and
negative instances, together, constitute the test set. Further-
more, we use the residual network to train the embedding
models, which is to obtain the representation of each node.
Then, these representations are utilized as the feature inputs
to predict the unobserved edges. Inspired by [18], in the link
prediction experiment, we rank both the positive and nega-
tive instances according to the cosine similarity function. To
judge the ranking quality, we employ the AUC [38] index,
which is widely used in information retrieval (IR) commu-
nity to evaluate a ranking list [16]. A higher score indicates
that the network representation is more informative. The link
prediction task is carried out on all datasets. The AUC value
for each model on the Citeseer, PubMed, Cora, Facebook
and Flickr dataset is summarized in Table 4, and the best
result is in bold. According to the observations, we give the
following analysis.
∙Structure vs. Structure: On most datasets, node2vec
achieves better or similar performance than DeepWalk, sug-
gesting that the exploration of more flexible neighborhood
facilitates the learning of node representations with higher
accuracy. Compared with the models that only preserve par-
tial microscopic structural information, such as DeepWalk,
First Author et al.: Preprint submitted to Elsevier Page 8 of 14
Short Title of the Article
12 x 5
Structure
information
and/or
Attribute
information
Node
representations
Edge
missing
Edge
appearing in
the future
Similarity
measurement
function
Algorithm
Figure 3: Display of the schematic diagram of link prediction task, where the same color implies that nodes have the similar node
attributes.
Table 4
Link prediction results in DANRL-ANE vs. state-of-the-art models
Datasets Citeseer PubMed Cora Facebook Flickr
Evaluation AUC AUC AUC AUC AUC
DeepWalk 0.6020 0.7925 0.7209 0.9461 0.7247
node2vec 0.5485 0.7977 0.7244 0.9552 0.7341
LINE 0.5309 0.6213 0.6047 0.5073 0.5262
GraRep 0.6008 0.8123 0.7210 0.8697 0.8899
SDNE 0.6093 0.7562 0.6326 0.8689 0.9023
DANE 0.6579 0.9140 0.7286 0.8780 0.6142
ANRL-WAN 0.9666 0.8035 0.9181 0.7698 0.7800
DANRL-ANE 0.9573 0.9439 0.9323 0.9577 0.9371
⋆We use bold to highlight the best performance.
node2vec and LINE, the superior performance exhibited by
GraRep in most of the time suggests that preserving the first-
order, second-order, and higher-order proximity is necessary
for the link prediction task. Furthermore, we compare the
deep model SDNE and shallow model LINE, which both
aim at capturing the first-order and second-order proximity.
The result shows that SDNE consistently achieves better
performance than LINE, especially on social networks. A
comparison of the performance gap on citation networks
and social networks, respectively, reveals that the larger
the network average degree is, the smaller the performance
gap is, which suggests that the autoencoder can capture the
second-order proximity that is beneficial to link prediction.
∙Structure vs. Structure & Attribute: We find that
the models considering both the structure and attributes tend
to perform better than those considering only the structure.
Furthermore, the proposed DANRL-ANE model has always
achieved better performance than GraRep, even on datasets
with sparse node attributes. The above phenomenon shows
that incorporating the attributes into NRL in a reasonable
way is beneficial to obtain the excellent link prediction
results.
∙Structure & Attribute vs. Structure & Attribute:
The comparison among DANRL-ANE, DANE and ANRL-
WAN shows that DANRL-ANE has the best experimental
results on most datasets, which further proves the impor-
tance of capturing the first-order, second-order, high-order
proximity and the attribute semantics together. Meanwhile,
it demonstrates that DANRL-ANE can learn the robust and
efficient network representation.
5.2.2. Node Classification
Node classification aims to predict the categories of
nodes by any known information of the network, which is an-
other common downstream task to evaluate the performance
of network embedding [19][21][37]. In Fig. 4, we utilize the
algorithms to obtain the node representation, which could
preserve the network information and/or attribute informa-
tion. Then, we use the classifiers to infer the unknown node
labels based on the learned node embeddings and existing
node labels.
In the experiment, we first learn the vector representation
of each node through different models. Then, following the
popular practices [18], we randomly sample 30% nodes from
the labeled nodes as the training set, and treat the rest as the
First Author et al.: Preprint submitted to Elsevier Page 9 of 14
Short Title of the Article
2
11
1
3
3
3
?
?
?
?
?
Algorithm
12 x 5
1
2
2
1
1
1
11
3
3
33
?
?
?
?
?
Classifier
Structure
information
and/or
Attribute
information
Node
representations
and
Some labels
Figure 4: Display of the schematic diagram of node classification task, where the same color implies that nodes have the similar
node attributes and the number inside the circle marks the node label.
Table 5
Node classification results in DANRL-ANE vs. state-of-the-art models
Datasets Citeseer PubMed Cora Flickr
Evaluation Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1
DeepWalk 0.5665 0.5212 0.8109 0.7978 0.7900 0.7782 0.4940 0.4835
node2vec 0.6002 0.5465 0.8104 0.7968 0.8058 0.7942 0.5155 0.5062
LINE 0.5605 0.5256 0.8049 0.7926 0.7884 0.7767 0.5613 0.5576
GraRep 0.4775 0.4352 0.7416 0.7248 0.7636 0.7496 0.5692 0.5606
SDNE 0.4161 0.3632 0.4258 0.2900 0.5813 0.5201 0.6043 0.5991
DANE 0.6870 0.6433 0.8063 0.7940 0.8110 0.7944 0.7721 0.7701
ANRL-WAN 0.7246 0.6764 0.8595 0.8584 0.8161 0.8030 0.6701 0.6584
DANRL-ANE 0.7275 0.6787 0.8745 0.8728 0.8419 0.8306 0.9059 0.9046
⋆We use bold to highlight the best performance.
test set. Here, SVM [39] is employed as the classifier. For
binary-class classification, F1 score is used as the evalua-
tion criterion. For multi-class and multi-label classification,
Micro-F1 and Macro-F1 are adopted as evaluation criteria
[17][19]. To measure the multi-label classification perfor-
mance, we use Micro-F1 and Macro-F1 as the evaluation
metrics. Notably, the above classification process is repeated
10 times and we report the average results. Because we don’t
have the ground truth of node labels in Facebook, the node
classification task is only performed on four datasets, i.e.,
Citeseer, PubMed, Cora and Flickr. Furthermore, Table 5
shows the performance of each network embedding method
on different datasets, in which the optimal result is in bold.
We analyze the results as follows.
∙Structure vs. Structure: Node2vec shows similar or
superior results to DeepWalk on different datasets. However,
SDNE performs worse than LINE on citation networks,
which implies that the average degree of a network could
be a key factor affecting the performance of SDNE, and
the larger the network average degree is, the worse the
performance is. The observation indicates that modeling
the directly connected relationship between two nodes with
joint probability is beneficial to capture the accurate first-
order proximity, which explains why the joint probability is
used in the proposed DANRL-ANE model. GraRep, which
considers all the microscopic structural information, has
worse performance than DeepWalk, node2vec and LINE,
on citation networks. Simply concatenating different order
information is not always suitable for any tasks, which
emphasizes that a careful design is critical.
∙Structure vs. Structure & Attribute: Table 5shows
that the methods incorporating the node attributes into NRL
have better experimental results than those only focusing
on the structure, which demonstrates the integration of the
structural information and attribute semantics is advanta-
geous to learn informative node vectors. DANE performs
worse than DeepWalk on PubMed, which is probably caused
by the sparse attributes.
∙Structure & Attribute vs. Structure & Attribute:
The optimal result of DANRL-ANE shows that our method
is not susceptible to the sparsity of either network structure
or node attributes, and could learn the robust and efficient
network representation, which explains the necessity of pre-
serving the first-order, second-order and high-order proxim-
ity. In a word, in the node classification task, the proposed
First Author et al.: Preprint submitted to Elsevier Page 10 of 14
Short Title of the Article
Table 6
Link prediction results in Attribute enhanced neighborhood autoencoder vs. ARGA
Datasets Citeseer PubMed Cora
Evaluation AUC AUC AUC
ARGA 0.8953 0.7544 0.8767
DANRL-ANE (autoencoder) 0.9514 0.9095 0.8661
⋆We use bold to highlight the best performance.
Table 7
Node classification results in Attribute enhanced neighborhood autoencoder vs. ARGA
Datasets Citeseer PubMed Cora
Evaluation Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1
ARGA 0.5915 0.5124 0.5185 0.4433 0.7090 0.6687
DANRL-ANE (autoencoder) 0.7237 0.6683 0.8729 0.8716 0.8397 0.8276
⋆We use bold to highlight the best performance.
DANRL-ANE model is applicable to various networks, even
on sparse networks or networks with isolated nodes, if we
can obtain sufficient attribute information.
5.3. Attribute enhanced neighborhood
autoencoder vs. ARGA
The above experimental results show that our modeling
of attribute information tends to improve the performance
of the entire DANRL-ANE model. We here choose ARGA
as the benchmark, to explain and analyze our attribute en-
hanced neighborhood autoencoder module, because ARGA
is the latest research on GAEs that node attributes are incor-
porated in the encoder [40][41].
ARGA learns more robust node representations than
the pioneer work GAE [42]. To obtain a more objective
performance comparison, Pan et al. [28] study ARGA and
use the same datasets in [42], i.e., Citeseer, PubMed and
Cora. We make similar experiments. Tables 6and 7give
the results on link prediction and node classification tasks
respectively. Here, we use bold to highlight the optimal
result.
We find that the performance of our model on almost
all datasets and tasks is ahead of ARGA, which reveals that
the DANRL-ANE model captures the precise neighborhood
semantic information.
5.4. Comparison between modules and
DANRL-ANE
Furthermore, the DANRL-ANE model is optimized by
training three modules, self-built first-order proximity pre-
served, attribute enhanced neighborhood autoencoder and
community-aware skip-gram, iteratively. It is necessary to
further analyze the contribution of each module to the model
performance and discuss the advantages of DANRL-ANE
as a whole model. Here, we conduct experimental compar-
ison via link prediction and node classfication on Citeseer,
PubMed and Cora datasets as well. The results are shown
in Tables 8 and 9, where the best is in bold. We draw the
following conclusions and analysis.
(1) Regardless of the tasks, compared with the other
modules, attribute enhanced neighborhood autoencoder ba-
sically performs best, which illustrates that our reasonable
modeling for second-order proximity preservation plays a
decisive role.
(2) In the node classification task, the proposed DANRL-
ANE and attribute enhanced neighborhood autoencoder
module have similar performance. However, considering the
diversity of tasks, DANRL-ANE obtains more outstanding
results, which shows the robustness of the proposed model.
5.5. Time Complexity
Time complexity is an important factor to evaluate the
effectiveness of algorithms on large-scale networks [43].
Hence, we further provide the time complexity for the pro-
posed DANRL-ANE and all the state-of-the-art baselines in
Table 10, where 𝑑denotes the dimension of node representa-
tions, 𝑉represents the number of nodes, 𝐸is the number
of edges, 𝐼indicates the number of iterations and 𝑚means
the number of node attribute values. We find that com-
pared with most algorithms that only consider the network
structure, our algorithm has higher time complexity, since
we explore the preservation of attribute information and/or
more topological properties. When taking network structure
and node attributes into consideration, DANRL-ANE ap-
pears to take only higher time than ANRL-WAN. However,
DANRL-ANE has consistently shown better performance
over ANRL-WAN, especially on dataset Flickr. Notably, our
method outperforms DANE in both performance and time
complexity.
6. Conclusion
To integrate the microscopic structural and attribute in-
formation for learning the robust and effective node embed-
dings from various networks, we propose a deep coupling
model DANRL-ANE, where three newly designed modules
First Author et al.: Preprint submitted to Elsevier Page 11 of 14
Short Title of the Article
Table 8
Link prediction results in Comparison between modules and DANRL-ANE
Datasets Citeseer PubMed Cora
Evaluation AUC AUC AUC
DANRL-ANE (self-built first-order) 0.8311 0.6408 0.5611
DANRL-ANE (autoencoder) 0.9514 0.9095 0.8661
DANRL-ANE (skip-gram) 0.9347 0.8689 0.9208
DANRL-ANE 0.9573 0.9439 0.9323
⋆We use bold to highlight the best performance.
Table 9
Node classification results in Comparison between modules and DANRL-ANE
Datasets Citeseer PubMed Cora
Evaluation Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1
DANRL-ANE (self-built first-order) 0.6652 0.5818 0.8241 0.8248 0.3072 0.0671
DANRL-ANE (autoencoder) 0.7237 0.6683 0.8729 0.8716 0.8397 0.8276
DANRL-ANE (skip-gram) 0.6729 0.6298 0.7980 0.8009 0.8058 0.7856
DANRL-ANE 0.7275 0.6787 0.8745 0.8728 0.8419 0.8306
⋆We use bold to highlight the best performance.
Table 10
Time complexity analysis
Algorithm Time complexity
DeepWalk 𝑂(𝑑𝑉𝑙𝑜𝑔𝑉)
node2vec 𝑂(𝑑𝑉)
LINE 𝑂(𝑑𝐸)
GraRep 𝑂(𝑉𝐸+𝑑𝑉2)
SDNE 𝑂(𝑑𝐼 𝑉2)
DANE 𝑂(𝑉𝐸+𝑑𝐼 𝑉(𝑉+𝑚))
ANRL-WAN 𝑂(𝑉𝑚+𝑑𝐼 𝑉(𝑚+ 1))
DANRL-ANE 𝑂(𝑉2+𝑑𝐼 𝑉(𝑉+ 1))
are used to preserve the first-order, second-order and high-
order proximity from the structure, respectively. In particu-
lar, the node attributes are incorporated into the adjacency
matrix based on the social homophily, as the input of our
model, so that the structure and attribute information are
explored simultaneously. The extensive experiments on the
tasks of link prediction and node classification show that
the DANRL-ANE model achieves the superior performance
comparing with other representation learning models. The
work demonstrates that integrating more sources of informa-
tion in a principled manner is conducive to learning higher
quality network representation. However, our method still
has some shortcomings. For example, (1) our method is
only applicable to undirected and unweighted networks, (2)
although our model can achieve good performance, the ex-
istence of a large number of hyperparameters will inevitably
consume some time, and it is difficult to obtain the optimal
results in the case of time constraints. Therefore, there are
still some directions for us to further explore. We could
extend the work to directed and weighted networks, even
temporal networks and higher-order networks, and design
general and concise models to fit practical scenarios.
CRediT authorship contribution statement
Cong Li: Conceptualization, Methodology, Formal anal-
ysis, Writing-original draft, Writing-review & editing, Fund-
ing acquisition. Min Shi: Data curation, Methodology,
Software, Formal analysis, Writing-original draft. Bo Qu:
Methodology, Formal analysis, Writing-review & editing,
Funding acquisition. Xiang Li: Writing - review & editing,
Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgment
This work is supported by National Natural Science
Foundation of China (Grant No.71731004, No.62002184,
No.62173095, No.61425019), Natural Science Foundation
of Shanghai 21ZR1404700, and the Major Key Project of
PCL (Grant No.PCL2022A03, PCL2021A02, PCL2021A09).
References
[1] X. Wang, L. Q. Nie, X. M. Song, D. X. Zhang, and T.-S. Chua. Uni-
fying virtual and physical worlds: Learning toward local and global
consistency. ACM Transactions on Information Systems (TOIS),
36(1):1–26, 2017.
[2] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A
survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.
[3] A. Grover and J. Leskovec. node2vec: Scalable feature learning for
networks. In Proceedings of the 22nd ACM SIGKDD international
First Author et al.: Preprint submitted to Elsevier Page 12 of 14
Short Title of the Article
conference on Knowledge discovery and data mining, pages 855–864,
2016.
[4] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning
of social representations. In Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining,
pages 701–710, 2014.
[5] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-
Rad. Collective classification in network data. AI magazine, 29(3):93–
93, 2008.
[6] P.Kazienko and T. Kajdanowicz. Label-dependent node classification
in the network. Neurocomputing, 75(1):199–209, 2012.
[7] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for
social networks. Journal of the American society for information
science and technology, 58(7):1019–1031, 2007.
[8] H. Narayanan, M. Belkin, and P. Niyogi. On the relation between
low density separation, spectral clustering and graph cuts. In NIPS,
volume 19, pages 1025–1032, 2006.
[9] L. Van der Maaten and G. Hinton. Visualizing data using t-sne.
Journal of machine learning research, 9(11), 2008.
[10] S. C. Yan, D. Xu, B. Y. Zhang, H.-J. Zhang, Q. Yang, and S. Lin.
Graph embedding and extensions: A general framework for dimen-
sionality reduction. IEEE transactions on pattern analysis and
machine intelligence, 29(1):40–51, 2006.
[11] D. X. Wang, P. Cui, and W. W. Zhu. Structural deep network
embedding. In Proceedings of the 22nd ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 1225–
1234, 2016.
[12] P. V Marsden and N. E Friedkin. Network studies of social influence.
Sociological Methods & Research, 22(1):127–151, 1993.
[13] M. McPherson, L. Smith-Lovin, and J. M Cook. Birds of a
feather: Homophily in social networks. Annual review of sociology,
27(1):415–444, 2001.
[14] P. V Marsden. Homogeneity in confiding relations. Social networks,
10(1):57–76, 1988.
[15] H. C. Gao and H. Huang. Deep attributed network embedding. In
Twenty-Seventh International Joint Conference on Artificial Intelli-
gence (IJCAI)), 2018.
[16] L. Z. Liao, X. N. He, H. W. Zhang, and T.-S. Chua. Attributed social
network embedding. IEEE Transactions on Knowledge and Data
Engineering, 30(12):2257–2270, 2018.
[17] C. H. Zheng, L. Pan, and P. Wu. Multimodal deep network embedding
with integrated structure and attribute information. IEEE transactions
on neural networks and learning systems, 31(5):1437–1449, 2019.
[18] Z. Zhang, H. X. Yang,J. J Bu, S. Zhou, P. G. Yu, J. W. Zhang, M. Ester,
and C. Wang. Anrl: Attributed network representation learning via
deep neural networks. In IJCAI, volume 18, pages 3155–3161, 2018.
[19] D. K. Zhang, J. Yin, X. Q. Zhu, and C. Q. Zhang. Network
representation learning: A survey. IEEE transactions on Big Data,
6(1):3–28, 2018.
[20] S. S. Cao, W. Lu, and Q. K. Xu. Grarep: Learning graph repre-
sentations with global structural information. In Proceedings of the
24th ACM international on conference on information and knowledge
management, pages 891–900, 2015.
[21] P. Goyal and E. Ferrara. Graph embedding techniques, applications,
and performance: A survey. Knowledge-Based Systems, 151:78–94,
2018.
[22] S. T Roweis and L. K Saul. Nonlinear dimensionality reduction by
locally linear embedding. science, 290(5500):2323–2326, 2000.
[23] J. B Tenenbaum, V. De Silva, and J. C Langford. A global geo-
metric framework for nonlinear dimensionality reduction. science,
290(5500):2319–2323, 2000.
[24] D. Nguyen and F. D Malliaros. Biasedwalk: Biased sampling for
representation learning on graphs. In 2018 IEEE International
Conference on Big Data (Big Data), pages 4045–4053, 2018.
[25] J. Tang, M. Qu, M. Z. Wang, M. Zhang, J. Yan, and Q. Z. Mei. Line:
Large-scale information network embedding. In Proceedings of the
24th international conference on world wide web, pages 1067–1077,
2015.
[26] N. Sheikh, Z. Kefato, and A. Montresor. gat2vec: representation
learning for attributed graphs. Computing, 101(3):187–209, 2019.
[27] W. Y. Liu, Z. N. Liu, F. C. Yu, P.-Y. Chen, T. Suzumura, and
G. M. Hu. A scalable attribute-aware network embedding system.
Neurocomputing, 339:279–291, 2019.
[28] S. R. Pan, R. Q. Hu, G. D. Long, J. Jiang, L. Yao, and C. Q. Zhang.
Adversarially regularized graph autoencoder for graph embedding. In
Proceedings of the 27th International Joint Conference on Artificial
Intelligence, pages 2609–2615, 2018.
[29] A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on
web-page clustering. In Workshop on artificial intelligence for web
search (AAAI 2000), volume 58, page 64, 2000.
[30] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the
dimensionality of data with neural networks. science, 313(5786):504–
507, 2006.
[31] R. Salakhutdinov and G. Hinton. Semantic hashing. International
Journal of Approximate Reasoning, 50(7):969–978, 2009.
[32] W. B. Liu, Z. D. Wang, X. H. Liu, N. Y Zeng, Y. R. Liu, and F. E
Alsaadi. A survey of deep neural network architectures and their
applications. Neurocomputing, 234:11–26, 2017.
[33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Dean
Jeffrey. Distributed representations of words and phrases and their
compositionality. In Proceedings of the 26th International Confer-
ence on Neural Information Processing Systems, volume 2, pages
3111–3119, 2013.
[34] X. Huang, J. D. Li, and X. Hu. Label informed attributed network em-
bedding. In Proceedings of the Tenth ACM International Conference
on Web Search and Data Mining, pages 731–739, 2017.
[35] T. N Kipf and M. Welling. Semi-supervised classification with graph
convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.
Advances in neural information processing systems, 27, 2014.
[37] W. L Hamilton, R. Ying, and J. Leskovec. Representation learning
on graphs: Methods and applications. IEEE Data(base) Engineering
Bulletin, 2017.
[38] T. Fawcett. An introduction to roc analysis. Pattern recognition
letters, 27(8):861–874, 2006.
[39] N. Y Zeng, H. Qiu, Z. D. Wang, W. B Liu, H. Zhang, and Y. R.
Li. A new switching-delayed-pso-based optimized svm algorithm
for diagnosis of alzheimer’s disease. Neurocomputing, 320:195–202,
2018.
[40] Z. W. Zhang, P. Cui, and W. W. Zhu. Deep learning on graphs: A
survey. IEEE Transactions on Knowledge and Data Engineering,
34(1):249–270, 2022.
[41] Z. H. Wu, S. Pan, F.W. Chen, G. D. Long, C. Q. Zhang, and P. S. Yu. A
comprehensive survey on graph neural networks. IEEE Transactions
on Neural Networks and Learning Systems, 32(1):4–24, 2021.
[42] T. N. Kipf and M. Welling. Variational graph auto-encoders. In
Workshop on Bayesian Deep Learning, 2016.
[43] F. X Chen, Y. C. Wang, B. Wang, and C-C J. Kuo. Graph repre-
sentation learning: a survey. APSIPA Transactions on Signal and
Information Processing, 9, 2020.
Cong Li (M’15) received the
PhD degree in intelligent systems
from Delft University of Technology
(TUDelft), Delft, The Netherlands,
in 2014. She is currently an asso-
ciate professor in Electronic Engi-
neering Department at Fudan Univer-
sity, where she involves in complex
network theory and applications. Her
First Author et al.: Preprint submitted to Elsevier Page 13 of 14
Short Title of the Article
research focuses on analysis and modeling of complex net-
works, including network properties, dynamic processes,
network of networks etc. Her work in on these subjects in-
clude 2 (co-) authored research monographs, 1 book chapter
and more than 30 international journal papers and confer-
ence papers.
Min Shi received the BS de-
gree in electronic information en-
gineering from Nantong University,
Jiangsu, China, in 2014. She is cur-
rently pursuing the M.S. degree with
the Department of Information Sci-
ence and Engineering, Fudan Univer-
sity, Shanghai, China. Her current re-
search interests include data mining
and network representation learning.
Bo Qu recieved the BS and
MS degrees in information security
and computer science from Shanghai
Jiaotong University, China, in 2009
and 2012, respectively. He recieved
the PhD degree in intelligent systems
from Delft Univeristy of Technol-
ogy, The Netherlands, in 2017. Before
joining Peng Cheng Labortory as a re-
search associate, he was with Tencent
Technology as a researcher in applied research of network
security. His research fouces on network security by network
ananlysis, network representation learning, etc.
Xiang Li (M’05–SM’08) re-
ceived the BS and PhD degrees in
control theory and control engineer-
ing from Nankai University, China,
in 1997 and 2002, respectively. He
was with the City University of Hong
Kong, Int. University Bremen, Shang-
hai Jiao Tong University, and Fudan
University as post-doc research fel-
low, Humboldt research fellow, an as-
sociate professor, professor/distinguished professor in 2002-
2004, 2005-2006, 2004-2007 and 2008-2021, respectively.
Currently, he is the founding director of the Institute of
Complex Networks and Intelligent Systems, Shanghai Re-
search Institute for Intelligent Autonomous Systems, Tongji
University, Shanghai, China. He served/serves as associate
editor for the IEEE Transactions on Circuits and Systems-
I: Regular Papers (2010-2015), Research, the Journal of
Complex Networks, the IEEE Circuits and Systems Society
Newsletter, Associate Editor (2018-2021) and the Area
Editor (since 2022) of the IEEE Transactions on Network
Science and Engineering. His main research interests cover
network science and systems control in both theory and
applications. He has (co-)authored 6 research monographs,
7 book chapters, and more than 150 peer-refereed journal
publications and 100+ indexed conference papers. He re-
ceived the IEEE Guillemin-Cauer Best Transactions Paper
Award from the IEEE Circuits and Systems Society in
2005, Shanghai Natural Science Award (1st class) in 2008,
Shanghai Science and Technology Young Talents Award in
2010, National Science Foundation for Distinguished Young
Scholar of China in 2014, National Natural Science Award
of China (2nd class) in 2015, Ten Thousand Talent Program
of China in 2017, TCCT CHEN Han-Fu Award of Chinese
Automation Association in 2019, the Excellent Editor Award
of the IEEE Trans. Network Science and Engineering in
2021, among other awards and honors.
First Author et al.: Preprint submitted to Elsevier Page 14 of 14