Article

Role Discovery in Networks

Authors:
  • Adobe Research
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Roles represent node-level connectivity patterns such as star-center, star-edge nodes, near-cliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are structurally similar. Roles have been mainly of interest to sociologists, but more recently, roles have become increasingly useful in other domains. Traditionally, the notion of roles were defined based on graph equivalences such as structural, regular, and stochastic equivalences. We briefly revisit the notions and instead propose a more general formulation of roles based on the similarity of a feature representation (in contrast to the graph representation). This leads us to propose a taxonomy of two general classes of techniques for discovering roles which includes (i) graph-based roles and (ii) feature-based roles. This survey focuses primarily on feature-based roles. In particular, we also introduce a flexible framework for discovering roles using the notion of structural similarity on a feature-based representation. The framework consists of two fundamental components: (1) role feature construction and (2) role assignment using the learned feature representation. We discuss the relevant decisions for discovering feature-based roles and highlight the advantages and disadvantages of the many techniques that can be used for this purpose. Finally, we discuss potential applications and future directions and challenges.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Numerous approaches have been proposed in the literature for the task of structural role discovery on graphs, where nodes on a graph are divided into classes of structurally similar nodes called 'roles' [1]. Early approaches in this area relied on graph sub-structures known as graphlets or motifs [2]. ...
... These methods are designed to learn higher-order structural relationships than those that can be discovered by small graphlets. However, in many cases, * Correspondence: eoghan.cunningham@ucdconnect.ie 1 School of Computer Science, University College Dublin, Ireland Full list of author information is available at the end of the article these alternative approaches come at the cost of interpretability. When applied to graphs that are too large to be visualised reasonably, it is often difficult to understand the substantive meaning of a given set of structural roles. ...
... Role discovery is the task of grouping nodes which share similar structural patterns in a graph into distinct classes [1]. Many modern approaches to role discovery rely on graph embedding, where nodes are transformed into low-dimensional vector representations [9,10,3,4]. ...
Preprint
Full-text available
Role discovery is the task of dividing the set ofnodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex graph structures when reducing nodes to dense vector representations. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose Surrogate Explanation for Role Discovery (SERD), a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our framework on a small synthetic graph with prescribed structure, before applying them to a larger real-world network. In the second case, a large, multidisciplinary citation network, we successfully identify a number of important citation patterns or structures which reflect interdisciplinary research.
... Depending on the granularity of partitioning, multiple roles can be observed within a single time slice (Rossi and Ahmed 2014). For example, in a human-relationship network in which people are nodes, and the connections between those people are links, a node that plays the role of a team leader in a company during weekdays forms a local structure with a tree topology (reflecting the structure of the company). ...
... Although previous research has acknowledged the multiple-role phenomenon, few specific approaches to the estimation of multiple roles have been presented (Rossi and Ahmed 2014;Liu et al. 2021). In earlier work (Liu et al. 2021), we proposed a method to predict multiple roles with an adversarial learning approach by treating the multiple-role discovery task as a multi-label classification problem. ...
... Methods aimed at role discovery obtain a vector representation of nodes from the network structure, cluster nodes by specific similarity criteria, and assign role labels to each cluster. In the task of role assignment, the conventional methods can be divided into three types: hard single-role discovery (assigning each node to one specific role), soft single-role discovery (assigning each node to one of a distribution of roles), and multiple-role discovery (assigning each node to one or more roles) (Rossi and Ahmed 2014). A large number of works have attempted hard single-role discovery. ...
Article
Full-text available
In complex networks, the role of a node is based on the aggregation of structural features and functions. However, in real networks, it has been observed that a single node can have multiple roles. Here, the roles of a node can be defined in a case-by-case manner, depending on the graph data mining task. Consequently, a significant obstacle to achieving multiple-role discovery in real networks is finding the best way to select datasets for pre-labeling. To meet this challenge, this study proposes a flexible framework that extends a single-role discovery method by using domain adversarial learning to discover multiple roles for nodes. Furthermore, we propose a method to assign sub-networks, derived through community extraction methods, to a source network and a validation network as training datasets. Experiments to evaluate accuracy conducted on real networks demonstrate that the proposed method can achieve higher accuracy and more stable results.
... To get closer to the objective of our study, let us also mention that in the recent times role discovery (especially topological feature-based [13]) has become a popular topic, most notably in the domain of non-attributed social network analysis [13][14][15][16][17][18][19]. In the network context, roles refer to clusters, or classes, of nodes, where the nodes from the same cluster are structurally similar to each other in some way. ...
... To get closer to the objective of our study, let us also mention that in the recent times role discovery (especially topological feature-based [13]) has become a popular topic, most notably in the domain of non-attributed social network analysis [13][14][15][16][17][18][19]. In the network context, roles refer to clusters, or classes, of nodes, where the nodes from the same cluster are structurally similar to each other in some way. ...
... The main idea behind the role discovery is to group nodes by their connectivity patterns, where each group represents some topological role such as hub, bridge, near-clique, etc. Topological roles indicate which functions nodes serve in the network [13]. Initially role discovery was the point of interest in sociology, used to study the interactions between social actors and assign roles to actors, but networks in these studies were very small [31,32]. ...
Preprint
Full-text available
In this paper, we propose a framework for solving the novel problem of role discovery in a public transportation network (PTN). We model a PTN as a weighted node-attributed network whose nodes are public transport stations (stops) grouped with respect to their geospatial position, node attributes store information about social infrastructure around the stations (stops), and weighted links integrate information about the travelling distance and the number of hops in the transportation routs between the stations (stops). Our framework discovers meaningful node roles in terms of both topological and infrastructural features of a PTN and is capable of extracting useful insights about the overall PTN’s efficiency. We apply the framework to the newly collected open data of St Petersburg, Russia, and point out some transportation and infrastructural weaknesses that should be taken into consideration by the city administration to improve the PTN in the future.
... Role-mining or role-discovery is a new concept compared to community detection for network data. Recently, Rossi and Ahmed (2015) provides a descriptive and well-organized survey on all the existing role discovery techniques in the literature. ...
... Our simulation methodology is based on the concept of role-mining (Rossi and Ahmed, 2015), which is a clustering approach that defines meaningful roles (clusters) of the nodes in a network based on the graph properties of the nodes. Our simulation approach is based on the hypothesis that nodes in a network are divided into several roles such as server, router, personal machine, university machine, and so on, where each role usually communicates with some other roles at a certain frequency. ...
Preprint
Full-text available
Detection of malicious networks (botnet) is becoming a major concern as they pose a serious threat to network security. But, botnet detection methods often perform very poorly in real-life datasets as the methods are not developed based on a real-life botnet dataset. A crucial reason for the detection methods not being developed based on a real-life dataset is the scarcity of large-scale, real-life botnet datasets. Due to security and privacy concerns, organizations do not publish their real-life botnet dataset. Realizing the need for a real-life large-scale botnet dataset, in this paper, we develop a simulation methodology to simulate a large-scale botnet dataset from a real-life botnet dataset. This simulation methodology is based on the Markov chain and role-mining approaches. Besides simulating the degree distribution, our simulation methodology also simulates triangles (community structures). We propose a novel scalable algorithm using parallel computing that generates large-scale botnet graphs from a small-size input dataset. To evaluate the performance of our simulation methodology, we compare our simulated graph with the original graph and with the graph simulated by the Preferential attachment (PA) algorithm based on the distributions of triangles, indegrees, and outdegrees. Results demonstrate that the distributions of the simulated graph generated by our methodology are very similar to the distributions of the original graph with minor real-life random variations. Results also demonstrate that our simulation algorithm substantially outperforms the PA algorithm in simulating the distributions of triangles and botnet subgraphs. To emphasize the accuracy of botnet simulation more, we provide a separate comparison between the botnet subgraphs of the simulated and the original graphs that demonstrates the similarity of our simulated botnet subgraphs with the original botnet subgraph. A comparison of our simulated scaled-up graph with the original graph demonstrates that our methodology preserves the triangle distribution and the botnet subgraphs of the original graph, whereas the PA algorithm fails to preserve the triangle distribution and the botnet subgraphs in the scaled-up graph.
... The survey by Rossi and Ahmed [35] puts forward an application-based approach to node role extraction that evaluates the node roles by how well they can be utilized in a downstream machine learning task. However, this perspective is task-specific and more applicable to node embeddings based on roles rather than the actual extraction of roles. ...
... In this section, we diverge a little from the optimization-based perspective of the paper up to this point and showcase the effectiveness of the information content of the extracted roles in a few-shot learning downstream task. This links our approach to the applicationbased role evaluation approach of [35]. We employ an embedding based on the minimizer of the long-term cost function (eq. ...
Preprint
Full-text available
Similar to community detection, partitioning the nodes of a network according to their structural roles aims to identify fundamental building blocks of a network. The found partitions can be used, e.g., to simplify descriptions of the network connectivity, to derive reduced order models for dynamical processes unfolding on processes, or as ingredients for various graph mining tasks. In this work, we offer a fresh look on the problem of role extraction and its differences to community detection and present a definition of node roles related to graph-isomorphism tests, the Weisfeiler-Leman algorithm and equitable partitions. We study two associated optimization problems (cost functions) grounded in ideas from graph isomorphism testing, and present theoretical guarantees associated to the solutions of these problems. Finally, we validate our approach via a novel "role-infused partition benchmark", a network model from which we can sample networks in which nodes are endowed with different roles in a stochastic way.
... Since the emergence of two landmark network models, the small-world model by Watts and Strogatz (WS model) [1] and scale-free network model by Barabasi and Albert (BA model), research on network models is rapidly increasing [2]. Network models used in data research include the online social networks [3], mail networks [4], biological networks [5], annotated networks [6], and online dating market [7]. Recently, complex networks have been extensively evaluated and applied in several applications by physicists [8][9][10][11][12][13][14][15]. ...
... [45]; (c) physicians trust network [46]; (d) student interpersonal network [47]. 4 Journal of Applied Mathematics ...
Article
Full-text available
Networks are prevalent in real life, and the study of network evolution models is very important for understanding the nature and laws of real networks. The distribution of the initial degree of nodes in existing classical models is constant or uniform. The model we proposed shows binomial distribution, and it is consistent with real network data. The theoretical analysis shows that the proposed model is scale-free at different probability values and its clustering coefficients are adjustable, and the Barabasi-Albert model is a special case of p = 0 in our model. In addition, the analytical results of the clustering coefficients can be estimated using mean-field theory. The mean clustering coefficients calculated from the simulated data and the analytical results tend to be stable. The model also exhibits small-world characteristics and has good reproducibility for short distances of real networks. Our model combines three network characteristics, scale-free, high clustering coefficients, and small-world characteristics, which is a significant improvement over traditional models with only a single or two characteristics. The theoretical analysis procedure can be used as a theoretical reference for various network models to study the estimation of clustering coefficients. The existence of stable equilibrium points of the model explains the controversy of whether scale-free is universal or not, and this explanation provides a new way of thinking to understand the problem.
... This proximity-based network representation learning is also known as community-based representation learning Rossi et al. (2019). Another way to measure node similarity is to consider the structural roles of nodes in the network Rossi and Ahmed (2015) Referring to nodes with similar structural properties, structural roles define the set of nodes that are structurally more similar to nodes within a community than to nodes outside the community. This approach aims to embed nodes with structurally similar neighborhoods together, while allowing nodes to be farther apart in the network. ...
... The main mechanisms of structural similarity-based representation learning methods utilize the initial set of structural features of the nodes to produce feature-based roles Rossi and Ahmed (2015). ...
Article
Full-text available
Analysis of large-scale networks generally requires mapping high-dimensional network data to a low-dimensional space. We thus need to represent the node and connections accurate and effectively, and representation learning could be a promising method. In this paper, we investigate a novel social similarity-based method for learning network representations. We first introduce neighborhood structural features for representing node identities based on higher-order structural parameters. Then the node representations are learned by a random walk approach that based on the structural features. Our proposed truss2vec is able to maintain both structural similarity of nodes and domain similarity. Extensive experiments have shown that our model outperforms the state-of-the-art solutions.
... Numerous approaches have been proposed in the literature for the task of structural role discovery on graphs, where nodes on a graph are divided into classes of structurally equivalent nodes [21]. Early approaches in this area relied on graph sub-structures known as graphlets or motifs [12]. ...
... Motifs, and graphlets represent a powerful tool for expressing graph structure, and have been employed in graph learning tasks such as node classification and anomaly detection [7]. Figure 1 illustrates a subset of graphlets with 2, 3, 4 and 5 nodes, and includes each of the distinct orbits on these graphlets, as they were enumerated by [18]. Role discovery is the task of grouping nodes which share similar structural patterns in a graph into distinct classes [21]. Many modern approaches to role discovery rely on graph embedding, where nodes are transformed into low-dimensional vector representations [3,10,20,8]. ...
Preprint
Full-text available
Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex local structures. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our methods on a large, multidisciplinary citation network, where we successfully identify a number of important citation patterns which reflect interdisciplinary research
... Definition 5 (Roles). Roles define sets of nodes that are more structurally similar to nodes inside the set than outside [RA15b]. The terms role and position are used synonymously. ...
... As a result, the latent feature vectors or embeddings given as output from an embedding method can be thought of as either community [HERPF10] [GL16]. In this light, recent embedding methods can be seen as approaches for modeling communities or (feature-based) roles [RA15b]. We refer the interested readers to our paper [RJK + 19] that was accepted at TKDD 2020 for more details. ...
Thesis
Graphs are ubiquitous as they naturally capture interactions between entities, such as user interaction in online social media, paper citations in bibliographic networks, and user-product preferences in sales networks. Recently, graph representation learning has gained significant popularity in both academia and industry thanks to its state-of-the-art performance in a variety of downstream machine learning (ML) tasks, such as friend recommendations and anomaly detection. Specifically, node representation learning (embedding) aims to find a dense vector of rich latent features per entity that can be used in ML tasks. However, these dense representations with fixed dimensions come with computational and storage challenges for real-world graphs with many millions or billions of nodes, and the "black-box" nature of the latent features impedes interpretability. On the other hand, graph summarization aims to find a concise and interpretable representation of the original graph that describes its key information, but it is often lossy and trades off space and performance in ML tasks. In this thesis, we bridge the two lines of research, node embedding and graph summarization: we introduce scalable methods for generating summaries of latent or non-latent (original) node features that achieve the state-of-the-art performance on ML tasks while requiring significantly reduced storage and supporting interpretability. Specifically, we introduce a new problem, latent network summarization, which summarizes the graph structural features in static networks as latent node embeddings for storage and query efficiency, and extend this idea to incorporate temporal proximity in temporal summaries of continuous-time dynamic networks. We also perform an extensive systematic study of temporal summaries and show that they capture the graph structure and temporal dependency at least as well as recently-proposed dynamic embedding approaches, while having significantly less complexity (i.e., no transitional or latent variables). Unlike methods that are based on complex models as "black boxes", our temporal summaries are easy-to-understand, which motivates their usage for practitioners in predictive applications. Finally, we summarize the non-latent graph features by modeling feature importance as the high-level knowledge through traditional and deep learning models that can be used for graph analysis and transfer learning. Throughout the thesis, we demonstrate the effectiveness, scalability and space efficiency of our methods on industrial applications such as entity linkage, user stitching, professional role inference, and temporal link prediction, and present insights that can inform further methodological development and applications.
... The network density (ND) represents the number of connections that exist in a network. The closer the network density is to 1, the more connections the members have on average [62]. Assuming that there are N members in the network, the maximum number of connections in the network is theoretically N (N − 1). ...
Article
Full-text available
In China, private-owned cooperatives are becoming increasingly involved in agricultural production. In order to find the key characteristics of smallholders’ social networks after the appearance of cooperatives and better organize different farmland operators, this study completed a field survey of 114 smallholders who adopted farmland trusteeship service of a private-owned cooperative in China and applied the social network analysis to reveal the following results. (1) Compared to the theoretical ideal value, smallholders’ social networks showed low network density, efficiency, and little relevancy. (2) In the social network of mechanical-sharing, neighbor, kinship, and labor-sharing relationships, some isolated nodes existed, but no isolated nodes are found in the synthetic network. (3) The mechanical-sharing relationship among smallholders was stronger than the other relationships. (4) Machinery owners, farmers whose plots are on the geometric center and experienced older farmers showed higher centralities in the network, but village cadres did not. (5) The centralities and QAP correlation coefficients among different networks inside the cooperative were lower than that inside a single village. As a result, this paper confirmed that the ability of cooperatives to organize farmers’ social networks is not ideal. Farmers’ trust of farmland to a cross-village cooperatives does not help them to form a larger social network than their villages. In the future, the answer to the question of “who will farm the land” will still lie with the professional farmers and highly autonomous cooperatives.
... Over the past decades, many methods have been proposed to determine the appropriate number of roles (or clusters). Among them, Akaike's information criterion (AIC) and Minimum Description Length (MDL) are two reliable approaches (Cook et al., 2007;Grünwald and Grunwald, 2007;Rossi and Ahmed, 2014). ...
Article
Service bottlenecks are a key barrier to building a resilient public transport system. In this paper, we propose a new approach to automatically extract the role of a station in dynamical public transport flow networks based on the emerging role discovery method in network science. The term "role" in this study refers to the distinctive position or function that a station plays within the public transport flow network. Using smart card data from Nanjing public bikesharing agencies, we first construct dynamical public transport flow networks with notions of dynamical graph and edge. We then develop a dynamical algorithm to recursively compute the structural flow characteristics of nodes in passenger flow networks. Non-negative Matrix Factorization is conducted to extract the role memberships from the derived structural feature matrix and interpret each role in terms of measurements with practical values. The network hubs and potential service bottlenecks are then identified based on their operating characteristics and dynamics. Furthermore, the day-today and within-day role dynamics of public transport stations over time are unveiled. The results contribute to a better understanding of the interplay between stations in the network, and the identification of roles provides insight for public transport agencies to improve service resilience.
... Moreover, the changing role of a person in such networks was also recognized [49]. If the interconnectivity of social members is taken into consideration, a member can play diferent roles in one or more communities [50][51][52]. Role detection is usually made with the blockmodelling, which identifes nodes with structural equivalence [53,54] and could also be used to detect structural changes in a network [50]. Tese methods are very convenient when dynamic online communities [50] or social rating platforms [55] are investigated. ...
Article
Full-text available
Central actors or opinion leaders are in the right structural position to spread relevant information or convince others about adopting an innovation or behaviour change. Who is a central actor or opinion leader might be conceptualised in various ways. Widely accepted centrality measures do not take into account that those in central positions in the social network may change over time. A longitudinal comparison of the set and importance of opinion leaders is problematic with these measures and therefore needs a novel approach. In this study, we investigate ways to compare the stability of the set of central actors over time. Using longitudinal survey data from primary schools (where the members of the social networks do not change much over time) on advice-seeking and friendship networks, we find a relatively poor stability of who is in the central positions anyhow we define centrality. We propose the application of combined indices in order to achieve more efficient targeting results. Our results suggest that because opinion leaders may change over time, researchers should be careful about relying on simple centrality indices from cross-sectional data to gain and interpret information (for example, in the design of prevention programs, network-based interventions or infection control) and must rely on more diverse structural information instead.
... HIN2Vec [3] is a metapath based heterogeneous embedding method that can combine several fixed-length metapaths to guide random walks and perform multiple prediction tasks to learn node embedding effectively [8,15,22]. Lowdimensional feature representation has a wide range of utilities in various network analysis tasks such as link prediction [4,8], node classification, recommendation systems [23,24], relational mining [25,26], and role discovery [27]. ...
Article
Full-text available
Heterogeneous Information Networks (HINs) consist of multiple categories of nodes and edges and encompass rich semantic information. Representing HINs in a low-dimensional feature space is challenging due to its complex structure and rich semantics. In this paper, we focus on link prediction and node classification by learning efficient low-dimensional feature representations of HINs. Metapath-guided walkers have been extensively studied in the literature for learning feature representations. However, the metapath walker does not control the length of random walks, resulting in weak structural and semantic information embeddings. In this work, we present an influence propagation controlled metapath-guided random walk model (called IPCMetapath2Vec) for representation learning in HINs. The model works in three phases: first, we perform node transition to generate a metapath-guided random walk, which is conditioned on two factors: (i) type mapping of the next node according to the metapath, and (ii) compute influence propagation score for each node and detect potential influencers on the walk by a threshold based filter. Next, we provide the collected random walks as input to the skip-gram model to learn each node’s feature representation. Lastly, we employ an attention mechanism that aggregates the learned feature representations of each node from various semantic metapath-guided walks, preserving the importance of different semantics. We use these network representation features to address link prediction and multi-label node classification tasks. Experimental results on two public HIN datasets, namely DBLP and IMDB, show that our model outperforms the state-of-the-art representation learning models such as DeepWalk, Node2vec, Metapath2Vec, and HIN2Vec by 4.5% to 17.2% in terms of micro-F1 score for multi-label node classification and 4% to 14.50% in terms of AUC-ROC score for link prediction.
... It is widely used in various downstream tasks such as role classification, etc. The first appearance of the concept of roles is in sociology [1] for mining potential social relationships, and it is gradually applied in complex networks like traffic network congestion [2]. By measuring the structural similarity [3], role-similar nodes can be found in distant locations, even in different subgraphs of disconnected networks. ...
Article
Full-text available
Role-based network embedding aims to embed role-similar nodes into a similar embedding space, which is widely used in graph mining tasks such as role classification and detection. Roles are sets of nodes in graph networks with similar structural patterns and functions. However, the role-similar nodes may be far away or even disconnected from each other. Meanwhile, the neighborhood node features and noise also affect the result of the role-based network embedding, which are also challenges of current network embedding work. In this paper, we propose a Role-based network Embedding via Quantum walk with weighted Features fusion (REQF), which simultaneously considers the influence of global and local role information, node features, and noise. Firstly, we capture the global role information of nodes via quantum walk based on its superposition property which emphasizes the local role information via biased quantum walk. Secondly, we utilize the quantum walk weighted characteristic function to extract and fuse features of nodes and their neighborhood by different distributions which contain role information implicitly. Finally, we leverage the Variational Auto-Encoder (VAE) to reduce the effect of noise. We conduct extensive experiments on seven real-world datasets, and the results show that REQF is more effective at capturing role information in the network, which outperforms the best baseline by up to 14.6% in role classification, and 23% in role detection on average.
... A role the nodes play within the network depends more on the structure of the network around them more than the distance between them. (See [40] for a survey on roles.) The next four algorithms aim to create embeddings that capture structural properties of the network. ...
Article
Full-text available
Users on social networks such as Twitter interact with each other without much knowledge of the real-identity behind the accounts they interact with. This anonymity has created a perfect environment for bot accounts to influence the network by mimicking real-user behaviour. Although not all bot accounts have malicious intent, identifying bot accounts in general is an important and difficult task. In the literature there are three distinct types of feature sets one could use for building machine learning models for classifying bot accounts. These feature-sets are: user profile metadata, natural language features (NLP) extracted from user tweets and finally features extracted from the the underlying social network. Profile metadata and NLP features are typically explored in detail in the bot-detection literature. At the same time less attention has been given to the predictive power of features that can be extracted from the underlying network structure. To fill this gap we explore and compare two classes of embedding algorithms that can be used to take advantage of information that network structure provides. The first class are classical embedding techniques, which focus on learning proximity information. The second class are structural embedding algorithms, which capture the local structure of node neighbourhood. We show that features created using structural embeddings have higher predictive power when it comes to bot detection. This supports the hypothesis that the local social network formed around bot accounts on Twitter contains valuable information that can be used to identify bot accounts.
... As a result, two nodes of different degrees can be regularly equivalent if they have the same types of neighbours. Similar definitions are used, for example, in [28]. A similar notion of structural similarity as proposed in [16] is considered in [33]. ...
Preprint
Full-text available
An embedding is a mapping from a set of nodes of a network into a real vector space. Embeddings can have various aims like capturing the underlying graph topology and structure, node-to-node relationship, or other relevant information about the graph, its subgraphs or nodes themselves. A practical challenge with using embeddings is that there are many available variants to choose from. Selecting a small set of most promising embeddings from the long list of possible options for a given task is challenging and often requires domain expertise. Embeddings can be categorized into two main types: classical embeddings and structural embeddings. Classical embeddings focus on learning both local and global proximity of nodes, while structural embeddings learn information specifically about the local structure of nodes' neighbourhood. For classical node embeddings there exists a framework which helps data scientists to identify (in an unsupervised way) a few embeddings that are worth further investigation. Unfortunately, no such framework exists for structural embeddings. In this paper we propose a framework for unsupervised ranking of structural graph embeddings. The proposed framework, apart from assigning an aggregate quality score for a structural embedding, additionally gives a data scientist insights into properties of this embedding. It produces information which predefined node features the embedding learns, how well it learns them, and which dimensions in the embedded space represent the predefined node features. Using this information the user gets a level of explainability to an otherwise complex black-box embedding algorithm.
... The learned encoding will introduce problems to the downstream decoder due to the presence of symmetries in the graphs that will make decoding challenging. In particular, when nodes have the same structural role within a graph their feature representations after passing through a MPNN will be identical; such nodes are known as regularly equivalent [38,46]. Such cases arise when a graph contains a bipartite subgraph, formed by groups of regularly equivalent nodes connected by edges, and the bipartite graph is neither complete 1 nor without edges. ...
Preprint
Full-text available
In this work, we addresses the problem of modeling distributions of graphs. We introduce the Vector-Quantized Graph Auto-Encoder (VQ-GAE), a permutation-equivariant discrete auto-encoder and designed to model the distribution of graphs. By exploiting the permutation-equivariance of graph neural networks (GNNs), our autoencoder circumvents the problem of the ordering of the graph representation. We leverage the capability of GNNs to capture local structures of graphs while employing vector-quantization to prevent the mapping of discrete objects to a continuous latent space. Furthermore, the use of autoregressive models enables us to capture the global structure of graphs via the latent representation. We evaluate our model on standard datasets used for graph generation and observe that it achieves excellent performance on some of the most salient evaluation metrics compared to the state-of-the-art.
... Over the past decades, many methods have been proposed for determining the appropriate number of roles (or clusters). Among them, Akaike's information criterion (AIC) and Minimum Description Length (MDL) are two reliable approaches (Cook et al., 2007;Grünwald and Grunwald, 2007;Rossi and Ahmed, 2014). ...
... A role the nodes play within the network depends more on the structure of the network around them more than the distance between them. (See [29] for a survey on roles.) The next four algorithms aim to create embeddings that capture structural properties of the network. ...
Preprint
Full-text available
Users on social networks such as Twitter interact with and are influenced by each other without much knowledge of the identity behind each user. This anonymity has created a perfect environment for bot and hostile accounts to influence the network by mimicking real-user behaviour. To combat this, research into designing algorithms and datasets for identifying bot users has gained significant attention. In this work, we highlight various techniques for classifying bots, focusing on the use of node and structural embedding algorithms. We show that embeddings can be used as unsupervised techniques for building features with predictive power for identifying bots. By comparing features extracted from embeddings to other techniques such as NLP, user profile and node-features, we demonstrate that embeddings can be used as unique source of predictive information. Finally, we study the stability of features extracted using embeddings for tasks such as bot classification by artificially introducing noise in the network. Degradation of classification accuracy is comparable to models trained on carefully designed node features, hinting at the stability of embeddings.
... However, we can cluster nodes according to criteria that are different from the density of their connections. This is particularly true for directed networks, where nodes can be clustered together based on the similarity of connectivity patterns that do not require them to share connections Malliaros and Vazirgiannis [2013], Rossi and Ahmed [2014]. This notion of clusters is akin to the detection of roles. ...
Preprint
Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.
... In more recent studies, considering weak ties was helpful in different sociological contexts, such as the influence of indirect contacts on decision making 29 , or the dismantling of organized crime 15 . In a more graph theoretical approach, researchers explored different methods of clustering people by their roles, relying on the fact that structurally equivalent nodes fulfill the same role in society 19 . ...
Article
Full-text available
Understanding how a disease spreads in a population is a first step to preparing for future epidemics, and machine learning models are a useful tool to analyze the spreading process of infectious diseases. For effective predictions of these spreading processes, node embeddings are used to encode networks based on the similarity between nodes into feature vectors, i.e., higher dimensional representations of human contacts. In this work, we evaluated the impact of homophily and structural equivalence on node2vec embedding for disease spread prediction by testing them on real world temporal human contact networks. Our results show that structural equivalence is a useful indicator for the infection status of a person. Embeddings that are balanced towards the preservation of structural equivalence performed better than those that focus on the preservation of homophily, with an average improvement of 0.1042 in the f1-score (95% CI 0.051 to 0.157). This indicates that structurally equivalent nodes behave similarly during an epidemic (e.g., expected time of a disease onset). This observation could greatly improve predictions of future epidemics where only partial information about contacts is known, thereby helping determine the risk of infection for different groups in the population.
... Existing GNNs have mostly focused on learning a single node embedding (or representation) [23,32], despite that a node often exhibits polysemous behavior in different contexts [3]. For instance, an individual may have many different personas, e.g., a user may be a researcher, father, coach, and activist [14,22]. These personas may be fundamentally different or even impossible for other individuals. ...
Preprint
Graph Neural Networks (GNNs) have become increasingly important in recent years due to their state-of-the-art performance on many important downstream applications. Existing GNNs have mostly focused on learning a single node representation, despite that a node often exhibits polysemous behavior in different contexts. In this work, we develop a persona-based graph neural network framework called PersonaSAGE that learns multiple persona-based embeddings for each node in the graph. Such disentangled representations are more interpretable and useful than a single embedding. Furthermore, PersonaSAGE learns the appropriate set of persona embeddings for each node in the graph, and every node can have a different number of assigned persona embeddings. The framework is flexible enough and the general design helps in the wide applicability of the learned embeddings to suit the domain. We utilize publicly available benchmark datasets to evaluate our approach and against a variety of baselines. The experiments demonstrate the effectiveness of PersonaSAGE for a variety of important tasks including link prediction where we achieve an average gain of 15% while remaining competitive for node classification. Finally, we also demonstrate the utility of PersonaSAGE with a case study for personalized recommendation of different entity types in a data management platform.
... GRAPHVIS provides a diverse collection of visual interactive graph partitioning methods. For example, community detection, role discovery (Rossi and Ahmed 2015b), and graph coloring. All graph partitioning methods are designed to be efficient taking at most linear time in the number of edges to compute. ...
Article
We present a web-based network visual analytics platform called GraphVis that combines interactive visualizations with analytic techniques to reveal important patterns and insights for sense making, reasoning, and decision-making. The platform is designed with simplicity in mind and allows users to visualize and explore networks in seconds with a simple drag-and-drop of a graph file into the web browser. GraphVis is fast and flexible, web-based, requires no installation, while supporting a wide range of graph formats as well as state-of-the-art visualization and analytic techniques. In particular, the multi-level network analysis engine of GraphVis gives rise to a variety of new possibilities for exploring, analyzing, and understanding complex networks interactively in real-time. Finally, we also highlight other key aspects including filtering, querying, ranking, manipulating, exporting, partitioning (community/role discovery), as well as tools for dynamic network analysis and visualization, interactive graph generators (including two new block model approaches), and a variety of multi-level network analysis and statistical techniques.
... One way to overcome these limitations is the paradigm of role discovery [38] that identifies nodes with structurally similar neighborhoods. In contrast to the notion of communities defined by network proximity, structural roles characterize nodes by their local connectivity and subgraph patterns independent of their location in the network [41]; thus, two nodes with similar roles may lie in different parts of the graph. ...
Article
Full-text available
We present InfoMotif, a new semi-supervised, motif-regularized, learning framework over graphs. We overcome two key limitations of message passing in popular graph neural networks (GNNs): localization (a k-layer GNN cannot utilize features outside the k-hop neighborhood of the labeled training nodes) and over-smoothed (structurally indistinguishable) representations. We formulate attributed structural roles of nodes based on their occurrence in different network motifs, independent of network proximity. Network motifs are higher-order structures indicating connectivity patterns between nodes and are crucial to the organization of complex networks. Two nodes share attributed structural roles if they participate in topologically similar motif instances over covarying sets of attributes. InfoMotif achieves architecture-agnostic regularization of arbitrary GNNs through novel self-supervised learning objectives based on mutual information maximization. Our training curriculum dynamically prioritizes multiple motifs in the learning process without relying on distributional assumptions in the underlying graph or the learning task. We integrate three state-of-the-art GNNs in our framework, to show notable performance gains (3–10% accuracy) across nine diverse real-world datasets spanning homogeneous and heterogeneous networks. Notably, we see stronger gains for nodes with sparse training labels and diverse attributes in local neighborhood structures.
... ferent from random graphs. Such high-density nodes may act as hubs and play an important structural role in their respective network topologies (see alsoHenderson et al., 2012;Rossi & Ahmed, 2014).These findings have important practical implications for future research. First, many classical frequentist tests require a Gaussian distribution and thus are ill suited for analyzing heavy-tailed distributions. ...
Article
Full-text available
Elucidating the neural basis of social behavior is a long‐standing challenge in neuroscience. Such endeavors are driven by attempts to extend the isolated perspective on the human brain by considering interacting persons' brain activities, but a theoretical and computational framework for this purpose is still in its infancy. Here, we posit a comprehensive framework based on bipartite graphs for interbrain networks and address whether they provide meaningful insights into the neural underpinnings of social interactions. First, we show that the nodal density of such graphs exhibits nonrandom properties. While the current hyperscanning analyses mostly rely on global metrics, we encode the regions' roles via matrix decomposition to obtain an interpretable network representation yielding both global and local insights. With Bayesian modeling, we reveal how synchrony patterns seeded in specific brain regions contribute to global effects. Beyond inferential inquiries, we demonstrate that graph representations can be used to predict individual social characteristics, outperforming functional connectivity estimators for this purpose. In the future, this may provide a means of characterizing individual variations in social behavior or identifying biomarkers for social interaction and disorders. To elucidate the neural mechanisms of social interactions, we introduce an inference and prediction framework for interbrain networks.
... We aim to partition each of our networks based on a relational equivalence of nodes (a perspective with a rich history in the social-networks literature (Lorrain and White, 1971; Rossi and Ahmed, 2015)), rather than on high internal traffic within sets of nodes (Munoz-Mendez et al., 2018). The analysis of time-aggregated data can shed light on "community" membership in the latter sense (Austwick et al., 2013) through a partition of a network into contiguous spatial clusters (Munoz-Mendez et al., 2018). ...
Article
Full-text available
In urban systems, there is an interdependency between neighborhood roles and transportation patterns between neighborhoods. In this paper, we classify docking stations in bicycle-sharing networks to gain insight into the human mobility patterns of three major cities in the United States. We propose novel time-dependent stochastic block models, with degree-heterogeneous blocks and either mixed or discrete block membership, which classify nodes based on their time-dependent activity patterns. We apply these models to (1) detect the roles of bicycle-sharing stations and (2) describe the traffic within and between blocks of stations over the course of a day. Our models successfully uncover work blocks, home blocks, and other blocks; they also reveal activity patterns that are specific to each city. Our work gives insights for the design and maintenance of bicycle-sharing systems, and it contributes new methodology for community detection in temporal and multilayer networks with heterogeneous degrees.
... Currently, there is no universally accepted framework for comparing graphs in terms of the distributions of their local structural properties, often referred to as node roles (Rossi & Ahmed, 2015). In particular, evaluation of graph generative models, which have potential uses in data anonymization Figure 1. ...
Preprint
Full-text available
We argue that when comparing two graphs, the distribution of node structural features is more informative than global graph statistics which are often used in practice, especially to evaluate graph generative models. Thus, we present GraphDCA - a framework for evaluating similarity between graphs based on the alignment of their respective node representation sets. The sets are compared using a recently proposed method for comparing representation spaces, called Delaunay Component Analysis (DCA), which we extend to graph data. To evaluate our framework, we generate a benchmark dataset of graphs exhibiting different structural patterns and show, using three node structure feature extractors, that GraphDCA recognizes graphs with both similar and dissimilar local structure. We then apply our framework to evaluate three publicly available real-world graph datasets and demonstrate, using gradual edge perturbations, that GraphDCA satisfyingly captures gradually decreasing similarity, unlike global statistics. Finally, we use GraphDCA to evaluate two state-of-the-art graph generative models, NetGAN and CELL, and conclude that further improvements are needed for these models to adequately reproduce local structural features.
... In recent years, structural role-based NE has attracted increasing attention, as it can be of great help in learning the function and behavior of nodes [9]. In essence, these methods usually rely on or imitate methods of structural feature extraction [10], [11] and subgraph isomorphism test [12], [13]. ...
Preprint
Full-text available
Capturing structural similarity has been a hot topic in the field of network embedding recently due to its great help in understanding the node functions and behaviors. However, existing works have paid very much attention to learning structures on homogeneous networks while the related study on heterogeneous networks is still a void. In this paper, we try to take the first step for representation learning on heterostructures, which is very challenging due to their highly diverse combinations of node types and underlying structures. To effectively distinguish diverse heterostructures, we firstly propose a theoretically guaranteed technique called heterogeneous anonymous walk (HAW) and its variant coarse HAW (CHAW). Then, we devise the heterogeneous anonymous walk embedding (HAWE) and its variant coarse HAWE in a data-driven manner to circumvent using an extremely large number of possible walks and train embeddings by predicting occurring walks in the neighborhood of each node. Finally, we design and apply extensive and illustrative experiments on synthetic and real-world networks to build a benchmark on heterostructure learning and evaluate the effectiveness of our methods. The results demonstrate our methods achieve outstanding performance compared with both homogeneous and heterogeneous classic methods, and can be applied on large-scale networks.
Chapter
In the field of node representation learning the task of interpreting latent dimensions has become a prominent, well-studied research topic. The contribution of this work focuses on appraising the interpretability of another rarely-exploited feature of node embeddings increasingly utilised in recommendation and consumption diversity studies: inter-node embedded distances. Introducing a new method to measure how understandable the distances between nodes are, our work assesses how well the proximity weights derived from a network before embedding relate to the node closeness measurements after embedding. Testing several classical node embedding models, our findings reach a conclusion familiar to practitioners albeit rarely cited in literature—the matrix factorisation model SVD is the most interpretable through 1, 2 and even higher-order proximities.
Article
Link prediction is a significant research problem in network science and has widespread applications. To date, much efforts have focused on predicting the links generated by pairwise interactions, but little is known about the predictability of links created by higher order interaction patterns. In this study, we investigated a new framework for predicting the links of different orders in social interaction networks based on edge orbit degrees (EODs) characterized by three-node and four-node graphlets. First, we defined a new problem of different-order link prediction to examine the predictability of links generated by different-order interaction patterns. Second, we quantified EODs for different-order link prediction and examined the performance of different-order predictors. The experiments on real-world networks show that higher order links are more accessible to be predicted than lower order (two-order) links. We also found that the closed three-node EOD has strong predictive power, which can accurately predict for both lower order and higher order links. Finally, we proposed a new method fusing multiple EODs (MEOD) to predict different-order links, and experiments indicate that the MEOD outperforms state-of-the-art methods. Our findings can not only effectively improve the link prediction performance of different orders, but also contribute to a better understanding of the organizational principle of higher order structures.
Article
Full-text available
Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex graph structures when reducing nodes to dense vector representations. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence, we propose surrogate explanation for role discovery, a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our framework on a small synthetic graph with prescribed structure, before applying them to a larger real-world network. In the second case, a large, multidisciplinary citation network, we successfully identify a number of important citation patterns or structures which reflect interdisciplinary research.
Chapter
One of the interesting tasks in social network analysis is detecting network nodes’ roles in their interactions. The first problem is discovering such roles, and the second is detecting the discovered roles in the network. Role detection, i.e., assigning a role to a node, is a classification task. Our paper addresses the second problem and uses three roles (classes) for classification. These roles are based only on the structural properties of the neighborhood of a given node and use the previously published non-symmetric relationship between pairs of nodes for their definition. This paper presents transductive learning experiments using graph neural networks (GNN) to show that excellent results can be obtained even with a relatively small sample size for training the network.KeywordsComplex networkGraph neural networkNon-symmetric dependencyNode prominency
Article
Node role explainability in complex networks is very difficult yet is crucial in different application domains such as social science, neurosciences, or computer science. Many efforts have been made on the quantification of hubs revealing particular nodes in a network using a given structural property. Yet, in several applications, when multiple instances of networks are available and several structural properties appear to be relevant, the identification of node roles remains largely unexplored. Inspired by the node automorphically equivalence relation, we define an equivalence relation on graph nodes associated with any collection of nodal statistics (i.e., any functions on the node set). This allows us to define new graph global measures: the power coefficient and the orthogonality score to evaluate the parsimony and heterogeneity of a given nodal statistics collection. In addition, we introduce a new method based on structural patterns to compare graphs that have the same vertices set. This method assigns a value to a node to determine its role distinctiveness in a graph family. Extensive numerical results of our method are conducted on both generative graph models and real data concerning human brain functional connectivity. The differences in nodal statistics are shown to be dependent on the underlying graph structure. Comparisons between generative models and real networks combining two different nodal statistics reveal the complexity of human brain functional connectivity with differences at both global and nodal levels. Using a group of 200 healthy controls connectivity networks, our method computes high correspondence scores among the whole population to detect homotopy and finally quantify differences between comatose patients and healthy controls.
Article
Earth observation technology has improved the detection of land cover changes. However, current pixel-based change detection methods cannot adequately describe the evolutionary process and spatiotemporal association of geographic entities. Therefore, we developed a method for analyzing the processes and patterns of land cover evolution based on spatiotemporal graphs. First, a spatiotemporal graph was generated from a time series of land cover maps according to the spatial and temporal relationships between land cover objects, as defined by spatial adjacency and temporal transition, respectively. Subsequently, structural characteristics, such as the spatial roles, adjacency type, temporal transitions and evolution trajectories, were derived from the spatiotemporal graph to describe and analyze the evolution of land cover. Finally, this method was applied to analyze land cover evolution in Fujian Province, China, from 2001 to 2019. The proposed method not only completely preserves the spatial adjacency and temporal transition details among land cover objects in a spatiotemporally unified graph framework but also extracts evolution-related spatiotemporal structural characteristics. This study provides a reliable scientific basis for analyzing the consistency of long-term land cover dynamics and has practical value for other geographic applications.
Chapter
Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex local structures. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our methods on a large, multidisciplinary citation network, where we successfully identify a number of important citation patterns which reflect interdisciplinary research. KeywordsRole discoveryNode embeddingCitation networkExplainable artificial intelligence
Chapter
A signed network is widely observed and constructed from the real world and is superior for containing rich information about the signs of edges. Several embedding methods have been proposed for signed networks. Current methods mainly focus on proximity similarity and the fulfillment of social psychological theories. However, no signed network embedding method has focused on structural similarity. Therefore, in this research, we propose a novel notion of degree in signed networks and a distance function to measure the similarity between two complex degrees and a node-embedding method based on structural similarity. Experiments on five network topologies, an inverted karate club network, and three real networks demonstrate that our proposed method embeds nodes with similar structural features close together and shows the superiority of a link sign prediction task from embeddings compared with the state-of-the-art methods.
Chapter
Recently, the demand and utility of online social networks are well accepted to share information and connect people from diverse areas. Online social networks have provided a common platform for frequent human interactions, resulting in a significant increase in information about the individual users, their interactions, and relationships. These users can be classified into different classes based on the similarity and differences in users’ characteristics and their local and global position in the network. The node classification problem has been recognized due to its real-time applications in recommendation systems, epidemiological diffusion, sociological dynamics of communities, and anomaly detection. Diverse attempts have been made to perform informative node classifications. Furthermore, the deep learning based approaches for node classification in online social networks have provided state-of-the-art results with better insights and high accuracy. In this chapter, we provide a rigorous literature review of deep learning based methods designed for node classification, and conclude the chapter with interesting and futuristic open research directions to fill the gap in the current works and the demand of next-generation online social systems.KeywordsDeep learningNode classificationCommunity detectionRole identificationGraph partitioningOnline social networks
Article
Full-text available
This study tackles the problem of extracting the node roles in uncertain graphs based on network motifs. Uncertain graphs are useful for modeling information diffusion phenomena because the presence or absence of edges is stochastically determined. In such an uncertain graph, the node role also changes stochastically according to the presence or absence of edges, so approximate calculation using a huge number of samplings is common. However, the calculation load is very large, even for a small graph. We propose a method to extract uncertain node roles with high accuracy and high speed by ensembling a large number of sampled graphs and efficiently searching for all other transitionable roles. This method provides highly accurate results compared to simple sampling and ensembling methods that do not consider the transition to other roles. In our evaluation experiment, we use real-world graphs artificially assigned uniform and non-uniform edge existence probabilities. The results show that the proposed method outperforms an existing method previously reported by the authors, which is the basis of the proposed method, as well as another current method based on the state-of-the-art algorithm, in terms of efficiency and accuracy.
Article
While most network embedding techniques model the proximity between nodes in a network, recently there has been significant interest in structural embeddings that are based on node equivalences , a notion rooted in sociology: equivalences or positions are collections of nodes that have similar roles—i.e., similar functions, ties or interactions with nodes in other positions—irrespective of their distance or reachability in the network. Unlike the proximity-based methods that are rigorously evaluated in the literature, the evaluation of structural embeddings is less mature. It relies on small synthetic or real networks with labels that are not perfectly defined, and its connection to sociological equivalences has hitherto been vague and tenuous. With new node embedding methods being developed at a breakneck pace, proper evaluation, and systematic characterization of existing approaches will be essential to progress. To fill in this gap, we set out to understand what types of equivalences structural embeddings capture. We are the first to contribute rigorous intrinsic and extrinsic evaluation methodology for structural embeddings, along with carefully-designed, diverse datasets of varying sizes. We observe a number of different evaluation variables that can lead to different results (e.g., choice of similarity measure, classifier, and label definitions). We find that degree distributions within nodes’ local neighborhoods can lead to simple yet effective baselines in their own right and guide the future development of structural embedding. We hope that our findings can influence the design of further node embedding methods and also pave the way for more comprehensive and fair evaluation of structural embedding methods.
Article
Social networks have a plethora of applications, and analysis of these applications has been gaining much interest from the research community. The high dimensionality of social network data poses a significant obstacle in its analysis, leading to the curse of dimensionality. The mushrooming of representation learning in various research fields facilitates network representation learning (also called network embedding), which will help us address the above-mentioned issue. Structural Representation Learning aims to learn low-dimensional vector representations of high-dimensional network data, allowing maximal preservation of network structural information. This representation can then serve as a backbone for various network-based applications. First, we investigate the techniques used in network representation learning and similarity indices. We then categorize the representative algorithms into three types based on the network structural level used in their learning process. We also introduce algorithms for representation learning of edges, subgraphs, and the whole network. Finally, we introduce the evaluation metrics and the applications of network representation learning and promising future research directions.
Preprint
Full-text available
Modern online platforms offer users an opportunity to participate in a variety of content-creation, social networking, and shopping activities. With the rapid proliferation of such online services, learning data-driven user behavior models is indispensable to enable personalized user experiences. Recently, representation learning has emerged as an effective strategy for user modeling, powered by neural networks trained over large volumes of interaction data. Despite their enormous potential, we encounter the unique challenge of data sparsity for a vast majority of entities, e.g., sparsity in ground-truth labels for entities and in entity-level interactions (cold-start users, items in the long-tail, and ephemeral groups). In this dissertation, we develop generalizable neural representation learning frameworks for user behavior modeling designed to address different sparsity challenges across applications. Our problem settings span transductive and inductive learning scenarios, where transductive learning models entities seen during training and inductive learning targets entities that are only observed during inference. We leverage different facets of information reflecting user behavior (e.g., interconnectivity in social networks, temporal and attributed interaction information) to enable personalized inference at scale. Our proposed models are complementary to concurrent advances in neural architectural choices and are adaptive to the rapid addition of new applications in online platforms.
Article
Full-text available
Many machine learning applications that involve relational databases incorporate first-order logic and probability. Markov Logic Networks (MLNs) are a prominent statistical relational model that consist of weighted first order clauses. Many of the current state-of-the-art algorithms for learning MLNs have focused on relatively small datasets with few descriptive attributes, where predicates are mostly binary and the main task is usually prediction of links between entities. This paper addresses what is in a sense a complementary problem: learning the structure of an MLN that models the distribution of discrete descriptive attributes on medium to large datasets, given the links between entities in a relational database. Descriptive attributes are usually nonbinary and can be very informative, but they increase the search space of possible candidate clauses. We present an efficient new algorithm for learning a directed relational model (parametrized Bayes net), which produces an MLN structure via a standard moralization procedure for converting directed models to undirected models. Learning MLN structure in this way is 200-1000 times faster and scores substantially higher in predictive accuracy than benchmark algorithms on three relational databases.
Article
Full-text available
Recommender systems which comes under web content mining have become extremely important as user generated information is more free style and unstructured, that creates difficulties in mining important information from data sources. So, in order to satisfy the information requirements of Web users and to expand the user experience in many Web applications, recommendation system has been studied in academia and widely deployed in industry. This paper presents the system, in which data sources can be modeled in the form of various types of web graphs using DRec algorithm. These web graphs can be used for various recommendation systems. This framework is built upon heat diffusion which will create a web graph diffusion model. Then, query suggestion algorithm is applied to test the queries and to generate recommendation. The work is extended with personalized query recommendation and comparative analysis of algorithm and proves the results in the terms of accuracy. And this system can be used to most of the web graphs for query suggestions, image recommendation, and social as well as personalized recommendation.
Chapter
Full-text available
Imagine that you are attending a cocktail party, the surrounding is full of chatting and noise, and somebody is talking about you. In this case, your ears are particularly sensitive to this speaker. This is the cocktail-party problem, which can be solved by blind source separation (BSS).
Article
Full-text available
We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the state-of-the-art probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction.
Article
Full-text available
In this paper we discuss a multilinear generalization of the best rank-R approximation problem for matrices, namely, the approximation of a given higher-order tensor, in an optimal least-squares sense, by a tensor that has prespecified column rank value, row rank value, etc. For matrices, the solution is conceptually obtained by truncation of the singular value decomposition (SVD); however, this approach does not have a straightforward multilinear counterpart. We discuss higher-order generalizations of the power method and the orthogonal iteration method.
Article
Full-text available
Learning the right graph representation from noisy, multisource data has garnered significant interest in recent years. A central tenet of this problem is relational learning. Here the objective is to incorporate the partial information each data source gives us in a way that captures the true underlying relationships. To address this challenge, we present a general, boosting-inspired framework for combining weak evidence of entity associations into a robust similarity metric. We explore the extent to which different quality measurements yield graph representations that are suitable for community detection. We then present empirical results on both synthetic and real datasets demonstrating the utility of this framework. Our framework leads to suitable global graph representations from quality measurements local to each edge. Finally, we discuss future extensions and theoretical considerations of learning useful graph representations from weak feedback in general application settings.
Article
Full-text available
Procedures for establishing a partition of a network in terms of structural equivalence can be divided into direct and indirect approaches. For the former, a new criterion function is proposed that reflects directly structural equivalence concerns. This criterion function can then be (locally) optimized to create a partition. For indirect approaches, measures of dissimilarity must be compatible with the definition of structural equivalence.
Article
Full-text available
Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine and categorize techniques for transforming graph-based relational data to improve SRL algorithms. In particular, appropriate transformations of the nodes, links, and/or features of the data can dramatically affect the capabilities and results of SRL algorithms. We introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. More specifically, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed.
Article
Full-text available
Given a large time-evolving graph, how can we model and characterize the temporal behaviors of individual nodes (and network states)? How can we model the behavioral transition patterns of nodes? We propose a temporal behavior model that captures the "roles" of nodes in the graph and how they evolve over time. The proposed dynamic behavioral mixed-membership model (DBMM) is scalable, fully automatic (no user-defined parameters), non-parametric/data-driven (no specific functional form or parameterization), interpretable (identifies explainable patterns), and flexible (applicable to dynamic and streaming networks). Moreover, the interpretable behavioral roles are generalizable and computationally efficient. We applied our model for (a) identifying patterns and trends of nodes and network states based on the temporal behavior, (b) predicting future structural changes, and (c) detecting unusual temporal behavior transitions. The experiments demonstrate the scalability, flexibility, and effectiveness of our model for identifying interesting patterns, detecting unusual structural transitions, and predicting the future structural changes of the network and individual nodes.
Article
Article
Article
It is shown that a strongly consistent estimation procedure for the order of an autoregression can be based on the law of the iterated logarithm for the partial autocorrelations. As compared to other strongly consistent procedures this procedure will underestimate the order to a lesser degree.
Article
Stochastic variational inference finds good posterior approximations of probabilistic models with very large data sets. It optimizes the variational objective with stochastic optimization, following noisy estimates of the natural gradient. Operationally, stochastic inference iteratively subsamples from the data, analyzes the subsample, and updates parameters with a decreasing learning rate. However, the algorithm is sensitive to that rate, which usually requires hand-tuning to each application. We solve this problem by developing an adaptive learning rate for stochastic variational inference. Our method requires no tuning and is easily implemented with computations already made in the algorithm. We demonstrate our approach with latent Dirichlet allocation applied to three large text corpora. Inference with the adaptive learning rate converges faster and to a better approximation than the best settings of hand-tuned rates.
Article
Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.
Chapter
Let A be a real m×n matrix with m≧n. It is well known (cf. [4]) that $$A = U\sum {V^T}$$ (1) where $${U^T}U = {V^T}V = V{V^T} = {I_n}{\text{ and }}\sum {\text{ = diag(}}{\sigma _{\text{1}}}{\text{,}} \ldots {\text{,}}{\sigma _n}{\text{)}}{\text{.}}$$ The matrix U consists of n orthonormalized eigenvectors associated with the n largest eigenvalues of AAT, and the matrix V consists of the orthonormalized eigenvectors of ATA. The diagonal elements of ∑ are the non-negative square roots of the eigenvalues of ATA; they are called singular values. We shall assume that $${\sigma _1} \geqq {\sigma _2} \geqq \cdots \geqq {\sigma _n} \geqq 0.$$ Thus if rank(A)=r, σr+1 = σr+2=⋯=σn = 0. The decomposition (1) is called the singular value decomposition (SVD).
Article
Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.
Article
The existence of an actor as a set of asymmetric relations to and from every actor in a network of relations is specified as the position of the actor in the network. Conditions of strong versus weak structural equivalence of actor positions in a network are defined. Network structure is characterized in terms of structurally nonequivalent, jointly occupied, network positions located in the observed network. The social distances of actors from network positions are specified as unobserved variables in structural equation models in order to extend the analysis of networks into the etiology and consequences of network structure.
Article
The self-organized map, an architecture suggested for artificial neural networks, is explained by presenting simulation experiments and practical applications. The self-organizing map has the property of effectively creating spatially organized internal representations of various features of input signals and their abstractions. One result of this is that the self-organization process can discover semantic relationships in sentences. Brain maps, semantic maps, and early work on competitive learning are reviewed. The self-organizing map algorithm (an algorithm which order responses spatially) is reviewed, focusing on best matching cell selection and adaptation of the weight vectors. Suggestions for applying the self-organizing map algorithm, demonstrations of the ordering process, and an example of hierarchical clustering of data are presented. Fine tuning the map by learning vector quantization is addressed. The use of self-organized maps in practical speech recognition and a simulation experiment on semantic mapping are discussed
Article
In a regression problem, typically there are p explanatory variables possibly related to a response variable, and we wish to select a subset of the p explanatory variables to fit a model between these variables and the response. A bootstrap variable/model selection procedure is to select the subset of variables by minimizing bootstrap estimates of the prediction error, where the bootstrap estimates are constructed based on a data set of size n. Although the bootstrap estimates have good properties, this bootstrap selection procedure is inconsistent in the sense that the probability of selecting the optimal subset of variables does not converge to 1 as n → ∞. This inconsistency can be rectified by modifying the sampling method used in drawing bootstrap observations. For bootstrapping pairs (response, explanatory variable), it is found that instead of drawing n bootstrap observations (a customary bootstrap sampling plan), much less bootstrap observations should be sampled: The bootstrap selection procedure becomes consistent if we draw m bootstrap observations with m → ∞ and m/n → 0. For bootstrapping residuals, we modify the bootstrap sampling procedure by increasing the variability among the bootstrap observations. The consistency of the modified bootstrap selection procedures is established in various situations, including linear models, nonlinear models, generalized linear models, and autoregressive time series. The choice of the bootstrap sample size m and some computational issues are also discussed. Some empirical results are presented.
Article
In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
Conference Paper
Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to community discovery, which finds groups of highly connected nodes, role discovery finds groups of nodes that share similar topological structure in the graph, and hence a common role (or function) such as being a broker or a periphery node. However, existing work so far is completely unsupervised, which is undesirable for a number of reasons. We provide an alternating least squares framework that allows convex constraints to be placed on the role discovery problem, which can provide useful supervision. In particular we explore supervision to enforce i) sparsity, ii) diversity, and iii) alternativeness in the roles. We illustrate the usefulness of this supervision on various data sets and applications.
Conference Paper
Graphlet frequency distribution (GFD) is an analysis tool for understanding the variance of local structure in a graph. Many recent works use GFD for comparing, and characterizing real-life networks. However, the main bottleneck for graph analysis using GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have upto 5 vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5%; For example, exact graphlet counting for ca-AstroPh takes approximately 3 days; but, GRAFT runs for 45 minutes to perform the same task with a counting accuracy of 95.6%.
Article
We discuss the interpretation of Cp-plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis for formal selection of a subset-regression model, and extend the range of application of the device to encompass arbitrary linear estimates of the regression coefficients, for example Ridge estimates.
Article
We discuss the interpretation of C p -plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis for formal selection of a subset-regression model, and extend the range of application of the device to encompass arbitrary linear estimates of the regression coefficients, for example Ridge estimates.
Conference Paper
Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updatesrank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.
Article
We study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction.
Article
Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions, and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF), Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles, characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed comprehensively. Some related work not on NMF that NMF should learn from or has connections with is involved too. Moreover, some open issues remained to be solved are discussed. Several relevant application areas of NMF are also briefly described. This survey aims to construct an integrated, state-of-the-art framework for NMF concept, from which the follow-up research may benefit.
Article
MACHINE LEARNING SYSTEMS automatically learn programs from data. This is often a very attractive alternative to manually constructing them, and in the last decade the use of machine learning has spread rapidly throughout computer science and beyond. Machine learning is used in Web search, spam filters, recommender systems, ad placement, credit scoring, fraud detection, stock trading, drug design, and many other applications. A recent report from the McKinsey Global Institute asserts that machine learning (a.k.a. data mining or predictive analytics) will be the driver of the next big wave of innovation. 15 Several fine textbooks are available to interested practitioners and researchers (for example, Mitchell 16 and Witten et al. 24). However, much of the "folk knowledge" that is needed to successfully develop machine learning applications is not readily available in them. As a result, many machine learning projects take much longer than necessary or wind up producing less-than-ideal results. Yet much of this folk knowledge is fairly easy to communicate. This is the purpose of this article.
Article
We suggest partial logarithmic binning as the method of choice for uncovering the nature of many distributions encountered in information science (IS). Logarithmic binning retrieves information and trends “not visible” in noisy power law tails. We also argue that obtaining the exponent from logarithmically binned data using a simple least square method is in some cases warranted in addition to methods such as the maximum likelihood. We also show why often-used cumulative distributions can make it difficult to distinguish noise from genuine features and to obtain an accurate power law exponent of the underlying distribution. The treatment is nontechnical, aimed at IS researchers with little or no background in mathematics. © 2010 Wiley Periodicals, Inc.
Article
Given a graph, how to find a small group of ‘gateways’, that is a small subset of nodes that are crucial in connecting the source to the target? For instance, given a social network, who is the best person to introduce you to, say, Chris Ferguson, the poker champion? Or, given a network of people and skills, who is the best person to help you learn about, say, wavelets? We formally formulate this problem in two scenarios: Pair-Gateway and Group-Gateway. For each scenario, we show that it is sub-modular and thus it can be solved near-optimally. We further give fast, scalable algorithms to find such gateways. Extensive experimental evaluations on real data sets demonstrate the effectiveness and efficiency of the proposed methods.
Article
Given a large time-evolving network, how can we model and characterize the temporal behaviors of individual nodes (and network states)? How can we model the behavioral transition patterns of nodes? We propose a temporal behavior model that captures the 'roles' of nodes in the graph and how they evolve over time. The proposed dynamic behavioral mixed-membership model (DBMM) is scalable, fully automatic (no user-defined parameters), non-parametric/data-driven (no specific functional form or parameterization), interpretable (identifies explainable patterns), and flexible (applicable to dynamic and streaming networks). Moreover, the interpretable behavioral roles are generalizable, computationally efficient, and natively supports attributes. We applied our model for (a) identifying patterns and trends of nodes and network states based on the temporal behavior, (b) predicting future structural changes, and (c) detecting unusual temporal behavior transitions. We use eight large real-world datasets from different time-evolving settings (dynamic and streaming). In particular, we model the evolving mixed-memberships and the corresponding behavioral transitions of Twitter, Facebook, IP-Traces, Email (University), Internet AS, Enron, Reality, and IMDB. The experiments demonstrate the scalability, flexibility, and effectiveness of our model for identifying interesting patterns, detecting unusual structural transitions, and predicting the future structural changes of the network and individual nodes.
Article
We provide a systematic analysis of nonnegative matrix factorization (NMF) relating to data clustering. We generalize the usual X = FG{sup T} decomposition to the symmetric W = HH{sup T} and W = HSH{sup T} decompositions. We show that (1) W = HH{sup T} is equivalent to Kernel K-means clustering and the Laplacian-based spectral clustering. (2) X = FG{sup T} is equivalent to simultaneous clustering of rows and columns of a bipartite graph. We emphasizes the importance of orthogonality in NMF and soft clustering nature of NMF. These results are verified with experiments on face images and newsgroups.
Article
Given a network, intuitively two nodes belong to the same role if they have similar structural behavior. Roles should be automatically determined from the data, and could be, for example, "clique-members," "periphery-nodes," etc. Roles enable numerous novel and useful network-mining tasks, such as sense-making, searching for similar nodes, and node classification. This paper addresses the question: Given a graph, how can we automatically discover roles for nodes? We propose RolX (Role eXtraction), a scalable (linear in the number of edges), unsupervised learning approach for automatically extracting structural roles from general network data. We demonstrate the effectiveness of RolX on several network-mining tasks: from exploratory data analysis to network transfer learning. Moreover, we compare network role discovery with network community discovery. We highlight fundamental differences between the two (e.g., roles generalize across disconnected networks, communities do not); and show that the two approaches are complimentary in nature.
Article
A method is described for choosing the number of components to retain in a principal component analysis when the aim is dimensionality reduction. The correspondence between principal component analysis and the singular value decomposition of the data matrix is used. The method is based on successively predicting each element in the data matrix after deleting the corresponding row and column of the matrix, and makes use of recently published algorithms for updating a singular value decomposition. These are very fast, which renders the proposed technique a practicable one for routine data analysis.
Article
Many networked systems, including physical, biological, social, and technological networks, appear to contain ``communities'' -- groups of nodes within which connections are dense, but between which they are sparser. The ability to find such communities in an automated fashion could be of considerable use. Communities in a web graph for instance might correspond to sets of web sites dealing with related topics, while communities in a biochemical network or an electronic circuit might correspond to functional units of some kind. We present a number of new methods for community discovery, including methods based on ``betweenness'' measures and methods based on modularity optimization. We also give examples of applications of these methods to both computer-generated and real-world network data, and show how our techniques can be used to shed light on the sometimes dauntingly complex structure of networked systems.