Article

Role Discovery in Networks

May 2014
IEEE Transactions on Knowledge and Data Engineering 27(4)

May 2014
27(4)

DOI:10.1109/TKDE.2014.2349913

Source
arXiv

Authors:

Ryan A. Rossi

Adobe Research

Roles represent node-level connectivity patterns such as star-center, star-edge nodes, near-cliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are structurally similar. Roles have been mainly of interest to sociologists, but more recently, roles have become increasingly useful in other domains. Traditionally, the notion of roles were defined based on graph equivalences such as structural, regular, and stochastic equivalences. We briefly revisit the notions and instead propose a more general formulation of roles based on the similarity of a feature representation (in contrast to the graph representation). This leads us to propose a taxonomy of two general classes of techniques for discovering roles which includes (i) graph-based roles and (ii) feature-based roles. This survey focuses primarily on feature-based roles. In particular, we also introduce a flexible framework for discovering roles using the notion of structural similarity on a feature-based representation. The framework consists of two fundamental components: (1) role feature construction and (2) role assignment using the learned feature representation. We discuss the relevant decisions for discovering feature-based roles and highlight the advantages and disadvantages of the many techniques that can be used for this purpose. Finally, we discuss potential applications and future directions and challenges.

Surrogate Explanations for Role Discovery on Graphs

Preprint

Full-text available

Mar 2023

Role discovery is the task of dividing the set ofnodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex graph structures when reducing nodes to dense vector representations. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose Surrogate Explanation for Role Discovery (SERD), a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our framework on a small synthetic graph with prescribed structure, before applying them to a larger real-world network. In the second case, a large, multidisciplinary citation network, we successfully identify a number of important citation patterns or structures which reflect interdisciplinary research.

A flexible framework for multiple-role discovery in real networks

Article

Full-text available

Sep 2022

In complex networks, the role of a node is based on the aggregation of structural features and functions. However, in real networks, it has been observed that a single node can have multiple roles. Here, the roles of a node can be defined in a case-by-case manner, depending on the graph data mining task. Consequently, a significant obstacle to achieving multiple-role discovery in real networks is finding the best way to select datasets for pre-labeling. To meet this challenge, this study proposes a flexible framework that extends a single-role discovery method by using domain adversarial learning to discover multiple roles for nodes. Furthermore, we propose a method to assign sub-networks, derived through community extraction methods, to a source network and a validation network as training datasets. Experiments to evaluate accuracy conducted on real networks demonstrate that the proposed method can achieve higher accuracy and more stable results.

Role discovery in node-attributed public transportation networks: The study of St Petersburg city open data

Preprint

Full-text available

Apr 2022

In this paper, we propose a framework for solving the novel problem of role discovery in a public transportation network (PTN). We model a PTN as a weighted node-attributed network whose nodes are public transport stations (stops) grouped with respect to their geospatial position, node attributes store information about social infrastructure around the stations (stops), and weighted links integrate information about the travelling distance and the number of hops in the transportation routs between the stations (stops). Our framework discovers meaningful node roles in terms of both topological and infrastructural features of a PTN and is capable of extracting useful insights about the overall PTN’s efficiency. We apply the framework to the newly collected open data of St Petersburg, Russia, and point out some transportation and infrastructural weaknesses that should be taken into consideration by the city administration to improve the PTN in the future.

Scalable Simulation Algorithm to Generate Large-Scale Botnet Dataset using Role-Mining and Markov Chain

Preprint

Full-text available

Aug 2023

Detection of malicious networks (botnet) is becoming a major concern as they pose a serious threat to network security. But, botnet detection methods often perform very poorly in real-life datasets as the methods are not developed based on a real-life botnet dataset. A crucial reason for the detection methods not being developed based on a real-life dataset is the scarcity of large-scale, real-life botnet datasets. Due to security and privacy concerns, organizations do not publish their real-life botnet dataset. Realizing the need for a real-life large-scale botnet dataset, in this paper, we develop a simulation methodology to simulate a large-scale botnet dataset from a real-life botnet dataset. This simulation methodology is based on the Markov chain and role-mining approaches. Besides simulating the degree distribution, our simulation methodology also simulates triangles (community structures). We propose a novel scalable algorithm using parallel computing that generates large-scale botnet graphs from a small-size input dataset. To evaluate the performance of our simulation methodology, we compare our simulated graph with the original graph and with the graph simulated by the Preferential attachment (PA) algorithm based on the distributions of triangles, indegrees, and outdegrees. Results demonstrate that the distributions of the simulated graph generated by our methodology are very similar to the distributions of the original graph with minor real-life random variations. Results also demonstrate that our simulation algorithm substantially outperforms the PA algorithm in simulating the distributions of triangles and botnet subgraphs. To emphasize the accuracy of botnet simulation more, we provide a separate comparison between the botnet subgraphs of the simulated and the original graphs that demonstrates the similarity of our simulated botnet subgraphs with the original botnet subgraph. A comparison of our simulated scaled-up graph with the original graph demonstrates that our methodology preserves the triangle distribution and the botnet subgraphs of the original graph, whereas the PA algorithm fails to preserve the triangle distribution and the botnet subgraphs in the scaled-up graph.

An Optimization-based Approach To Node Role Discovery in Networks: Approximating Equitable Partitions

Preprint

Full-text available

May 2023

Similar to community detection, partitioning the nodes of a network according to their structural roles aims to identify fundamental building blocks of a network. The found partitions can be used, e.g., to simplify descriptions of the network connectivity, to derive reduced order models for dynamical processes unfolding on processes, or as ingredients for various graph mining tasks. In this work, we offer a fresh look on the problem of role extraction and its differences to community detection and present a definition of node roles related to graph-isomorphism tests, the Weisfeiler-Leman algorithm and equitable partitions. We study two associated optimization problems (cost functions) grounded in ideas from graph isomorphism testing, and present theoretical guarantees associated to the solutions of these problems. Finally, we validate our approach via a novel "role-infused partition benchmark", a network model from which we can sample networks in which nodes are endowed with different roles in a stochastic way.

Network Model with Scale-Free, High Clustering Coefficients, and Small-World Properties

Article

Full-text available

Apr 2023
J Appl Math

Chuankui Yan

Networks are prevalent in real life, and the study of network evolution models is very important for understanding the nature and laws of real networks. The distribution of the initial degree of nodes in existing classical models is constant or uniform. The model we proposed shows binomial distribution, and it is consistent with real network data. The theoretical analysis shows that the proposed model is scale-free at different probability values and its clustering coefficients are adjustable, and the Barabasi-Albert model is a special case of p = 0 in our model. In addition, the analytical results of the clustering coefficients can be estimated using mean-field theory. The mean clustering coefficients calculated from the simulated data and the analytical results tend to be stable. The model also exhibits small-world characteristics and has good reproducibility for short distances of real networks. Our model combines three network characteristics, scale-free, high clustering coefficients, and small-world characteristics, which is a significant improvement over traditional models with only a single or two characteristics. The theoretical analysis procedure can be used as a theoretical reference for various network models to study the estimation of clustering coefficients. The existence of stable equilibrium points of the model explains the controversy of whether scale-free is universal or not, and this explanation provides a new way of thinking to understand the problem.

Network representation learning based on social similarities

Article

Full-text available

Aug 2022

Analysis of large-scale networks generally requires mapping high-dimensional network data to a low-dimensional space. We thus need to represent the node and connections accurate and effectively, and representation learning could be a promising method. In this paper, we investigate a novel social similarity-based method for learning network representations. We first introduce neighborhood structural features for representing node identities based on higher-order structural parameters. Then the node representations are learned by a random walk approach that based on the structural features. Our proposed truss2vec is able to maintain both structural similarity of nodes and domain similarity. Extensive experiments have shown that our model outperforms the state-of-the-art solutions.

The Structure of Interdisciplinary Science: Uncovering and Explaining Roles in Citation Graphs

Preprint

Full-text available

Jun 2022

Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex local structures. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our methods on a large, multidisciplinary citation network, where we successfully identify a number of important citation patterns which reflect interdisciplinary research

Graph Summarization Meets Representation Learning for Scalable Feature Summarization: Methods and Applications

Thesis

Jan 2021

Di Jin

Graphs are ubiquitous as they naturally capture interactions between entities, such as user interaction in online social media, paper citations in bibliographic networks, and user-product preferences in sales networks. Recently, graph representation learning has gained significant popularity in both academia and industry thanks to its state-of-the-art performance in a variety of downstream machine learning (ML) tasks, such as friend recommendations and anomaly detection. Specifically, node representation learning (embedding) aims to find a dense vector of rich latent features per entity that can be used in ML tasks. However, these dense representations with fixed dimensions come with computational and storage challenges for real-world graphs with many millions or billions of nodes, and the "black-box" nature of the latent features impedes interpretability. On the other hand, graph summarization aims to find a concise and interpretable representation of the original graph that describes its key information, but it is often lossy and trades off space and performance in ML tasks. In this thesis, we bridge the two lines of research, node embedding and graph summarization: we introduce scalable methods for generating summaries of latent or non-latent (original) node features that achieve the state-of-the-art performance on ML tasks while requiring significantly reduced storage and supporting interpretability. Specifically, we introduce a new problem, latent network summarization, which summarizes the graph structural features in static networks as latent node embeddings for storage and query efficiency, and extend this idea to incorporate temporal proximity in temporal summaries of continuous-time dynamic networks. We also perform an extensive systematic study of temporal summaries and show that they capture the graph structure and temporal dependency at least as well as recently-proposed dynamic embedding approaches, while having significantly less complexity (i.e., no transitional or latent variables). Unlike methods that are based on complex models as "black boxes", our temporal summaries are easy-to-understand, which motivates their usage for practitioners in predictive applications. Finally, we summarize the non-latent graph features by modeling feature importance as the high-level knowledge through traditional and deep learning models that can be used for graph analysis and transfer learning. Throughout the thesis, we demonstrate the effectiveness, scalability and space efficiency of our methods on industrial applications such as entity linkage, user stitching, professional role inference, and temporal link prediction, and present insights that can inform further methodological development and applications.

Social Network Analysis of Farmers after the Private Cooperatives’ “Intervention” in a Rural Area of China—A Case Study of the XiangX Cooperative in Shandong Province

Article

Full-text available

Apr 2024

In China, private-owned cooperatives are becoming increasingly involved in agricultural production. In order to find the key characteristics of smallholders’ social networks after the appearance of cooperatives and better organize different farmland operators, this study completed a field survey of 114 smallholders who adopted farmland trusteeship service of a private-owned cooperative in China and applied the social network analysis to reveal the following results. (1) Compared to the theoretical ideal value, smallholders’ social networks showed low network density, efficiency, and little relevancy. (2) In the social network of mechanical-sharing, neighbor, kinship, and labor-sharing relationships, some isolated nodes existed, but no isolated nodes are found in the synthetic network. (3) The mechanical-sharing relationship among smallholders was stronger than the other relationships. (4) Machinery owners, farmers whose plots are on the geometric center and experienced older farmers showed higher centralities in the network, but village cadres did not. (5) The centralities and QAP correlation coefficients among different networks inside the cooperative were lower than that inside a single village. As a result, this paper confirmed that the ability of cooperatives to organize farmers’ social networks is not ideal. Farmers’ trust of farmland to a cross-village cooperatives does not help them to form a larger social network than their villages. In the future, the answer to the question of “who will farm the land” will still lie with the professional farmers and highly autonomous cooperatives.

Identifying service bottlenecks in public bikesharing flow networks

Article

Mar 2024
J TRANSP GEOGR

Service bottlenecks are a key barrier to building a resilient public transport system. In this paper, we propose a new approach to automatically extract the role of a station in dynamical public transport flow networks based on the emerging role discovery method in network science. The term "role" in this study refers to the distinctive position or function that a station plays within the public transport flow network. Using smart card data from Nanjing public bikesharing agencies, we first construct dynamical public transport flow networks with notions of dynamical graph and edge. We then develop a dynamical algorithm to recursively compute the structural flow characteristics of nodes in passenger flow networks. Non-negative Matrix Factorization is conducted to extract the role memberships from the derived structural feature matrix and interpret each role in terms of measurements with practical values. The network hubs and potential service bottlenecks are then identified based on their operating characteristics and dynamics. Furthermore, the day-today and within-day role dynamics of public transport stations over time are unveiled. The results contribute to a better understanding of the interplay between stations in the network, and the identification of roles provides insight for public transport agencies to improve service resilience.

The Core Might Change Anyhow We Define It: The Instability of Key Actors in Longitudinal Social Network Data

Article

Full-text available

Mar 2024
COMPLEXITY

Central actors or opinion leaders are in the right structural position to spread relevant information or convince others about adopting an innovation or behaviour change. Who is a central actor or opinion leader might be conceptualised in various ways. Widely accepted centrality measures do not take into account that those in central positions in the social network may change over time. A longitudinal comparison of the set and importance of opinion leaders is problematic with these measures and therefore needs a novel approach. In this study, we investigate ways to compare the stability of the set of central actors over time. Using longitudinal survey data from primary schools (where the members of the social networks do not change much over time) on advice-seeking and friendship networks, we find a relatively poor stability of who is in the central positions anyhow we define centrality. We propose the application of combined indices in order to achieve more efficient targeting results. Our results suggest that because opinion leaders may change over time, researchers should be careful about relying on simple centrality indices from cross-sectional data to gain and interpret information (for example, in the design of prevention programs, network-based interventions or infection control) and must rely on more diverse structural information instead.

An effective representation learning model for link prediction in heterogeneous information networks

Article

Full-text available

Nov 2023
COMPUTING

Heterogeneous Information Networks (HINs) consist of multiple categories of nodes and edges and encompass rich semantic information. Representing HINs in a low-dimensional feature space is challenging due to its complex structure and rich semantics. In this paper, we focus on link prediction and node classification by learning efficient low-dimensional feature representations of HINs. Metapath-guided walkers have been extensively studied in the literature for learning feature representations. However, the metapath walker does not control the length of random walks, resulting in weak structural and semantic information embeddings. In this work, we present an influence propagation controlled metapath-guided random walk model (called IPCMetapath2Vec) for representation learning in HINs. The model works in three phases: first, we perform node transition to generate a metapath-guided random walk, which is conditioned on two factors: (i) type mapping of the next node according to the metapath, and (ii) compute influence propagation score for each node and detect potential influencers on the walk by a threshold based filter. Next, we provide the collected random walks as input to the skip-gram model to learn each node’s feature representation. Lastly, we employ an attention mechanism that aggregates the learned feature representations of each node from various semantic metapath-guided walks, preserving the importance of different semantics. We use these network representation features to address link prediction and multi-label node classification tasks. Experimental results on two public HIN datasets, namely DBLP and IMDB, show that our model outperforms the state-of-the-art representation learning models such as DeepWalk, Node2vec, Metapath2Vec, and HIN2Vec by 4.5% to 17.2% in terms of micro-F1 score for multi-label node classification and 4% to 14.50% in terms of AUC-ROC score for link prediction.

Role-Based Network Embedding via Quantum Walk with Weighted Features Fusion

Article

Full-text available

Aug 2023
CMC-COMPUT MATER CON

Role-based network embedding aims to embed role-similar nodes into a similar embedding space, which is widely used in graph mining tasks such as role classification and detection. Roles are sets of nodes in graph networks with similar structural patterns and functions. However, the role-similar nodes may be far away or even disconnected from each other. Meanwhile, the neighborhood node features and noise also affect the result of the role-based network embedding, which are also challenges of current network embedding work. In this paper, we propose a Role-based network Embedding via Quantum walk with weighted Features fusion (REQF), which simultaneously considers the influence of global and local role information, node features, and noise. Firstly, we capture the global role information of nodes via quantum walk based on its superposition property which emphasizes the local role information via biased quantum walk. Secondly, we utilize the quantum walk weighted characteristic function to extract and fuse features of nodes and their neighborhood by different distributions which contain role information implicitly. Finally, we leverage the Variational Auto-Encoder (VAE) to reduce the effect of noise. We conduct extensive experiments on seven real-world datasets, and the results show that REQF is more effective at capturing role information in the network, which outperforms the best baseline by up to 14.6% in role classification, and 23% in role detection on average.

Detecting bots in social-networks using node and structural embeddings

Article

Full-text available

Jul 2023

Users on social networks such as Twitter interact with each other without much knowledge of the real-identity behind the accounts they interact with. This anonymity has created a perfect environment for bot accounts to influence the network by mimicking real-user behaviour. Although not all bot accounts have malicious intent, identifying bot accounts in general is an important and difficult task. In the literature there are three distinct types of feature sets one could use for building machine learning models for classifying bot accounts. These feature-sets are: user profile metadata, natural language features (NLP) extracted from user tweets and finally features extracted from the the underlying social network. Profile metadata and NLP features are typically explored in detail in the bot-detection literature. At the same time less attention has been given to the predictive power of features that can be extracted from the underlying network structure. To fill this gap we explore and compare two classes of embedding algorithms that can be used to take advantage of information that network structure provides. The first class are classical embedding techniques, which focus on learning proximity information. The second class are structural embedding algorithms, which capture the local structure of node neighbourhood. We show that features created using structural embeddings have higher predictive power when it comes to bot detection. This supports the hypothesis that the local social network formed around bot accounts on Twitter contains valuable information that can be used to identify bot accounts.

Unsupervised Framework for Evaluating and Explaining Structural Node Embeddings of Graphs

Preprint

Full-text available

Jun 2023

An embedding is a mapping from a set of nodes of a network into a real vector space. Embeddings can have various aims like capturing the underlying graph topology and structure, node-to-node relationship, or other relevant information about the graph, its subgraphs or nodes themselves. A practical challenge with using embeddings is that there are many available variants to choose from. Selecting a small set of most promising embeddings from the long list of possible options for a given task is challenging and often requires domain expertise. Embeddings can be categorized into two main types: classical embeddings and structural embeddings. Classical embeddings focus on learning both local and global proximity of nodes, while structural embeddings learn information specifically about the local structure of nodes' neighbourhood. For classical node embeddings there exists a framework which helps data scientists to identify (in an unsupervised way) a few embeddings that are worth further investigation. Unfortunately, no such framework exists for structural embeddings. In this paper we propose a framework for unsupervised ranking of structural graph embeddings. The proposed framework, apart from assigning an aggregate quality score for a structural embedding, additionally gives a data scientist insights into properties of this embedding. It produces information which predefined node features the embedding learns, how well it learns them, and which dimensions in the embedded space represent the predefined node features. Using this information the user gets a level of explainability to an otherwise complex black-box embedding algorithm.

Vector-Quantized Graph Auto-Encoder

Preprint

Full-text available

Jun 2023

In this work, we addresses the problem of modeling distributions of graphs. We introduce the Vector-Quantized Graph Auto-Encoder (VQ-GAE), a permutation-equivariant discrete auto-encoder and designed to model the distribution of graphs. By exploiting the permutation-equivariance of graph neural networks (GNNs), our autoencoder circumvents the problem of the ordering of the graph representation. We leverage the capability of GNNs to capture local structures of graphs while employing vector-quantization to prevent the mapping of discrete objects to a continuous latent space. Furthermore, the use of autoregressive models enables us to capture the global structure of graphs via the latent representation. We evaluate our model on standard datasets used for graph generation and observe that it achieves excellent performance on some of the most salient evaluation metrics compared to the state-of-the-art.

Identifying service bottlenecks in transit passenger flow network

Preprint

Full-text available

Jun 2023

under review

Detecting Bots in Social-Networks Using Node and Structural Embeddings

Preprint

Full-text available

Mar 2022

Users on social networks such as Twitter interact with and are influenced by each other without much knowledge of the identity behind each user. This anonymity has created a perfect environment for bot and hostile accounts to influence the network by mimicking real-user behaviour. To combat this, research into designing algorithms and datasets for identifying bot users has gained significant attention. In this work, we highlight various techniques for classifying bots, focusing on the use of node and structural embedding algorithms. We show that embeddings can be used as unsupervised techniques for building features with predictive power for identifying bots. By comparing features extracted from embeddings to other techniques such as NLP, user profile and node-features, we demonstrate that embeddings can be used as unique source of predictive information. Finally, we study the stability of features extracted using embeddings for tasks such as bot classification by artificially introducing noise in the network. Degradation of classification accuracy is comparable to models trained on carefully designed node features, hinting at the stability of embeddings.

Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs

Preprint

Jan 2023

Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.

On the importance of structural equivalence in temporal networks for epidemic forecasting

Article

Full-text available

Jan 2023

Understanding how a disease spreads in a population is a first step to preparing for future epidemics, and machine learning models are a useful tool to analyze the spreading process of infectious diseases. For effective predictions of these spreading processes, node embeddings are used to encode networks based on the similarity between nodes into feature vectors, i.e., higher dimensional representations of human contacts. In this work, we evaluated the impact of homophily and structural equivalence on node2vec embedding for disease spread prediction by testing them on real world temporal human contact networks. Our results show that structural equivalence is a useful indicator for the infection status of a person. Embeddings that are balanced towards the preservation of structural equivalence performed better than those that focus on the preservation of homophily, with an average improvement of 0.1042 in the f1-score (95% CI 0.051 to 0.157). This indicates that structurally equivalent nodes behave similarly during an epidemic (e.g., expected time of a disease onset). This observation could greatly improve predictions of future epidemics where only partial information about contacts is known, thereby helping determine the risk of infection for different groups in the population.

PersonaSAGE: A Multi-Persona Graph Neural Network

Preprint

Dec 2022

Graph Neural Networks (GNNs) have become increasingly important in recent years due to their state-of-the-art performance on many important downstream applications. Existing GNNs have mostly focused on learning a single node representation, despite that a node often exhibits polysemous behavior in different contexts. In this work, we develop a persona-based graph neural network framework called PersonaSAGE that learns multiple persona-based embeddings for each node in the graph. Such disentangled representations are more interpretable and useful than a single embedding. Furthermore, PersonaSAGE learns the appropriate set of persona embeddings for each node in the graph, and every node can have a different number of assigned persona embeddings. The framework is flexible enough and the general design helps in the wide applicability of the learned embeddings to suit the domain. We utilize publicly available benchmark datasets to evaluate our approach and against a variety of baselines. The experiments demonstrate the effectiveness of PersonaSAGE for a variety of important tasks including link prediction where we achieve an average gain of 15% while remaining competitive for node classification. Finally, we also demonstrate the utility of PersonaSAGE with a case study for personalized recommendation of different entity types in a data management platform.

Interactive Visual Graph Analytics on the Web

Article

Aug 2021

We present a web-based network visual analytics platform called GraphVis that combines interactive visualizations with analytic techniques to reveal important patterns and insights for sense making, reasoning, and decision-making. The platform is designed with simplicity in mind and allows users to visualize and explore networks in seconds with a simple drag-and-drop of a graph file into the web browser. GraphVis is fast and flexible, web-based, requires no installation, while supporting a wide range of graph formats as well as state-of-the-art visualization and analytic techniques. In particular, the multi-level network analysis engine of GraphVis gives rise to a variety of new possibilities for exploring, analyzing, and understanding complex networks interactively in real-time. Finally, we also highlight other key aspects including filtering, querying, ranking, manipulating, exporting, partitioning (community/role discovery), as well as tools for dynamic network analysis and visualization, interactive graph generators (including two new block model approaches), and a variety of multi-level network analysis and statistical techniques.

Self-supervised role learning for graph neural networks

Article

Full-text available

Aug 2022
KNOWL INF SYST

We present InfoMotif, a new semi-supervised, motif-regularized, learning framework over graphs. We overcome two key limitations of message passing in popular graph neural networks (GNNs): localization (a k-layer GNN cannot utilize features outside the k-hop neighborhood of the labeled training nodes) and over-smoothed (structurally indistinguishable) representations. We formulate attributed structural roles of nodes based on their occurrence in different network motifs, independent of network proximity. Network motifs are higher-order structures indicating connectivity patterns between nodes and are crucial to the organization of complex networks. Two nodes share attributed structural roles if they participate in topologically similar motif instances over covarying sets of attributes. InfoMotif achieves architecture-agnostic regularization of arbitrary GNNs through novel self-supervised learning objectives based on mutual information maximization. Our training curriculum dynamically prioritizes multiple motifs in the learning process without relying on distributional assumptions in the underlying graph or the learning task. We integrate three state-of-the-art GNNs in our framework, to show notable performance gains (3–10% accuracy) across nine diverse real-world datasets spanning homogeneous and heterogeneous networks. Notably, we see stronger gains for nodes with sparse training labels and diverse attributes in local neighborhood structures.

Interacting brains revisited: A cross‐brain network neuroscience perspective

Article

Full-text available

Jun 2022
HUM BRAIN MAPP

Elucidating the neural basis of social behavior is a long‐standing challenge in neuroscience. Such endeavors are driven by attempts to extend the isolated perspective on the human brain by considering interacting persons' brain activities, but a theoretical and computational framework for this purpose is still in its infancy. Here, we posit a comprehensive framework based on bipartite graphs for interbrain networks and address whether they provide meaningful insights into the neural underpinnings of social interactions. First, we show that the nodal density of such graphs exhibits nonrandom properties. While the current hyperscanning analyses mostly rely on global metrics, we encode the regions' roles via matrix decomposition to obtain an interpretable network representation yielding both global and local insights. With Bayesian modeling, we reveal how synchrony patterns seeded in specific brain regions contribute to global effects. Beyond inferential inquiries, we demonstrate that graph representations can be used to predict individual social characteristics, outperforming functional connectivity estimators for this purpose. In the future, this may provide a means of characterizing individual variations in social behavior or identifying biomarkers for social interaction and disorders. To elucidate the neural mechanisms of social interactions, we introduce an inference and prediction framework for interbrain networks.

Role detection in bicycle-sharing networks using multilayer stochastic block models

Article

Full-text available

Mar 2022

In urban systems, there is an interdependency between neighborhood roles and transportation patterns between neighborhoods. In this paper, we classify docking stations in bicycle-sharing networks to gain insight into the human mobility patterns of three major cities in the United States. We propose novel time-dependent stochastic block models, with degree-heterogeneous blocks and either mixed or discrete block membership, which classify nodes based on their time-dependent activity patterns. We apply these models to (1) detect the roles of bicycle-sharing stations and (2) describe the traffic within and between blocks of stations over the course of a day. Our models successfully uncover work blocks, home blocks, and other blocks; they also reveal activity patterns that are specific to each city. Our work gives insights for the design and maintenance of bicycle-sharing systems, and it contributes new methodology for community detection in temporal and multilayer networks with heterogeneous degrees.

GraphDCA -- a Framework for Node Distribution Comparison in Real and Synthetic Graphs

Preprint

Full-text available

Feb 2022

We argue that when comparing two graphs, the distribution of node structural features is more informative than global graph statistics which are often used in practice, especially to evaluate graph generative models. Thus, we present GraphDCA - a framework for evaluating similarity between graphs based on the alignment of their respective node representation sets. The sets are compared using a recently proposed method for comparing representation spaces, called Delaunay Component Analysis (DCA), which we extend to graph data. To evaluate our framework, we generate a benchmark dataset of graphs exhibiting different structural patterns and show, using three node structure feature extractors, that GraphDCA recognizes graphs with both similar and dissimilar local structure. We then apply our framework to evaluate three publicly available real-world graph datasets and demonstrate, using gradual edge perturbations, that GraphDCA satisfyingly captures gradually decreasing similarity, unlike global statistics. Finally, we use GraphDCA to evaluate two state-of-the-art graph generative models, NetGAN and CELL, and conclude that further improvements are needed for these models to adequately reproduce local structural features.

Representation Learning on Heterostructures via Heterogeneous Anonymous Walks

Preprint

Full-text available

Jan 2022

Capturing structural similarity has been a hot topic in the field of network embedding recently due to its great help in understanding the node functions and behaviors. However, existing works have paid very much attention to learning structures on homogeneous networks while the related study on heterogeneous networks is still a void. In this paper, we try to take the first step for representation learning on heterostructures, which is very challenging due to their highly diverse combinations of node types and underlying structures. To effectively distinguish diverse heterostructures, we firstly propose a theoretically guaranteed technique called heterogeneous anonymous walk (HAW) and its variant coarse HAW (CHAW). Then, we devise the heterogeneous anonymous walk embedding (HAWE) and its variant coarse HAWE in a data-driven manner to circumvent using an extremely large number of possible walks and train embeddings by predicting occurring walks in the neighborhood of each node. Finally, we design and apply extensive and illustrative experiments on synthetic and real-world networks to build a benchmark on heterostructure learning and evaluate the effectiveness of our methods. The results demonstrate our methods achieve outstanding performance compared with both homogeneous and heterogeneous classic methods, and can be applied on large-scale networks.

Interpreting Node Embedding Distances Through n-Order Proximity Neighbourhoods

Chapter

Apr 2024

In the field of node representation learning the task of interpreting latent dimensions has become a prominent, well-studied research topic. The contribution of this work focuses on appraising the interpretability of another rarely-exploited feature of node embeddings increasingly utilised in recommendation and consumption diversity studies: inter-node embedded distances. Introducing a new method to measure how understandable the distances between nodes are, our work assesses how well the proximity weights derived from a network before embedding relate to the node closeness measurements after embedding. Testing several classical node embedding models, our findings reach a conclusion familiar to practitioners albeit rarely cited in literature—the matrix factorisation model SVD is the most interpretable through 1, 2 and even higher-order proximities.

PersonaSAGE: A Multi-Persona Graph Neural Network

Conference Paper

Dec 2023

Role-aware random walk for network embedding

Article

Jan 2024
INFORM SCIENCES

Low-rank persistent probability representation for higher-order role discovery

Article

Sep 2023
EXPERT SYST APPL

Role Identification of Social Networkers

Chapter

Jun 2018

Anna Zygmunt

Predicting Higher Order Links in Social Interaction Networks

Article

Jan 2023

Link prediction is a significant research problem in network science and has widespread applications. To date, much efforts have focused on predicting the links generated by pairwise interactions, but little is known about the predictability of links created by higher order interaction patterns. In this study, we investigated a new framework for predicting the links of different orders in social interaction networks based on edge orbit degrees (EODs) characterized by three-node and four-node graphlets. First, we defined a new problem of different-order link prediction to examine the predictability of links generated by different-order interaction patterns. Second, we quantified EODs for different-order link prediction and examined the performance of different-order predictors. The experiments on real-world networks show that higher order links are more accessible to be predicted than lower order (two-order) links. We also found that the closed three-node EOD has strong predictive power, which can accurately predict for both lower order and higher order links. Finally, we proposed a new method fusing multiple EODs (MEOD) to predict different-order links, and experiments indicate that the MEOD outperforms state-of-the-art methods. Our findings can not only effectively improve the link prediction performance of different orders, but also contribute to a better understanding of the organizational principle of higher order structures.

Role discovery in node-attributed public transportation networks: the study of Saint Petersburg city open data

Article

Jun 2023

Surrogate explanations for role discovery on graphs

Article

Full-text available

May 2023

Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex graph structures when reducing nodes to dense vector representations. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence, we propose surrogate explanation for role discovery, a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our framework on a small synthetic graph with prescribed structure, before applying them to a larger real-world network. In the second case, a large, multidisciplinary citation network, we successfully identify a number of important citation patterns or structures which reflect interdisciplinary research.

Role discovery in node-attributed public transportation networks: the model description

Article

Apr 2023

Role-Oriented Dynamic Network Embedding

Conference Paper

Dec 2022

Node Classification Based on Non-symmetric Dependencies and Graph Neural Networks

Chapter

Jan 2023

One of the interesting tasks in social network analysis is detecting network nodes’ roles in their interactions. The first problem is discovering such roles, and the second is detecting the discovered roles in the network. Role detection, i.e., assigning a role to a node, is a classification task. Our paper addresses the second problem and uses three roles (classes) for classification. These roles are based only on the structural properties of the neighborhood of a given node and use the previously published non-symmetric relationship between pairs of nodes for their definition. This paper presents transductive learning experiments using graph neural networks (GNN) to show that excellent results can be obtained even with a relatively small sample size for training the network.KeywordsComplex networkGraph neural networkNon-symmetric dependencyNode prominency

Nodal-statistics-based equivalence relation for graph collections

Article

Jan 2023
PRE

Node role explainability in complex networks is very difficult yet is crucial in different application domains such as social science, neurosciences, or computer science. Many efforts have been made on the quantification of hubs revealing particular nodes in a network using a given structural property. Yet, in several applications, when multiple instances of networks are available and several structural properties appear to be relevant, the identification of node roles remains largely unexplored. Inspired by the node automorphically equivalence relation, we define an equivalence relation on graph nodes associated with any collection of nodal statistics (i.e., any functions on the node set). This allows us to define new graph global measures: the power coefficient and the orthogonality score to evaluate the parsimony and heterogeneity of a given nodal statistics collection. In addition, we introduce a new method based on structural patterns to compare graphs that have the same vertices set. This method assigns a value to a node to determine its role distinctiveness in a graph family. Extensive numerical results of our method are conducted on both generative graph models and real data concerning human brain functional connectivity. The differences in nodal statistics are shown to be dependent on the underlying graph structure. Comparisons between generative models and real networks combining two different nodal statistics reveal the complexity of human brain functional connectivity with differences at both global and nodal levels. Using a group of 200 healthy controls connectivity networks, our method computes high correspondence scores among the whole population to detect homotopy and finally quantify differences between comatose patients and healthy controls.

Spatiotemporal graph-based analysis of land cover evolution using remote sensing time series data

Article

Jan 2023

Earth observation technology has improved the detection of land cover changes. However, current pixel-based change detection methods cannot adequately describe the evolutionary process and spatiotemporal association of geographic entities. Therefore, we developed a method for analyzing the processes and patterns of land cover evolution based on spatiotemporal graphs. First, a spatiotemporal graph was generated from a time series of land cover maps according to the spatial and temporal relationships between land cover objects, as defined by spatial adjacency and temporal transition, respectively. Subsequently, structural characteristics, such as the spatial roles, adjacency type, temporal transitions and evolution trajectories, were derived from the spatiotemporal graph to describe and analyze the evolution of land cover. Finally, this method was applied to analyze land cover evolution in Fujian Province, China, from 2001 to 2019. The proposed method not only completely preserves the spatial adjacency and temporal transition details among land cover objects in a spatiotemporally unified graph framework but also extracts evolution-related spatiotemporal structural characteristics. This study provides a reliable scientific basis for analyzing the consistency of long-term land cover dynamics and has practical value for other geographic applications.

The Structure of Interdisciplinary Science: Uncovering and Explaining Roles in Citation Graphs

Chapter

Jan 2023

Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex local structures. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our methods on a large, multidisciplinary citation network, where we successfully identify a number of important citation patterns which reflect interdisciplinary research. KeywordsRole discoveryNode embeddingCitation networkExplainable artificial intelligence

SignedS2V: Structural Embedding Method for Signed Networks

Chapter

Jan 2023

A signed network is widely observed and constructed from the real world and is superior for containing rich information about the signs of edges. Several embedding methods have been proposed for signed networks. Current methods mainly focus on proximity similarity and the fulfillment of social psychological theories. However, no signed network embedding method has focused on structural similarity. Therefore, in this research, we propose a novel notion of degree in signed networks and a distance function to measure the similarity between two complex degrees and a node-embedding method based on structural similarity. Experiments on five network topologies, an inverted karate club network, and three real networks demonstrate that our proposed method embeds nodes with similar structural features close together and shows the superiority of a link sign prediction task from embeddings compared with the state-of-the-art methods.

Taxonomy-Enhanced Graph Neural Networks

Conference Paper

Oct 2022

Node Classification Using Deep Learning in Social Networks

Chapter

Sep 2022

Recently, the demand and utility of online social networks are well accepted to share information and connect people from diverse areas. Online social networks have provided a common platform for frequent human interactions, resulting in a significant increase in information about the individual users, their interactions, and relationships. These users can be classified into different classes based on the similarity and differences in users’ characteristics and their local and global position in the network. The node classification problem has been recognized due to its real-time applications in recommendation systems, epidemiological diffusion, sociological dynamics of communities, and anomaly detection. Diverse attempts have been made to perform informative node classifications. Furthermore, the deep learning based approaches for node classification in online social networks have provided state-of-the-art results with better insights and high accuracy. In this chapter, we provide a rigorous literature review of deep learning based methods designed for node classification, and conclude the chapter with interesting and futuristic open research directions to fill the gap in the current works and the demand of next-generation online social systems.KeywordsDeep learningNode classificationCommunity detectionRole identificationGraph partitioningOnline social networks

Improving accuracy of expected frequency of uncertain roles based on efficient ensembling

Article

Full-text available

Aug 2022

This study tackles the problem of extracting the node roles in uncertain graphs based on network motifs. Uncertain graphs are useful for modeling information diffusion phenomena because the presence or absence of edges is stochastically determined. In such an uncertain graph, the node role also changes stochastically according to the presence or absence of edges, so approximate calculation using a huge number of samplings is common. However, the calculation load is very large, even for a small graph. We propose a method to extract uncertain node roles with high accuracy and high speed by ensembling a large number of sampled graphs and efficiently searching for all other transitionable roles. This method provides highly accurate results compared to simple sampling and ensembling methods that do not consider the transition to other roles. In our evaluation experiment, we use real-world graphs artificially assigned uniform and non-uniform edge existence probabilities. The results show that the proposed method outperforms an existing method previously reported by the authors, which is the basis of the proposed method, as well as another current method based on the state-of-the-art algorithm, in terms of efficiency and accuracy.

Toward Understanding and Evaluating Structural Node Embeddings

Article

Jun 2022

While most network embedding techniques model the proximity between nodes in a network, recently there has been significant interest in structural embeddings that are based on node equivalences , a notion rooted in sociology: equivalences or positions are collections of nodes that have similar roles—i.e., similar functions, ties or interactions with nodes in other positions—irrespective of their distance or reachability in the network. Unlike the proximity-based methods that are rigorously evaluated in the literature, the evaluation of structural embeddings is less mature. It relies on small synthetic or real networks with labels that are not perfectly defined, and its connection to sociological equivalences has hitherto been vague and tenuous. With new node embedding methods being developed at a breakneck pace, proper evaluation, and systematic characterization of existing approaches will be essential to progress. To fill in this gap, we set out to understand what types of equivalences structural embeddings capture. We are the first to contribute rigorous intrinsic and extrinsic evaluation methodology for structural embeddings, along with carefully-designed, diverse datasets of varying sizes. We observe a number of different evaluation variables that can lead to different results (e.g., choice of similarity measure, classifier, and label definitions). We find that degree distributions within nodes’ local neighborhoods can lead to simple yet effective baselines in their own right and guide the future development of structural embedding. We hope that our findings can influence the design of further node embedding methods and also pave the way for more comprehensive and fair evaluation of structural embedding methods.

Not Only Degree Matters: Diffusion-Driven Role Recognition

Conference Paper

Jun 2022

A Survey of Structural Representation Learning for Social Networks

Article

May 2022
NEUROCOMPUTING

Social networks have a plethora of applications, and analysis of these applications has been gaining much interest from the research community. The high dimensionality of social network data poses a significant obstacle in its analysis, leading to the curse of dimensionality. The mushrooming of representation learning in various research fields facilitates network representation learning (also called network embedding), which will help us address the above-mentioned issue. Structural Representation Learning aims to learn low-dimensional vector representations of high-dimensional network data, allowing maximal preservation of network structural information. This representation can then serve as a backbone for various network-based applications. First, we investigate the techniques used in network representation learning and similarity indices. We then categorize the representative algorithms into three types based on the network structural level used in their learning process. We also introduce algorithms for representation learning of edges, subgraphs, and the whole network. Finally, we introduce the evaluation metrics and the applications of network representation learning and promising future research directions.

Sparsity-aware neural user behavior modeling in online interaction platforms

Preprint

Full-text available

Feb 2022

Aravind Sankar

Modern online platforms offer users an opportunity to participate in a variety of content-creation, social networking, and shopping activities. With the rapid proliferation of such online services, learning data-driven user behavior models is indispensable to enable personalized user experiences. Recently, representation learning has emerged as an effective strategy for user modeling, powered by neural networks trained over large volumes of interaction data. Despite their enormous potential, we encounter the unique challenge of data sparsity for a vast majority of entities, e.g., sparsity in ground-truth labels for entities and in entity-level interactions (cold-start users, items in the long-tail, and ephemeral groups). In this dissertation, we develop generalizable neural representation learning frameworks for user behavior modeling designed to address different sparsity challenges across applications. Our problem settings span transductive and inductive learning scenarios, where transductive learning models entities seen during training and inductive learning targets entities that are only observed during inference. We leverage different facets of information reflecting user behavior (e.g., interconnectivity in social networks, temporal and attributed interaction information) to enable personalized inference at scale. Our proposed models are complementary to concurrent advances in neural architectural choices and are adaptive to the rapid addition of new applications in online platforms.

Structure Learning for Markov Logic Networks with Many Descriptive Attributes

Article

Full-text available

Jul 2010

Many machine learning applications that involve relational databases incorporate first-order logic and probability. Markov Logic Networks (MLNs) are a prominent statistical relational model that consist of weighted first order clauses. Many of the current state-of-the-art algorithms for learning MLNs have focused on relatively small datasets with few descriptive attributes, where predicates are mostly binary and the main task is usually prediction of links between entities. This paper addresses what is in a sense a complementary problem: learning the structure of an MLN that models the distribution of discrete descriptive attributes on medium to large datasets, given the links between entities in a relational database. Descriptive attributes are usually nonbinary and can be very informative, but they increase the search space of possible candidate clauses. We present an efficient new algorithm for learning a directed relational model (parametrized Bayes net), which produces an MLN structure via a standard moralization procedure for converting directed models to undirected models. Learning MLN structure in this way is 200-1000 times faster and scores substantially higher in predictive accuracy than benchmark algorithms on three relational databases.

Mining Web Graphs for query Recommendations

Article

Full-text available

May 2013

Varsha H. Patil

Recommender systems which comes under web content mining have become extremely important as user generated information is more free style and unstructured, that creates difficulties in mining important information from data sources. So, in order to satisfy the information requirements of Web users and to expand the user experience in many Web applications, recommendation system has been studied in academia and widely deployed in industry. This paper presents the system, in which data sources can be modeled in the form of various types of web graphs using DRec algorithm. These web graphs can be used for various recommendation systems. This framework is built upon heat diffusion which will create a web graph diffusion model. Then, query suggestion algorithm is applied to test the queries and to generate recommendation. The work is extended with personalized query recommendation and comparative analysis of algorithm and proves the results in the terms of accuracy. And this system can be used to most of the web graphs for query suggestions, image recommendation, and social as well as personalized recommendation.

Independent Component Analysis

Chapter

Full-text available

Jan 2014

Imagine that you are attending a cocktail party, the surrounding is full of chatting and noise, and somebody is talking about you. In this case, your ears are particularly sensitive to this speaker. This is the cocktail-party problem, which can be solved by blind source separation (BSS).

A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

Article

Full-text available

Jan 2013
Adv Neural Inform Process Syst

We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the state-of-the-art probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction.

On the best rank-1 and rank-(R 1 ,R 2 ,···,R N ) approximation of higher-order tensors

Article

Full-text available

Mar 2000
SIAM J MATRIX ANAL A

In this paper we discuss a multilinear generalization of the best rank-R approximation problem for matrices, namely, the approximation of a given higher-order tensor, in an optimal least-squares sense, by a tensor that has prespecified column rank value, row rank value, etc. For matrices, the solution is conceptually obtained by truncation of the singular value decomposition (SVD); however, this approach does not have a straightforward multilinear counterpart. We discuss higher-order generalizations of the power method and the orthogonal iteration method.

A Boosting Approach to Learning Graph Representations

Article

Full-text available

Jan 2014

Learning the right graph representation from noisy, multisource data has garnered significant interest in recent years. A central tenet of this problem is relational learning. Here the objective is to incorporate the partial information each data source gives us in a way that captures the true underlying relationships. To address this challenge, we present a general, boosting-inspired framework for combining weak evidence of entity associations into a robust similarity metric. We explore the extent to which different quality measurements yield graph representations that are suitable for community detection. We then present empirical results on both synthetic and real datasets demonstrating the utility of this framework. Our framework leads to suitable global graph representations from quality measurements local to each edge. Finally, we discuss future extensions and theoretical considerations of learning useful graph representations from weak feedback in general application settings.

Direct and indirect methods for structural equivalence

Article

Full-text available

Mar 1992
SOC NETWORKS

Procedures for establishing a partition of a network in terms of structural equivalence can be divided into direct and indirect approaches. For the former, a new criterion function is proposed that reflects directly structural equivalence concerns. This criterion function can then be (locally) optimized to create a partition. For indirect approaches, measures of dissimilarity must be compatible with the definition of structural equivalence.

Transforming Graph Data for Statistical Relational Learning

Article

Full-text available

Oct 2012
JAIR

Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine and categorize techniques for transforming graph-based relational data to improve SRL algorithms. In particular, appropriate transformations of the nodes, links, and/or features of the data can dramatically affect the capabilities and results of SRL algorithms. We introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. More specifically, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed.

Modeling Dynamic Behavior in Large Evolving Graphs

Article

Full-text available

Feb 2013

Given a large time-evolving graph, how can we model and characterize the temporal behaviors of individual nodes (and network states)? How can we model the behavioral transition patterns of nodes? We propose a temporal behavior model that captures the "roles" of nodes in the graph and how they evolve over time. The proposed dynamic behavioral mixed-membership model (DBMM) is scalable, fully automatic (no user-defined parameters), non-parametric/data-driven (no specific functional form or parameterization), interpretable (identifies explainable patterns), and flexible (applicable to dynamic and streaming networks). Moreover, the interpretable behavioral roles are generalizable and computationally efficient. We applied our model for (a) identifying patterns and trends of nodes and network states based on the temporal behavior, (b) predicting future structural changes, and (c) detecting unusual temporal behavior transitions. The experiments demonstrate the scalability, flexibility, and effectiveness of our model for identifying interesting patterns, detecting unusual structural transitions, and predicting the future structural changes of the network and individual nodes.

A Survey of Dimension Reduction Techniques

Article

Jan 2002
NEOPLASIA

I.K. Fodor

Some comments on CP

Article

Jan 1973
TECHNOMETRICS

Colin Mallows

A survey of tensor methods

Article

Jan 2009

L.D. Lathauwer

Sparse feature learning for deep belief networks

Article

Jan 2008
Adv Neural Inform Process Syst

Finding community structure in very large networks

Article

Jan 2004

Introduction to information retrieval, chapt

Article

Jan 2008

Probabilistic matrix factorization

Article

Jan 2008
Adv Neural Inform Process Syst

The infinite Gaussian mixture model

Article

Jan 1999

C.E. Rasmussen

The Determination of the Order of an Autoregression

Article

Jan 1979

It is shown that a strongly consistent estimation procedure for the order of an autoregression can be based on the law of the iterated logarithm for the partial autocorrelations. As compared to other strongly consistent procedures this procedure will underestimate the order to a lesser degree.

Probabilistic topic models. Handb. Latent Semant

Article

Jan 2007

Convex collective matrix factorization

Article

Jan 2013

An adaptive learning rate for stochastic variational inference

Article

Jan 2013

Stochastic variational inference finds good posterior approximations of probabilistic models with very large data sets. It optimizes the variational objective with stochastic optimization, following noisy estimates of the natural gradient. Operationally, stochastic inference iteratively subsamples from the data, analyzes the subsample, and updates parameters with a decreasing learning rate. However, the algorithm is sensitive to that rate, which usually requires hand-tuning to each application. We solve this problem by developing an adaptive learning rate for stochastic variational inference. Our method requires no tuning and is easily implemented with computations already made in the algorithm. We demonstrate our approach with latent Dirichlet allocation applied to three large text corpora. Inference with the adaptive learning rate converges faster and to a better approximation than the best settings of hand-tuned rates.

No Free Lunch Theorems for Optimization

Article

Jan 1997
IEEE T EVOLUT COMPUT

Mixed Membership Stochastic Blockmodels

Article

Sep 2008

Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.

Singular Value Decomposition and Least Squares Solutions

Chapter

Jan 1971
NUMER MATH

Let A be a real m×n matrix with m≧n. It is well known (cf. [4]) that $$A = U\sum {V^T}$$ (1) where $${U^T}U = {V^T}V = V{V^T} = {I_n}{\text{ and }}\sum {\text{ = diag(}}{\sigma _{\text{1}}}{\text{,}} \ldots {\text{,}}{\sigma _n}{\text{)}}{\text{.}}$$ The matrix U consists of n orthonormalized eigenvectors associated with the n largest eigenvalues of AAT, and the matrix V consists of the orthonormalized eigenvectors of ATA. The diagonal elements of ∑ are the non-negative square roots of the eigenvalues of ATA; they are called singular values. We shall assume that $${\sigma _1} \geqq {\sigma _2} \geqq \cdots \geqq {\sigma _n} \geqq 0.$$ Thus if rank(A)=r, σr+1 = σr+2=⋯=σn = 0. The decomposition (1) is called the singular value decomposition (SVD).

Comprehensive Survey on Distance/Similarity Measures Between Probability Density Functions

Article

Jan 2007

Sung-Hyuk Cha

Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.

Introduction to Information Retrieval

Book

Jan 2008

Applied Network Analysis: A Methodological Introduction.

Article

Mar 1984

Estimating the Dimension of a Model

Article

Jan 1978
ANN STAT

Gideon Schwarz

Positions in Networks

Article

Sep 1976

Ronald S. Burt

The existence of an actor as a set of asymmetric relations to and from every actor in a network of relations is specified as the position of the actor in the network. Conditions of strong versus weak structural equivalence of actor positions in a network are defined. Network structure is characterized in terms of structurally nonequivalent, jointly occupied, network positions located in the observed network. The social distances of actors from network positions are specified as unobserved variables in structural equation models in order to extend the analysis of networks into the etiology and consequences of network structure.

New Look at Statistical-Model Identification

Article

Jan 1974

H Akaike

The self-organizing map

Article

Oct 1990

Teuvo Kohonen

The self-organized map, an architecture suggested for artificial neural networks, is explained by presenting simulation experiments and practical applications. The self-organizing map has the property of effectively creating spatially organized internal representations of various features of input signals and their abstractions. One result of this is that the self-organization process can discover semantic relationships in sentences. Brain maps, semantic maps, and early work on competitive learning are reviewed. The self-organizing map algorithm (an algorithm which order responses spatially) is reviewed, focusing on best matching cell selection and adaptation of the weight vectors. Suggestions for applying the self-organizing map algorithm, demonstrations of the ordering process, and an example of hierarchical clustering of data are presented. Fine tuning the map by learning vector quantization is addressed. The use of self-organized maps in practical speech recognition and a simulation experiment on semantic mapping are discussed

Bootstrap Model Selection

Article

Jun 1996

Jun Shao

In a regression problem, typically there are p explanatory variables possibly related to a response variable, and we wish to select a subset of the p explanatory variables to fit a model between these variables and the response. A bootstrap variable/model selection procedure is to select the subset of variables by minimizing bootstrap estimates of the prediction error, where the bootstrap estimates are constructed based on a data set of size n. Although the bootstrap estimates have good properties, this bootstrap selection procedure is inconsistent in the sense that the probability of selecting the optimal subset of variables does not converge to 1 as n → ∞. This inconsistency can be rectified by modifying the sampling method used in drawing bootstrap observations. For bootstrapping pairs (response, explanatory variable), it is found that instead of drawing n bootstrap observations (a customary bootstrap sampling plan), much less bootstrap observations should be sampled: The bootstrap selection procedure becomes consistent if we draw m bootstrap observations with m → ∞ and m/n → 0. For bootstrapping residuals, we modify the bootstrap sampling procedure by increasing the variability among the bootstrap observations. The consistency of the modified bootstrap selection procedures is established in various situations, including linear models, nonlinear models, generalized linear models, and autoregressive time series. The choice of the bootstrap sample size m and some computational issues are also discussed. Some empirical results are presented.

BAYESIAN PROBABILISTIC MATRIX FACTORIZATION USING MCMC

Article

Ruslan Salakhutdinov

Feature extraction. Foundations and applications. Papers from NIPS 2003 workshop on feature extraction, Whistler, BC, Canada, December 11–13, 2003. With CD-ROM

Book

Jan 2006

On the Use of Side Information for Mining Text Data

Article

Jun 2014

In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.

Guided learning for role discovery (GLRD): framework, algorithms, and applications

Conference Paper

Aug 2013

Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to community discovery, which finds groups of highly connected nodes, role discovery finds groups of nodes that share similar topological structure in the graph, and hence a common role (or function) such as being a broker or a periphery node. However, existing work so far is completely unsupervised, which is undesirable for a number of reasons. We provide an alternating least squares framework that allows convex constraints to be placed on the role discovery problem, which can provide useful supervision. In particular we explore supervision to enforce i) sparsity, ii) diversity, and iii) alternativeness in the roles. We illustrate the usefulness of this supervision on various data sets and applications.

GRAFT: An approximate graphlet counting algorithm for large graph analysis

Conference Paper

Oct 2012

Graphlet frequency distribution (GFD) is an analysis tool for understanding the variance of local structure in a graph. Many recent works use GFD for comparing, and characterizing real-life networks. However, the main bottleneck for graph analysis using GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have upto 5 vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5%; For example, exact graphlet counting for ca-AstroPh takes approximately 3 days; but, GRAFT runs for 45 minutes to perform the same task with a counting accuracy of 95.6%.

Some Comments on Cp

Article

Mar 2012

Colin Mallows

We discuss the interpretation of Cp-plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis for formal selection of a subset-regression model, and extend the range of application of the device to encompass arbitrary linear estimates of the regression coefficients, for example Ridge estimates.

Some Comments on C p

Article

Apr 2012

Colin Mallows

We discuss the interpretation of C p -plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis for formal selection of a subset-regression model, and extend the range of application of the device to encompass arbitrary linear estimates of the regression coefficients, for example Ridge estimates.

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Conference Paper

Dec 2012

Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updatesrank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.

Anonymization of Centralized and Distributed Social Networks by Sequential Clustering

Article

Feb 2013

We study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction.

Nonnegative Matrix Factorization: A Comprehensive Review

Article

Jun 2013

Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions, and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF), Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles, characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed comprehensively. Some related work not on NMF that NMF should learn from or has connections with is involved too. Moreover, some open issues remained to be solved are discussed. Several relevant application areas of NMF are also briefly described. This survey aims to construct an integrated, state-of-the-art framework for NMF concept, from which the follow-up research may benefit.

A Few Useful Things to Know About Machine Learning

Article

Oct 2012

Pedro Domingos

MACHINE LEARNING SYSTEMS automatically learn programs from data. This is often a very attractive alternative to manually constructing them, and in the last decade the use of machine learning has spread rapidly throughout computer science and beyond. Machine learning is used in Web search, spam filters, recommender systems, ad placement, credit scoring, fraud detection, stock trading, drug design, and many other applications. A recent report from the McKinsey Global Institute asserts that machine learning (a.k.a. data mining or predictive analytics) will be the driver of the next big wave of innovation. 15 Several fine textbooks are available to interested practitioners and researchers (for example, Mitchell 16 and Witten et al. 24). However, much of the "folk knowledge" that is needed to successfully develop machine learning applications is not readily available in them. As a result, many machine learning projects take much longer than necessary or wind up producing less-than-ideal results. Yet much of this folk knowledge is fairly easy to communicate. This is the purpose of this article.

Power Law Distributions in Information Science: Making the Case for Logarithmic Binning

Article

Dec 2010

Stasa Milojevic

We suggest partial logarithmic binning as the method of choice for uncovering the nature of many distributions encountered in information science (IS). Logarithmic binning retrieves information and trends “not visible” in noisy power law tails. We also argue that obtaining the exponent from logarithmically binned data using a simple least square method is in some cases warranted in addition to methods such as the maximum likelihood. We also show why often-used cumulative distributions can make it difficult to distinguish noise from genuine features and to obtain an accurate power law exponent of the underlying distribution. The treatment is nontechnical, aimed at IS researchers with little or no background in mathematics. © 2010 Wiley Periodicals, Inc.

Gateway finder in large graphs: Problem definitions and fast solutions

Article

Jun 2012

Given a graph, how to find a small group of ‘gateways’, that is a small subset of nodes that are crucial in connecting the source to the target? For instance, given a social network, who is the best person to introduce you to, say, Chris Ferguson, the poker champion? Or, given a network of people and skills, who is the best person to help you learn about, say, wavelets? We formally formulate this problem in two scenarios: Pair-Gateway and Group-Gateway. For each scenario, we show that it is sub-modular and thus it can be solved near-optimally. We further give fast, scalable algorithms to find such gateways. Extensive experimental evaluations on real data sets demonstrate the effectiveness and efficiency of the proposed methods.

Modeling Temporal Behavior in Large Networks: A Dynamic Mixed-Membership Model

Article

Given a large time-evolving network, how can we model and characterize the temporal behaviors of individual nodes (and network states)? How can we model the behavioral transition patterns of nodes? We propose a temporal behavior model that captures the 'roles' of nodes in the graph and how they evolve over time. The proposed dynamic behavioral mixed-membership model (DBMM) is scalable, fully automatic (no user-defined parameters), non-parametric/data-driven (no specific functional form or parameterization), interpretable (identifies explainable patterns), and flexible (applicable to dynamic and streaming networks). Moreover, the interpretable behavioral roles are generalizable, computationally efficient, and natively supports attributes. We applied our model for (a) identifying patterns and trends of nodes and network states based on the temporal behavior, (b) predicting future structural changes, and (c) detecting unusual temporal behavior transitions. We use eight large real-world datasets from different time-evolving settings (dynamic and streaming). In particular, we model the evolving mixed-memberships and the corresponding behavioral transitions of Twitter, Facebook, IP-Traces, Email (University), Internet AS, Enron, Reality, and IMDB. The experiments demonstrate the scalability, flexibility, and effectiveness of our model for identifying interesting patterns, detecting unusual structural transitions, and predicting the future structural changes of the network and individual nodes.

On the Equivalence of Nonnegative Matrix Factorization and K-means- Spectral Clustering

Article

Apr 2005

We provide a systematic analysis of nonnegative matrix factorization (NMF) relating to data clustering. We generalize the usual X = FG{sup T} decomposition to the symmetric W = HH{sup T} and W = HSH{sup T} decompositions. We show that (1) W = HH{sup T} is equivalent to Kernel K-means clustering and the Laplacian-based spectral clustering. (2) X = FG{sup T} is equivalent to simultaneous clustering of rows and columns of a bipartite graph. We emphasizes the importance of orthogonality in NMF and soft clustering nature of NMF. These results are verified with experiments on face images and newsgroups.

RolX: Structural role extraction & mining in large graphs

Article

Aug 2012

Given a network, intuitively two nodes belong to the same role if they have similar structural behavior. Roles should be automatically determined from the data, and could be, for example, "clique-members," "periphery-nodes," etc. Roles enable numerous novel and useful network-mining tasks, such as sense-making, searching for similar nodes, and node classification. This paper addresses the question: Given a graph, how can we automatically discover roles for nodes? We propose RolX (Role eXtraction), a scalable (linear in the number of edges), unsupervised learning approach for automatically extracting structural roles from general network data. We demonstrate the effectiveness of RolX on several network-mining tasks: from exploratory data analysis to network transfer learning. Moreover, we compare network role discovery with network community discovery. We highlight fundamental differences between the two (e.g., roles generalize across disconnected networks, communities do not); and show that the two approaches are complimentary in nature.

Cross-Validatory Choice of the Number of Components From a Principal Component Analysis

Article

Feb 1982

A method is described for choosing the number of components to retain in a principal component analysis when the aim is dimensionality reduction. The correspondence between principal component analysis and the singular value decomposition of the data matrix is used. The method is based on successively predicting each element in the data matrix after deleting the corresponding row and column of the matrix, and makes use of recently published algorithms for updating a singular value decomposition. These are very fast, which renders the proposed technique a practicable one for routine data analysis.

Detecting Community Structure in Networks

Article

Mar 2004

Mark Newman

Many networked systems, including physical, biological, social, and technological networks, appear to contain ``communities'' -- groups of nodes within which connections are dense, but between which they are sparser. The ability to find such communities in an automated fashion could be of considerable use. Communities in a web graph for instance might correspond to sets of web sites dealing with related topics, while communities in a biochemical network or an electronic circuit might correspond to functional units of some kind. We present a number of new methods for community discovery, including methods based on ``betweenness'' measures and methods based on modularity optimization. We also give examples of applications of these methods to both computer-generated and real-world network data, and show how our techniques can be used to shed light on the sometimes dauntingly complex structure of networked systems.

Role Discovery in Networks

Abstract

No full-text available

Recommended publications

Navigability on Networks: A Graph Theoretic Perspective

Controlled Label Propagation: Preventing Over-Propagation through Gradual Expansion

Active Discovery of Network Roles for Predicting the Classes of Network Nodes

Analysis and Control of Beliefs in Social Networks