Conference Paper

node2vec: Scalable Feature Learning for Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Our proposed solution is based on an attentionbased and graph-based analysis of interaction datasets. In our graph-based analysis and development, we use the unsupervised vertex embedding approach Node2vec [20] in generating embedding attributes associated with each chromosome position. Afterward, we train a graph attention neural network (GAT) [21] with residual connections to use these unsupervised embedding attributes in predicting the three-dimensional coordinates of each chromosome position. ...
... To tackle this, we use Node2vec [20] algorithm to produce node representations. Node2vec is an adaptable and expandable method for node representation that captures both local and global graph forms by modeling biased random walks on the graph. ...
Preprint
Full-text available
Hi-C is an experimental technique to measure the genome-wide topological dynamics and three-dimensional (3D) shape of chromosomes indirectly via counting the number of interactions between distinct sets of loci. One can estimate the 3D shape of a chromosome over these indirect interaction datasets. Here, we come up with graph attention and residual network-based GAT-HiC to predict three-dimensional chromosome structure from Hi-C interactions. GAT-HiC is distinct from the existing 3D chromosome shape prediction approaches in a way that it can generalize to data that is different than train data. So, we can train GAT-HiC on one type of Hi-C interaction matrix and infer on a completely dissimilar interaction matrix. GAT-HiC combines the unsupervised vertex embedding method Node2vec with an attention-based graph neural network when predicting each genomic locis three-dimensional coordinates from Hi-C interaction matrix. We test the performance of our method across multiple Hi-C interaction datasets, where a trained model can be generalized across distinct cell populations, distinct restriction enzymes, and distinct Hi-C resolutions over human and mouse. GAT-HiC can reconstruct accurately in all these scenarios. Our method outperforms the existing approaches in terms of the accuracy of three-dimensional chromosome shape inference over interaction datasets.
... This has led to the development of graph representation learning, which aims to extract the underlying information from the graph and project it into vectors for prediction [2], [3]. Early works, such as Deepwalk [4], Line [5], and node2vec [6], generate the representation for individual nodes, with the distribution keeping the neighbor relationships. However, these methods ignore the node features and fail to capture high-order structural information in the graph. ...
... Graph representation learning [2], [3] aims to extract implicit information from structured data and transform it into vectors, facilitating subsequent reasoning tasks. Initially, Deepwalk [4], Line [5], and Node2vec [6] propose preserving the neighborhood relationships within the distribution of representations. Inspired by word embedding technology, they generate random walk sequences on the graph and utilize the cooccurrence of nodes to produce representations. ...
Preprint
Graphs effectively characterize relational data, driving graph representation learning methods that uncover underlying predictive information. As state-of-the-art approaches, Graph Neural Networks (GNNs) enable end-to-end learning for diverse tasks. Recent disentangled graph representation learning enhances interpretability by decoupling independent factors in graph data. However, existing methods often implicitly and coarsely characterize graph structures, limiting structural pattern analysis within the graph. This paper proposes the Graph Optimal Matching Kernel Convolutional Network (GOMKCN) to address this limitation. We view graphs as node-centric subgraphs, where each subgraph acts as a structural factor encoding position-specific information. This transforms graph prediction into structural pattern recognition. Inspired by CNNs, GOMKCN introduces the Graph Optimal Matching Kernel (GOMK) as a convolutional operator, computing similarities between subgraphs and learnable graph filters. Mathematically, GOMK maps subgraphs and filters into a Hilbert space, representing graphs as point sets. Disentangled representations emerge from projecting subgraphs onto task-optimized filters, which adaptively capture relevant structural patterns via gradient descent. Crucially, GOMK incorporates local correspondences in similarity measurement, resolving the trade-off between differentiability and accuracy in graph kernels. Experiments validate that GOMKCN achieves superior accuracy and interpretability in graph pattern mining and prediction. The framework advances the theoretical foundation for disentangled graph representation learning.
... After establishing the ingredient complement networks, we represented their nodes with embeddings using Node2Vec [64], which maps the nodes in a network to a feature space while preserving the initial structure of the network. We show the details of finetuning the parameters in Appendix B. ...
... The numbers of raw terms in the online recipes, cleaned and processed ingredients, and ingredients with flavor compounds are shown in Table A1. across different scenarios [64]; thus, we undertook a series of experiments to evaluate various values for this parameter. Specifically, during the generation of the node embeddings for the ingredients in the complement networks of recipes in the Xiachufang, Allrecipes, and Kochbar collections, values from 10 to 25 were examined. ...
Article
Full-text available
Navigating cross-cultural food choices is complex, influenced by cultural nuances and various factors, with flavor playing a crucial role. Understanding cultural flavor preferences helps individuals make informed food choices in cross-cultural contexts. We examined flavor differences across China, the US, and Germany, as well as consistent flavor preference patterns using online recipes from prominent recipe portals. Distinct from applying traditional food pairing theory, we directly mapped ingredients to their individual flavor compounds using an authorized database. This allowed us to analyze cultural flavor preferences at the molecular level and conduct machine learning experiments on 25,000 recipes from each culture to reveal flavor-based distinctions. The classifier, trained on these flavor compounds, achieved 77% accuracy in discriminating recipes by country in a three-class classification task, where random choice would yield 33.3% accuracy. Additionally, using user interaction data on appreciation metrics from each recipe portal (e.g., recipe ratings), we selected the top 10% and bottom 10% of recipes as proxies for appreciated and less appreciated recipes, respectively. Models trained within each portal discriminated between the two groups, reaching a maximum accuracy of 66%, while random selection would result in a baseline accuracy of 50%. We also explored cross-cultural preferences by applying classifiers trained on one culture to recipes from other cultures. While the cross-cultural performance was modest (specifically, a max accuracy of 54% was obtained when predicting food preferences ofthe USusers with models trained on the Chinesedata), the results indicate potential shared flavor patterns, especially between Chinese and US recipes, which show similarities, while German preferences differ. Exploratory analyses further validated these findings: we constructed ingredient networks based on co-occurrence relationships to label recipes as savory or sweet, and clustered the flavor profiles of compounds as sweet or non-sweet. These analyses showed opposing trends in sweet vs. non-sweet/savory appreciation between US and German users, supporting the machine learning results. Although our findings are likely to be influenced by biases in online data sources and the limitations of data-driven methods, they may still highlight meaningful cultural differences and shared flavor preferences. These insights offer potential for developing food recommender systems that cater to cross-cultural contexts.
... To analyze this chain transaction knowledge, we propose a method involving Graph Representation Learning and Graph Neural Networks (GNNs). Earlier network embedding approaches utilize biased random walks to conserve vertices local neighborhoods (Grover & Leskovec, 2016;Perozzi et al., 2014;Tang et al., 2015). Another line of works (Patro et al., 2012;Peng et al., 2020;Sun et al., 2019;Veličković et al., 2019) maximizes the mutual information between global and local embeddings by training vertex encoders. ...
... In this case, we have compared the performance of embedding the transaction network via our motif-based approach combined with GAT. Comparison is made with respect to 4 other well-known GNNs in the literature: Node2Vec (Grover & Leskovec, 2016), GraphWave (Donnat et al., 2018), Graph Isomorphism Network (GIN) (Xu et al., 2018), and WatchYourStep (Abu-El-Haija et al., 2018). Among these GNNs, Node2Vec incorporates multiple distinct descriptions for vertex neighborhoods in networks via biased random walk simulation. ...
Article
Full-text available
Decentralized and transparent nature of cryptocurrencies have lately increased investors interest in them. Forecasting cryptocurrency’s price accurately is crucial to come up with a good investment strategy, and such a forecast requires one to consider its unique attributes as well as high volatility. Even though many existing studies have focused on analyzing the cryptocurrency transaction graph topology, studies on the analysis of transaction graph’s impact on prices are quite limited. In this paper, we explore the forecasting ability of blockchain transaction graph-based attributes on Bitcoin’s and Ethereum’s future price via deep learning methods. More specifically, we came up with motif convolution module (MCM), a motif-based graph representation learning approach to take local structural knowledge into account more strongly in node and edge-attributed transaction graphs encoding substantial structural knowledge. Our proposed MCM constructs a motif dictionary without supervision, and employs a new motif convolution operation while extracting the vertices local structural context. Afterwards, we learn high-level vertex embeddings by using such structural context via multilayer perceptron and graph neural network. Overall, we extract the attributed transaction graphs temporally-evolving low-dimensional representations, and use such embedding data together with historical prices within self-attention-based LSTM to predict the future prices accurately. Our proposed approach outperforms all considered baselines in terms of both price and price direction prediction, showing the promise of efficient integration of transaction data into cryptocurrency price prediction.
... For the model of semantic content, we embedded the titles and abstracts of published papers into a 128-dimensional 'semantic space' using word embedding models 47 . We also embedded articles in a 128-dimensional 'intellectual space' as a function of their position within the set of cited previous research with network embedding models 48 . ...
... The articles were considered to be written by the same author if and only if the first name, the last name and the organization of the authors were matched between the articles. After building the networks, we ran the Node2vec algorithm 48 (algorithm code available in GitHub at https://github.com/eliorc/node2vec) on the network to embed the articles into each vector space. Network embedding models have revolutionized network prediction and description, just as text https://doi.org/10.1038/s41562-025-02153-1 ...
Article
Full-text available
Scientific research is often characterized by schools of thought. We investigate whether these divisions are associated with differences in researchers’ cognitive traits such as tolerance for ambiguity. These differences may guide researchers to prefer different problems, tackle identical problems in different ways, and even reach different conclusions when studying the same problems in the same way. We surveyed 7,973 researchers in psychological sciences and investigated links between what they research, their stances on open questions in the field, and their cognitive traits and dispositions. Our results show that researchers’ stances on scientific questions are associated with what they research and with their cognitive traits. Further, these associations are detectable in their publication histories. These findings support the idea that divisions in scientific fields reflect differences in the researchers themselves, hinting that some divisions may be more difficult to bridge than suggested by a traditional view of data-driven scientific consensus.
... A better solution to tackle complexity issue is to utilize random walk strategy and machine learning framework. For example, DeepWalk [21] and node2vec [22] utilize random walk and skip-gram with negative sampling model to learn a more efficient network representation. However, random walk-based NRL algorithms can only capture the linear information from network data due to their usage of linear node sequences [23]. ...
... Thus, LINE models and optimizes the cooccurrence probability and node conditional probability to learn node representations that preserve first-order and second-order proximities. • Node2vec [22]: Node2vec improves DeepWalk by modifying the sampling method of random walk, which can strike a balance between local and global ...
Article
Full-text available
In recent years, network representation learning (NRL) has attracted increasing attention due to its efficiency and effectiveness to analyze network structural data. NRL aims to learn low-dimensional representations of nodes while preserving their structural information, and preserving multiscale structural information of nodes is important for NRL. Deep learning-based algorithms are popular owing to their good performance to learn network representations, but they lack sufficient interpretability as black boxes. In this study, we propose a novel algorithm called Multiscale structural information-based Laplacian generative adversarial Network Representation Learning (MLNRL). This algorithm consists of two components: 1) multiscale structural information preserving component, where a shift positive pointwise mutual information matrix (SPPMI) is calculated for storing multiscale structural information; 2) Laplacian generative adversarial learning component, where the ideas of Laplacian pyramid and generative adversarial networks are leveraged to generate robust and meaningful representations. We apply our model to three downstream tasks on real-world datasets for evaluation, and the results show that our model outperforms the baselines in almost all cases. Then, we perform an ablation study and verified the necessity of both components. We also investigate the hyperparameter sensitivity to prove the robustness of MLNRL.
... The node2vec algorithm proposed by Grover and Leskovec improves upon DeepWalk. By using a biased random-walk strategy [39]. In multi-Modal Embedding Field, Fukui et al. [40] proposed a multi-modal compact bilinear pooling method for fusing image and text features to obtain multi-modal embedding representations. ...
Article
Full-text available
The characteristics of protein pockets can better capture the interaction information between proteins and small molecules, thereby improving the performance of drug-target interaction (DTI) prediction tasks. However, pocket data typically need to be predicted using software such as AlphaFold, which would entail a massive workload for datasets ranging from tens of thousands to hundreds of thousands of samples. Moreover, feature representation networks for 3D pocket data are computationally intensive. To address this, we propose simulating 3D pocket data using sequence data through feature fusion of two different objects based on structure cross-attention (CASD). Additionally, precise feature representation is a prerequisite for accurately identifying pocket information. We introduce a method that leverages the output of the last layer of a pre-trained model as an embedding layer for training a new model from scratch. This approach not only incorporates prior knowledge from the pre-trained model but also expands model capacity, enabling more accurate feature representation. Furthermore, we enhance the multimodal representation of small molecule compounds using feature fusion based on structure cross-attention for the same object (CASS), further improving feature representation capabilities. Our cross-attention mechanisms operate at the token-level or node-level, allowing fine-grained capture of interactions between amino acids and atoms. This enables the identification of the contribution score of each atom or amino acid to the task, making our model interpretable for drug-target prediction. Experimental validation demonstrates that our model achieves state-of-the-art predictive performance.
... , G C }, sharing the same set of n nodes but differing in the number of edges in each network. We capture the structural information of the networks by converting them into text-like sequences using random walks, similar to node2vec [40]. The walks are encoded through an embedding matrix ξ ∈ R n×dn , where n is the size of the vocabulary (total nodes across all networks) and p is the desired embedding dimension. ...
Preprint
Full-text available
The inference of gene regulatory networks (GRNs) is a foundational stride towards deciphering the fundamentals of complex biological systems. Inferring a possible regulatory link between two genes can be formulated as a link prediction problem. Inference of GRNs via gene coexpression profiling data may not always reflect true biological interactions, as its susceptibility to noise and misrepresenting true biological regulatory relationships. Most GRN inference methods face several challenges in the network reconstruction phase. Therefore, it is important to encode gene expression values, leverege the prior knowledge gained from the available inferred network structures and positional informations of the input network nodes towards inferring a better and more confident GRN network reconstruction. In this paper, we explore the integration of multiple inferred networks to enhance the inference of Gene Regulatory Networks (GRNs). Primarily, we employ autoencoder embeddings to capture gene expression patterns directly from raw data, preserving intricate biological signals. Then, we embed the prior knowledge from GRN structures transforming them into a text-like representation using random walks, which are then encoded with a masked language model, BERT, to generate global embeddings for each gene across all networks. Additionally, we embed the positional encodings of the input gene networks to better identify the position of each unique gene within the graph. These embeddings are integrated into graph transformer-based model, termed GT-GRN, for GRN inference. The GT-GRN model effectively utilizes the topological structure of the ground truth network while incorporating the enriched encoded information. Experimental results demonstrate that GT-GRN significantly outperforms existing GRN inference methods, achieving superior accuracy and highlighting the robustness of our approach.
... F. Node2Vec Node2Vec [26] learns a task-independent embedding for the nodes. The ultimate goal of this method is to find a matrix Z ∈ R d ×|V| , where column i shows the representation of node Ni. ...
Preprint
Drug discovery requires a tremendous amount of time and cost. Computational drug-target interaction prediction, a significant part of this process, can reduce these requirements by narrowing the search space for wet lab experiments. In this survey, we provide comprehensive details of graph machine learning-based methods in predicting drug-target interaction, as they have shown promising results in this field. These details include the overall framework, main contribution, datasets, and their source codes. The selected papers were mainly published from 2020 to 2024. Prior to discussing papers, we briefly introduce the datasets commonly used with these methods and measurements to assess their performance. Finally, future challenges and some crucial areas that need to be explored are discussed.
... Service representations that are constructed using graph data can be more generally studied under the field of network embedding methods, where the features are extracted from the network using random walks, neighborhood aggregation, network measures such as average path length, node in or out-degree. Traditional network embedding models try to leverage different aspects of the topological information present in a network to establish relations or measures of similarity among their nodes, such as DeepWalk [13], LINE [14] and node2vec [15]. While these approaches contributed to data mining tasks on network data such as link prediction, node classification and clustering, they failed to take advantage of the rich information that is often available as node attributes. ...
Article
Full-text available
In this paper, we propose an approach to learn optimal service representations that can be used in downstream data mining tasks by combining attributes and multiple relations from a heterogeneous information network and using meta-paths to model service relations. We construct a heterogeneous information network that connects services, mashups, and their attributes and derive latent relational knowledge from it for representation learning. We address two major challenges related to using such a heterogeneous network in representation learning: (1) how to effectively combine attribute information and network topology, (2) how to optimize the weighting scheme among those inputs to enhance the learning from relevant features and minimize distraction from noisy ones. Our approach can take advantage of partially or fully annotated data, and still works in a fully unsupervised setting. We conduct a comprehensive experimental study on a real world dataset, where we use clustering as a downstream task. In our experiments, Our model performs better than all competing approaches.
... At present, there are many models based on knowledge graph embedding representation, such as translation models, semantic matching models, neural network models, etc. . In this study, we primarily employ the Node2vec model (Grover and Leskovec 2016) to embed the representations of the developed PMinrKG into a low-dimensional vector space. The primary advantage of Node2vec lies in its introduction of a flexible random walk strategy, which integrates the characteristics of depth-first search (DFS) and breadth-first search (BFS). ...
Article
Full-text available
Currently, geological reports and maps of mineral resources contain a wealth of earth science knowledge and expert experience. A key challenge in mineral resource exploration and prediction is the standardization of complex mineral deposit data into structured, analyzable formats, the extraction of relevant knowledge from this data, and its effective application in mineral deposit research. This paper presents an intelligent mining prediction framework based on multimodal data for the construction of a polymetallic mineral resource knowledge graph (PMinrKG). Firstly, using mineral geological survey reports and geological maps as data sources, entity relationship extraction is performed using the current mainstream Universal Information Extraction framework (UIE) and ArcGIS Pro software, and aligned and fused to form PMinrKG. Secondly, we systematically organized the service application of KGs from four dimensions: analysis of the elements of mineralization, semantic understanding based on KG, KG-based intelligent Q&A analysis, and mineral resource relations prediction based on KG embedding. Experimental results indicate that the mineral resources knowledge graph, as a semantic network, can provide valuable insights through in-depth exploration and analysis. By extracting multidimensional information, such as mineral types and associated strata, it offers critical reference value for effectively delineating deep mineral resource exploration areas.
... These embeddings capture key structural and semantic information of the graphs, making them suitable for downstream machine learning tasks. Representative methods can be categorized into node [26]- [28] and whole graph embedding [29]- [32]. Node embedding methods focus on mapping individual nodes within a graph to vector representations while preserving local structural information, such as neighborhood connectivity or node attributes. ...
... The formula for DIV* is as follows: To calculate d ij , we constructed three disciplinary citation networks corresponding to the three levels of topics in the Citation Topics classification system. Then, the Node2Vec (Grover & Leskovec, 2016) module was used to generate the 64-dimensional disciplinary vector, (Vector i ), for each research area. ...
Article
Full-text available
Purpose Interdisciplinary research has become a critical approach to addressing complex societal, economic, technological, and environmental challenges, driving innovation and integrating scientific knowledge. While interdisciplinarity indicators are widely used to evaluate research performance, the impact of classification granularity on these assessments remains underexplored. Design/methodology/approach This study investigates how different levels of classification granularity— macro, meso, and micro—affect the evaluation of interdisciplinarity in research institutes. Using a dataset of 262 institutes from four major German non-university organizations (FHG, HGF, MPG, WGL) from 2018 to 2022, we examine inconsistencies in interdisciplinarity across levels, analyze ranking changes, and explore the influence of institutional fields and research focus (applied vs. basic). Findings Our findings reveal significant inconsistencies in interdisciplinarity across classification levels, with rankings varying substantially. Notably, the Fraunhofer Society (FHG), which performs well at the macro level, experiences significant ranking declines at meso and micro levels. Normalizing interdisciplinarity by research field confirmed that these declines persist. The research focus of institutes, whether applied, basic, or mixed, does not significantly explain the observed ranking dynamics. Research limitations This study has only considered the publication-based dimension of institutional interdisciplinarity and has not explored other aspects. Practical implications The findings provide insights for policymakers, research managers, and scholars to better interpret interdisciplinarity metrics and support interdisciplinary research effectively. Originality/value This study underscores the critical role of classification granularity in interdisciplinarity assessment and emphasizes the need for standardized approaches to ensure robust and fair evaluations.
... Therefore, in traffic flow prediction tasks, understanding the road network's topology requires not only considering the adjacency relationships between spatial units but also incorporating the relationships between non-adjacent spatial units based on functional similarity. To address the limitations of the spatial position embedding model in terms of interpretability and modeling the relationships between non-adjacent spatial units, we introduce the classic random walk network representation learning method (Preozzi et al. 2014;Grover and Leskovec 2016). The main process of this method includes constructing transition probabilities, executing random walks, and generating network embeddings. ...
Article
Full-text available
This paper focuses on a mid-to-long-term traffic flow prediction method that integrates spatial–temporal graph neural networks with the Transformer. It systematically improves upon existing methods, which struggle to effectively capture the spatial–temporal evolution patterns of traffic flow in complex traffic scenarios. The current research faces three key issues: (1) The temporal position embedding model significantly lacks effectiveness in modeling short-term evolution features of traffic flow. (2) The spatial position embedding lacks a verifiable mechanism, making it difficult to confirm its validity in representing actual road network topologies. (3) The spatial–temporal graph has a single spatial structure, limiting the ability of spatial–temporal graph neural networks to capture multistage evolution patterns of traffic flow. To address these issues, we propose a traffic flow prediction framework based on a dynamic adaptive spatial–temporal graph Transformer. The framework first improves the temporal position embedding model's expressiveness by introducing attributes closely related to short-term traffic flow changes. Second, a random walk-based spatial embedding method is designed, where the transition probability matrix and node vector space mapping have a verifiable mathematical relationship, ensuring theoretical interpretability in the spatial position modeling process. Finally, a dynamic adaptive spatial–temporal graph neural network model is proposed. This model learns time-varying spatial structures in a data-driven manner based on an adaptive mechanism and combines temporal self-attention with dynamic adaptive graph attention networks to collaboratively capture multiscale spatial–temporal dependencies. Comparative experiments with six baseline methods on the real-world PEMS08 dataset demonstrate that the proposed framework exhibits significant performance advantages in traffic flow prediction.
... More specifically, these models compute a latent vector for each node during training, then define the probability of an edge as a function of the dot product between its endpoints' latent vectors. A closely related approach relies on graph embedding algorithms such as node2vec [16], which also compute a latent vector (also called embedding) for each node. Linear models can then take these vectors as input in order to predict normal edges and/or detect anomalous ones [44,50,8,35]. ...
Preprint
Full-text available
Foundation models have recently emerged as a new paradigm in machine learning (ML). These models are pre-trained on large and diverse datasets and can subsequently be applied to various downstream tasks with little or no retraining. This allows people without advanced ML expertise to build ML applications, accelerating innovation across many fields. However, the adoption of foundation models in cybersecurity is hindered by their inability to efficiently process data such as network traffic captures or binary executables. The recent introduction of graph foundation models (GFMs) could make a significant difference, as graphs are well-suited to representing these types of data. We study the usability of GFMs in cybersecurity through the lens of one specific use case, namely lateral movement detection. Using a pre-trained GFM, we build a detector that reaches state-of-the-art performance without requiring any training on domain-specific data. This case study thus provides compelling evidence of the potential of GFMs for cybersecurity.
... Even if the attacker does not fully control the entire system within the attack graph, they can still obtain partial information that is meaningful in estimating various elements guiding their decisions. State-of-the-art techniques such as those provided by [25,46,54] enable learning network features from partial knowledge, making this learning process feasible. Thus, the attacker launches a targeted attack to compromise a specific subset of nodes, selecting an optimal attack path to maximize expected rewards. ...
... After the skip-gram is trained, it predicts the context nodes based on its input and the output is the embedding of the nodes. Node2vec [10], is a technique for learning node embeddings in a graph. In fact, it is an extension of the word2vec model which is a method for word embedding in textual data. ...
... Another way is to use embedding layers, which can represent sparse vectors as dense real-valued numerical vectors that can be used more efficiently by ANNs [9]. This approach has been used in graphs, where the node2vec embedding has been proposed to learn a continuous representation of nodes, aiding the decision to add new links to improve connectivity, i.e., planning network upgrades [10], [11]. ...
Article
Full-text available
Machine learning (ML) is emerging as a promising tool for estimating the Quality of Transmission (QoT) in optical networks, especially for unestablished lightpaths where traditional methods are limited. However, inaccuracies in ML-based QoT predictions—typically expressed in terms of generalized signal-to-noise ratio (GSNR)—can significantly affect network operation. Overestimation may lead to retransmissions due to overly aggressive modulation format choices, while underestimation results in underutilized spectral resources. To address this, we propose a novel ML architecture that incorporates an embedding layer for link-level features alongside path-and service-level inputs. Using data generated from an accurate analytical model, we show that our approach reduces prediction error by up to 34% compared to standard architectures. Simulated deployment scenarios further demonstrate operational benefits, with a 15.9% decrease in incorrect and a 34.8% reduction in overly conservative modulation format selections.
... • Traditional Unsupervised Clustering Methods: AE [25], node2vec [18], struc2vec [42], and LINE [51]. ...
Preprint
Graph Neural Networks (GNNs) struggle to balance heterophily and homophily in representation learning, a challenge further amplified in self-supervised settings. We propose H3^3GNNs, an end-to-end self-supervised learning framework that harmonizes both structural properties through two key innovations: (i) Joint Structural Node Encoding. We embed nodes into a unified space combining linear and non-linear feature projections with K-hop structural representations via a Weighted Graph Convolution Network(WGCN). A cross-attention mechanism enhances awareness and adaptability to heterophily and homophily. (ii) Self-Supervised Learning Using Teacher-Student Predictive Architectures with Node-Difficulty Driven Dynamic Masking Strategies. We use a teacher-student model, the student sees the masked input graph and predicts node features inferred by the teacher that sees the full input graph in the joint encoding space. To enhance learning difficulty, we introduce two novel node-predictive-difficulty-based masking strategies. Experiments on seven benchmarks (four heterophily datasets and three homophily datasets) confirm the effectiveness and efficiency of H3^3GNNs across diverse graph types. Our H3^3GNNs achieves overall state-of-the-art performance on the four heterophily datasets, while retaining on-par performance to previous state-of-the-art methods on the three homophily datasets.
Article
(1) Background: Circular RNAs (circRNAs) are covalently closed single-stranded molecules that play crucial roles in gene regulation, while microRNAs (miRNAs), specifically mature microRNAs, are naturally occurring small molecules of non-coding RNA with 17-25-nucleotide sizes. Understanding circRNA–miRNA interactions (CMIs) can reveal new approaches for diagnosing and treating complex human diseases. (2) Methods: In this paper, we propose a novel approach for predicting CMIs based on a graph attention network (GAT). We utilized DNABERT to extract molecular features of the circRNA and miRNA sequences and role-based graph embeddings generated by Role2Vec to extract the CMI features. The GAT’s ability to learn complex node dependencies in biological networks provided enhanced performance over the existing methods and the traditional deep neural network models. (3) Results: Our simulation studies showed that our GAT model achieved accuracies of 0.8762 and 0.8837 on the CMI-9905 and CMI-9589, respectively. These accuracies were the highest among the other existing CMI prediction methods. Our GAT method also achieved the highest performance as measured by the precision, recall, F1-score, area under the receiver operating characteristic (AUROC) curve, and area under the precision–recall curve (AUPR). (4) Conclusions: These results reflect the GAT’s ability to capture the intricate relationships between circRNAs and miRNAs, thus offering an efficient computational approach for prioritizing potential interactions for experimental validation.
Article
This article introduces the analytical approach of practice mapping , using vector embeddings of network actions and interactions to map commonalities and disjunctures in the practices of social media users, as a framework for methodological advancement beyond the limitations of conventional network analysis and visualization. In particular, the methodological framework we outline here has the potential to incorporate multiple distinct modes of interaction into a single practice map; can be further enriched with account-level attributes such as information gleaned from textual analysis, profile information, available demographic details, and other features; and can be applied even to a cross-platform analysis of communicative patterns and practices. The article presents practice mapping as an analytical framework and outlines its key methodological considerations. Given its prominence in past social media research, we draw on examples and data from the platform formerly known as Twitter to enable experienced scholars to translate their approaches to a practice mapping paradigm more easily, but point out how data from other platforms may be used in equivalent ways in practice mapping studies. We illustrate the utility of the approach by applying it to a dataset where the application of conventional network analysis and visualization approaches has produced few meaningful insights.
Preprint
Full-text available
We investigate tasks that can be accomplished with unlabelled graphs, where nodes do not have persistent or semantically meaningful labels. New techniques to visualize these graphs have been proposed, but more understanding of unlabelled graph tasks is required before they can be adequately evaluated. Some tasks apply to both labelled and unlabelled graphs, but many do not translate between these contexts. We propose a taxonomy of unlabelled graph abstract tasks, organized according to the Scope of the data at play, the Action intended by the user, and the Target data under consideration. We show the descriptive power of this task abstraction by connecting to concrete examples from previous frameworks, and connect these abstractions to real-world problems. To showcase the evaluative power of the taxonomy, we perform a preliminary assessment of 6 visualizations for each task. For each combination of task and visual encoding, we consider the effort required from viewers, the likelihood of task success, and how both factors vary between small-scale and large-scale graphs.
Article
Full-text available
Similarity-based analysis is a common and intuitive tool for exploring large data sets. For instance, grouping data items by their level of similarity, regarding one or several chosen aspects, can reveal patterns and relations from the intrinsic structure of the data and thus provide important insights in the sense-making process. Existing analytical methods (such as clustering and dimensionality reduction) tend to target questions such as “Which objects are similar?”; but since they are not necessarily well-suited to answer questions such as “How does the result change if we change the similarity criteria?” or “How are the items linked together by the similarity relations?” they do not unlock the full potential of similarity-based analysis—and here we see a gap to fill. In this paper, we propose that the concept of similarity could be regarded as both: (1) a relation between items, and (2) a property in its own, with a specific distribution over the data set. Based on this approach, we developed an embedding-based computational pipeline together with a prototype visual analytics tool which allows the user to perform similarity-based exploration of a large set of scientific publications. To demonstrate the potential of our method, we present two different use cases, and we also discuss the strengths and limitations of our approach.
Article
Location-based services and applications can provide large-scale vehicle trajectory data. However, these data are often sparse due to human factors and faulty positioning devices, making it challenging to use them in research tasks that require precision. This affects the efficiency and optimization of sustainable transportation systems. Therefore, this paper proposed a trajectory recovery model based on road network constraints and graph contrastive learning (RNCGCL). Vehicles must drive on the road and their driving processes are affected by the surrounding road network structure. Based on the motivations, bidirectional long short-term memory neural networks and an attention mechanism were used to obtain the spatiotemporal features of trajectory. Graph contrastive learning was applied to extract the local feature representation of road networks. A multi-task module was introduced to guarantee the recovered points strictly projected onto the road. Experiments showed that RNCGCL outperformed other benchmarks. It improved the F1-score by 2.81% and decreased the error by 8.62%, indicating higher accuracy and lower regression errors. Furthermore, this paper validated the effectiveness of the proposed method by case studies and downstream task performance. This study provides a robust solution for trajectory data recovery, contributing to the overall efficiency and sustainability of transportation.
Article
The transition to electric vehicles is a critical step toward achieving carbon neutrality and environmental sustainability. This shift relies on advancements across multiple technological domains, driving the need for strategic technology intelligence to anticipate emerging technology convergence opportunities. To address this challenge, this study aimed at providing an analytical framework for identifying technology convergence opportunities using node2vec graph embedding. A dual-level prediction framework that combines similarity-based scoring and machine learning-based classification was proposed to systematically identify new potential technology linkages between previously unrelated technology areas. The patent co-classification network was used to generate graph embeddings, which were then processed to calculate edge similarity among unconnected nodes and to train the classifier model. A case study in the EV market demonstrated the framework can reliably predict future patterns across disparate technology domains. Consequently, advancements in battery protection, thermal management, and composite materials emerged as relevant for future technology development. These insights not only deepen our understanding of future innovation trends but also provide actionable guidance for optimizing R&D investments and shaping policy strategies in the evolving electric vehicle market. The findings contribute to a systematic approach to forecasting technology convergence, supporting innovation-driven growth in the evolving EV sector.
Conference Paper
Full-text available
In this paper, we present {GraRep}, a novel model for learning vertex representations of weighted graphs. This model learns low dimensional vectors to represent vertices appearing in a graph and, unlike existing work, integrates global structural information of the graph into the learning process. We also formally analyze the connections between our work and several previous research efforts, including the DeepWalk model of Perozzi et al. as well as the skip-gram model with negative sampling of Mikolov et al. We conduct experiments on a language network, a social network as well as a citation network and show that our learned global representations can be effectively used as features in tasks such as clustering, classification and visualization. Empirical results demonstrate that our representation significantly outperforms other state-of-the-art methods in such tasks.
Article
Full-text available
Matrix factorization (MF) and Autoencoder (AE) are among the most successful approaches of unsupervised learning. While MF based models have been extensively exploited in the graph modeling and link prediction literature, the AE family has not gained much attention. In this paper we investigate both MF and AE's application to the link prediction problem in sparse graphs. We show the connection between AE and MF from the perspective of multiview learning, and further propose MF+AE: a model training MF and AE jointly with shared parameters. We apply dropout to training both the MF and AE parts, and show that it can significantly prevent overfitting by acting as an adaptive regularization. We conduct experiments on six real world sparse graph datasets, and show that MF+AE consistently outperforms the competing methods, especially on datasets that demonstrate strong non-cohesive structures.
Article
Full-text available
Recently deep learning has been successfully adopted in many applications such as speech recognition and image classification. In this work, we explore the possibility of employing deep learning in graph clustering. We propose a simple method, which first learns a nonlinear embedding of the original graph by stacked au- Toencoder, and then runs it-means algorithm on the embedding to obtain clustering result. We show that this simple method has solid theoretical foundation, due to the similarity between autoencoder and spectral clustering in terms of what they actually optimize. Then, we demonstrate that the proposed method is more efficient and flexible than spectral clustering. First, the computational complexity of autoencoder is much lower than spectral clustering: The former can be linear to the number of nodes in a sparse graph while the latter is super quadratic due to eigenvalue decomposition. Second, when additional sparsity constraint is imposed, we can simply employ the sparse autoencoder developed in the literature of deep learning; however, it is non- straightforward to implement a sparse spectral method. The experimental results on various graph datasets show that the proposed method significantly outperforms conventional spectral clustering, which clearly indicates the effectiveness of deep learning in graph clustering. Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Chapter
Full-text available
Time varying problems usually have complex underlying structures represented as dynamic networks where entities and relationships appear and disappear over time. The problem of efficiently performing dynamic link inference is extremely challenging due to the dynamic nature in massive evolving networks especially when there exist sparse connectivities and nonlinear transitional patterns. In this paper, we propose a novel deep learning framework, i.e., Conditional Temporal Restricted Boltzmann Machine (ctRBM), which predicts links based on individual transition variance as well as influence introduced by local neighbors. The proposed model is robust to noise and have the exponential capability to capture nonlinear variance. We tackle the computational challenges by developing an efficient algorithm for learning and inference of the proposed model. To improve the efficiency of the approach, we give a faster approximated implementation based on a proposed Neighbor Influence Clustering algorithm. Extensive experiments on simulated as well as real-world dynamic networks show that the proposed method outperforms existing algorithms in link inference on dynamic networks.
Conference Paper
Full-text available
Linked data consist of both node attributes, e.g., preferences, posts and degrees, and links which describe the connections between nodes. They have been widely used to represent various network systems, such as social networks, biological networks and etc. Knowledge discovery on linked data is of great importance to many real applications. One of the major challenges of learning linked data is how to effectively and efficiently extract useful information from both node attributes and links in linked data. Current studies on this topic either use selected topological statistics to represent network structures, or linearly map node attributes and network structures to a shared latent feature space. However, while approaches based on statistics may miss critical patterns in network structure, approaches based on linear mappings may not be sufficient to capture the non-linear characteristics of nodes and links. To handle the challenge, we propose, to our knowledge, the first deep learning method to learn from linked data. A restricted Boltzmann machine model named LRBM is developed for representation learning on linked data. In LRBM, we aim to extract the latent feature representation of each node from both node attributes and network structures, non-linearly map each pair of nodes to the links, and use hidden units to control the mapping. The details of how to adapt LRBM for link prediction and node classification on linked data have also been presented. In the experiments, we test the performance of LRBM as well as other baselines on link prediction and node classification. Overall, the extensive experimental evaluations confirm the effectiveness of the proposed LRBM model in mining linked data.
Article
Full-text available
Multi-label classification methods are increasingly required by modern applications, such as protein function classification, music categorization, and semantic scene classification. This article introduces the task of multi-label classification, organizes the sparse related literature into a structured presentation and performs comparative experimental results of certain multilabel classification methods. It also contributes the definition of concepts for the quantification of the multi-label nature of a data set.
Article
Full-text available
This paper studies the problem of embedding very large information networks into low-dimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. Most existing graph embedding methods do not scale for real world information networks which usually contain millions of nodes. In this paper, we propose a novel network embedding method called the "LINE," which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edge-sampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference. Empirical experiments prove the effectiveness of the LINE on a variety of real-world information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. The source code of the LINE is available online.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Article
Full-text available
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools. Supplementary information The online version of this article (doi:10.1038/nmeth.2340) contains supplementary material, which is available to authorized users.
Article
Full-text available
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
Full-text available
We address the problem of within-network classification in sparsely labeled networks. Recent work has demonstrated success with statistical relational learning (SRL) and semi-supervised learning (SSL) on such problems. However, both approaches rely on the availability of labeled nodes to infer the v alues of missing labels. When few labels are available, the performa nce of these approaches can degrade. In addition, many such approaches are sensitive to the specific set of nodes labeled. So, although average performance may be acceptable, the performance on a specific task may not. We explore a complimentary approach to within- network classification, based on the use of label-independent (LI) features - i.e., features not influenced by the values of c lass la- bels. While previous work has made some use of LI features, the effects of these features on classification perform ance have not been extensively studied. Here, we present an empir ical study in order to better understand these effects. Through e xperiments on several real-world data sets, we show that the use of LI features produces classifiers that are less sensitive to spe cific label assign- ments and can lead to performance improvements of over 40% for both SRL and SSL based classifiers. We also examine the relative utility of individual LI features and show that, in many cases, it is a combination of a few diverse network-structural c haracteristics that is most informative.
Conference Paper
Full-text available
Given a graph, how can we extract good features for the nodes? For example, given two large graphs from the same domain, how can we use information in one to do classification in the other (i.e., perform across-network classification or transfer learning on graphs)? Also, if one of the graphs is anonymized, how can we use information in one to de-anonymize the other? The key step in all such graph mining tasks is to find effective node features. We propose ReFeX (Recursive Feature eXtraction), a novel algorithm, that recursively combines local (node-based) features with neighborhood (egonet-based) features; and outputs regional features -- capturing "behavioral" information. We demonstrate how these powerful regional features can be used in within-network and across-network classification and de-anonymization tasks -- without relying on homophily, or the availability of class labels. The contributions of our work are as follows: (a) ReFeX is scalable and (b) it is effective, capturing regional ("behavioral") information in large graphs. We report experiments on real graphs from various domains with over 1M edges, where ReFeX outperforms its competitors on typical graph mining tasks like network classification and de-anonymization.
Conference Paper
Full-text available
Targeting interest to match a user with services (e.g. news, products, games, advertisements) and predicting friendship to build connections among users are two fundamental tasks for social network systems. In this paper, we show that the information contained in interest networks (i.e. user-service interactions) and friendship networks (i.e. user-user connections) is highly correlated and mutually helpful. We propose a framework that exploits homophily to establish an integrated network linking a user to interested services and connecting different users with common interests, upon which both friendship and interests could be efficiently propagated. The proposed friendship-interest propagation (FIP) framework devises a factor-based random walk model to explain friendship connections, and simultaneously it uses a coupled latent factor model to uncover interest interactions. We discuss the flexibility of the framework in the choices of loss objectives and regularization penalties and benchmark different variants on the Yahoo! Pulse social networking system. Experiments demonstrate that by coupling friendship with interest, FIP achieves much higher performance on both interest targeting and friendship prediction than systems using only one source of information.
Article
Full-text available
Social media has reshaped the way in which people interact with each other. The rapid development of participatory web and social networking sites like YouTube, Twitter, and Facebook, also brings about many data mining opportunities and novel challenges. In particular, we focus on classification tasks with user interaction information in a social network. Networks in social media are heterogeneous, consisting of various relations. Since the relation-type information may not be available in social media, most existing approaches treat these inhomogeneous connections homogeneously, leading to an unsatisfactory classification performance. In order to handle the network heterogeneity, we propose the concept of social dimension to represent actors’ latent affiliations, and develop a classification framework based on that. The proposed framework, SocioDim, first extracts social dimensions based on the network structure to accurately capture prominent interaction patterns between actors, then learns a discriminative classifier to select relevant social dimensions. SocioDim, by differentiating different types of network connections, outperforms existing representative methods of classification in social media, and offers a simple yet effective approach to integrating two types of seemingly orthogonal information: the network of actors and their attributes.
Article
Full-text available
A large family of algorithms - supervised or unsupervised; stemming from statistics or geometry theory - has been designed to provide different solutions to the problem of dimensionality reduction. Despite the different motivations of these algorithms, we present in this paper a general formulation known as graph embedding to unify them within a common framework. In graph embedding, each algorithm can be considered as the direct graph embedding or its linear/kernel/tensor extension of a specific intrinsic graph that describes certain desired statistical or geometric properties of a data set, with constraints from scale normalization or a penalty graph that characterizes a statistical or geometric property that should be avoided. Furthermore, the graph embedding framework can be used as a general platform for developing new dimensionality reduction algorithms. By utilizing this framework as a tool, we propose a new supervised dimensionality reduction algorithm called marginal Fisher analysis in which the intrinsic graph characterizes the intraclass compactness and connects each data point with its neighboring points of the same class, while the penalty graph connects the marginal points and characterizes the interclass separability. We show that MFA effectively overcomes the limitations of the traditional linear discriminant analysis algorithm due to data distribution assumptions and available projection directions. Real face recognition experiments show the superiority of our proposed MFA in comparison to LDA, also for corresponding kernel and tensor extensions
Article
Full-text available
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.
Article
Full-text available
Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets. We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site. MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.
Article
Full-text available
The goal of the Biological General Repository for Interaction Datasets (BioGRID) (http://www.thebiogrid.org) is to archive and freely disseminate collections of genetic and protein interactions from major model organisms. BioGRID currently houses over 335,000 interactions curated from high-throughput datasets and individual focused studies found in the primary literature, as derived from some 23,000 publications. Complete coverage of the entire literature for both the budding yeast _Saccharomyces cerevisiae_ and the fission yeast _Schizosaccharomyces pombe_ has been achieved, resulting in the curation of over 246,000 interactions, and efforts to expand curation across multiple species are underway. Through collaborations with the Gene Ontology (GO) Consortium and the Linking Animal Models to Human Disease Initiative (LAMHDI), we are focusing our curation efforts across model organisms on particular areas of biology to enable insights into conserved networks and pathways that are relevant to human health. The BioGRID 3.0 web interface contains new search and display features that enable rapid queries across multiple data types and sources. A dedicated Interaction Management System (IMS) is used to track all curation and to prioritize publications across multiple curation projects. BioGRID data are incorporated in several model organism databases and other biological databases. The entire BioGRID interaction collection may be downloaded in multiple file formats, including PSI MI XML, and source code for BioGRID is freely available without any restrictions. This work is supported by NIH NCRR grant R01 RR024031 to MT and KD, and by grants from the CIHR and BBSRC to MT.
Article
Full-text available
The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure, or clustering, i. e. the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters. Such clusters, or communities, can be considered as fairly independent compartments of a graph, playing a similar role like, e. g., the tissues or the organs in the human body. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. This problem is very hard and not yet satisfactorily solved, despite the huge effort of a large interdisciplinary community of scientists working on it over the past few years. We will attempt a thorough exposition of the topic, from the definition of the main elements of the problem, to the presentation of most methods developed, with a special focus on techniques designed by statistical physicists, from the discussion of crucial issues like the significance of clustering and how methods should be tested and compared against each other, to the description of applications to real networks. Comment: Review article. 103 pages, 42 figures, 2 tables. Two sections expanded + minor modifications. Three figures + one table + references added. Final version published in Physics Reports
Article
Full-text available
Determining protein function is one of the most challenging problems of the post-genomic era. The availability of entire genome sequences and of high-throughput capabilities to determine gene coexpression patterns has shifted the research focus from the study of single proteins or small complexes to that of the entire proteome. In this context, the search for reliable methods for assigning protein function is of primary importance. There are various approaches available for deducing the function of proteins of unknown function using information derived from sequence similarity or clustering patterns of co-regulated genes, phylogenetic profiles, protein-protein interactions (refs. 5-8 and Samanta, M.P. and Liang, S., unpublished data), and protein complexes. Here we propose the assignment of proteins to functional classes on the basis of their network of physical interactions as determined by minimizing the number of protein interactions among different functional categories. Function assignment is proteome-wide and is determined by the global connectivity pattern of the protein network. The approach results in multiple functional assignments, a consequence of the existence of multiple equivalent solutions. We apply the method to analyze the yeast Saccharomyces cerevisiae protein-protein interaction network. The robustness of the approach is tested in a system containing a high percentage of unclassified proteins and also in cases of deletion and insertion of specific protein interactions.
Article
Full-text available
The Biological General Repository for Interaction Datasets (BioGRID) database (http://www.thebiogrid.org) was developed to house and distribute collections of protein and genetic interactions from major model organism species. BioGRID currently contains over 198 000 interactions from six different species, as derived from both high-throughput studies and conventional focused studies. Through comprehensive curation efforts, BioGRID now includes a virtually complete set of interactions reported to date in the primary literature for both the budding yeast Saccharomyces cerevisiae and the fission yeast Schizosaccharomyces pombe. A number of new features have been added to the BioGRID including an improved user interface to display interactions based on different attributes, a mirror site and a dedicated interaction management system to coordinate curation across different locations. The BioGRID provides interaction data with monthly updates to Saccharomyces Genome Database, Flybase and Entrez Gene. Source code for the BioGRID and the linked Osprey network visualization system is now freely available without restriction.
Article
Full-text available
We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) fine-grained modeling of unknown word features. Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ, an error reduction of 4.4% on the best previous single automatically learned tagging result.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.
Article
Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.
Article
Networks provide a powerful way to study complex systems of interacting objects. Detecting network communities-groups of objects that often correspond to functional modules-is crucial to understanding social, technological, and biological systems. Revealing communities allows for analysis of system properties that are invisible when considering only individual objects or the entire system, such as the identification of module boundaries and relationships or the classification of objects according to their functional roles. However, in networks where objects can simultaneously belong to multiple modules at once, the decomposition of a network into overlapping communities remains a challenge. Here we present a new paradigm for uncovering the modular structure of complex networks, based on a decomposition of a network into any combination of overlapping, nonoverlapping, and hierarchically organized communities. We demonstrate on a diverse set of networks coming from a wide range of domains that our approach leads to more accurate communities and improved identification of community boundaries. We also unify two fundamental organizing principles of complex networks: the modularity of communities and the commonly observed core-periphery structure. We show that dense network cores form as an intersection of many overlapping communities. We discover that communities in social, information, and food web networks have a single central dominant core while communities in protein-protein interaction (PPI) as well as product copurchasing networks have small overlaps and form many local cores.
Article
We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk's latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk's representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk's representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.
Article
Given a network, intuitively two nodes belong to the same role if they have similar structural behavior. Roles should be automatically determined from the data, and could be, for example, "clique-members," "periphery-nodes," etc. Roles enable numerous novel and useful network-mining tasks, such as sense-making, searching for similar nodes, and node classification. This paper addresses the question: Given a graph, how can we automatically discover roles for nodes? We propose RolX (Role eXtraction), a scalable (linear in the number of edges), unsupervised learning approach for automatically extracting structural roles from general network data. We demonstrate the effectiveness of RolX on several network-mining tasks: from exploratory data analysis to network transfer learning. Moreover, we compare network role discovery with network community discovery. We highlight fundamental differences between the two (e.g., roles generalize across disconnected networks, communities do not); and show that the two approaches are complimentary in nature.
Article
The Internet has become a rich and large repository of information about us as individuals. Anything from the links and text on a user’s homepage to the mailing lists the user subscribes to are reflections of social interactions a user has in the real world. In this paper we devise techniques and tools to mine this information in order to extract social networks and the exogenous factors underlying the networks’ structure. In an analysis of two data sets, from Stanford University and the Massachusetts Institute of Technology (MIT), we show that some factors are better indicators of social connections than others, and that these indicators vary between user populations. Our techniques provide potential applications in automatically inferring real world connections and discovering, labeling, and characterizing communities.
Conference Paper
Predicting the occurrence of links is a fundamental problem in networks. In the link prediction problem we are given a snapshot of a network and would like to infer which interactions among existing members are likely to occur in the near future or which existing interactions are we missing. Although this problem has been extensively studied, the challenge of how to effectively combine the information from the network structure with rich node and edge attribute data remains largely open. We develop an algorithm based on Supervised Random Walks that naturally combines the information from the network structure with node and edge level attributes. We achieve this by using these attributes to guide a random walk on the graph. We formulate a supervised learning task where the goal is to learn a function that assigns strengths to edges in the network such that a random walker is more likely to visit the nodes to which new links will be created in the future. We develop an efficient training algorithm to directly learn the edge strength estimation function. Our experiments on the Facebook social graph and large collaboration networks show that our approach outperforms state-of-the-art unsupervised approaches as well as approaches that are based on feature extraction.
Article
Nowadays, multi-label classification methods are increasingly required by modern applications, such as protein function classification, music categorization and semantic scene classification. This paper introduces the task of multi-label classification, organizes the sparse related literature into a structured presentation and performs comparative experimental results of certain multi-label classification methods. It also contributes the definition of concepts for the quantification of the multi-label nature of a data set.
Article
Predicting the occurrence of links is a fundamental problem in networks. In the link prediction problem we are given a snapshot of a network and would like to infer which interactions among existing members are likely to occur in the near future or which existing interactions are we missing. Although this problem has been extensively studied, the challenge of how to effectively combine the information from the network structure with rich node and edge attribute data remains largely open. We develop an algorithm based on Supervised Random Walks that naturally combines the information from the network structure with node and edge level attributes. We achieve this by using these attributes to guide a random walk on the graph. We formulate a supervised learning task where the goal is to learn a function that assigns strengths to edges in the network such that a random walker is more likely to visit the nodes to which new links will be created in the future. We develop an efficient training algorithm to directly learn the edge strength estimation function. Our experiments on the Facebook social graph and large collaboration networks show that our approach outperforms state-of-the-art unsupervised approaches as well as approaches that are based on feature extraction.
Article
Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.
Article
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.
Article
Over the past few decades, a large family of algorithms - supervised or unsupervised; stemming from statistics or geometry theory - has been designed to provide different solutions to the problem of dimensionality reduction. Despite the different motivations of these algorithms, we present in this paper a general formulation known as graph embedding to unify them within a common framework. In graph embedding, each algorithm can be considered as the direct graph embedding or its linear/kernel/tensor extension of a specific intrinsic graph that describes certain desired statistical or geometric properties of a data set, with constraints from scale normalization or a penalty graph that characterizes a statistical or geometric property that should be avoided. Furthermore, the graph embedding framework can be used as a general platform for developing new dimensionality reduction algorithms. By utilizing this framework as a tool, we propose a new supervised dimensionality reduction algorithm called Marginal Fisher Analysis in which the intrinsic graph characterizes the intraclass compactness and connects each data point with its neighboring points of the same class, while the penalty graph connects the marginal points and characterizes the interclass separability. We show that MFA effectively overcomes the limitations of the traditional Linear Discriminant Analysis algorithm due to data distribution assumptions and available projection directions. Real face recognition experiments show the superiority of our proposed MFA in comparison to LDA, also for corresponding kernel and tensor extensions.
Article
Network models are widely used to represent relational information among interacting units. In studies of social networks, recent emphasis has been placed on random graph models where the nodes usually represent individual social actors and the edges represent the presence of a speci ed relation between actors. We develop a class of models where the probability of a relation between actors depends on the positions of individuals in an unobserved social space." Inference for the social space is developed within a maximum likelihood and Bayesian framework, and Markov chain Monte Carlo procedures are proposed for making inference on latent positions and the eects of observed covariates. We present analyses of three standard datasets from the social networks literature, and compare the method to an alternative stochastic blockmodeling approach. In addition to improving upon model t, our method provides a visual and interpretable model-based spatial representation of social relationships, and improves upon existing methods by allowing the statistical uncertainty in the social space to be quanti ed and graphically represented. KEY WORDS: Network data; latent position model; conditional independence model. 1
Article
Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future? We formalize this question as the link prediction problem, and develop approaches to link prediction based on measures for analyzing the "proximity" of nodes in a network. Experiments on large co-authorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures.
Article
Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering. Several applications are considered.
Large text compression benchmark
  • M Mahoney
M. Mahoney. Large text compression benchmark. www.mattmahoney.net/dc/textdata, 2011.
GraRep: Learning Graph Representations with global structural information
  • S Cao
  • W Lu
  • Q Xu
S. Cao, W. Lu, and Q. Xu. GraRep: Learning Graph Representations with global structural information. In CIKM, 2015.
Social computing data repository at ASU 2009. R. Zafarani and H. Liu. Social computing data repository at ASU
  • R Zafarani
  • H Liu
R. Zafarani and H. Liu. Social computing data repository at ASU, 2009.
SNAP Datasets: Stanford large network dataset collection
  • J Leskovec
  • A Krevl
J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
The link-prediction problem for social networks. J. of the American society for information science and technology
  • D Liben-Nowell
  • J Kleinberg