Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The applications of Google matrix algorithms to the fibrosis PPI based on the MetaCore network are described in [16] using the fibrosis responses obtained in [9]. We note that the Google matrix algorithms [17][18][19] find a variety of useful applications in modern complex networks [20] including World Wide Web, Wikipedia, world trade etc. ...
... The universal mathematical methods to analyze such networks are generic and based on the concept of Markov chains and Google matrix [17][18][19]. The validity of these methods has been confirmed for various directed networks from various fields of science. ...
... The Google matrix of the global MetaCore PPI network G is constructed with specific rules briefly described in the next Section 2.3 and in detail in [17][18][19]. The matrix G is obtained from a matrix of Markov chain transition elements S ij that give weights of transitions between nodes. ...
Article
Full-text available
Myocardial fibrosis is a major pathologic disorder associated with a multitude of cardiovascular diseases (CVD). The pathogenesis is complex and encompasses multiple molecular pathways. Integration of fibrosis-associated genes into the global MetaCore network of protein-protein interactions (PPI) offers opportunities to identify PPI with functional and therapeutic significance. Here, we report the generation of a fibrosis-focused PPI network and identification of fibroblast-specific arbitrators driving reparative and reactive myocardial fibrosis. In TGF-β-mediated fibroblast activation, developed network analysis predicts new regulatory mechanisms for fibrosis-associated genes. We introduce an efficient Erdös barrage approach to suppress activation of a number of fibrosis-associated nodes in order to reverse fibrotic cascades. In the network model each protein node is characterized by an Ising up or down spin corresponding to activated or repairing state acting on other nodes being initially in a neutral state. An asynchronous Monte Carlo process describes fibrosis progression determined by a dominant action of linked proteins. Our results suggest that the constructed Ising Network Fibrosis Interaction model offers network insights into fibrosis mechanisms and can complement future experimental efforts to counteract cardiac fibrosis.
... The PR method, a variant of Eigenvector centrality [40], was originally developed for the Google search engine [32]. It evaluates a webpage's importance based on its links to other pages or relevance to a searched topic. ...
... The PageRank algorithm was developed by Google founders Larry Page and Sergey Brin in the late 1990s [32]. It is a link analysis algorithm used to measure the importance or relevance of web pages. ...
Article
Full-text available
Finding the most influential nodes in complex networks is a significant challenge with applications in various fields, including social networks, biology, and transportation systems. Many existing methods rely on different structural properties but often overlook complementary features. This paper highlights the complementary nature of K-Shell and PageRank and proposes a novel linear metric that combines them. Through extensive comparisons of 19 real-world and several artificial networks, the proposed method demonstrates superior accuracy, resolution, and computational efficiency. Evaluations against 11 state-of-the-art methods, including IDME, HGSM, and DNC, underscore the superiority of the proposed approach. Notably, the average accuracy has increased by 33.3% compared to PageRank and 23.1% compared to K-Shell, emphasizing the importance of integrating these two features.
... Page rank [28] ranks web pages with a high rank if it is pointed by popular pages, has inspired many expert rank techniques in this domain. The advancement of the Page Rank algorithm for Twitter is the Twitter Rank [29]. ...
... This test is carried out to find out the capability of the proposed technique in truly representing the ratings given by the individual nodes. The results are compared against the graph-based baseline that utilizes page rank algorithm [28] referred to as BL1 and a reputation-based technique for ranking workers in an enterprise [19] referred to as BL2 in text. The reputation structure adopted by this technique is a normal distribution(NDR) [47]. ...
Article
Full-text available
The emergence of online enterprises spread across continents has given rise to the need for managing the tacit knowledge and expertise of employees. Scenarios that includes the intention of the employer to find tacit expertise and knowledge of an employee that is not documented or self-disclosed have been addressed in this article. In today's world management of tacit knowledge has become important for the organizations. Recent studies have also proposed hosting tacit knowledge management module over cloud for large global enterprises. There are many conceptual frameworks for the acquisition and processing of the tacit knowledge. This article has proposed a reputation based approach utilizing social interactions of employees to identify the expert based on the tacit knowledge. The existing reputation-based approaches towards expertise ranking in enterprises utilize PageRank, Normal distribution, and the Hidden Markov model for expertise ranking. These models suffer negative referral, collusion, reputation inflation, and dynamism. However, the authors have proposed a Bayesian-based approach utilizing the beta probability distribution reputation model for employee ranking in enterprises that can be hosted as a cloud service for the employees of the enterprise. The experimental results reveal improved performance compared to previous techniques in terms of mean average error (MAE) for the three data sets. The proposed scheme is able to differentiate categories of interactions in a dynamic context. The results reveal that the technique is independent of the rating pattern and density of data.
... In parallel to advances in graph representation via matrices, centrality metrics have proved to be insightful in the study of graphs. Chief among them is the success of the PageRank centrality criterion revealing the significance of certain webpages (Brin & Page, 1998) and playing a role in the formation of what is now one of the largest companies worldwide. But also an even older metric, the k-core centrality (Seidman, 1983;Malliaros et al., 2020), as well as the degree centrality, closeness centrality, and betweenness centrality, have proven to be impactful in revealing key structural properties of graphs (Freeman, 1977;Zhang & Luo, 2017). ...
... corresponds to the PageRank score (Brin & Page, 1998). The PageRank score quantifies the likelihood of a random walk visiting a particular node, serving as a fundamental metric for evaluating node significance in various networks. ...
Preprint
Full-text available
Graph Shift Operators (GSOs), such as the adjacency and graph Laplacian matrices, play a fundamental role in graph theory and graph representation learning. Traditional GSOs are typically constructed by normalizing the adjacency matrix by the degree matrix, a local centrality metric. In this work, we instead propose and study Centrality GSOs (CGSOs), which normalize adjacency matrices by global centrality metrics such as the PageRank, k-core or count of fixed length walks. We study spectral properties of the CGSOs, allowing us to get an understanding of their action on graph signals. We confirm this understanding by defining and running the spectral clustering algorithm based on different CGSOs on several synthetic and real-world datasets. We furthermore outline how our CGSO can act as the message passing operator in any Graph Neural Network and in particular demonstrate strong performance of a variant of the Graph Convolutional Network and Graph Attention Network using our CGSOs on several real-world benchmark datasets.
... In this equation, PR(i) represents the PageRank score of user i, n is the total number of users, N * (i) is the set of users connected to user i in the refined social graph, and |N * (j)| is the number of connections of user j. The factor d, known as the damping factor, typically set to 0.85, represents the probability of a random user following the connections in the social graph versus randomly jumping to any user [13]. The term (1 − d)∕n ensures that every user has a minimum base PageRank, preventing nodes with no incoming edges from having a zero score. ...
Article
Full-text available
Precisely recommending relevant items to users is a challenging task because the user’s rating can be influenced by various features. Therefore, social recommender systems have recently been introduced to leverage both the user-item interaction graph and the user-user social relation graph for more accurate rating predictions. Moreover, as graph neural networks (GNN) have demonstrated superior performance in graph representation learning, several algorithms have been developed to incorporate GNN into social recommender systems. However, when the sizes of the social graph and user-item graph are very large, the computational demands of existing GNN-based social recommender systems for aggregating user and item nodes becomes the primary bottleneck. In this paper, we develop a novel lightweight GNN-based social recommender system (called LiteGSR) that effectively reduces the computational overhead associated with aggregation operations while maintaining accuracy. To achieve this, we propose a new approach for refining the social graph by utilizing PageRank-based centrality scores of users and adapting representative virtual users in the user-item graph. Experimental results demonstrate that our new social recommender systems outperform existing state-of-the-art recommender systems in both accuracy and training time.
... In network science, centrality measures are used to identify the most important or central nodes in a network. The key centrality measures used here are degree centrality and betweenness centrality (as developed by Freeman, [1977]), and PageRank centrality (as developed by Brin & Page, [1998]). We do not use Freeman's (1977) closeness centrality because closeness centrality provides inaccurate results if a network is unconnected, and since the JLT is a journal in the Humanities, which is a field dominated by single-authored papers (Wang & Barabási, 2021), we expected the network to be too fragmented to use closeness centrality. ...
Article
Full-text available
This study provides a quantitative overview of the Journal for Language Teaching from 2001 to 2023. More specifically, the current study applies network science to study both the co-authorship network and to identify topics. In addition, the journal's focus on multilingualism is investigated. The results indicate a notable growth in collaborative research in the journal, shown by the increasing average number of authors per paper. The analysis of the co-authorship network reveals a moderately connected network, with a significant group of authors forming the giant component. Important authors are also recognised based on centrality measures, highlighting their crucial roles in fostering connections within the network. Collaboration primarily happens within universities, but when it extends across institutions, inland universities tend to collaborate more frequently than those on the coast or between coastal and inland universities. Furthermore, the analysis of research topics identified eight distinct themes prevalent in the Journal for Language Teaching, encom-passing various areas in language education. It is also shown that both in the language of papers and in their language focus, the journal foregrounds English throughout this period, and papers tend to be more often in English and focus on English in recent years. Keywords: academic publishing, authorship patterns, co-authorship networks, language education, language teaching research, publication trends, research trends, scholarly communication
... PageRank is a classic graph mining algorithm (Brin and Page 1998) for weighting and ranking the nodes of a graph. It takes graph G as input and provides a weighted rank p ∈ R n to each node based on the random walk distribution on the input graph. ...
Article
Graph neural networks are powerful graph representation learners in which node representations are highly influenced by features of neighboring nodes. Prior work on individual fairness in graphs has focused only on node features rather than structural issues. However, from the perspective of fairness in high-stakes applications, structural fairness is also important, and the learned representations may be systematically and undesirably biased against unprivileged individuals due to a lack of structural awareness in the learning process. In this work, we propose a pre-processing bias mitigation approach for individual fairness that gives importance to local and global structural features. We mitigate the local structure discrepancy of the graph embedding via a locally fair PageRank method. We address the global structure disproportion between pairs of nodes by introducing truncated singular value decomposition-based pairwise node similarities. Empirically, the proposed pre-processed fair structural features have superior performance in individual fairness metrics compared to the state-of-the-art methods while maintaining prediction performance.
... This is indicative of organizational maintenance because it identifies CSOs which others are likely to rely upon for access to political information. Finally, PageRank is used to evaluate success in an internal competition for attention (Brin and Page 1998). It measures the extent to which users are connected to users who are connected to many other users in directed networks. ...
Thesis
Full-text available
What explains the politicization of EU trade agreement negotiations? Resonance with the contestation of others! In this thesis, I present a generic theory of politicization developed in relation to trade policy, but which is broadly applicable across policy fields. I argue that politicization is caused by actors who are not affected by policy decisions reacting to contestation by those who are. My theory is that actors form coalitions by advocating for resolutions to the problems of others strategically, to influence the closure of their own field of interaction, or the control of enforcement resources within it. I call this strategic behaviour 'resonance' and construct a conditional logic through which different actors are motivated to take such political action. I build on Bartolini's (2018) distinction of political action as being motivated by the will to achieve the behavioural compliance of others by affecting conditions of closure or control. From this theoretical starting point, I derive a set of ideal-typical situations in which resonating with counterparts from another field will resolve dilemmas for different types of political actors. I test the theory using a combination of statistical analysis, network science and quantitative text analysis to determine if actors 'resonate' as the theory anticipates. The thesis demonstrates the plausibility of the theory using cases spanning EU trade agreement negotiations over the past 20 years. It shows that even though politicization often begins because of stakeholder contestation, further contestation and the eventual institutionalization of politicized issues are the result of strategic behaviour by actors who resonate with stakeholders to resolve dilemmas of their own.
... Eigenvector centrality is a key parameter for assessing the importance and impact of a vertex within a graph (or a node in a network). Noticeably, Google's PageRank algorithm shares similarities with eigenvector centrality [46]. Denoted as EC i for node i, this centrality measure initially assumes a value of one and is calculated based on the following formula [47]. ...
Article
Full-text available
The role of clustering in unsupervised fault diagnosis is significant, but different clustering techniques can yield varied results and cause inevitable uncertainty. Ensemble clustering methods have been introduced to tackle this challenge. This study presents a novel integrated technique in the field of fault diagnosis using spectral ensemble clustering. A new dimensionality reduction technique is proposed to intelligently identify faults, even in ambiguous scenarios, by exploiting the informative segment of the underlying bipartite graph. This is achieved by identifying and extracting the most informative sections of the bipartite graph based on the eigenvector centrality measure of nodes within the graph. The proposed method is applied to experimental current-voltage (I-V) curve data collected from a real photovoltaic (PV) platform. The obtained results remarkably improved the accuracy of aging fault detection to more than 83.50%, outperforming the existing state-of-the-art approaches. We also decided to separately analyze the ensemble clustering part of our FDD method, which indicated surpassing performance compared to similar methods by evaluating commonly used datasets like handwritten datasets. This proves that the proposed approach inherently holds promise for application in various real-world scenarios that are indicated by ambiguity and complexity.
... Societal biases such as homophily and in-group favouritism [104] shape the social networks in a specific way as shown in figure 2. Now let us consider a ranking algorithm that harvests this structural information to rank the most influential people in the field. The most prominent and widely used algorithm is PageRank [105]. Such ranking algorithms are commonly used in various flavours of professional social networking applications such as LinkedIn, Google Scholar and ResearchGate. ...
Article
Full-text available
In this article, we identify challenges in the complex interaction between artificial intelligence (AI) systems and society. We argue that AI systems need to be studied in their socio-political context to be able to better appreciate a diverse set of potential outcomes that emerge from long-term feedback between technological development, inequalities and collective decision-making processes. This means that assessing the risks from the deployment of any specific technology presents unique challenges. We propose that risk assessments concerning AI systems should incorporate a complex systems perspective, with adequate models that can represent short- and long-term effects and feedback, along with an emphasis on increasing public engagement and participation in the process. This article is part of the theme issue ‘Co-creating the future: participatory cities and digital governance’.
... In [KW23] it is demonstrated that every stage of SimRank [JW02] can be expressed by a P LA(σ)-formula (defined in Definition 4.14 below) that uses only admissible aggregation functions. One can also show (which is simpler) that every stage of PageRank [BP98] can be expressed by a P LA(σ)-formula with only admissible aggregation functions. ...
Article
We consider logics with truth values in the unit interval [0,1]. Such logics are used to define queries and to define probability distributions. In this context the notion of almost sure equivalence of formulas is generalized to the notion of asymptotic equivalence. We prove two new results about the asymptotic equivalence of formulas where each result has a convergence law as a corollary. These results as well as several older results can be formulated as results about the relative asymptotic expressivity of inference frameworks. An inference framework F\mathbf{F} is a class of pairs (P,L)(\mathbb{P}, L), where P=(Pn:n=1,2,3,)\mathbb{P} = (\mathbb{P}_n : n = 1, 2, 3, \ldots), Pn\mathbb{P}_n are probability distributions on the set Wn\mathbf{W}_n of all σ\sigma-structures with domain {1,,n}\{1, \ldots, n\} (where σ\sigma is a first-order signature) and L is a logic with truth values in the unit interval [0,1][0, 1]. An inference framework F\mathbf{F}' is asymptotically at least as expressive as an inference framework F\mathbf{F} if for every (P,L)F(\mathbb{P}, L) \in \mathbf{F} there is (P,L)F(\mathbb{P}', L') \in \mathbf{F}' such that P\mathbb{P} is asymptotically total variation equivalent to P\mathbb{P}' and for every φ(xˉ)L\varphi(\bar{x}) \in L there is φ(xˉ)L\varphi'(\bar{x}) \in L' such that φ(xˉ)\varphi'(\bar{x}) is asymptotically equivalent to φ(xˉ)\varphi(\bar{x}) with respect to P\mathbb{P}. This relation is a preorder. If, in addition, F\mathbf{F} is at least as expressive as F\mathbf{F}' then we say that F\mathbf{F} and F\mathbf{F}' are asymptotically equally expressive. Our third contribution is to systematize the new results of this paper and several previous results in order to get a preorder on a number of inference systems that are of relevance in the context of machine learning and artificial intelligence.
... Hidalgo et al. (2007); Hidalgo and Hausmann (2009);Spelta et al. (2023); Pagnottoni and Spelta (2024)). To shed light on higher order properties of the global FDI network, we make use of the PageRank centrality (Brin & Page, 1998), which has been applied in several economics contexts (Bonaccorsi et al., 2019;Yun et al., 2019;Rovira Kaltwasser & Spelta, 2019). The idea behind PageRank is that the importance of a node increases with the importance of its neighbours, of its neighbours' neighbours, and so on. ...
Article
Full-text available
Understanding Foreign Direct Investments (FDI) networks is crucial for both economic and societal reasons. While FDI can boost growth and development, it can also harm the environment if not properly managed. This paper employs network theory to analyze the evolution of the FDI network. We delve into the topological properties of the FDI network, considering country-to-country aggregate relationships and industry sub-networks. We observe contrasting patterns in node importance, such as an increasing centralization in the aggregate network and a more uniform distribution in industry sub-networks. Furthermore, we examine the presence of core-periphery, revealing an emergent core in the aggregate FDI network and a decreasing core-periphery structure in industry sub-networks over time. Our analysis also uncovers preferential attachment regimes at the aggregate level, influenced by cumulative advantage of early entrants. At industry level, instead, the economic catch-up by latecomers makes new links to be established irrespective of the time firms have entered the market. Finally, to reproduce our empirical findings, we introduce a network-building algorithm grounded on a fitness function, capturing the observed heterogeneity in industry sub-networks.
... For example, sentiment analysis is frequently used in marketing to assess consumer attitudes, whereas predictive modeling is often applied to forecast disease outbreaks in public health. These approaches underscore the value of datadriven decision-making in various fields, such as marketing, public health, social policy, and urban planning, which increasingly rely on precise analytics to effectively respond to evolving challenges (Brin & Page, 1998;Callahan, 2014). ...
Article
Full-text available
The rise of data analytics has transformed our understanding of human and social behavior by utilizing data from digital interactions, social platforms, and various other sources. This study explored the value of analytics techniques-sentiment analysis, network analysis, and predictive modeling-in capturing individual and collective behaviors. Such insights enable decision making in fields such as marketing, public health, social policy, and urban planning. However, challenges such as data bias, ethical considerations, and complexity of human behavior underscore the need for advanced methods and human oversight. To address these complexities, the proposed framework integrates multimodal sentiment analysis, context-aware network models, and adaptive predictive modeling. This comprehensive approach supports nuanced analysis that aids in real-time decision-making and promotes fair and transparent use of analytics in human and social contexts.
... The search space has been largely dominated by a few major companies that have pioneered both keyword-based and semantic search technologies. Google, for example, is renowned for its use of PageRank, a traditional keyword-based search algorithm that laid the foundation for early web search [1]. Over the years, Google has incorporated semantic search techniques, such as the Knowledge Graph and BERT (Bidirectional Encoder Representations from Transformers), which enhance its ability to understand user intent and provide more contextually relevant results [4]. ...
Article
The advent of semantic search alongside the tra- ditional robustness of keyword search systems has established an intriguing dichotomy within the landscape of information retrieval. While keyword-based methods excel in terms of speci- ficity, semantic models offer contextual richness and intent aware- ness. However, each paradigm faces challenges in harnessing the strengths of the other. This paper investigates a hybrid search model that integrates both keyword and semantic search techniques, aiming to optimize relevance, interpretability, and user experience in complex information-seeking scenarios. We present an innovative hybrid architecture, conduct a comprehen- sive evaluation of its performance, and discuss its implications for information retrieval in specialized domains such as healthcare and education. Index Terms—Hybrid Search, Keyword Search, Semantic Search, Information Retrieval, Natural Language Processing, Search Architectures
... DC, NCC, NBC) which are only based on local neighbourhood information, the PageRank algorithm initiates by assigning the initial PageRank values to each node in the network, reflecting their initial importance or centrality within the network. Subsequently, the PageRank algorithm iteratively refines these values until they converge to stable values [61]. The PageRank value is calculated as shown in Eq. (7): ...
... Calculating the metrics: centrality. It is calculated in accordance with the PageRank algorithm [19,20], designated P(Node i ) and based on the importance of a vertex. The more important the vertex is, the more possible paths from all vertices of the graph to a given one: 1] is an attenuation coefficient, usually taken to be 0.85, and a Node i ,Node j is an element of the adjacency matrix of the graph simulating Sys, a Node i ,Node j = 1, Node j , Node i ∈ Edges 0, Node j , Node i / ∈ Edges ; criticality by connectivity. ...
Article
Full-text available
The paper proposes a technique for protecting reconfigurable networks that implements topology rebuilding, which combines immunization and network gaming methods, as a solution for maintaining cyber resilience. Immunization presumes an adaptive set of protective reconfigurations destined to ensure the functioning of a network. It is a protective reconfiguration aimed to preserve/increase the functional quality of the system. Network nodes and edges are adaptively reorganized to counteract an invasion. This is a functional component of cyber resilience. It can be implemented as a global strategy, using knowledge of the whole network structure, or a local strategy that only works with a certain part of a network. A formal description of global and local immune strategies based on hierarchical and peer-to-peer network topologies is presented. A network game is a kind of the well-defined game model in which each situation generates a specific network, and the payoff function is calculated based on the constructed networks. A network game is proposed for analyzing a network topology. This model allows quickly identifying nodes that require disconnection or replacement when a cyber attack occurs, and understanding which network sectors might be affected by an attack. The gaming method keeps the network topology resistant to unnecessary connections. This is a structural component of cyber resilience. The basic network game method has been improved by using the criterion of maximum possible path length to reduce the number of reconfigurations. Network optimization works together with immunization to preserve the structural integrity of the network. In an experimental study, the proposed method demonstrated its effectiveness in maintaining system quality within given functional limits and reducing the cost of system protective restructuring.
... The paradigm shift in web search pioneered by Google, in the late 1990s, was to move from keyword search to predominantly link-based relevance estimation (Brin and Page 1998). Thus, it is useful to use Google's famed PageRank algorithm as typifying the paradigm. ...
Article
Web search engines arguably form the most popular data-driven systems in contemporary society. They wield a considerable power by functioning as gatekeepers of the Web. Since the late 1990s, search engines have been dominated by the paradigm of link-based web search. In this paper, we critically analyse the Political Economy of the paradigm of link-based web search, drawing upon insights and methodologies from Critical Political Economy. We illustrate how link-based web search has led to phenomena that favour capital through long-term structural changes on the Web, and how it has led to accentuating unpaid digital labour and ecologically unsustainable practices, among several others. We show how con-temporary observations on the degrading quality of link-based web search can be traced back to the internal contradictions with the paradigm, and how such socio-technical phenom-ena may lead to an eventual disutility of the link-based model. Our contribution is on enhanc-ing the understanding of the Political Economy of link-based web search, and laying bare the phenomena at work, towards catalysing the search for alternative models of content organi-sation and search on the Web.
... In the 1960s, a model based on similar principles was introduced and promoted byÉlő, eventually becoming the default rating system used by the World Chess Federation (FIDE) [8]. More recently, the advent of the World Wide Web and search engines highlighted the significance of network topology in ranking systems, exemplified by the success of PageRank, the original algorithm behind Google and introduced Brin and Page at the end of the 90's [9]. The study of rating and ranking systems remains an active area of research [1,[10][11][12][13], as collective decision-making plays a vital role in modern civilization, particularly in the information age. ...
Preprint
Full-text available
The inference of rankings plays a central role in the theory of social choice, which seeks to establish preferences from collectively generated data, such as pairwise comparisons. Examples include political elections, ranking athletes based on competition results, ordering web pages in search engines using hyperlink networks, and generating recommendations in online stores based on user behavior. Various methods have been developed to infer rankings from incomplete or conflicting data. One such method, HodgeRank, introduced by Jiang et al.~\cite{jiang2011statistical}, utilizes Hodge decomposition of cochains in Higher Order Networks to disentangle gradient and cyclical components contributing to rating scores, enabling a parsimonious inference of ratings and rankings for lists of items. This paper presents a systematic study of HodgeRank's performance under the influence of quenched disorder and across networks with complex topologies generated by four different network models. The results reveal a transition from a regime of perfect trieval of true rankings to one of imperfect trieval as the strength of the quenched disorder increases. A range of observables are analyzed, and their scaling behavior with respect to the network model parameters is characterized. This work advances the understanding of social choice theory and the inference of ratings and rankings within complex network structures.
... One such method that has been adapted to the task of evaluating scientific productivity is the PageRank (PR) method. The known application of this method is assessing the importance of web pages in social networks and the Internet [10,11]. But the importance of pages on the Internet can be compared with the innovativeness and interest of the scientific community in the promotion of a certain scientific publication. ...
Article
Full-text available
The object of this study is the processes related to the assessment of the closeness of publication ties among scientists and taking into account their productivity related to scientific activity. This is necessary to increase the efficiency of management of research projects. To this end, the PR, TWPR, TWPR-CI methods for calculating scientific productivity estimates of scientists were described. In particular, the TWPR-CI method gives preference to those scientists whose works were more intensively published and cited during the last period of time, which is important for the formation of the composition of the executors of scientific research projects. The method for calculating the closeness of publication ties among scientists or average asymmetric tie strength was also described. The verification of dependence between the evaluation of the closeness of publication ties among scientists and their scientific productivity was carried out based on the analysis of the citation network of scientific publications and the network of scientific cooperation. The networks are built on the basis of the open access Citation Network Dataset (ver. 14). The dataset contains information on more than 5 million scientific publications and more than 36 million citations to them. The correlation analysis revealed the presence of a weak inverse relationship between these estimates. However, the weakness of the connection allows us to state that for this case there is no established correlation between the assessment of scientific productivity and the assessment of the closeness of publication ties. That is, the hypothesis that the weak connection between scientists makes it possible to increase the productivity and innovativeness of their publications was not confirmed. The results allow for a systematic approach to the process of evaluation and planning of the results of research projects, as well as the formation of the composition of their executors
Article
Full-text available
Pandemics like COVID-19 have a huge impact on human society and the global economy. Vaccines are effective in the fight against these pandemics but often in limited supplies, particularly in the early stages. Thus, it is imperative to distribute such crucial public goods efficiently. Identifying and vaccinating key spreaders (i.e., influential nodes) is an effective approach to break down the virus transmission network, thereby inhibiting the spread of the virus. Previous methods for identifying influential nodes in networks lack consistency in terms of effectiveness and precision. Their applicability also depends on the unique characteristics of each network. Furthermore, most of them rank nodes by their individual influence in the network without considering mutual effects among them. However, in many practical settings like vaccine distribution, the challenge is how to select a group of influential nodes. This task is more complex due to the interactions and collective influence of these nodes together. This paper introduces a new framework integrating Graph Neural Network (GNN) and Deep Reinforcement Learning (DRL) for vaccination distribution. This approach combines network structural learning with strategic decision-making. It aims to efficiently disrupt the network structure and stop disease spread through targeting and removing influential nodes. This method is particularly effective in complex environments, where traditional strategies might not be efficient or scalable. Its effectiveness is tested across various network types including both synthetic and real-world datasets, demonstrting a potential for real-world applications in fields like epidemiology and cybersecurity. This interdisciplinary approach shows the capabilities of deep learning in understanding and manipulating complex network systems.
Article
Performance and citation impact of scientific journals are measured by traditional metrics such as impact factor, article influence score, journal citation indicator, and others. While the impact factor is based on the total number of citations and does not reflect the quality of journals cited, the article influence score considers the past importance of the citing journals. This paper aims the analyze the possibility of measuring the performance of journals by data envelopment analysis (DEA) models and propose a new DEA based citation performance metrics for ranking of a set of journals. We applied traditional radial and slack-based measure DEA models with weight restrictions where the outputs of the models are the citation counts from Q1 to Q4 categories, and other journals. This basic model is extended by considering the impact factor of the journals from the previous year as one of the inputs of the model. The results of the study are illustrated in the set of 80 journals from the Web of Science category Operations Research and Management Science (ORMS). The dataset for the study was obtained from the Journal Citation Reports in the period from 2017 until 2022. The relative efficiency scores and the ranking of journals obtained by the models are compared with traditional metrics, Academic Journal Guide 2021 classification and the results of the study (Chen et al., Journal of Informetrics, 15(3), 2021) that applies DEA models for the classification of ORMS journals in the same year as our study.
Article
Full-text available
Este estudo analisa a influência das Instituições de Ensino Superior (IES) na Rede Federal de Ensino, considerando os docentes como intermediadores. Foram aplicados conceitos de redes complexas, utilizando docentes, IES de formação e as unidades de ensino onde atuam como vértices. Redes de cada estado brasileiro foram modeladas, e as propriedades de centralidade de grau e PageRank foram avaliadas. Os resultados indicam que as universidades públicas, especialmente federais, são as principais formadoras de docentes para a rede federal. Também se destaca a crescente relevância dos Institutos Federais na formação de professores. O estudo conclui que há uma padronização na distribuição de docentes na rede federal.
Article
Telegram is a widely used instant messaging app that has gained popularity due to its high level of privacy protection. Telegram has standout social network features like channels, which are virtual rooms where only administrators can post and broadcast messages to all subscribers. However, these same features have also led to the emergence of problematic activities and a significant number of fake accounts. To address these issues, Telegram has introduced verified and scam marks for channels, but only a small number of official channels are currently marked as verified, and only a few fakes as scams. In this research, we conduct a large-scale analysis of Telegram by collecting data from 120,979 different public channels and over 247 million messages. We identify and analyze two types of channels: Clones and fakes. Clones are channels that publish identical content from another channel in order to gain subscribers and promote services. Fakes, on the other hand, are channels that impersonate celebrities or well-known services by posting their own messages. To automatically detect fake channels, we propose a machine learning model that achieves an F1-score of 85.45%. By applying this model to our dataset, we find the main targets of fakes are political figures, well-known people such as actors or singers, and services.
Article
The current research adopted a network‐analytic approach to summarize the changes within the field of personality psychology over a 30‐year span from 1990 to 2019. Bibliographic data from 25,238 articles were used to construct three separate co‐authorship networks respectively representing the patterns of collaboration within each decade of personality research (i.e., 1990–1999, 2000–2009, and 2010–2019). The network properties of each co‐authorship graph suggested that personality researchers have become more interconnected and collaborative with each successive decade. An examination of the semantic content of these articles suggested that the synthesis of clinical and normative trait models, along with the integration of traditional person and situation perspectives, may be driving the increased connectivity and collaboration between researchers. We hope that this novel application of network‐analytic and machine‐learning principles can serve as proof of concept for future efforts to summarize a scientific literature.
Article
This paper considers the use of machine learning for diagnosis of diseases that is based on the analysis of a complete gene expression profile. This distinguishes our study from other approaches that require a preliminary step of finding a limited number of relevant genes (tens or hundreds of genes). We conducted experiments with complete genetic expression profiles (20 531 genes) that we obtained after processing transcriptomes of 801 patients with known oncologic diagnoses (oncology of the lung, kidneys, breast, prostate, and colon). Using the indextron (instant learning index system) for a new purpose, i.e., for complete expression profile processing, provided diagnostic accuracy that is 99.75% in agreement with the results of histological verification.
Article
Full-text available
In recent years, the PageRank algorithm has garnered significant attention due to its crucial role in search engine technologies and its applications across various scientific fields. It is well-known that the power method is a classical method for computing PageRank. However, there is a pressing demand for alternative approaches that can address its limitations and enhance its efficiency. Specifically, the power method converges very slowly when the damping factor is close to 1. To address this challenge, this paper introduces a modified multi-step splitting iteration approach for accelerating PageRank computations. Furthermore, we present two variants for computing PageRank, which are variants of the modified multi-step splitting iteration approach, specifically utilizing the thick restarted Arnoldi and adaptively accelerated Arnoldi methods. We provide detailed discussions on the construction and theoretical convergence results of these two approaches. Extensive experiments using large test matrices demonstrate the significant performance improvements achieved by our proposed algorithms.
Article
Full-text available
Identifying influential nodes is crucial in network science for controlling diseases, sharing information, and viral marketing. Current methods for finding vital spreaders have problems with accuracy, resolution, or time complexity. To address these limitations, this paper presents a hybrid approach called the Bubble Method (BM). First, the BM assumes a bubble with a radius of two surrounding each node. Then, it extracts various attributes from inside and near the surface of the bubble. These attributes are the k-shell index, k-shell diversity, and the distances of nodes within the bubble from the central node. We compared our method to 12 recent ones, including the Hybrid Global Structure model (HGSM) and Generalized Degree Decomposition (GDD), using the Susceptible–Infectious–Recovered (SIR) model to test its effectiveness. The results show the BM outperforms other methods in terms of accuracy, correctness, and resolution. Its low computational complexity renders it highly suitable for analyzing large-scale networks.
Article
In the digital age, the expansion of cyberspace has resulted in increasing complexity, making clear cyberspace visualization crucial for effective analysis and decision‐making. Current cyberspace visualizations are overly complex and fail to accurately reflect node importance. To address the challenge of complex cyberspace visualization, this study introduces the integrated centrality metric (ICM) for constructing a metaphorical map that accurately reflects node importance. The ICM, a novel node centrality measure, demonstrates superior accuracy in identifying key nodes compared to degree centrality (DC), k‐shell centrality (KC), and PageRank values. Through community partitioning and point‐cluster feature generalization, we extract a network's hierarchical structure to intuitively represent its community and backbone topology, and we construct a metaphorical map that offers a clear visualization of cyberspace. Experiments were conducted on four original networks and their extracted backbone networks to identify core nodes. The Jaccard coefficient was calculated considering the results of the three aforementioned centrality measures, ICM, and the SIR model. The results indicate that ICM achieved the best performance in both the original networks and all extracted backbone networks. This demonstrates that ICM can more precisely evaluate node importance, thereby facilitating the construction of metaphorical maps. Moreover, the proposed metaphorical map is more convenient than traditional topological maps for quickly comprehending the complex characteristics of networks.
Article
Using computational Social Network Analysis (SNA), this longitudinal study investigates the development of the interaction network and its influence on the second language (L2) gains of a complete cohort of 41 U.S. sojourners enrolled in a 3‐month intensive study‐abroad Arabic program in Jordan. Unlike extant research, our study focuses on students’ interactions with alma mater classmates, reconstructing their complete network, tracing the impact of individual students’ positions in the social graph using centrality metrics, and incorporating a developmental perspective with three measurement points. Objective proficiency gains were influenced by predeparture proficiency (negatively), multilingualism, perceived integration of the peer learner group (negatively), and the number of fellow learners speaking to the student. Analyses reveal relatively stable same‐gender cliques, but with changes in the patterns and strength of interaction. We also discuss interesting divergent trajectories of centrality metrics, L2 use, and progress; predictors of self‐perceived progress across skills; and the interplay of context and gender.
Article
In the era of big data, social network services continuously modify social connections, leading to dynamic and evolving graph data structures. These evolving graphs, vital for representing social relationships, pose significant memory challenges as they grow over time. To address this, storage-class-memory (SCM) emerges as a cost-effective solution alongside DRAM. However, contemporary graph evolution processes often scatter neighboring vertices across multiple pages, causing weak graph spatial locality and high-TLB misses during traversals. This article introduces SCM-Based graph-evolving aware data arranger (GEAR), a joint management middleware optimizing data arrangement on SCMs to enhance graph traversal efficiency. SCM-based GEAR comprises multilevel page allocation, locality-aware data placement, and dual-granularity wear leveling techniques. Multilevel page allocation prevents scattering of neighbor vertices relying on managing each page in a finer-granularity, while locality-aware data placement reserves space for future updates, maintaining strong graph spatial locality. The dual-granularity wear leveler evenly distributes updates across SCM pages with considering graph traversing characteristics. Evaluation results demonstrate SCM-based GEAR’s superiority, achieving 23% to 70% reduction in traversal time compared to state-of-the-art frameworks.
Article
Full-text available
NSDLib, short for Network Source Detection Library, is an advanced package designed to detect the sources of propagation in networks. It is easy to integrate and offers a range of algorithms for source detection, including evaluating node importance, identifying outbreaks, and reconstructing propagation graphs. This library serves as a comprehensive repository, promoting collaboration among researchers and developers worldwide to combat disinformation warfare. By enabling the implementation and comparison of new techniques, NSDLib aims to enhance the understanding and mitigation of misinformation and improve propagation analysis. This paper provides an overview of NSDLib's capabilities, emphasizing its role in bridging the gap between theoretical research and practical application.
Article
Modelling information from complex systems such as humans social interaction or words co-occurrences in our languages can help to understand how these systems are organized and function. Such systems can be modelled by networks, and network theory provides a useful set of methods to analyze them. Among these methods, graph embedding is a powerful tool to summarize the interactions and topology of a network in a vectorized feature space. When used in input of machine learning algorithms, embedding vectors help with common graph problems such as link prediction, graph matching, etc In Natural Language Processing (NLP), such a vectorization process is also employed. Word embedding has the goal of representing the sense of words, extracting it from large text corpora. Despite differences in the structure of information in input of embedding algorithms, many graph embedding approaches are adapted and inspired from methods in NLP. Limits of these methods are observed in both domains. Most of these methods require long and resource greedy training. Another downside to most methods is that they are black-box, from which understanding how the information is structured is rather complex. Interpretability of a model allows understanding how the vector space is structured without the need for external information, and thus can be audited more easily. With both these limitations in mind, we propose a novel framework to efficiently embed network vertices in an interpretable vector space. Our Lower Dimension Bipartite Framework (LDBGF) leverages the bipartite projection of a network using cliques to reduce dimensionality. Along with LDBGF, we introduce two implementations of this framework that rely on communities instead of cliques: SINr-NR and SINr-MF. We show that SINr-MF can perform well on classical graphs and SINr-NR can produce high-quality graph and word embeddings that are interpretable and stable across runs.
Chapter
This chapter covers link prediction through its principles, methods, and applications. It forecasts potential future connections and identifies currently unknown links across both temporal and spatial dimensions. Link prediction has become a prominent research area, expanding its techniques by integrating various models. This chapter presents three key models, illustrating their interconnections. Furthermore, advancements in neural networks and deep learning have led to the creation of graph-based models that combine network structures and topologies. Link prediction is widely applied in fields like social network recommendations (e.g., Weibo, QQ, and Twitter) and in predicting node types in known networks, such as detecting spam emails or forecasting criminal behavior. Despite its broad applications, link prediction remains an active research topic in social networks.
Chapter
Existing risk control techniques have primarily been developed from the perspectives of de-anonymizing address clustering and illicit account classification. However, these techniques cannot be used to ascertain the potential risks for all accounts and are limited by specific heuristic strategies or insufficient label information. These constraints motivate us to seek an effective rating method for quantifying the spread of risk in a transaction network. To the best of our knowledge, we are the first to address the problem of account risk rating on Ethereum by proposing a novel model called RiskProp, which includes a de-anonymous score to measure transaction anonymity and a network propagation mechanism to formulate the relationships between accounts and transactions. We demonstrate the effectiveness of RiskProp in overcoming the limitations of existing models by conducting experiments on real-world datasets from Ethereum. Through case studies on the detected high-risk accounts, we demonstrate that the risk assessment by RiskProp can be used to provide warnings for investors and protect them from possible financial losses, and the superior performance of risk score-based account classification experiments further verifies the effectiveness of our rating method.
Article
Full-text available
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
Web information retrieval tools typically make use of only the text on pages, ignoring valuable information implicitly contained in links. At the other extreme, viewing the Web as a traditional hypertext system would also be mistake, because heterogeneity, cross-domain links, and the dynamic nature of the Web mean that many assumptions of typical hypertext systems do not apply. The novelty of the Web leads to new problems in information access, and it is necessary to make use of the new kinds of information available, such as multiple independent categorization, naming, and indexing of pages. This paper discusses the varieties of link information (not just hyperlinks) on the Web, how the Web differs from conventional hypertext, and how the links can be exploited to build useful applications. Specific applications presented as part of the ParaSite system find individuals' homepages, new locations of moved pages, and unindexed information.
Conference Paper
Finding the right information in the World Wide Web is becoming a fundamental problem, since the amount of global information that the WWW contains is growing at an incredible rate. In this paper, we present a novel method to extract from a web object its ''hyper'' informative content, in contrast with current search engines, which only deal with the ''textual'' informative content. This method is not only valuable per se, but it is shown to be able to considerably increase the precision of current search engines, Moreover, it integrates smoothly with existing search engines technology since it can be implemented on top of every search engine, acting as a post-processor, thus automatically transforming a search engine into its corresponding ''hyper'' version. We also show how, interestingly, the hyper information can be usefully employed to face the search engines persuasion problem. (C) 1997 Published by Elsevier Science B.V.
Article
One of the enabling technologies of the World Wide Web, along with browsers, domain name servers, and hypertext markup language, is the search engine. Although the Web contains over 100 million pages of information, those millions of pages are useless if you cannot find the pages you need. All major Web search engines operate the same way: a gathering program explores the hyperlinked documents of the Web, foraging for Web pages to index. These pages are stockpiled by storing them in some kind of database or repository. Finally, a retrieval program takes a user query and creates a list of links to Web documents matching the words, phrases, or concepts in the query. Although the retrieval program itself is correctly called a search engine, by popular usage the term now means a database combined with a retrieval program. For example, the Lycos search engine comprises the Lycos Catalog of the Internet and the Pursuit retrieval program. This paper describes the Lycos system for collecting, storing, and retrieving information about pages on the Web. After outlining the history and precursors of the Lycos system, the paper discusses some of the design choices made in building this Web indexer and touches briefly on the economic issues involved in working with very large retrieval systems
Publisher: Department of Commerce, National Institute of Standards and Technology Full text at: http://trec.nist.gov/ [Witten 94 Managing Gigabytes: Compressing and Indexing Documents and Images
  • Maryland D K Gaithersburg
  • E M Harman
  • H Voorhees Ian
  • Alistair Witten
  • Timothy C Moffat
  • K David
  • Gifford
Gaithersburg, Maryland, November 20-22, 1996. Publisher: Department of Commerce, National Institute of Standards and Technology. Editors: D. K. Harman and E. M. Voorhees. Full text at: http://trec.nist.gov/ [Witten 94] Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994. [Weiss 96] Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference on Hypertext. New York, 1996.
tools for taming the Web
  • McBryan