Article

Authoritative sources in a hyperlinked environment

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... It is observed that the link structure formed by citations is analogous to that of the web where the links are hyperlinks between web pages. This motivated us to use widely used network analysis methods for web link analysis such as HITS [4] and PageRank [5] to apply on our dataset. ...
... Note also that "power law"-similarity enrichment adds 3, 871 nodes to the citation network for "power law". Ranking results are reported at Table 3 for the articles which take place in the top ten for all methodologies namely, in-degree, HITS [4] and PageRank [5]. As it is shown on the table, we are able to identify most prominent articles in "power law" related topic. ...
... For convenience, the citation contexts are underlined and the terms are in bold. For example, "[4]" in article a1 denotes a citation to article a4 with terms τ 1 and τ 2 . ...
Preprint
It is hard to detect important articles in a specific context. Information retrieval techniques based on full text search can be inaccurate to identify main topics and they are not able to provide an indication about the importance of the article. Generating a citation network is a good way to find most popular articles but this approach is not context aware. The text around a citation mark is generally a good summary of the referred article. So citation context analysis presents an opportunity to use the wisdom of crowd for detecting important articles in a context sensitive way. In this work, we analyze citation contexts to rank articles properly for a given topic. The model proposed uses citation contexts in order to create a directed and weighted citation network based on the target topic. We create a directed and weighted edge between two articles if citation context contains terms related with the target topic. Then we apply common ranking algorithms in order to find important articles in this newly created network. We showed that this method successfully detects a good subset of most prominent articles in a given topic. The biggest contribution of this approach is that we are able to identify important articles for a given search term even though these articles do not contain this search term. This technique can be used in other linked documents including web pages, legal documents, and patents.
... The sheer size of the datasets allows also system level analysis on research production and consumption [20], migration of authors [21,22], and change in production in several regions of the world as a function of time [5,6], just to name a few examples. At the same time those analyses have spurred an intense research activity aimed at defining metrics able to capture the importance/ranking of authors, institutions, or even entire countries [23,24,14,15,17,25,26,27,28,29]. Whereas such large datasets are extremely useful in understanding scholarly networks and in charting the creation of knowledge, they are also pointing out the limits of our conceptual and modeling frameworks [30] and call for a deeper understanding of the dynamics ruling the diffusion and fruition of knowledge across the the social and geographical space. ...
... However, these common indicators might fail to account for the actual importance and prestige associated to each publication. In order to overcome these limitations, many different measures have been proposed [25,26,27,28]. Here we introduce the scientific production ranking algorithm (SPR), an iterative algorithm based on the notion of diffusing scientific credits. ...
... Here we introduce the scientific production ranking algorithm (SPR), an iterative algorithm based on the notion of diffusing scientific credits. It is analogous to PageRank [33], CiteRank [26], HITS [25], SARA [29], and others ranking metrics. In the algorithm each node receives a credit that is redistributed to its neighbours at the next iteration until the process converges in a stationary distribution of credit to all nodes (see Methods section for the formal definition). ...
Preprint
We analyze the entire publication database of the American Physical Society generating longitudinal (50 years) citation networks geolocalized at the level of single urban areas. We define the knowledge diffusion proxy, and scientific production ranking algorithms to capture the spatio-temporal dynamics of Physics knowledge worldwide. By using the knowledge diffusion proxy we identify the key cities in the production and consumption of knowledge in Physics as a function of time. The results from the scientific production ranking algorithm allow us to characterize the top cities for scholarly research in Physics. Although we focus on a single dataset concerning a specific field, the methodology presented here opens the path to comparative studies of the dynamics of knowledge across disciplines and research areas
... Many research works have proposed methods to identify influential users in online social networks. For example, there are studies [12,29,30] that adapted the HITS algorithm, which was originally proposed to extract user information by analyzing link structure in the World Wide Web [20], to identify influential users in OSNs. Gayo-Avello [12] applied HITS on Twitter users' following-follower relationships to identify and differentiate influential users from spammers; the work considers an authority user to be someone who is followed by many hub users, while a hub user to be one who follows many authority users. ...
... Silva et al. [31] proposed ProfileRank, which is a PageRank [27] inspired model, to find and recommend influential users based on Twitter users' retweet activities. There are also works which extended HITS algorithm [20] to find influential users in OSNs. Romero et al. [29] proposed the influence-passivity (I-P) algorithm to measure Twitter users' influence and passivity from their retweet activities. ...
Preprint
Finding influential users in online social networks is an important problem with many possible useful applications. HITS and other link analysis methods, in particular, have been often used to identify hub and authority users in web graphs and online social networks. These works, however, have not considered topical aspect of links in their analysis. A straightforward approach to overcome this limitation is to first apply topic models to learn the user topics before applying the HITS algorithm. In this paper, we instead propose a novel topic model known as Hub and Authority Topic (HAT) model to combine the two process so as to jointly learn the hub, authority and topical interests. We evaluate HAT against several existing state-of-the-art methods in two aspects: (i) modeling of topics, and (ii) link recommendation. We conduct experiments on two real-world datasets from Twitter and Instagram. Our experiment results show that HAT is comparable to state-of-the-art topic models in learning topics and it outperforms the state-of-the-art in link recommendation task.
... Text quality prediction primarily includes webpage quality prediction and answer quality prediction. For the webpage quality prediction task, there are two mainstream methods: the link analysis [52][53][54] and feature methods. The latter is closer to our research. ...
... applied the HITS [52] algorithm to a user-answer graph in an online forum to calculate user authority [59]. Dom et al. and Campbell et al. discovered that in the community expert identification task, a linkbased algorithm had better performance than a content-based algorithm [60,61]. ...
Preprint
Currently, a growing number of health consumers are asking health-related questions online, at any time and from anywhere, which effectively lowers the cost of health care. The most common approach is using online health expert question-answering (HQA) services, as health consumers are more willing to trust answers from professional physicians. However, these answers can be of varying quality depending on circumstance. In addition, as the available HQA services grow, how to predict the answer quality of HQA services via machine learning becomes increasingly important and challenging. In an HQA service, answers are normally short texts, which are severely affected by the data sparsity problem. Furthermore, HQA services lack community features such as best answer and user votes. Therefore, the wisdom of the crowd is not available to rate answer quality. To address these problems, in this paper, the prediction of HQA answer quality is defined as a classification task. First, based on the characteristics of HQA services and feedback from medical experts, a standard for HQA service answer quality evaluation is defined. Next, based on the characteristics of HQA services, several novel non-textual features are proposed, including surface linguistic features and social features. Finally, a deep belief network (DBN)-based HQA answer quality prediction framework is proposed to predict the quality of answers by learning the high-level hidden semantic representation from the physicians' answers. Our results prove that the proposed framework overcomes the problem of overly sparse textual features in short text answers and effectively identifies high-quality answers.
... Depending on the setting, assigning each vertex a ranking score can be used for many tasks, including the estimation of vertex importance (popularity prediction) and the inference of similar vertices to a target vertex (similarity search), and edge suggestion for connecting a target vertex (link prediction and recommendation). Existing work on graph ranking have largely focused on unipartite graphs, including PageRank [2], HITS [3] 1 , and many of their variants [4], [5], [6], [7]. Although several works [8], [9], [10] have considered ranking on bipartite graphs, they have either focused on a 1. ...
... In the context of web graph ranking, PageRank [2] and HITS [3] are the most prominent methods. PageRank estimates the importance score of vertices as the stationary distribution of a random walk process -starting from a vertex, the surfer randomly jumps to a neighbor vertex according to the edge weight. ...
Preprint
The bipartite graph is a ubiquitous data structure that can model the relationship between two entity types: for instance, users and items, queries and webpages. In this paper, we study the problem of ranking vertices of a bipartite graph, based on the graph's link structure as well as prior information about vertices (which we term a query vector). We present a new solution, BiRank, which iteratively assigns scores to vertices and finally converges to a unique stationary ranking. In contrast to the traditional random walk-based methods, BiRank iterates towards optimizing a regularization function, which smooths the graph under the guidance of the query vector. Importantly, we establish how BiRank relates to the Bayesian methodology, enabling the future extension in a probabilistic way. To show the rationale and extendability of the ranking methodology, we further extend it to rank for the more generic n-partite graphs. BiRank's generic modeling of both the graph structure and vertex features enables it to model various ranking hypotheses flexibly. To illustrate its functionality, we apply the BiRank and TriRank (ranking for tripartite graphs) algorithms to two real-world applications: a general ranking scenario that predicts the future popularity of items, and a personalized ranking scenario that recommends items of interest to users. Extensive experiments on both synthetic and real-world datasets demonstrate BiRank's soundness (fast convergence), efficiency (linear in the number of graph edges) and effectiveness (achieving state-of-the-art in the two real-world tasks).
... How can we measure effective node-to-node proximities for graph mining applications such as ranking and link prediction? Measuring relevance (i.e., proximity or similarity) scores between nodes is a fundamental tool for many graph mining applications [1,3,2,14,5,12]. Among various relevance measures, Random Walk with Restart (RWR) [6,11,10] provides useful node-to-node relevance scores by considering global network structure [7] and intricate edge relationships [26]. ...
... Relevance measures in graphs. There are various relevance measures in graphs based on link analysis and random walk, e.g., PageRank [18], HITS [14], Random Walk Graph Kernel [13], and RWR (or Personalized PageRank) [6]. Among these measures, RWR has received much attention from the data mining community since it provides a personalized ranking w.r.t. a node, and it has been applied to many graph mining applications such as community detection [4], link prediction [5,15], ranking [28], and graph matching [27]. ...
Preprint
Given a real-world graph, how can we measure relevance scores for ranking and link prediction? Random walk with restart (RWR) provides an excellent measure for this and has been applied to various applications such as friend recommendation, community detection, anomaly detection, etc. However, RWR suffers from two problems: 1) using the same restart probability for all the nodes limits the expressiveness of random walk, and 2) the restart probability needs to be manually chosen for each application without theoretical justification. We have two main contributions in this paper. First, we propose Random Walk with Extended Restart (RWER), a random walk based measure which improves the expressiveness of random walks by using a distinct restart probability for each node. The improved expressiveness leads to superior accuracy for ranking and link prediction. Second, we propose SuRe (Supervised Restart for RWER), an algorithm for learning the restart probabilities of RWER from a given graph. SuRe eliminates the need to heuristically and manually select the restart parameter for RWER. Extensive experiments show that our proposed method provides the best performance for ranking and link prediction tasks, improving the MAP (Mean Average Precision) by up to 15.8% on the best competitor.
... The centrality measure proposed by Kleinberg (1999) is utilized in this paper as the last step of our structural analysis. Since our networks are strongly ...
Article
Full-text available
This paper examines Türkiye’s forward and backward inter-industry technology diffusion networks between 2009 and 2014. It is explored that each industry is a home and source of knowledge spillovers through input–output tables. This paper demonstrates the heterogeneity of knowledge diffusion networks by interacting industrial structure and R&D intensity. Hub and authority scores are computed for centrality analysis, showing that industries stand out as consumers with higher R&D intensity are central authorities and suppliers have higher R&D intensities as they supply goods and services to their customers in Türkiye’s economy.
... These metrics reveal topological structural information about the global semiconductor trade network from diverse angles, offering valuable insights into its evolution. To illustrate the role of nodes within the global semiconductor trade network, measurement frameworks (Table 3) are developed based on factors such as nodes' direct influence [28], intermediary influence [29,30], and comprehensive influence [31]. These metrics illuminate the roles of different regions within the trade network. ...
Article
Full-text available
Amidst the global restructuring of the semiconductor supply chain, this paper constructs a global semiconductor trade network (2007, 2012, 2017, 2021) encompassing three segments (raw materials, equipment, and finished components), based on the CEPII database. After initially exploring trade flows among different regions, the paper conducts an in-depth analysis of the network’s overall structure and the significance of its nodes. Furthermore, the evolution of the trade network’s community structure is discussed and its robustness and dynamics over recent years are assessed through computer program simulation. The findings are as follows: First, semiconductor trade flows are concentrated primarily among a few regions in Asia, US, and EU. Second, the network has grown in size and exhibits significant “small-world” characteristics in all segments, deviating from the typical "sparsity" seen in large-scale networks. Third, Japan, the US, and a few European regions wield significant influence in semiconductor materials and equipment trade, while Asian economies such as Chinese mainland, Chinese Taiwan, and Korea dominate semiconductor components trade. Fourth, the raw materials trade network has diversified in recent years, while the trade networks for equipment and finished components remain in a state of continuous “polarization.” Fifth, the semiconductor trade network demonstrates robustness against random attacks but collapses quickly under targeted attacks. Among the three segments, the trade network of finished components, being larger in scale, exhibits greater resilience against both random and targeted attacks. This paper not only enhances the construction of the global semiconductor trade network but also introduces a dynamic perspective, offering deeper insights into its structure and robustness. The insights gained from this analysis provide valuable guidance for policymakers and companies, especially amidst rapid technological change and geopolitical tensions.
... λ is the largest eigenvalue, and φ iα is the eigenvector of node i in the α layer. Authority centrality is based on the centrality of eigenvectors, and the "authority" index is selected to measure its importance (Kleinberg 1999) so that the overall connectedness and individual extremes can be taken into account. Table 1 shows the specific indexes. ...
Article
Full-text available
As the international environment changes, frequent geopolitical crises continue to hinder the healthy development of global stock markets. To analyze in-depth the risk contagion path between the international stock market and geopolitics under the impact of extreme events, this paper explores the tail risk interactive contagion mechanism and dynamic effects of the double-layer network between the international stock market and geopolitics from the perspective of complex networks. Empirical research finds that geopolitical conflicts exacerbate risk contagion among international stock markets, and there are significant differences in risk contagion between developed and emerging economies. The analysis of the complex interaction effect in the double-layer network of the international stock market and geopolitics shows that the intralayer risk spillover effect of geopolitics in the short term is significantly higher than that of the stock price volatility network layer. Finally, the study on the dynamic changes of the double-layer network connectedness between the international stock market and geopolitics found that the shock of extreme events, such as military conflict and public health security, is an important factor in triggering the cross-contagion of risks. The research conclusions provide new ideas for preventing the cross-contagion of geopolitical risks in the stock markets of various countries under the evolution of the global political and economic pattern.
... Research by Brin and Page (1998) established the foundational principles of algorithmic search, while subsequent work by Kleinberg (2000) introduced the concept of authority in web information retrieval. These developments created the framework for modern search technology, emphasizing the importance of link structure and relevance signals in information retrieval. ...
Thesis
Full-text available
The integration of generative artificial intelligence into search technology marks a pivotal moment in the evolution of digital information systems. This research examines this transformation through multiple lenses, analyzing how AI-powered search capabilities are fundamentally reshaping the way humans discover and interact with information online. The study is particularly timely as it coincides with Google's integration of Gemini into search results and the rising prominence of ChatGPT as a primary information discovery tool, representing the first significant challenge to Google's long-standing dominance in the search domain. At its core, this research investigates how the emergence of AI-mediated search is disrupting traditional digital business models and forcing a comprehensive reconceptualization of online publishing strategies. The study employs a mixed-methods approach, combining quantitative analysis of traffic patterns and user behavior with qualitative insights from industry experts and case studies. This methodology allows for a nuanced understanding of both the macro-level industry transformation and the micro-level adaptations being implemented by individual organizations. The research's scope encompasses three interconnected dimensions: the technological evolution of search interfaces, the economic implications for digital content creators and platforms, and businesses' strategic responses to this changing landscape. By examining these dimensions simultaneously, the study aims to develop a comprehensive framework for understanding and navigating the emerging AI-mediated information ecosystem. The analysis pays particular attention to the changing dynamics between content creators, platforms, and users. Traditional SEO-driven content strategies are being challenged as AI systems increasingly mediate between content and consumers, necessitating new approaches to content creation and distribution. This shift raises important questions about the future viability of current digital publishing models and the potential emergence of new value-creation opportunities in an AI-dominated search landscape. Through rigorous empirical analysis and theoretical development, this research seeks to contribute both to academic understanding of digital platform evolution and to practical knowledge for industry stakeholders. The findings will offer insights into how organizations can adapt their strategies to thrive in an environment where AI systems play an increasingly central role in information discovery and distribution. The implications of this research extend beyond the immediate concerns of search technology and digital publishing. They touch upon broader questions about the future of information access, the evolution of digital business models, and the changing relationship between human users and AI systems in information discovery and consumption. Understanding these dynamics is crucial for academics, practitioners, and policymakers as they navigate the opportunities and challenges presented by this technological transformation. The ultimate goal of this research is to provide a foundational framework for understanding how AI is reshaping the search landscape and to offer actionable insights for organizations adapting to this new reality. This includes developing theoretical models for understanding AI-driven disruption in digital platforms and providing practical guidelines for content creators and publishers navigating this evolving ecosystem.
... In reaction to the constraints of typical versions, machine learning techniques have actually gotten significant popularity for stock price prediction. Artificial intelligence designs such as Support Vector [4,5]. These designs, particularly SVM and ANN, have actually shown effective in getting from historic data and recording in-depth patterns in time series. ...
Article
Full-text available
The application of deep learning in stock price prediction has actually obtained boosting importance because of its capacity to capture complex patterns in financial information. This study uses Long Short-Term Memory (LSTM) networks to anticipate stock costs utilizing both day-to-day trading information and high-frequency minute-level information. For the day-to-day trading data, 42 stock datasets were analyzed, with 37 utilized for training and 5 booked for testing. High-frequency data, characterized by minute-by-minute changes, called for the development of specific designs for each stock. Each design incorporated differing configurations of LSTM devices and technical indicators to enhance predictive accuracy. The experimental outcomes demonstrated that the inclusion of technical indicators led to a reduction in prediction error, as measured by Root Mean Squared Error (RMSE). This highlights the capacity of LSTM networks, especially when boosted with technical indicators, for improving stock price forecasting accuracy throughout various timeframes.
... Step 2: Decide whether these pages are sufficiently professional to be relevant to the target topic. Kleinberg [20] proposed using a hyperlink-induced topic search (HITS) algorithm to rank these pages according to their authority. ...
Article
Full-text available
This paper proposes using a web crawler to organize website content as a dialogue tree in some domains. We build an intelligent customer service agent based on this dialogue tree for general usage. The encoder-decoder architecture Seq2Seq is used to understand natural language and then modified as a bi-directional LSTM to increase the accuracy of the polysemy cases. The attention mechanism is added in the decoder to improve the problem of accuracy decreasing as the sentence grows in length. We conducted four experiments. The first is an ablation experiment demonstrating that the Seq2Seq + Bi-directional LSTM + Attention mechanism is superior to LSTM, Seq2Seq, Seq2Seq + Attention mechanism in natural language processing. Using an open-source Chinese corpus for testing, the accuracy was 82.1%, 63.4%, 69.2%, and 76.1%, respectively. The second experiment uses knowledge of the target domain to ask questions. Five thousand data from Taiwan Water Supply Company were used as the target training data, and a thousand questions that differed from the training data but related to water were used for testing. The accuracy of RasaNLU and this study were 86.4% and 87.1%, respectively. The third experiment uses knowledge from non-target domains to ask questions and compares answers from RasaNLU with the proposed neural network model. Five thousand questions were extracted as the training data, including chat databases from eight public sources such as Weibo, Tieba, Douban, and other well-known social networking sites in mainland China and PTT in Taiwan. Then, 1000 questions from the same corpus that differed from the training data for testing were extracted. The accuracy of this study was 83.2%, which is far better than RasaNLU. It is confirmed that the proposed model is more accurate in the general field. The last experiment compares this study with voice assistants like Xiao Ai, Google Assistant, Siri, and Samsung Bixby. Although this study cannot answer vague questions accurately, it is more accurate in the trained application fields.
... It is suggested that two major factors affect feature importance, namely feature relevance and feature frequency [Zhang et al. 2010]. We applied the Hyperlink-induced topic search (HITS) algorithm [Kleinberg 1999] to compute feature relevance for ranking, and then removed lowrelevancy features. Second, the occurrence frequency of all the candidate features was calculated and those with a very low occurrence frequency were removed. ...
Article
In this study, we investigate whether consumers with different cultures concentrate on different product features in online consumer product reviews and show different opinions toward individual product features of the same products. To this end, we extract product features and their associated opinions (i.e., feature-opinion pairs) from online consumer reviews of the same products available at Amazon websites for U.S. and Chinese consumers. The analysis of 4,754 reviews shows that American consumers tend to focus more on usability features of products and have more negative opinions on the same product features in their online reviews than Chinese consumers. Chinese consumers, on the other hand, comment more on aesthetics of products in their reviews. These findings provide some valuable guidance for sellers and manufacturers to better customize their products and improve marketing strategies for consumers with different cultural backgrounds.
... Pei et al. [21] addressed a direct method to search for influential spreaders by following the real spreading dynamics in a wide range of networks. Some other methods such as HITS [22] and TwitterRank [23] are also useful and effective. Recently a local based method ClusterRank [24] has also good performance in some cases. ...
Preprint
Full-text available
Identifying a set of influential spreaders in complex networks plays a crucial role in effective information spreading. A simple strategy is to choose top-r ranked nodes as spreaders according to influence ranking method such as PageRank, ClusterRank and k-shell decomposition. Besides, some heuristic methods such as hill-climbing, SPIN, degree discount and independent set based are also proposed. However, these approaches suffer from a possibility that some spreaders are so close together that they overlap sphere of influence or time consuming. In this report, we present a simply yet effectively iterative method named VoteRank to identify a set of decentralized spreaders with the best spreading ability. In this approach, all nodes vote in a spreader in each turn, and the voting ability of neighbors of elected spreader will be decreased in subsequent turn. Experimental results on four real networks show that under Susceptible-Infected-Recovered (SIR) model, VoteRank outperforms the traditional benchmark methods on both spreading speed and final affected scale. What's more, VoteRank is also superior to other group-spreader identifying methods on computational time.
... Many ranking methods have been proposed in homogeneous networks. For example, PageRank [117] evaluates the importance of objects through a random walk process, and HITS [118] ranks objects using the authority and hub scores. ...
Preprint
Most real systems consist of a large number of interacting, multi-typed components, while most contemporary researches model them as homogeneous networks, without distinguishing different types of objects and links in the networks. Recently, more and more researchers begin to consider these interconnected, multi-typed data as heterogeneous information networks, and develop structural analysis approaches by leveraging the rich semantic meaning of structural types of objects and links in the networks. Compared to widely studied homogeneous network, the heterogeneous information network contains richer structure and semantic information, which provides plenty of opportunities as well as a lot of challenges for data mining. In this paper, we provide a survey of heterogeneous information network analysis. We will introduce basic concepts of heterogeneous information network analysis, examine its developments on different data mining tasks, discuss some advanced topics, and point out some future research directions.
... Therefore, it is necessary to assign importance ranks to each patent. To address the issue, a link evaluation method such as PageRank [12] or HITS [13] has been used for calculating importance of patents. Lukach, et al. [14] have proposed computing importance by the PageRank score of patents. ...
Preprint
Full-text available
Scoring patent documents is very useful for technology management. However, conventional methods are based on static models and, thus, do not reflect the growth potential of the technology cluster of the patent. Because even if the cluster of a patent has no hope of growing, we recognize the patent is important if PageRank or other ranking score is high. Therefore, there arises a necessity of developing citation network clustering and prediction of future citations. In our research, clustering of patent citation networks by Stochastic Block Model was done with the aim of enabling corporate managers and investors to evaluate the scale and life cycle of technology. As a result, we confirmed nested SBM is appropriate for graph clustering of patent citation networks. Also, a high MAPE value was obtained and the direction accuracy achieved a value greater than 50% when predicting growth potential for each cluster by using LSTM.
... A graph embedding method for Wikipedia using a similarity inspired by the HITS algorithm [7] was presented by Sajadi et al. [16]. The output of this approach for each Wikipedia Concept is a fixed length list of similar Wikipedia pages and their similarity, which represents the dimension name of the corresponding Wikipedia concepts. ...
Preprint
Using deep learning for different machine learning tasks such as image classification and word embedding has recently gained many attentions. Its appealing performance reported across specific Natural Language Processing (NLP) tasks in comparison with other approaches is the reason for its popularity. Word embedding is the task of mapping words or phrases to a low dimensional numerical vector. In this paper, we use deep learning to embed Wikipedia Concepts and Entities. The English version of Wikipedia contains more than five million pages, which suggest its capability to cover many English Entities, Phrases, and Concepts. Each Wikipedia page is considered as a concept. Some concepts correspond to entities, such as a person's name, an organization or a place. Contrary to word embedding, Wikipedia Concepts Embedding is not ambiguous, so there are different vectors for concepts with similar surface form but different mentions. We proposed several approaches and evaluated their performance based on Concept Analogy and Concept Similarity tasks. The results show that proposed approaches have the performance comparable and in some cases even higher than the state-of-the-art methods.
... The origin of PageRank was rooted in the intent to rank web pages based on their link topology [37]. Although there are alternative link topology based algorithms such as HITS [23] and the SALSA [26] (which combines PageRank and HITS), PageRank enjoys a brand recognition due to its early integration into and association with the Google search engine [15]. ...
Preprint
Full-text available
Patterns often appear in a variety of large, real-world networks, and interesting physical phenomena are often explained by network topology as in the case of the bow-tie structure of the World Wide Web, or the small world phenomenon in social networks. The discovery and modelling of such regular patterns has a wide application from disease propagation to financial markets. In this work we describe a newly discovered regularly occurring striation pattern found in the PageRank ordering of adjacency matrices that encode real-world networks. We demonstrate that these striations are the result of well-known graph generation processes resulting in regularities that are manifest in the typical neighborhood distribution. The spectral view explored in this paper encodes a tremendous amount about the explicit and implicit topology of a given network, so we also discuss the interesting network properties, outliers and anomalies that a viewer can determine from a brief look at the re-ordered matrix.
... We use PageRank score [49], HITS authority and hub values [50], in-degree and out-degree scores as features of users. A mixture of Gaussians model is proposed to explain the features generation process. ...
Preprint
In networks, multiple contagions, such as information and purchasing behaviors, may interact with each other as they spread simultaneously. However, most of the existing information diffusion models are built on the assumption that each individual contagion spreads independently, regardless of their interactions. Gaining insights into such interaction is crucial to understand the contagion adoption behaviors, and thus can make better predictions. In this paper, we study the contagion adoption behavior under a set of interactions, specifically, the interactions among users, contagions' contents and sentiments, which are learned from social network structures and texts. We then develop an effective and efficient interaction-aware diffusion (IAD) framework, incorporating these interactions into a unified model. We also present a generative process to distinguish user roles, a co-training method to determine contagions' categories and a new topic model to obtain topic-specific sentiments. Evaluation on large-scale Weibo dataset demonstrates that our proposal can learn how different users, contagion categories and sentiments interact with each other efficiently. With these interactions, we can make a more accurate prediction than the state-of-art baselines. Moreover, we can better understand how the interactions influence the propagation process and thus can suggest useful directions for information promotion or suppression in viral marketing.
... Current notions of graph similarity such as graph isomorphism and edit distance (cf. [10]), descriptive statistics of graph structure measures such as degree distribution or diameter, or iterative approaches which assess the similarity of the neighborhood of nodes (e.g., [19,21,31]) rely purely on graph theoretical properties. ...
Preprint
While visual comparison of directed acyclic graphs (DAGs) is commonly encountered in various disciplines (e.g., finance, biology), knowledge about humans' perception of graph similarity is currently quite limited. By graph similarity perception we mean how humans perceive commonalities and differences in graphs and herewith come to a similarity judgment. As a step toward filling this gap the study reported in this paper strives to identify factors which influence the similarity perception of DAGs. In particular, we conducted a card-sorting study employing a qualitative and quantitative analysis approach to identify 1) groups of DAGs that are perceived as similar by the participants and 2) the reasons behind their choice of groups. Our results suggest that similarity is mainly influenced by the number of levels, the number of nodes on a level, and the overall shape of the graph.
... Graph-based ranking algorithms are based on the following idea: first, a graph from a document is created that has as nodes the candidate keyphrases, and then edges are added between related candidate keyphrases. The final goal is the ranking of the nodes using a graph-based ranking method, such as PageRank (Brin & Page, 1998), Positional Function (Herings et al., 2005), and HITS (Kleinberg, 1999). TextRank (Mihalcea & Tarau, 2004) builds an undirected and unweighted graph with candidate lexical units as nodes for a specific text and adds connections (edges) between those nodes that co-occur within a window of N words. ...
Preprint
Automated keyphrase extraction is a fundamental textual information processing task concerned with the selection of representative phrases from a document that summarize its content. This work presents a novel unsupervised method for keyphrase extraction, whose main innovation is the use of local word embeddings (in particular GloVe vectors), i.e., embeddings trained from the single document under consideration. We argue that such local representation of words and keyphrases are able to accurately capture their semantics in the context of the document they are part of, and therefore can help in improving keyphrase extraction quality. Empirical results offer evidence that indeed local representations lead to better keyphrase extraction results compared to both embeddings trained on very large third corpora or larger corpora consisting of several documents of the same scientific field and to other state-of-the-art unsupervised keyphrase extraction methods.
... It models a random walk occurring on the network, and noticeably includes the possibility for the walker to teleport anywhere in the network at any step. The Hub and Authority Scores [12] are two complementary measures also based on random walks. ...
Preprint
While online communities have become increasingly important over the years, the moderation of user-generated content is still performed mostly manually. Automating this task is an important step in reducing the financial cost associated with moderation, but the majority of automated approaches strictly based on message content are highly vulnerable to intentional obfuscation. In this paper, we discuss methods for extracting conversational networks based on raw multi-participant chat logs, and we study the contribution of graph features to a classification system that aims to determine if a given message is abusive. The conversational graph-based system yields unexpectedly high performance , with results comparable to those previously obtained with a content-based approach.
... This type of spam is presented as numerous links from a large number of web pages to a few target web pages. Studies on Link spam have been receiving attention due to the limitations of PageRank [31] and HITS [24]. Thanks to significant link characteristics, many web link graph structure-based spam detection approaches have been introduced [18,40,25,3,41,8]. ...
Preprint
Full-text available
In the last decade we have witnessed the explosive growth of online social networking services (SNSs) such as Facebook, Twitter, RenRen and LinkedIn. While SNSs provide diverse benefits for example, forstering interpersonal relationships, community formations and news propagation, they also attracted uninvited nuiance. Spammers abuse SNSs as vehicles to spread spams rapidly and widely. Spams, unsolicited or inappropriate messages, significantly impair the credibility and reliability of services. Therefore, detecting spammers has become an urgent and critical issue in SNSs. This paper deals with Follow spam in Twitter. Instead of spreading annoying messages to the public, a spammer follows (subscribes to) legitimate users, and followed a legitimate user. Based on the assumption that the online relationships of spammers are different from those of legitimate users, we proposed classification schemes that detect follow spammers. Particularly, we focused on cascaded social relations and devised two schemes, TSP-Filtering and SS-Filtering, each of which utilizes Triad Significance Profile (TSP) and Social status (SS) in a two-hop subnetwork centered at each other. We also propose an emsemble technique, Cascaded-Filtering, that combine both TSP and SS properties. Our experiments on real Twitter datasets demonstrated that the proposed three approaches are very practical. The proposed schemes are scalable because instead of analyzing the whole network, they inspect user-centered two hop social networks. Our performance study showed that proposed methods yield significantly better performance than prior scheme in terms of true positives and false positives.
... Other research work combined anchor texts with additional information. Some researchers aggregated anchor texts from all pages that link to a page in the same domain [19], or the same website [14,16] as the target page. Regarding time, historical trends of anchor texts have also been investigated for estimating anchor text importance. ...
Preprint
Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different ranking strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from "bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.
... From the discussion above, inverse query frequency is more informative than user frequency in a click graph. Some early works like PageRank [4] and [13] tried to identify the global properties of documents with link analysis on the hyperlink graph. This motivates us to consider the inverse query frequency as a global property of URL, and to develop the global consistency model for query representation on a click graph, in which the global nature of URLs plays a central role and user frequency should be incorporated in tune with the global nature of the URL. ...
Preprint
Extensive research has been conducted on query log analysis. A query log is generally represented as a bipartite graph on a query set and a URL set. Most of the traditional methods used the raw click frequency to weigh the link between a query and a URL on the click graph. In order to address the disadvantages of raw click frequency, researchers proposed the entropy-biased model, which incorporates raw click frequency with inverse query frequency of the URL as the weighting scheme for query representation. In this paper, we observe that the inverse query frequency can be considered a global property of the URL on the click graph, which is more informative than raw click frequency, which can be considered a local property of the URL. Based on this insight, we develop the global consistency model for query representation, which utilizes the click frequency and the inverse query frequency of a URL in a consistent manner. Furthermore, we propose a new scheme called inverse URL frequency as an effective way to capture the global property of a URL. Experiments have been conducted on the AOL search engine log data. The result shows that our global consistency model achieved better performance than the current models.
... Fourth, Kleinberg has pointed out that in a hyperlinked web environment, a "good" authority represents a page that is linked to by many hubs [19]. Similarly, academic authority can be designated by being highly cited by many other researchers in a specific domain of expertise. ...
Preprint
A widely used measure of scientific impact is citations. However, due to their heavy-tailed distribution, citations are fundamentally difficult to predict. Instead, to characterize scientific impact, we address two analogous questions asked by many scientific researchers: "How will my h-index evolve over time, and which of my previously or newly published papers will contribute to it?" To answer these questions, we perform two related tasks. First, we develop a model to predict authors' future h-indices based on their current scientific impact. Second, we examine the factors that drive papers---either previously or newly published---to increase their authors' predicted future h-indices. By leveraging relevant factors, we can predict an author's h-index in five years with an R2 value of 0.92 and whether a previously (newly) published paper will contribute to this future h-index with an F1 score of 0.99 (0.77). We find that topical authority and publication venue are crucial to these effective predictions, while topic popularity is surprisingly inconsequential. Further, we develop an online tool that allows users to generate informed h-index predictions. Our work demonstrates the predictability of scientific impact, and can help scholars to effectively leverage their position of "standing on the shoulders of giants."
... 29-30]). Web search results ranking algorithms take into account additional parameters such as the number of links pointing to the given page [9,10], the anchor text of the links pointing to the page, the placement of the search terms in the document (terms occurring in title or header may get a higher weight), the distance between the search terms, popularity of the page (in terms of the number of times it is visited), the text appearing in metatags [11], subject-specific authority of the page [12,13], recency in search index, and exactness of match [14]. ...
Preprint
In this paper we present a number of measures that compare rankings of search engine results. We apply these measures to five queries that were monitored daily for two periods of about 21 days each. Rankings of the different search engines (Google, Yahoo and Teoma for text searches and Google, Yahoo and Picsearch for image searches) are compared on a daily basis, in addition to longitudinal comparisons of the same engine for the same query over time. The results and rankings of the two periods are compared as well.
... Hyperlink-Induced Topic Search (HITS) [23] ranks web pages in a web linkage graph W by a 2-phase iterative update, the authority update and the hub update. Similar to Adsorption, the authority update requires each node i to generate the output values damped by d and scaled by A(i, j), where matrix A = W T W , while the hub update scales a node's output values by ...
Preprint
Myriad of graph-based algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a large-scale distributed environment in order to scale to massive data sets. To accelerate these large-scale graph-based iterative computations, we propose delta-based accumulative iterative computation (DAIC). Different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, DAIC updates the result by accumulating the "changes" between iterations. By DAIC, we can process only the "changes" to avoid the negligible updates. Furthermore, we can perform DAIC asynchronously to bypass the high-cost synchronous barriers in heterogeneous distributed environments. Based on the DAIC model, we design and implement an asynchronous graph processing framework, Maiter. We evaluate Maiter on local cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves as much as 60x speedup over Hadoop and outperforms other state-of-the-art frameworks.
... This leads to a recursive definition of status which is mathematically addressed by eigenvector analysis. Since the web's hyperlink structure mimics the properties of a social network graph (WWW pages are nodes, hyperlink are edges), eigenvector analysis can also used to measure the prestige of web pages; well-known algorithms include PageRank , SALSA (Lempel and Moran, 2000) and HITS (Kleinberg, 1999). However, in these algorithms all edges by definition have binary weights: a hyperlink either exists or does not exist, and a social relationship exists or does not exist. ...
Preprint
The field of digital libraries (DLs) coalesced in 1994: the first digital library conferences were held that year, awareness of the World Wide Web was accelerating, and the National Science Foundation awarded 24Million(U.S.)fortheDigitalLibraryInitiative(DLI).InthispaperweexaminethestateoftheDLdomainafteradecadeofactivitybyapplyingsocialnetworkanalysistothecoauthorshipnetworkofthepastACM,IEEE,andjointACM/IEEEdigitallibraryconferences.Webaseouranalysisonacommonbinaryundirectionalnetworkmodeltorepresentthecoauthorshipnetwork,andfromitweextractseveralestablishednetworkmeasures.Wealsointroduceaweighteddirectionalnetworkmodeltorepresentthecoauthorshipnetwork,forwhichwedefine24 Million (U.S.) for the Digital Library Initiative (DLI). In this paper we examine the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences. We base our analysis on a common binary undirectional network model to represent the co-authorship network, and from it we extract several established network measures. We also introduce a weighted directional network model to represent the co-authorship network, for which we define AuthorRank$ as an indicator of the impact of an individual author in the network. The results are validated against conference program committee members in the same period. The results show clear advantages of PageRank and AuthorRank over degree, closeness and betweenness centrality metrics. We also investigate the amount and nature of international participation in Joint Conference on Digital Libraries (JCDL).
... Among the several proposed methods for measuring word influence, the majority of them focus on directed weighted graphs (e.g., the web, social networks, citations), in which influence is considered to spread through the edges. Methods such as PageRank [10], authority computation [45] and random graph simulations [46] all make use of the link structure. In this paper, we utilize the influence calculating algorithm proposed by [20], where word influence is obtained by random walk. ...
Preprint
Academic literature retrieval is concerned with the selection of papers that are most likely to match a user's information needs. Most of the retrieval systems are limited to list-output models, in which the retrieval results are isolated from each other. In this work, we aim to uncover the relationships of the retrieval results and propose a method for building structural retrieval results for academic literatures, which we call a paper evolution graph (PEG). A PEG describes the evolution of the diverse aspects of input queries through several evolution chains of papers. By utilizing the author, citation and content information, PEGs can uncover the various underlying relationships among the papers and present the evolution of articles from multiple viewpoints. Our system supports three types of input queries: keyword, single-paper and two-paper queries. The construction of a PEG mainly consists of three steps. First, the papers are soft-clustered into communities via metagraph factorization during which the topic distribution of each paper is obtained. Second, topically cohesive evolution chains are extracted from the communities that are relevant to the query. Each chain focuses on one aspect of the query. Finally, the extracted chains are combined to generate a PEG, which fully covers all the topics of the query. The experimental results on a real-world dataset demonstrate that the proposed method is able to construct meaningful PEGs.
... In a road network for example the degree of an intersection would just be proportional to the number of cars passing through it. As another example, consider eigenvector centrality [10,11,12], a measure of centrality akin to an extended form of degree centrality and closely related to "PageRank" and similar centrality measures used in web search engines [13,14]. The eigenvector centrality x i of a vertex in an unweighted network is defined to be proportional to the sum of the centralities of the vertex's neighbors, so that a vertex can acquire high centrality either by being connected to a lot of others (as with simple degree centrality) or by being connected to others that themselves are highly central. ...
Preprint
The connections in many networks are not merely binary entities, either present or not, but have associated weights that record their strengths relative to one another. Recent studies of networks have, by and large, steered clear of such weighted networks, which are often perceived as being harder to analyze than their unweighted counterparts. Here we point out that weighted networks can in many cases be analyzed using a simple mapping from a weighted network to an unweighted multigraph, allowing us to apply standard techniques for unweighted graphs to weighted ones as well. We give a number of examples of the method, including an algorithm for detecting community structure in weighted networks and a new and simple proof of the max-flow/min-cut theorem.
... In this work, we used FastText [24] for the text embedding task (both because of the speed of the training and because in [47] was found to be more effective than other models). Subsequently, we identified central users in the graph using the HITS algorithm [26] (corresponding to 30% of users with the highest hub score and 30% of users with the highest authorative score), and utilized their 627 embeddings to calculate the centroids of the network. With as the sum of distances between the embeddings of central users and the centroids of their respective groups, and as the sum of distances from the global centroid, we finally computed the score as follows: Eco Chamber Risk. ...
Article
Full-text available
Social media platforms have become central arenas for public discourse, enabling the exchange of ideas and information among diverse user groups. However, the rise of echo chambers, where individuals reinforce their existing beliefs through repeated interactions with like-minded users, poses significant challenges to the democratic exchange of ideas and the potential for polarization and information disorder. This paper presents a comparative analysis of the main metrics that have been proposed in the literature for echo chamber detection, with a focus on their application in a cross-platform scenario constituted by the two major social media platforms, i.e., Twitter (now renamed X\mathbb {X} ) and Reddit. The echo chamber detection metrics considered encompass network analysis, content analysis, and hybrid solutions. The findings of this work shed light on the unique dynamics of echo chambers present on the two social media platforms, while also highlighting the strengths and limitations of various metrics employed to identify them, and their transversality to the different social graph modeling and domains considered.
... The authors used recommendation algorithms for Web Pages to select the most informative sentences. The proposed algorithms used both Google's PageRank [23] and HITS [24]. ...
Preprint
Huge volumes of textual information has been produced every single day. In order to organize and understand such large datasets, in recent years, summarization techniques have become popular. These techniques aims at finding relevant, concise and non-redundant content from such a big data. While network methods have been adopted to model texts in some scenarios, a systematic evaluation of multilayer network models in the multi-document summarization task has been limited to a few studies. Here, we evaluate the performance of a multilayer-based method to select the most relevant sentences in the context of an extractive multi document summarization (MDS) task. In the adopted model, nodes represent sentences and edges are created based on the number of shared words between sentences. Differently from previous studies in multi-document summarization, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer). As a proof of principle, our results reveal that such a discrimination between intra- and inter-layer in a multilayered representation is able to improve the quality of the generated summaries. This piece of information could be used to improve current statistical methods and related textual models.
... However, due to the prohibitive costs in computing the rank-k truncated SVD, in practical applications one typically computes a rank-k approximate SVD which satisfies some tolerance requirements [17,27,30,40,64]. Then rank-k approximate SVD has been applied to many research areas including principal component analysis (PCA) [36,56], web search models [37], information retrieval [4,23], and face recognition [50,68]. ...
Preprint
We present Flip-Flop Spectrum-Revealing QR (Flip-Flop SRQR) factorization, a significantly faster and more reliable variant of the QLP factorization of Stewart, for low-rank matrix approximations. Flip-Flop SRQR uses SRQR factorization to initialize a partial column pivoted QR factorization and then compute a partial LQ factorization. As observed by Stewart in his original QLP work, Flip-Flop SRQR tracks the exact singular values with "considerable fidelity". We develop singular value lower bounds and residual error upper bounds for Flip-Flop SRQR factorization. In situations where singular values of the input matrix decay relatively quickly, the low-rank approximation computed by SRQR is guaranteed to be as accurate as truncated SVD. We also perform a complexity analysis to show that for the same accuracy, Flip-Flop SRQR is faster than randomized subspace iteration for approximating the SVD, the standard method used in Matlab tensor toolbox. We also compare Flip-Flop SRQR with alternatives on two applications, tensor approximation and nuclear norm minimization, to demonstrate its efficiency and effectiveness.
... Articles with a truly extraordinary knack of grabbing citations are the authorities of the citation network. Articles, like review papers, that cite a considerably number of references are the network hubs (Kleinberg, 1999). Highlycited review papers are both authorities and hubs: they are connectors, with the peculiar ability to relate ostensibly different topics and to create short citation paths between any two nodes in the system, making the citation network look like a small world. ...
Preprint
Computer science is a relatively young discipline combining science, engineering, and mathematics. The main flavors of computer science research involve the theoretical development of conceptual models for the different aspects of computing and the more applicative building of software artifacts and assessment of their properties. In the computer science publication culture, conferences are an important vehicle to quickly move ideas, and journals often publish deeper versions of papers already presented at conferences. These peculiarities of the discipline make computer science an original research field within the sciences, and, therefore, the assessment of classical bibliometric laws is particularly important for this field. In this paper, we study the skewness of the distribution of citations to papers published in computer science publication venues (journals and conferences). We find that the skewness in the distribution of mean citedness of different venues combines with the asymmetry in citedness of articles in each venue, resulting in a highly asymmetric citation distribution with a power law tail. Furthermore, the skewness of conference publications is more pronounced than the asymmetry of journal papers. Finally, the impact of journal papers, as measured with bibliometric indicators, largely dominates that of proceeding papers.
... Ranking or sorting items can be seen as deducing a linear order for the items. Applications for ranking are, for example, finding relevant web pages [14,15] or ranking database query results [16]. One of the key differences in these approaches and ours is that in our case the reversed order is as good as the original. ...
Preprint
Items in many datasets can be arranged to a natural order. Such orders are useful since they can provide new knowledge about the data and may ease further data exploration and visualization. Our goal in this paper is to define a statistically well-founded and an objective score measuring the quality of an order. Such a measure can be used for determining whether the current order has any valuable information or can it be discarded. Intuitively, we say that the order is good if dependent attributes are close to each other. To define the order score we fit an order-sensitive model to the dataset. Our model resembles a Markov chain model, that is, the attributes depend only on the immediate neighbors. The score of the order is the BIC score of the best model. For computing the measure we introduce a fast dynamic program. The score is then compared against random orders: if it is better than the scores of the random orders, we say that the order is good. We also show the asymptotic connection between the score function and the number of free parameters of the model. In addition, we introduce a simple greedy approach for finding an order with a good score. We evaluate the score for synthetic and real datasets using different spectral orders and the orders obtained with the greedy method.
... Dans le contexte du web des documents, un hyperlien indique une relation entre des informations portées par deux pages web. Bien que de telles relations soient généralement d'une granularité assez grossière, elles forment cependant une composante essentielle des algorithmes d'ordonnancement les plus reconnus (PageRank (Page et al., 1999), HITS (Kleinberg, 1999), SALSA (Lempel, Moran, 2001)). ...
Preprint
The advances of the Linked Open Data (LOD) initiative are giving rise to a more structured web of data. Indeed, a few datasets act as hubs (e.g., DBpedia) connecting many other datasets. They also made possible new web services for entity detection inside plain text (e.g., DBpedia Spotlight), thus allowing for new applications that will benefit from a combination of the web of documents and the web of data. To ease the emergence of these new use-cases, we propose a query-biased algorithm for the ranking of entities detected inside a web page. Our algorithm combine link analysis with dimensionality reduction. We use crowdsourcing for building a publicly available and reusable dataset on which we compare our algorithm to the state of the art. Finally, we use this algorithm for the construction of semantic snippets for which we evaluate the usability and the usefulness with a crowdsourcing-based approach.
... Low-rank matrix approximation is an integral component of tools such as principal component analysis (PCA) [29]. It is also an important instrument used in many applications like computer vision (e.g., face recognition) [49], signal processing (e.g., adaptive beamforming) [41], recommender systems [19], information retrieval and latent semantic indexing [7], [6], web search modeling [30], DNA microarray data [3], [42] and text mining, to name a few examples. Several algorithms have been proposed in the literature for finding low rank approximations of matrices [35], [57], [26], [18], [11]. ...
Preprint
Low rank approximation is an important tool used in many applications of signal processing and machine learning. Recently, randomized sketching algorithms were proposed to effectively construct low rank approximations and obtain approximate singular value decompositions of large matrices. Similar ideas were used to solve least squares regression problems. In this paper, we show how matrices from error correcting codes can be used to find such low rank approximations and matrix decompositions, and extend the framework to linear least squares regression problems. The benefits of using these code matrices are the following: (i) They are easy to generate and they reduce randomness significantly. (ii) Code matrices with mild properties satisfy the subspace embedding property, and have a better chance of preserving the geometry of an entire subspace of vectors. (iii) For parallel and distributed applications, code matrices have significant advantages over structured random matrices and Gaussian random matrices. (iv) Unlike Fourier or Hadamard transform matrices, which require sampling O(klogk)O(k\log k) columns for a rank-k approximation, the log factor is not necessary for certain types of code matrices. That is, (1+ϵ)(1+\epsilon) optimal Frobenius norm error can be achieved for a rank-k approximation with O(k/ϵ)O(k/\epsilon) samples. (v) Fast multiplication is possible with structured code matrices, so fast approximations can be achieved for general dense input matrices. (vi) For least squares regression problem minAxb2\min\|Ax-b\|_2 where ARn×dA\in \mathbb{R}^{n\times d}, the (1+ϵ)(1+\epsilon) relative error approximation can be achieved with O(d/ϵ)O(d/\epsilon) samples, with high probability, when certain code matrices are used.
... Furthermore, the model has been applied to calculate structural node properties in large networks. HITS [17] and PageRank [4,25] rank network nodes according to their values in the stationary distribution of the random surfer model. Especially for the later there exists a detailed analysis ranging from the efficiency of its calculation towards its robustness [3,18]. ...
Preprint
Websites have an inherent interest in steering user navigation in order to, for example, increase sales of specific products or categories, or to guide users towards specific information. In general, website administrators can use the following two strategies to influence their visitors' navigation behavior. First, they can introduce click biases to reinforce specific links on their website by changing their visual appearance, for example, by locating them on the top of the page. Second, they can utilize link insertion to generate new paths for users to navigate over. In this paper, we present a novel approach for measuring the potential effects of these two strategies on user navigation. Our results suggest that, depending on the pages for which we want to increase user visits, optimal link modification strategies vary. Moreover, simple topological measures can be used as proxies for assessing the impact of the intended changes on the navigation of users, even before these changes are implemented.
... While a great many centrality measures exist for a given network, each with slightly different meanings, all centralities measure in some sense each node's role or importance in connecting the network [11]. Moreover, many of the most useful centrality measure are represented by eigenvectors of a matrix, for instance PageRank centrality [31], hub and authority centrality [32], dynamical importance [33], and classical eigenvector centrality [11]. For a single-layered network any eigenvector-based centrality measure is described by the the dominant eigenvector of a matrix C that is some function of the adjacency matrix A [34]. ...
Preprint
Today's colleges and universities consist of highly complex structures that dictate interactions between the administration, faculty, and student body. These structures can play a role in dictating the efficiency of policy enacted by the administration and determine the effect that curriculum changes in one department have on other departments. Despite the fact that the features of these complex structures have a strong impact on the institutions, they remain by-and-large unknown in many cases. In this paper we study the academic structure of our home institution of Trinity College in Hartford, CT using the major and minor patterns between graduating students to build a temporal multiplex network describing the interactions between different departments. Using recent network science techniques developed for such temporal networks we identify the evolving community structures that organize departments' interactions, as well as quantify the interdisciplinary centrality of each department. We implement this framework for Trinity College, finding practical insights and applications, but also present it as a general framework for colleges and universities to better understand their own structural makeup in order to better inform academic and administrative policy.
... Several modifications have been proposed to handle this situation. The Hub and Authority Scores [35] are two complementary measures processed through the HITS algorithm (Hyperlink-Induced Topic Search). They solve the issue by splitting the centrality value into two parts: one for the incoming influence (Authority), and the other for the outgoing one (Hub). ...
Preprint
Full-text available
Moderation of user-generated content in an online community is a challenge that has great socio-economical ramifications. However, the costs incurred by delegating this work to human agents are high. For this reason, an automatic system able to detect abuse in user-generated content is of great interest. There are a number of ways to tackle this problem, but the most commonly seen in practice are word filtering or regular expression matching. The main limitations are their vulnerability to intentional obfuscation on the part of the users, and their context-insensitive nature. Moreover, they are language-dependent and may require appropriate corpora for training. In this paper, we propose a system for automatic abuse detection that completely disregards message content. We first extract a conversational network from raw chat logs and characterize it through topological measures. We then use these as features to train a classifier on our abuse detection task. We thoroughly assess our system on a dataset of user comments originating from a French Massively Multiplayer Online Game. We identify the most appropriate network extraction parameters and discuss the discriminative power of our features, relatively to their topological and temporal nature. Our method reaches an F-measure of 83.89 when using the full feature set, improving on existing approaches. With a selection of the most discriminative features, we dramatically cut computing time while retaining most of the performance (82.65).
... Broad-topic queries [55] Web graph 6175 16150 google.com internal [56] 15763 171206 nd.edu domain [57] 325729 1497134 Baidu articles [58] 415641 3284387 ...
Preprint
Many real-world networks are large, complex and thus hard to understand, analyze or visualize. The data about networks is not always complete, their structure may be hidden or they change quickly over time. Therefore, understanding how incomplete system differs from complete one is crucial. In this paper, we study the changes in networks under simplification (i.e., reduction in size). We simplify 30 real-world networks with six simplification methods and analyze the similarity between original and simplified networks based on preservation of several properties, for example degree distribution, clustering coefficient, betweenness centrality, density and degree mixing. We propose an approach for assessing the effectiveness of simplification process to define the most appropriate size of simplified networks and to determine the method, which preserves the most properties of original networks. The results reveal the type and size of original networks do not influence the changes of networks under simplification process, while the size of simplified networks does. Moreover, we investigate the performance of simplification methods when the size of simplified networks is 10% of the original networks. The findings show that sampling methods outperform merging ones, particularly random node selection based on degree and breadth-first sampling perform the best.
... Of course, true ontologies also require concept nodes to be connected by informative relations, and in Section 5 we will see researchers mine such relations in a [Bellomi and Bonato 2005]. The two most prominent techniques for web analysis are Google's PageRank [Brin and Page 1998] and the HITS algorithm [Kleinberg 1998]. Bellomi and Bonato [2005] The Wikipedia category graph also forms a network structure. ...
Preprint
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.
... Because global searchers are not confined to some local neighborhood, they are able to consider vertices that are more than a single hop away from the query node. The class of global network ranking measures, typified by PageRank [30], HITS [16] and SALSA [20], are excellent at computing the global importance of objects according to their overall connectivity. Numerous additional studies stem from these initial studies. ...
Preprint
Full-text available
Similarity search is a fundamental problem in social and knowledge networks like GitHub, DBLP, Wikipedia, etc. Existing network similarity measures are limited because they only consider similarity from the perspective of the query node. However, due to the complicated topology of real-world networks, ignoring the preferences of target nodes often results in odd or unintuitive performance. In this work, we propose a dual perspective similarity metric called Forward Backward Similarity (FBS) that efficiently computes topological similarity from the perspective of both the query node and the perspective of candidate nodes. The effectiveness of our method is evaluated by traditional quantitative ranking metrics and large-scale human judgement on four large real world networks. The proposed method matches human preference and outperforms other similarity search algorithms on community overlap and link prediction. Finally, we demonstrate top-5 rankings for five famous researchers on an academic collaboration network to illustrate how our approach captures semantics more intuitively than other approaches.
... [13] and [14] assign a global trust rating to each data source on a P2P network. Authorityhub analysis [15] and PageRank [16] decide trustworthiness based on link analysis [17]. As mentioned in Section II, our work is based on [3] to obtain the true probability of of values in data sources due to the following two reasons: [3] considers both source accuracy and copying relationship between sources, thus it can return a true probability with a high precision. ...
Preprint
In Big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and even impossible to access all data sources. Data Source selection should consider both efficiency and effectiveness issues. For efficiency, the approach should achieve high performance and be scalability to fit large data source amount. From effectiveness aspect, data quality and overlapping of sources are to be considered, since data quality varies much from data sources, with significant differences in the accuracy and coverage of the data provided, and the overlapping of sources can even lower the quality of data integrated from selected data sources. In this paper, we study source selection problem in \textit{Big Data Era} and propose methods which can scale to datasets with up to millions of data sources and guarantee the quality of results. Motivated by this, we propose a new object function taking the expected number of true values a source can provide as a criteria to evaluate the contribution of a data source. Based on our proposed index we present a scalable algorithm and two pruning strategies to improve the efficiency without sacrificing precision. Experimental results on both real world and synthetic data sets show that our methods can select sources providing a large proportion of true values efficiently and can scale to massive data sources.
... Hence, a knowledge graph constructed with the proposed methodology represents a network of hyperlinks that connect the different documents in the corpus. In order to measure artist relevance in our constructed graph, we applied the PageRank (Brin & Page, 1998) and HITS (Kleinberg, 1999) algorithms. PageRank outputs a measure of relevance for each node, and HITS gives two different results: authority and hubness. ...
Preprint
Today, a massive amount of musical knowledge is stored in written form, with testimonies dated as far back as several centuries ago. In this work, we present different Natural Language Processing (NLP) approaches to harness the potential of these text collections for automatic music knowledge discovery, covering different phases in a prototypical NLP pipeline, namely corpus compilation, text-mining, information extraction, knowledge graph generation and sentiment analysis. Each of these approaches is presented alongside different use cases (i.e., flamenco, Renaissance and popular music) where large collections of documents are processed, and conclusions stemming from data-driven analyses are presented and discussed.
Chapter
Generative Artificial Intelligence (AI) offers powerful tools that fundamentally change the design of information access systems; however, it is unclear how to use them to best serve the needs of people. At present, Large Language Models (LLMs) process natural language (and multi-modal) input and present credible-appearing but often completely untrue multi-modal output. This opens the door to research into how to produce true, complete, relevant information, where and how to design retrieval augmentation to personalize and ground the system, and how to evaluate beyond relevance for truth, completeness, utility, and satisfaction. The applications of generative AI for information-seeking tasks are broad. In this chapter, we present recent developments in four domains that have been well studied in the information retrieval community (education, biomedical, legal, and finance). We follow with a discussion of new challenges (agentic systems) and research areas that are common to most applications of generative AI to information seeking tasks (credibility and veracity, new paradigms for evaluation, and synthetic data generation). The field of Information Retrieval (IR) is at the leading edge of a transformation in how people access information and accomplish tasks. We have the rare opportunity to design and build the future we want to live in.
Preprint
A defining feature of many large empirical networks is their intrinsic complexity. However, many networks also contain a large degree of structural repetition. An immediate question then arises: can we characterize essential network complexity while excluding structural redundancy? In this article we utilize inherent network symmetry to collapse all redundant information from a network, resulting in a coarse-graining which we show to carry the essential structural information of the `parent' network. In the context of algebraic combinatorics, this coarse-graining is known as the \emph{quotient}. We systematically explore the theoretical properties of network quotients and summarize key statistics of a variety of `real-world' quotients with respect to those of their parent networks. In particular, we find that quotients can be substantially smaller than their parent networks yet typically preserve various key functional properties such as complexity (heterogeneity and hubs vertices) and communication (diameter and mean geodesic distance), suggesting that quotients constitute the essential structural skeleton of their parent network. We summarize with a discussion of potential uses of quotients in analysis of biological regulatory networks and ways in which using quotients can reduce the computational complexity of network algorithms.
Preprint
Methods for efficiently controlling dynamics propagated on networks are usually based on identifying the most influential nodes. Knowledge of these nodes can be used for the targeted control of dynamics such as epidemics, or for modifying biochemical pathways relating to diseases. Similarly they are valuable for identifying points of failure to increase network resilience in, for example, social support networks and logistics networks. Many measures, often termed `centrality', have been constructed to achieve these aims. Here we consider Katz centrality and provide a new interpretation as a steady-state solution to continuous-time dynamics. This enables us to implement a sensitivity analysis which is similar to metabolic control analysis used in the analysis of biochemical pathways. The results yield a centrality which quantifies, for each node, the net impact of its absence from the network. It also has the desirable property of requiring a node with a high centrality to play a central role in propagating the dynamics of the system by having the capacity to both receive flux from others and then to pass it on. This new perspective on Katz centrality is important for a more comprehensive analysis of directed networks.
Article
Modeling the number of citations from one journal to another may be done by assuming independent contributions from the referencing journal and from the cited journal. Empirical and theoretical evidence, however, indicates that self-citations are different from interjournal citations. For this reason a model is proposed that separates the analysis of selfcitations from inter-citations. In addition, a model is proposed that adjusts the expected citation counts by the journal to journal similarity. Computational procedures for fitting coefficients of the models to the observed citation pattern are described along with a statistical method for evaluating the validity of the model.
Article
Co-citation analysis is based on the assumption that all citing articles view the scientific literature from a common point-of-view. When a co-citation matrix is analyzed, this assumption affects measures of the dimensionality and clustering of articles. Therefore, before a co-citation matrix is constructed, the citing articles should be limited to those written by individuals in an invisible college.
Article
Using Markov Chain theory we give further insight into the citation influence methodology for scientific publications which was initially described by Pinski and Narin.
Article
Measures of cluster-based retrieval effectiveness are computed for five composite representations in the cystic fibrosis (CF) Document Collection. The composite representations are constructed from combinations of two subject representations, based on Medical Subject Headings and subheadings, and two citation representations, consisting of the complete list of cited references and a comprehensive list of citations for each document. Experimental retrieval results are presented as a function of the exhaustivity and similarity of the composite representations and reveal consistent patterns from which optimal performance levels can be identified. The optimal performance values provide an assessment of the absolute capacity of each composite representation to associate documents relevant to the same query and discriminate between documents relevant to different queries in single-link hierarchies. The optimal performance values for all composite representations are completely comparable and are superior to the optimal performance of constituent representations. Optimal performance consistently occurs at low levels of exhaustivity. Exhaustive composite representations that include subject descriptions produce the lowest levels of performance; retrieval results derived from random structures are comparable to the observed results. The effectiveness of the exhaustive representation composed of references and citations is materially superior to the effectiveness of exhaustive composite representations that include subject descriptions. © 1991 John Wiley & Sons, Inc.
Article
Given a journal-to-journal citation matrix, it is straight-forward to construct measures of relative standing for the journals in the citation network. The methodology is completely portable and can be used for citation networks where the units are national scientific communities. The new application points to the need to have a measure of standing that takes a wider citation environment into account. A new measure of standing that does this is proposed. The new measure imposes no additional computational burden, making it prudent to use the measure for any citation matrix.
Article
Instead of the two-year impact factor as used in the Journal Citation Reports, there is much m favor of using x-year impact factors (x>0). These impact factors are studied as a function of x and compared with the average number of citations per paper to papers that appeared in the journal x years ago. It is shown that both are equal if and only if the derivative of the impact-factor function is zero. Based on this, a simple classification of impact-factor curves versus mean citation curves is established and examples are given. These results are also applied to recent practical data that were obtained by Rousseau.
Article
A measure of the relative standing of professional journals in an aggregated journal citation matrix is considered. The measure can be viewed as an iteration of inputoutput methods or as a normalized eigenvector of a transformation matrix. The measure is applied in a network of 22 geographical journals for 1970–1972 and 1980–1982. The relative locations of the journals in the network for each period are discussed as well as the relative and absolute changes between the two periods.
Article
We empirically test existing theories on the provision of public goods, in particular air quality, using data on sulfur dioxide (SO2) concentrations from the Global Environment Monitoring Projects for 107 cities in 42 countries from 1971 to 1996. The results are as follows: First, we provide additional support for the claim that the degree of democracy has an independent positive effect on air quality. Second, we find that among democracies, presidential systems are more conducive to air quality than parliamentary ones. Third, in testing competing claims about the effect of interest groups on public goods provision in democracies we establish that labor union strength contributes to lower environmental quality, whereas the strength of green parties has the opposite effect.