Chapter

Fast and Attributed Change Detection on Dynamic Graphs with Density of States

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

How can we detect traffic disturbances from international flight transportation logs, or changes to collaboration dynamics in academic networks? These problems can be formulated as detecting anomalous change points in a dynamic graph. Current solutions do not scale well to large real world graphs, lack robustness to large amount of node additions / deletions and overlook changes in node attributes. To address these limitations, we propose a novel spectral method: Scalable Change Point Detection (SCPD). SCPD generates an embedding for each graph snapshot by efficiently approximating the distribution of the Laplacian spectrum at each step. SCPD can also capture shifts in node attributes by tracking correlations between attributes and eigenvectors. Through extensive experiments using synthetic and real world data, we show that SCPD (a) achieves state-of-the-art performance, (b) is significantly faster than the state-of-the-art methods and can easily process millions of edges in a few CPU minutes, (c) can effectively tackle a large quantity of node attributes, additions or deletions and (d) discovers interesting events in large real world graphs. Code is publicly available at https://github.com/shenyangHuang/SCPD.git.KeywordsAnomaly DetectionDynamic GraphsSpectral Methods

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... • Scalability: Due to their computational efficiency and ability to utilize only a small portion of graph nodes and edges to analyze overall behavior, the InnerCore discovery and expansion/decay calculations are suitable on large temporal graphs including Ethereum transaction and stablecoin networks. InnerCore is more effective and efficient than baselines (Batagelj and Zaversnik, 2011;Victor et al., 2021) and the state-of-the-art attributed change detection method in dynamic graphs (Huang et al., 2023) ( §6). ...
... We refer to §5.1, InnerCore vs Alphacore for their differences. Additionally, we compare against Scalable Change Point Detection (SCPD) (Huang et al., 2023), state-of-the-art attributed change detection method in dynamic graphs. ...
... SCPD is state-of-the-art method to identify anomalies from attributed graph snapshots (Huang et al., 2023). Due to its spectral approach, we find it slower: InnerCore discovery runs nearly 7x faster on each daily graph snapshot, which demonstrates the scalability of our solution. ...
Article
Full-text available
Blockchains are significantly easing trade finance , with billions of dollars worth of assets being transacted daily. However, analyzing these networks remains challenging due to the sheer volume and complexity of the data. We introduce a method named InnerCore that detects market manipulators within blockchain-based networks and offers a sentiment indicator for these networks. This is achieved through data depth-based core decomposition and centered motif discovery, ensuring scalability. InnerCore is a computationally efficient, unsupervised approach suitable for analyzing large temporal graphs. We demonstrate its effectiveness by analyzing and detecting three recent real-world incidents from our datasets: the catastrophic collapse of LunaTerra, the Proof-of-Stake switch of Ethereum, and the temporary peg loss of USDC–while also verifying our results against external ground truth. Our experiments show that InnerCore can match the qualified analysis accurately without human involvement, automating blockchain analysis in a scalable manner, while being more effective and efficient than baselines and state-of-the-art attributed change detection approach in dynamic graphs.
... This approach enables scalable hallucination detection while maintaining performance. By utilizing tools such as the Density of States (DOS) and the kernel polynomial method (KPM) for approximating EigenScore (Huang et al., 2023b;Lin et al., 2014), we aim to enhance the efficiency of our analysis in the context of confabulations, which we will demonstrate empirically with EES and SeND. ...
Preprint
Full-text available
As large language models (LLMs) become increasingly deployed across various industries, concerns regarding their reliability, particularly due to hallucinations-outputs that are factually inaccurate or irrelevant to user input-have grown. Our research investigates the relationship between the training process and the emergence of hallucinations to address a key gap in existing research that focuses primarily on post hoc detection and mitigation strategies. Using models from the Pythia suite (70M-12B parameters) and several hallucination detection metrics, we analyze hallucination trends throughout training and explore LLM internal dynamics. We introduce SEnsitive Neuron Dropout (SeND), a novel training protocol designed to mitigate hallucinations by reducing variance during training. SeND achieves this by deterministically dropping neurons with significant variability on a dataset, referred to as Sensitive Neurons. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This efficient metric is integrated into our protocol, allowing SeND to be both computationally scalable and effective at reducing hallucinations. Our empirical evaluation demonstrates that our approach improves LLM reliability at test time by up to 40% compared to normal training while also providing an efficient method to improve factual accuracy when adapting LLMs to domains such as Wikipedia and Medical datasets.
... However, when forecasting the trajectory of an ongoing epidemic (or simply epidemic forecasting), the dynamic contact network structure is inaccessible, as the future contacts have not been observed. In addition, accurately predicting future contact patterns proves challenging since the graph structures often undergo significant changes over time 6,7 . As such, utilizing a static, approximated network based on the past contact networks emerges as a more practical approach for real-time epidemic forecasting. ...
Article
Full-text available
Epidemic modeling is essential in understanding the spread of infectious diseases like COVID-19 and devising effective intervention strategies to control them. Recently, network-based disease models have integrated traditional compartment-based modeling with real-world contact graphs and shown promising results. However, in an ongoing epidemic, future contact network patterns are not observed yet. To address this, we use aggregated static networks to approximate future contacts for disease modeling. The standard method in the literature concatenates all edges from a dynamic graph into one collapsed graph, called the full static graph. However, the full static graph often leads to severe overestimation of key epidemic characteristics. Therefore, we propose two novel static network approximation methods, DegMST and EdgeMST, designed to preserve the sparsity of real world contact network while remaining connected. DegMST and EdgeMST use the frequency of temporal edges and the node degrees respectively to preserve sparsity. Our analysis show that our models more closely resemble the network characteristics of the dynamic graph compared to the full static ones. Moreover, our analysis on seven real-world contact networks suggests EdgeMST yield more accurate estimations of disease dynamics for epidemic forecasting when compared to the standard full static method.
Article
Blockchain technology and cryptocurrencies have garnered considerable attention over the past fifteen years. The term Web3 (sometimes Web 3.0) has been coined to define a possible direction for the web based on the use of decentralisation via blockchain. Cryptocurrencies are characterised by high market volatility and susceptibility to substantial crashes, issues that require temporal analysis methodologies able to tackle the high temporal resolution, heterogeneity and scale of blockchain data. While existing research attempts to analyse crash events, fundamental questions persist regarding the optimal time scale for analysis, differentiation between long-term and short-term trends, and the identification and characterisation of shock events within these decentralised systems. This paper addresses these issues by examining cryptocurrencies traded on the Ethereum blockchain, with a spotlight on the crash of the stablecoin TerraUSD (UST) and the currency LUNA designed to stabilise it. Utilising complex network analysis and a multi-layer temporal graph allows the study of the correlations between the layers representing the currencies and system evolution across diverse time scales. The investigation sheds light on the strong interconnections among stablecoins pre-crash and the significant post-crash transformations. We identify anomalous signals before, during, and after the collapse, emphasising their impact on graph structure metrics and user movement across layers. This paper is novel in its use of temporal, cross-chain graph analysis to explore a cryptocurrency collapse. It emphasises the importance of temporal analysis for studies on web-derived data. In addition, the methodology shows how graph-based analysis can enhance traditional econometric results. Overall, this research carries implications beyond its field, for example for regulatory agencies aiming to safeguard users could use multi-layer temporal graphs as part of their suite of analysis tools.
Article
Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: i) how to compare graph snapshots across time, ii) how to capture temporal dependencies, and iii) how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short term and long term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that i) LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, ii) MultiLAD’s advantage over contenders significantly increases when additional views are available, and iii) MultiLAD is highly robust to noise from individual views. In five real world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.
Article
Full-text available
Networks provide a powerful formalism for modeling complex systems, by representing the underlying set of pairwise interactions. But much of the structure within these systems involves interactions that take place among more than two nodes at once; for example, communication within a group rather than person-to-person, collaboration among a team rather than a pair of co-authors, or biological interaction between a set of molecules rather than just two. We refer to these type of simultaneous interactions on sets of more than two nodes as higher-order interactions; they are ubiquitous, but the empirical study of them has lacked a general framework for evaluating higher-order models. Here we introduce such a framework, based on link prediction, a fundamental problem in network analysis. The traditional link prediction problem seeks to predict the appearance of new links in a network, and here we adapt it to predict which (larger) sets of elements will have future interactions. We study the temporal evolution of 19 datasets from a variety of domains, and use our higher-order formulation of link prediction to assess the types of structural features that are most predictive of new multi-way interactions. Among our results, we find that different domains vary considerably in their distribution of higher-order structural parameters, and that the higher-order link prediction problem exhibits some fundamental differences from traditional pairwise link prediction, with a greater role for local rather than long-range information in predicting the appearance of new interactions.
Article
Full-text available
A number of real world problems in many domains (e.g. sociology, biology, political science and communication networks) can be modeled as dynamic networks with nodes representing entities of interest and edges representing interactions among the entities at different points in time. A common representation for such models is the snapshot model - where a network is defined at logical time-stamps. An important problem under this model is change point detection. In this work we devise an effective and efficient three-step-approach for detecting change points in dynamic networks under the snapshot model. Our algorithm achieves up to 9X speedup over the state-of-the-art while improving quality on both synthetic and real world networks.
Conference Paper
Full-text available
Automatic Dependent Surveillance-Broadcast (ADS-B) is one of the key components of the next generation air transportation system. Since ADS-B will become mandatory by 2020 for most airspaces, it is important that aspects such as capacity, applications, and security are investigated by an independent research community. However, large-scale real-world data was previously only accessible to a few closed industrial and governmental groups because it required specialized and expensive equipment. To enable researchers to conduct experimental studies based on real data, we developed OpenSky, a sensor network based on low-cost hardware connected over the Internet. OpenSky is based on off-the-shelf ADS-B sensors distributed to volunteers throughout Central Europe. It covers 720,000 sq km2, is able to capture more than 30% of the commercial air traffic in Europe, and enables researchers to analyze billions of ADS-B messages. In this paper, we report on the challenges we faced during the development and deployment of this participatory network and the insights we gained over the last two years of operations as a service to academic research groups. We go on to provide real-world insights about the possibilities and limitations of such low-cost sensor networks concerning air traffic surveillance and further applications such as multilateration.
Article
Full-text available
How can we spot anomalies in large, time-evolving graphs? When we have multi-aspect data, e.g. who published which paper on which conference and on what year, how can we combine this information, in order to obtain good summaries thereof and unravel hidden anomalies and patterns? Such multi-aspect data, including time-evolving graphs, can be successfully modelled using Tensors. In this paper, we show that when we have multiple dimensions in the dataset, then tensor analysis is a powerful and promising tool. Our method TENSORSPLAT, at the heart of which lies the "PARAFAC" decomposition method, can give good insights about the large networks that are of interest nowadays, and contributes to spotting micro-clusters, changes and, in general, anomalies. We report extensive experiments on a variety of datasets (co-authorship network, time-evolving DBLP network, computer network and Facebook wall posts) and show how tensors can be proved useful in detecting "strange" behaviors.
Article
Full-text available
This paper explains the multi-way decomposition method PARAFAC and its use in chemometrics. PARAFAC is a generalization of PCA to higher order arrays, but some of the characteristics of the method are quite different from the ordinary two-way case. There is no rotation problem in PARAFAC, and e.g., pure spectra can be recovered from multi-way spectral data. One cannot as in PCA estimate components successively as this will give a model with poorer fit, than if the simultaneous solution is estimated. Finally scaling and centering is not as straightforward in the multi-way case as in the two-way case. An important advantage of using multi-way methods instead of unfolding methods is that the estimated models are very simple in a mathematical sense, and therefore more robust and easier to interpret. All these aspects plus more are explained in this tutorial and an implementation in Matlab code is available, that contains most of the features explained in the text. Three examples show how PARAFAC can be used for specific problems. The applications include subjects as: Analysis of variance by PARAFAC, a five-way application of PARAFAC, PARAFAC with half the elements missing, PARAFAC constrained to positive solutions and PARAFAC for regression as in principal component regression.
Conference Paper
Full-text available
We report on an automated runtime anomaly detection method at the application layer of multi-node computer systems. Although several network management systems are available in the market, none of them have sufficient capabilities to detect faults in multi-tier Web-based systems with redundancy. We model a Web-based system as a weighted graph, where each node represents a "service" and each edge represents a dependency between services. Since the edge weights vary greatly over time, the problem we address is that of anomaly detection from a time sequence of graphs.In our method, we first extract a feature vector from the adjacency matrix that represents the activities of all of the services. The heart of our method is to use the principal eigenvector of the eigenclusters of the graph. Then we derive a probability distribution for an anomaly measure defined for a time-series of directional data derived from the graph sequence. Given a critical probability, the threshold value is adaptively updated using a novel online algorithm.We demonstrate that a fault in a Web application can be automatically detected and the faulty services are identified without using detailed knowledge of the behavior of the system.
Conference Paper
Spectral analysis connects graph structure to the eigenvalues and eigenvectors of associated matrices. Much of spectral graph theory descends directly from spectral geometry, the study of differentiable manifolds through the spectra of associated differential operators. But the translation from spectral geometry to spectral graph theory has largely focused on results involving only a few extreme eigenvalues and their associated eigenvalues. Unlike in geometry, the study of graphs through the overall distribution of eigenvalues --- the \em spectral density --- is largely limited to simple random graph models. The interior of the spectrum of real-world graphs remains largely unexplored, difficult to compute and to interpret. In this paper, we delve into the heart of spectral densities of real-world graphs. We borrow tools developed in condensed matter physics, and add novel adaptations to handle the spectral signatures of common graph motifs. The resulting methods are highly efficient, as we illustrate by computing spectral densities for graphs with over a billion edges on a single compute node. Beyond providing visually compelling fingerprints of graphs, we show how the estimation of spectral densities facilitates the computation of many common centrality measures, and use spectral densities to estimate meaningful information about graph structure that cannot be inferred from the extremal eigenpairs alone.
Conference Paper
How do we spot interesting events from e-mail or transportation logs? How can we detect port scan or denial of service attacks from IP-IP communication data? In general, given a sequence of weighted, directed or bipartite graphs, each summarizing a snapshot of activity in a time window, how can we spot anomalous graphs containing the sudden appearance or disappearance of large dense subgraphs (e.g., near bicliques) in near real-time using sublinear memory? To this end, we propose a randomized sketching-based approach called SpotLight, which guarantees that an anomalous graph is mapped 'far' away from 'normal' instances in the sketch space with high probability for appropriate choice of parameters. Extensive experiments on real-world datasets show that SpotLight (a) improves accuracy by at least 8.4% compared to prior approaches, (b) is fast and can process millions of edges within a few minutes, (c) scales linearly with the number of edges and sketching dimensions and (d) leads to interesting discoveries in practice.
Conference Paper
A number of real world problems in many domains (e.g. sociology, biology, political science and communication networks) can be modeled as dynamic networks with nodes representing entities of interest and edges representing interactions among the entities at different points in time. A common representation for such models is the snapshot model - where a network is defined at logical time-stamps. An important problem under this model is change point detection. In this work we devise an effective and efficient three-step-approach for detecting change points in dynamic networks under the snapshot model. Our algorithm achieves up to 9X speedup over the state-of-the-art while improving quality on both synthetic and real world networks.
Conference Paper
In this paper we describe a new release of a Web scale entity graph that serves as the backbone of Microsoft Academic Service (MAS), a major production effort with a broadened scope to the namesake vertical search engine that has been publicly available since 2008 as a research prototype. At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine. As a result of the Bing integration, the new MAS graph sees significant increase in size, with fresh information streaming in automatically following their discoveries by the search engine. In addition, the rich entity relations included in the knowledge base provide additional signals to disambiguate and enrich the entities within and beyond the academic domain. The number of papers indexed by MAS, for instance, has grown from low tens of millions to 83 million while maintaining an above 95% accuracy based on test data sets derived from academic activities at Microsoft Research. Based on the data set, we demonstrate two scenarios in this work: a knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation.
Article
Networks are an important tool for describing and quantifying data on interactions among objects or people, e.g., online social networks, offline friendship networks, and object-user interaction networks, among others. When interactions are dynamic, their evolving pattern can be represented as a sequence of networks each giving the interactions among a common set of vertices at consecutive points in time. An important task in analyzing such evolving networks, and for predicting their future evolution, is change-point detection, in which we identify moments in time across which the large-scale pattern of interactions changes fundamentally. Here, we formalize the network change point detection problem within a probabilistic framework and introduce a method that can reliably solve it in data. This method combines a generalized hierarchical random graph model with a generalized likelihood ratio test to quantitatively determine if, when, and precisely how a change point has occurred. Using synthetic data with known structure, we characterize the difficulty of detecting change points of different types, e.g., groups merging, splitting, forming or fragmenting, and show that this method is more accurate than several alternatives. Applied to two high-resolution evolving social networks, this method identifies a sequence of change points that align with known external shocks to these networks.
Article
A stochastic model is proposed for social networks in which the actors in a network are partitioned into subgroups called blocks. The model provides a stochastic generalization of the blockmodel. Estimation techniques are developed for the special case of a single relation social network, with blocks specified a priori. An extension of the model allows for tendencies toward reciprocation of ties beyond those explained by the partition. The extended model provides a one degree-of-freedom test of the model. A numerical example from the social network literature is used to illustrate the methods.
Article
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
Article
In this paper the solution to the problem of placing n connected points (or nodes) in r-dimensional Euclidean space is given. The criterion for optimality is minimizing a weighted sum of squared distances between the points subject to quadratic constraints of the form X'X - 1, for each of the r unknown coordinate vectors. It is proved that the problem reduces to the minimization of a sum or r positive semi-definite quadratic forms which, under the quadratic constraints, reduces to the problem of finding r eigenvectors of a special "disconnection" matrix. It is shown, by example, how this can serve as a basis for cluster identification.
Conference Paper
Spectral graph theory is the study of the eigenvalues and eigenvectors of matrices associated with graphs. In this tutorial, we will try to provide some intuition as to why these eigenvectors and eigenvalues have combinatorial significance, and will sitn'ey some of their applications.
Philosophical transactions of the royal society a: mathematical, physical and engineering sciences
  • A L Barabási
  • AL Barabási
Robust random cut forest based anomaly detection on streams
  • S Guha
  • N Mishra
  • G Roy
  • O Schrijvers