Article

Graph-based Anomaly Detection and Description: A Survey

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured {\em graph} data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we provide a comprehensive exploration of both data mining and machine learning algorithms for these {\em detection} tasks. we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly {\em attribution} and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Applications in biology include protein detection [22], biological network comparison [29] and disease gene identification [23]. In network science, the researchers have applied graphlets for web spam detection [5], anomaly detection [4], social network structure analysis [35], and friendship recommendation, e.g., there are two "types" of 3-node graphlets: (1) a 3-node line subgraph This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. ...
... (b) If we want to explore possible graphlets g 4 i and have decided to perform a random walk on G (2) (i.e., d = 2). Assume we make l = 3 transitions on the following states: (1, 2) → (1, 3) → (3, 4), then we can obtain a 4-node graphlet sample induced by the node set {1, 2, 3, 4}, because {1, 2, 3, 4} is contained in the three states (1, 2), (1,3) and (3,4). In this case, the obtained graphlet sample corresponds to g 4 5 in Figure 2. The main technical challenge is to remove the bias of the obtained graphlet samples. ...
... • Effect of the concentration value. From Figure 5b we can see that SRW2 and SRW2CSS perform better than SRW3 for all the 4-node graphlets except g 4 3 (because the weighted concentration of g 4 3 computed with SRW3 is higher than that of SRW2). Besides, the smaller the concentration value, the higher the estimation error, which is consistent with our analysis in Theorem 3. ...
Preprint
Full-text available
Graphlets are induced subgraph patterns and have been frequently applied to characterize the local topology structures of graphs across various domains, e.g., online social networks (OSNs) and biological networks. Discovering and computing graphlet statistics are highly challenging. First, the massive size of real-world graphs makes the exact computation of graphlets extremely expensive. Secondly, the graph topology may not be readily available so one has to resort to web crawling using the available application programming interfaces (APIs). In this work, we propose a general and novel framework to estimate graphlet statistics of "any size". Our framework is based on collecting samples through consecutive steps of random walks. We derive an analytical bound on the sample size (via the Chernoff-Hoeffding technique) to guarantee the convergence of our unbiased estimator. To further improve the accuracy, we introduce two novel optimization techniques to reduce the lower bound on the sample size. Experimental evaluations demonstrate that our methods outperform the state-of-the-art method up to an order of magnitude both in terms of accuracy and time cost.
... For example, outlier nodes in a social network graph may include: scammers who steal users' personal information; fake accounts that manipulate the reputation management system; or spammers who send free and mostly false advertisements [2], [3]. Researchers have been working on algorithms to detect these malicious outlier nodes in graphs [4], [5], [6], [7]. Outlier edges are also common in graphs. ...
... b,a = S b\a and R (4) b,a = S a\b . Using Theorem 3, we can easily get α P (4) a,b , R (4) a,b = α P (4) b,a , R (4) b,a . ...
... b,a = S b\a and R (4) b,a = S a\b . Using Theorem 3, we can easily get α P (4) a,b , R (4) a,b = α P (4) b,a , R (4) b,a . To prove Theorem 4 for scheme 2, we divide the nodes in edge-ego-network G ab into five mutually exclusive sets: ...
Preprint
Outliers are samples that are generated by different mechanisms from other normal data samples. Graphs, in particular social network graphs, may contain nodes and edges that are made by scammers, malicious programs or mistakenly by normal users. Detecting outlier nodes and edges is important for data mining and graph analytics. However, previous research in the field has merely focused on detecting outlier nodes. In this article, we study the properties of edges and propose outlier edge detection algorithms using two random graph generation models. We found that the edge-ego-network, which can be defined as the induced graph that contains two end nodes of an edge, their neighboring nodes and the edges that link these nodes, contains critical information to detect outlier edges. We evaluated the proposed algorithms by injecting outlier edges into some real-world graph data. Experiment results show that the proposed algorithms can effectively detect outlier edges. In particular, the algorithm based on the Preferential Attachment Random Graph Generation model consistently gives good performance regardless of the test graph data. Further more, the proposed algorithms are not limited in the area of outlier edge detection. We demonstrate three different applications that benefit from the proposed algorithms: 1) a preprocessing tool that improves the performance of graph clustering algorithms; 2) an outlier node detection algorithm; and 3) a novel noisy data clustering algorithm. These applications show the great potential of the proposed outlier edge detection techniques.
... In many instances, the relationships between vertices evolve as a function of time: edges may appear and disappear, the weights along the edges may change. The study of such dynamic graphs often involves the identification of patterns that couple changes in the network topology with the latent dynamical processes that drive the evolution of the connectivity of the network [2,24,30,31,42]. ...
... In this paper we are concerned with two undirected weighted graphs G (1) and G (2) with a common vertex set V = {1, . . . , n}, two edge sets E (1) and E (2) , and two symmetric weight functions w (1) and w (2) . ...
... In this paper we are concerned with two undirected weighted graphs G (1) and G (2) with a common vertex set V = {1, . . . , n}, two edge sets E (1) and E (2) , and two symmetric weight functions w (1) and w (2) . We denote by A (1) and A (2) the corresponding weighted adjacency matrices. ...
Preprint
To quantify the fundamental evolution of time-varying networks, and detect abnormal behavior, one needs a notion of temporal difference that captures significant organizational changes between two successive instants. In this work, we propose a family of distances that can be tuned to quantify structural changes occurring on a graph at different scales: from the local scale formed by the neighbors of each vertex, to the largest scale that quantifies the connections between clusters, or communities. Our approach results in the definition of a true distance, and not merely a notion of similarity. We propose fast (linear in the number of edges) randomized algorithms that can quickly compute an approximation to the graph metric. The third contribution involves a fast algorithm to increase the robustness of a network by optimally decreasing the Kirchhoff index. Finally, we conduct several experiments on synthetic graphs and real networks, and we demonstrate that we can detect configurational changes that are directly related to the hidden variables governing the evolution of dynamic networks.
... For alleviating anomalies' detrimental impact and protecting the public interests, anomaly detection has attracted extensive research interests across disciplines since 1969 [9]. Although numerous anomaly/outlier detection techniques have been developed, the majority of them only focus on tabular data and overlook the complex relations and interactions between objects [10], [11] that widely exist in real information networks. Such complex relations provide important clues for detecting anomalies. ...
... Graph anomaly detection in this work is dedicated to anomalous node detection, which aims to detect rare, abnormal or unusual nodes in a graph [9], [10]. The well-known applications of this research direction include fraud detection [21], fake news detection [4], and intrusion detection [6]. ...
... Since anomalies are generated by significantly different and unknown mechanisms [10], their data distributions are essentially different from the majority. Given a model that is capable of capturing the underlying major data distribution in the embedding space, normal nodes adhering to the distribution can be seamlessly reconstructed or decoded using the embeddings while anomalies are harder to reconstruct. ...
Article
Anomalies often occur in real-world information networks/graphs, such as malevolent users in online review networks and fake news in social media. When representing such structured network data as graphs, anomalies usually appear as anomalous nodes that exhibit significantly deviated structure patterns, or different attributes, or the both. To date, numerous unsupervised methods have been developed to detect anomalies based on residual analysis, which assumes that anomalies will introduce larger residual errors (i.e., graph reconstruction loss). While these existing works achieved encouraging performance, in this paper, we formally prove that their employed learning objectives, i.e., MSE and cross-entropy losses, encounter significant limitations in learning the major data distributions, particularly for anomaly detection, and through our preliminary study, we reveal that the vanilla residual analysis-based methods cannot effectively investigate the rich graph structure. Upon these discoveries, we propose a novel structure-biased graph anomaly detection framework (SALAD) to attain anomalies' divergent patterns with the assistance of a specially designed node representation augmentation approach. We further present two effective training objectives to empower SALAD to effectively capture the major structure and attribute distributions by emphasizing less on anomalies that introduce higher reconstruction errors under the encoder-decoder framework. The detection performance on eight widely-used datasets demonstrates SALAD's superiority over twelve state-of-the-art baselines. Additional ablation and case studies validate that our data augmentation method and training objectives result in the impressive performance.
... Many real-world networks such as transport, electricity, and social networks can be represented as graphs, which model the relationships between sets of entities. An ordered sequence of such graphs can be used to model temporal changes, such as the evolution of relationships as well as the addition and removal of entities [1], [8], [20], [22]. We refer to sequences of graphs as dynamic graphs. ...
... In contrast to detecting anomalous vertices in a graph [19], in this work we focus on graph-level anomaly detection, which is determining whether an entire graph is abnormal within a sequence of graphs [22]. Anomaly detection techniques for dynamic graphs can be categorised into four broad types, based on: (i) features, (ii) decompositions, (iii) clustering, and (iv) moving windows [1]. ...
Conference Paper
Full-text available
Detecting anomalies in a temporal sequence of graphs can be applied is areas such as the detection of accidents in transport networks and cyber attacks in computer networks. Existing methods for detecting abnormal graphs can suffer from multiple limitations, such as high false positive rates as well as difficulties with handling variable-sized graphs and non-trivial temporal dynamics. To address this, we propose a technique where temporal dependencies are explicitly modelled via time series analysis of a large set of pertinent graph features, followed by using residuals to remove the dependencies. Extreme Value Theory is then used to robustly model and classify any remaining extremes, aiming to produce low false positives rates. Comparative evaluations on a multitude of graph instances show that the proposed approach obtains considerably better accuracy than TensorSplat and Laplacian Anomaly Detection.
... Graph anomaly detection (GAD) refers to the tasks of identifying anomalous graph objects-such as nodes, edges or sub-graphs-in an individual graph (Akoglu et al, 2015;Ma et al, 2021), or identifying anomalous graphs from a set of graphs (Ma et al, 2022;Li et al, 2024a). GAD has numerous successful applications, e.g., in finance fraud detection (Motie and Raahemi, 2023), fake news detection (Xu et al, 2022a), system fault diagnosis (Li et al, 2024b), and network intrusion detection (Garcia-Teodoro et al, 2009). ...
... We use three popular citation networks, namely Cora, Citeseer, and Pubmed ( Akoglu, 2015). The resulting datasets are summarized in Table 2. ...
Preprint
Full-text available
Self-supervised learning (SSL) is an emerging paradigm that exploits supervisory signals generated from the data itself, and many recent studies have leveraged SSL to conduct graph anomaly detection. However, we empirically found that three important factors can substantially impact detection performance across datasets: 1) the specific SSL strategy employed; 2) the tuning of the strategy's hyperparameters; and 3) the allocation of combination weights when using multiple strategies. Most SSL-based graph anomaly detection methods circumvent these issues by arbitrarily or selectively (i.e., guided by label information) choosing SSL strategies, hyperparameter settings, and combination weights. While an arbitrary choice may lead to subpar performance, using label information in an unsupervised setting is label information leakage and leads to severe overestimation of a method's performance. Leakage has been criticized as "one of the top ten data mining mistakes", yet many recent studies on SSL-based graph anomaly detection have been using label information to select hyperparameters. To mitigate this issue, we propose to use an internal evaluation strategy (with theoretical analysis) to select hyperparameters in SSL for unsupervised anomaly detection. We perform extensive experiments using 10 recent SSL-based graph anomaly detection algorithms on various benchmark datasets, demonstrating both the prior issues with hyperparameter selection and the effectiveness of our proposed strategy.
... The rising use of social networks and various other sensor networks in real-world applications ranging from politics to healthcare has made computational analysis of graphs a very important area of research today. The use of machine learning for various graph analysis tasks such as community detection [16], link analysis [9], node classification [2] and anomaly detection [4] has exponentially increased over the last few years, considering the direct impact of the success of these methods on business outcomes. A wide variety of methods have been proposed in the aforementioned areas over the last few years. ...
... Epinion Dataset: Epinion 3 is a popular product review site where users write critical reviews about products from various categories. The Epinion dataset 4 contains review details such as user ids, product ids, category ids, time-stamp when the ratings were created, along with few other fields. This dataset has been used in earlier work such as [22] and [23]. ...
Preprint
Full-text available
Analyzing the temporal behavior of nodes in time-varying graphs is useful for many applications such as targeted advertising, community evolution and outlier detection. In this paper, we present a novel approach, STWalk, for learning trajectory representations of nodes in temporal graphs. The proposed framework makes use of structural properties of graphs at current and previous time-steps to learn effective node trajectory representations. STWalk performs random walks on a graph at a given time step (called space-walk) as well as on graphs from past time-steps (called time-walk) to capture the spatio-temporal behavior of nodes. We propose two variants of STWalk to learn trajectory representations. In one algorithm, we perform space-walk and time-walk as part of a single step. In the other variant, we perform space-walk and time-walk separately and combine the learned representations to get the final trajectory embedding. Extensive experiments on three real-world temporal graph datasets validate the effectiveness of the learned representations when compared to three baseline methods. We also show the goodness of the learned trajectory embeddings for change point detection, as well as demonstrate that arithmetic operations on these trajectory representations yield interesting and interpretable results.
... The detection of anomalies is extremely useful, because the discovery of irregularities without any prior knowledge or help of an expert is a necessity in many domains [4]. The analysis of graphical data has grown in popularity [16] over the past two decades, and this has been accompanied by increasing quantities of research on anomaly detection in complex networks. ...
... Currently, there is a very limited number of publicly available datasets with known anomalies, and manual labeling is a challenging task [4]. To deal with these issues and evaluate the proposed anomaly detection algorithm on various types of networks, we used simulated anomalous vertices (see Algorithm 2) for different scenarios. ...
Preprint
In the past decade, network structures have penetrated nearly every aspect of our lives. The detection of anomalous vertices in these networks has become increasingly important, such as in exposing computer network intruders or identifying fake online reviews. In this study, we present a novel unsupervised two-layered meta-classifier that can detect irregular vertices in complex networks solely by using features extracted from the network topology. Following the reasoning that a vertex with many improbable links has a higher likelihood of being anomalous,we employed our method on 10 networks of various scales, from a network of several dozen students to online social networks with millions of users. In every scenario, we were able to identify anomalous vertices with lower false positive rates and higher AUCs compared to other prevalent methods. Moreover, we demonstrated that the presented algorithm is efficient both in revealing fake users and in disclosing the most influential people in social networks.
... Fraud detection: Due to the openness and anonymity of Internet, the online platforms attract a large number of malicious users, such as vandals, trolls, and sockpuppets. Many fraud detection techniques have been developed in recent years [1,4,16,20,47], including content-based approaches and graph-based approaches. The contentbased approaches extract content features, (i.e., text, URL), to identify malicious users from user activities on social networks [2]. ...
... Research in [48] proposed two deep neural networks for fraud detection on a signed graph. Often based on unsupervised learning, the graph-based approaches consider fraud as anomalies and extract various graph features associated with nodes, edges, ego-net, or communities from the graph [1,28,32]. ...
Preprint
Many online applications, such as online social networks or knowledge bases, are often attacked by malicious users who commit different types of actions such as vandalism on Wikipedia or fraudulent reviews on eBay. Currently, most of the fraud detection approaches require a training dataset that contains records of both benign and malicious users. However, in practice, there are often no or very few records of malicious users. In this paper, we develop one-class adversarial nets (OCAN) for fraud detection using training data with only benign users. OCAN first uses LSTM-Autoencoder to learn the representations of benign users from their sequences of online activities. It then detects malicious users by training a discriminator with a complementary GAN model that is different from the regular GAN model. Experimental results show that our OCAN outperforms the state-of-the-art one-class classification models and achieves comparable performance with the latest multi-source LSTM model that requires both benign and malicious users in the training phase.
... These nonconforming patterns, occurring rarely in datasets, are often referred to as anomalies, outliers, exceptions, aberrations, surprises, or contaminates in the field of specific research. In a word, our studies are to warn the problems what are happening, or predict the system evolving trend for a long time by detecting the local or global unusual changes [2]- [4]. ...
... Subsequently Le et.al [10] used complex network concepts such as degree distribution, maximum degree and dK-2 distance to detect anomalous network traffic. In fact, detecting the anomalies in datasets depicted by dynamic network has received much attention in recent years, since dynamic network can provide a powerful machinery for effectively capturing these long-range correlations among inter-dependent data objects [4]. For instance the intentional attack for Internet [11]. ...
Preprint
Detecting the anomaly behaviors such as network failure or Internet intentional attack in the large-scale Internet is a vital but challenging task. While numerous techniques have been developed based on Internet traffic in past years, anomaly detection for structured datasets by complex network have just been of focus recently. In this paper, a anomaly detection method for large-scale Internet topology is proposed by considering the changes of network crashes. In order to quantify the dynamic changes of Internet topology, the network path changes coefficient(NPCC) is put forward which will highlight the Internet abnormal state after it is attacked continuously. Furthermore we proposed the decision function which is inspired by Fibonacci Sequence to determine whether the Internet is abnormal or not. That is the current Internet is abnormal if its NPCC is beyond the normal domain which structured by the previous k NPCCs of Internet topology. Finally the new Internet anomaly detection method was tested over the topology data of three Internet anomaly events. The results show that the detection accuracy of all events are over 97%, the detection precision of each event are 90.24%, 83.33% and 66.67%, when k = 36. According to the experimental values of the index F_1, we found the the better the detection performance is, the bigger the k is, and our method has better performance for the anomaly behaviors caused by network failure than that caused by intentional attack. Compared with traditional anomaly detection, our work may be more simple and powerful for the government or organization in items of detecting large-scale abnormal events.
... Anomaly detection on networks is a problem arising in various areas: from intrusion detection [10] to fraud detection [6], from email network [17] to fMRI image [13]. One problem of particular interest is change point detection on dynamic social networks [1,19]. Social networks are known to have the hierarchical structure, where the most well-known one is the community structure [9,20]. ...
... There are two recent surveys [1,19] on change point detection. Most state-of-the-art works [13,22] do not consider the hierarchical structure in a network; Other works [4,16,17] mention the hierarchy in their papers, and they all make speci c assumption about the underlying generative model: Moreno's work [16] assumes a network is generated from a mixed Kronecker product graph model (mKPGM), which is generated recursively from a seed matrix. ...
Preprint
This paper studies change point detection on networks with community structures. It proposes a framework that can detect both local and global changes in networks efficiently. Importantly, it can clearly distinguish the two types of changes. The framework design is generic and as such several state-of-the-art change point detection algorithms can fit in this design. Experiments on both synthetic and real-world networks show that this framework can accurately detect changes while achieving up to 800X speedup.
... The general approach for detecting such anomalies is to extract some features of the network (such as centrality measures, degree distribution, etc.), monitor these features over time, and raise a signal when these observed features cross a specified threshold. A rich class of anomaly detection techniques have been developed for dynamic networks, e.g., density based techniques (Papadimitriou et al., 2003), clustering based techniques (Wang et al., 2012), distribution based techniques (Akoglu et al., 2015(Akoglu et al., ,Šaltenis, 2004, and scan methods (Priebe et al., 2005). On the other hand, the goal of anomaly detection in a static network is to detect a subgraph that is significantly different from the overall network (Miller et al., 2015, Sengupta, 2018. ...
... On the other hand, the goal of anomaly detection in a static network is to detect a subgraph that is significantly different from the overall network (Miller et al., 2015, Sengupta, 2018. Some popular approaches include network analysis at the egonet level (Akoglu et al., 2015, Sengupta, 2018, spatial autocorrelation (Chawla and Sun, 2006), and modularity maximization (Newman, 2016, Sun et al., 2005, Haveliwala, 2003. In our paper, we restrict our attention to static networks. ...
Preprint
Full-text available
Monitoring of networks for anomaly detection has attracted a lot of attention in recent years especially with the rise of connected devices and social networks. This is of importance as anomaly detection could span a wide range of application, from detecting terrorist cells in counter-terrorism efforts to phishing attacks in social network circles. For this reason, numerous techniques for anomaly detection have been introduced. However, application of these techniques to more complex network models is hindered by various challenges such as the size of the network being investigated, how much apriori information is needed, the size of the anomalous graph, among others. A recent technique introduced by Miller et al, which relies on a spectral framework for anomaly detection, has the potential to address many of these challenges. In their discussion of the spectral framework, three algorithms were proposed that relied on the eigenvalues and eigenvectors of the residual matrix of a binary network. The authors demonstrated the ability to detect anomalous subgraphs that were less than 1% of the network size. However, to date, there is little work that has been done to evaluate the statistical performance of these algorithms. This study investigates the statistical properties of the spectral methods, specifically the Chi-square and L1 norm algorithm proposed by Miller. We will analyze the performance of the algorithm using simulated networks and also extend the method's application to count networks. Finally we will make some methodological improvements and recommendations to both algorithms.
... While early research established core principles for identifying rare events [11], recent advances in deep learning have revolutionized network traffic analysis through multi-layered pattern recognition [27]. Graph-based approaches [3] are particularly good at modeling complex relationships between objects of interest, using advanced node feature learning techniques [29]. ...
... Recent advances in graph-based detection have yielded promising results in network security applications. While earlier work explored basic graph structures for relationship modeling [3], the development of graph convolutional networks [18] has dramatically improved detection accuracy in real-world security scenarios. These improvements, combined with advanced clustering techniques [26], demonstrate particular effectiveness in identifying coordinated attack patterns across distributed networks. ...
Preprint
Full-text available
As cyber threats continue to evolve in sophistication and scale, the ability to detect anomalous network behavior has become critical for maintaining robust cybersecurity defenses. Modern cybersecurity systems face the overwhelming challenge of analyzing billions of daily network interactions to identify potential threats, making efficient and accurate anomaly detection algorithms crucial for network defense. This paper investigates the use of variations of the Isolation Forest (iForest) machine learning algorithm for detecting anomalies in internet scan data. In particular, it presents the Set-Partitioned Isolation Forest (siForest), a novel extension of the iForest method designed to detect anomalies in set-structured data. By treating instances such as sets of multiple network scans with the same IP address as cohesive units, siForest effectively addresses some challenges of analyzing complex, multidimensional datasets. Extensive experiments on synthetic datasets simulating diverse anomaly scenarios in network traffic demonstrate that siForest has the potential to outperform traditional approaches on some types of internet scan data.
... Deng et al. [28] introduced a graph-based framework that combines transaction analysis and entity-level anomaly detection to enhance the identification of money laundering schemes. Unsupervised GNN models based on the GAE framework may also allow us to detect money laundering in a manner that is free of labeled data, but still informative enough for real world criminal activity investigation and prevention [26]. FlowScope [28] proposes a new metric, Anomalousness, which computes the likelihood for money laundering in multistep streaming transaction graphs of high density with an end goal of tracing the full money flow (payments) from sources to destinations. ...
Conference Paper
Full-text available
Detecting money laundering within financial networks is challenging due to the complexity of illicit transactions and the scarcity of labeled data. In this study, we model accounts as nodes and transactions as edges to develop an un-supervised anomaly detection framework, AMLGaurd, utilizing Graph Auto-Encoders (GAEs). GAEs encode the structural and transactional information of financial entities into a latent space and reconstruct the network to identify anomalies based on reconstruction discrepancies. We classify anomalies into four categories: contextual, structural, joint-type, and structure-type, each capturing different irregular patterns in the network. Furthermore, GAE encoder based embeddings can also be used for supervised edge classification, which can promote a multi-task learning setting. Although our present findings offer encouraging insights, this methodology establishes the groundwork for future progress in identifying complex money laundering schemes. This research enhances the application of graph-based unsupervised learning techniques in anti-money laundering (AML) systems, contributing to more effective and scalable financial security measures.
... In anomaly detection methodologies, particularly within graphs with node attributes like text graphs, encoding processes typically involve encoding the textual attributes of nodes [2,14]. For instance, in Graphformers [26] the textual features of nodes are independently encoded by language models. ...
Article
Full-text available
Dynamic graphs represent connections in complex systems changing over time, posing unique challenges for anomaly detection. Traditional static graph models and shallow dynamic graph methods often fail to capture the temporal dynamics and interactions effectively, limiting their ability to detect anomalies accurately. In this work, we introduce the Attribute Encoding Transformer (AET), a novel framework specifically designed for anomaly detection in unattributed dynamic graphs. The AET integrates advanced encoding strategies that leverage both spatial and historical interaction data, enhancing the model’s ability to identify anomalous patterns. Our approach includes a Link Prediction Pre-training methodology that optimizes the transformer architecture for dynamic contexts by pre-training on link prediction tasks, followed by fine-tuning for anomaly detection. Comprehensive experiments on four real-world datasets demonstrate that our framework outperforms the state-of-the-art methods in detecting anomalies, thereby addressing key challenges in dynamic graph analysis. This study not only advances the field of graph anomaly detection but also sets a new benchmark for future research on dynamic graph data analysis.
... Real-time monitoring of abnormal activities in financial markets not only helps improve market transparency and prevent systemic risks but also provides investors with more timely early warning information, thus enhancing the efficiency and fairness of the market [9][10][11]. Currently, deep learning-based image recognition technology has made significant progress in multiple fields, and it also shows great potential in the analysis of financial market time-series data [12,13]. ...
... The graph anomaly is defined as an abnormal or unusual pattern of nodes, edges, or subgraphs in the graph data (Akoglu, Tong, and Koutra 2015). From the perspective of the task, current graph anomaly detection methods can be divided into two main categories, node-level anomaly detection, and graph-level anomaly detection. ...
Preprint
Full-text available
Graph neural networks (GNNs) have shown promise in integrating protein-protein interaction (PPI) networks for identifying cancer genes in recent studies. However, due to the insufficient modeling of the biological information in PPI networks, more faithfully depiction of complex protein interaction patterns for cancer genes within the graph structure remains largely unexplored. This study takes a pioneering step toward bridging biological anomalies in protein interactions caused by cancer genes to statistical graph anomaly. We find a unique graph anomaly exhibited by cancer genes, namely weight heterogeneity, which manifests as significantly higher variance in edge weights of cancer gene nodes within the graph. Additionally, from the spectral perspective, we demonstrate that the weight heterogeneity could lead to the "flattening out" of spectral energy, with a concentration towards the extremes of the spectrum. Building on these insights, we propose the HIerarchical-Perspective Graph Neural Network (HIPGNN) that not only determines spectral energy distribution variations on the spectral perspective, but also perceives detailed protein interaction context on the spatial perspective. Extensive experiments are conducted on two reprocessed datasets STRINGdb and CPDB, and the experimental results demonstrate the superiority of HIPGNN.
... Graphs offer a powerful representation for many types of structured data, including chemical processes, molecules, financial or social networks. There is much work focused on the task of detecting anomalous nodes and edges within a graph [5]. However, in many applications, it is much more relevant to ask whether an entire graph is abnormal. ...
Article
Full-text available
Data augmentation plays a critical role in self-supervised learning, including anomaly detection. While hand-crafted transformations such as image rotations can achieve impressive performance on image data, effective transformations of non-image data are lacking. In this work, we study learning such transformations for end-to-end anomaly detection on arbitrary data. We find that a contrastive loss – which encourages learning diverse data transformations while preserving the relevant semantic content of the data – is more suitable than previously proposed losses for transformation learning, a fact that we prove theoretically and empirically. We demonstrate that anomaly detection using neural transformation learning can achieve state-of-the-art results for time series data, tabular data, text data and graph data. Furthermore, our approach can make image anomaly detection more interpretable by learning transformations at different levels of abstraction.
... While the first question relates to obtaining a statistically representative sample, the sampling bias is best understood as a question about the impact of outliers for the structure of a DNS graph. Here, an outlier is understood broadly as a graph object that is rare and differs considerably from the majority of graph objects sampled [9]. Fast flux networks are a good example of such outliers seen in empirical DNS graph mining applications. ...
Preprint
The concept of agile domain name system (DNS) refers to dynamic and rapidly changing mappings between domain names and their Internet protocol (IP) addresses. This empirical paper evaluates the bias from this kind of agility for DNS-based graph theoretical data mining applications. By building on two conventional metrics for observing malicious DNS agility, the agility bias is observed by comparing bipartite DNS graphs to different subgraphs from which vertices and edges are removed according to two criteria. According to an empirical experiment with two longitudinal DNS datasets, irrespective of the criterion, the agility bias is observed to be severe particularly regarding the effect of outlying domains hosted and delivered via content delivery networks and cloud computing services. With these observations, the paper contributes to the research domains of cyber security and DNS mining. In a larger context of applied graph mining, the paper further elaborates the practical concerns related to the learning of large and dynamic bipartite graphs.
... Anomaly detection has been widely investigated in previous work (Akoglu, Tong, and Koutra 2015). Anomaly detection in networks aims to infer the structural inconsistencies, which means the anomalous nodes that connect to various diverse influential communities (Burt 2004;Hu et al. 2016), such as the red node in Fig. 13. ...
Preprint
Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we focus on categorizing and then reviewing the current development on network embedding methods, and point out its future research directions. We first summarize the motivation of network embedding. We discuss the classical graph embedding algorithms and their relationship with network embedding. Afterwards and primarily, we provide a comprehensive overview of a large number of network embedding methods in a systematic manner, covering the structure- and property-preserving network embedding methods, the network embedding methods with side information and the advanced information preserving network embedding methods. Moreover, several evaluation approaches for network embedding and some useful online resources, including the network data sets and softwares, are reviewed, too. Finally, we discuss the framework of exploiting these network embedding methods to build an effective system and point out some potential future directions.
... Among the many, anomaly detection in graphs emerged as a problem of particular relevance, as a consequence of the ever growing possibility to monitor and collect data coming from natural and man-made systems of various size. An overview of proposed approaches for anomaly and change detection on time-variant graphs is reported in [35], [37], where the authors distinguish the level of influence of a change. They identify changes affecting vertices and edges, or involving entire subnetworks of different size; this type of change usually concerns static networks, where the topology is often fixed. ...
Preprint
Graph representations offer powerful and intuitive ways to describe data in a multitude of application domains. Here, we consider stochastic processes generating graphs and propose a methodology for detecting changes in stationarity of such processes. The methodology is general and considers a process generating attributed graphs with a variable number of vertices/edges, without the need to assume one-to-one correspondence between vertices at different time steps. The methodology acts by embedding every graph of the stream into a vector domain, where a conventional multivariate change detection procedure can be easily applied. We ground the soundness of our proposal by proving several theoretical results. In addition, we provide a specific implementation of the methodology and evaluate its effectiveness on several detection problems involving attributed graphs representing biological molecules and drawings. Experimental results are contrasted with respect to suitable baseline methods, demonstrating the effectiveness of our approach.
... To identify "camou aged" fraud, Hooi et al. [14] introduced "suspiciousness" metrics that apply to bipartite user-to-item graphs, and developed a greedy algorithm to nd the subgraph with the highest suspiciousness. Akoglu et al. [2] survey graph based online fraud detection. [13] provide a survey of community detection methods, evaluation scores and techniques for general networks. ...
Preprint
The profitability of fraud in online systems such as app markets and social networks marks the failure of existing defense mechanisms. In this paper, we propose FraudSys, a real-time fraud preemption approach that imposes Bitcoin-inspired computational puzzles on the devices that post online system activities, such as reviews and likes. We introduce and leverage several novel concepts that include (i) stateless, verifiable computational puzzles, that impose minimal performance overhead, but enable the efficient verification of their authenticity, (ii) a real-time, graph-based solution to assign fraud scores to user activities, and (iii) mechanisms to dynamically adjust puzzle difficulty levels based on fraud scores and the computational capabilities of devices. FraudSys does not alter the experience of users in online systems, but delays fraudulent actions and consumes significant computational resources of the fraudsters. Using real datasets from Google Play and Facebook, we demonstrate the feasibility of FraudSys by showing that the devices of honest users are minimally impacted, while fraudster controlled devices receive daily computational penalties of up to 3,079 hours. In addition, we show that with FraudSys, fraud does not pay off, as a user equipped with mining hardware (e.g., AntMiner S7) will earn less than half through fraud than from honest Bitcoin mining.
... In particular, when prior knowledge about the correspondence between vertices the can be incorporated into the algorithms, the GMP can be approximately solved efficiently for graphs with more than 10 5 vertices [11,12] without the need for sophisticated modern parallel computing to be brought to bear. A more challenging problem that we will not consider is to detect anomalous subgraphs within a collection of graphs [13]. In the anomaly detection setting, the structure of the anomalous subgraph may be only known up to certain graph characteristics or deviations from the structure of the remaining graph. ...
Preprint
The problem of finding the vertex correspondence between two noisy graphs with different number of vertices where the smaller graph is still large has many applications in social networks, neuroscience, and computer vision. We propose a solution to this problem via a graph matching matched filter: centering and padding the smaller adjacency matrix and applying graph matching methods to align it to the larger network. The centering and padding schemes can be incorporated into any algorithm that matches using adjacency matrices. Under a statistical model for correlated pairs of graphs, which yields a noisy copy of the small graph within the larger graph, the resulting optimization problem can be guaranteed to recover the true vertex correspondence between the networks. However, there are currently no efficient algorithms for solving this problem. To illustrate the possibilities and challenges of such problems, we use an algorithm that can exploit a partially known correspondence and show via varied simulations and applications to {\it Drosophila} and human connectomes that this approach can achieve good performance.
... Some users would have different types of interactions with different communities. Such social relationship among friends can be used to create dependency graph among participants and these graphs can be used to detect anomalies [100] [2]. These graphs help in finding unexpected behavior or any user or community that shows an abnormal activity. ...
Preprint
Pervasive Online Social Networks (POSNs) are the extensions of Online Social Networks (OSNs) which facilitate connectivity irrespective of the domain and properties of users. POSNs have been accumulated with the convergence of a plethora of social networking platforms with a motivation of bridging their gap. Over the last decade, OSNs have visually perceived an altogether tremendous amount of advancement in terms of the number of users as well as technology enablers. A single OSN is the property of an organization, which ascertains smooth functioning of its accommodations for providing a quality experience to their users. However, with POSNs, multiple OSNs have coalesced through communities, circles, or only properties, which make service-provisioning tedious and arduous to sustain. Especially, challenges become rigorous when the focus is on the security perspective of cross-platform OSNs, which are an integral part of POSNs. Thus, it is of utmost paramountcy to highlight such a requirement and understand the current situation while discussing the available state-of-the-art. With the modernization of OSNs and convergence towards POSNs, it is compulsory to understand the impact and reach of current solutions for enhancing the security of users as well as associated services. This survey understands this requisite and fixates on different sets of studies presented over the last few years and surveys them for their applicability to POSNs...
... Change-point and anomaly detection for temporal networks is also concerned with detecting changes of networks over time 7,[17][18][19][20][21] . A main difference between our system state dynamics and these methods is that detection of system state dynamics is concerned with not only the change, but what is before and after the change. ...
Preprint
Many time-evolving systems in nature, society and technology leave traces of the interactions within them. These interactions form temporal networks that reflect the states of the systems. In this work, we pursue a coarse-grained description of these systems by proposing a method to assign discrete states to the systems and inferring the sequence of such states from the data. Such states could, for example, correspond to a mental state (as inferred from neuroimaging data) or the operational state of an organization (as inferred by interpersonal communication). Our method combines a graph distance measure and hierarchical clustering. Using several empirical data sets of social temporal networks, we show that our method is capable of inferring the system's states such as distinct activities in a school and a weekday state as opposed to a weekend state. We expect the methods to be equally useful in other settings such as temporally varying protein interactions, ecological interspecific interactions, functional connectivity in the brain and adaptive social networks.
... One-class classification has many applications such as anomaly or abnormality detection [1], [2], [3], [4], [5], novelty detection [6], [7], [8], and user authentication [9], [10], [11], [12], [13], [14]. For example, in novelty detection, it is normally assumed that one does not have a priori knowledge of the novel class data. ...
Preprint
We present a novel Convolutional Neural Network (CNN) based approach for one class classification. The idea is to use a zero centered Gaussian noise in the latent space as the pseudo-negative class and train the network using the cross-entropy loss to learn a good representation as well as the decision boundary for the given class. A key feature of the proposed approach is that any pre-trained CNN can be used as the base network for one class classification. The proposed One Class CNN (OC-CNN) is evaluated on the UMDAA-02 Face, Abnormality-1001, FounderType-200 datasets. These datasets are related to a variety of one class application problems such as user authentication, abnormality detection and novelty detection. Extensive experiments demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art methods. The source code is available at : github.com/otkupjnoz/oc-cnn.
... Research in this area has primarily focused on individual aspects of anomaly detection or dimensionality reduction. Comprehensive surveys [1,3,6,22,28] have explored various anomaly detection techniques, offering valuable insights and highlighting unresolved issues. Dimensionality reduction, a key enabler of scalable anomaly detection, has also been extensively reviewed [40]. ...
Preprint
As command-line interfaces remain an integral part of high-computation environments, the risk of exploitation through stealthy, complex command-line abuse continues to grow. Conventional security solutions often struggle with these command-line-based anomalies due to their context-specific nature and lack of labeled data, especially in detecting rare, malicious patterns amidst legitimate, high-volume activity. This gap has left organizations vulnerable to sophisticated threats like Living-off-the-Land (LOL) attacks, where standard detection tools frequently miss or misclassify anomalous command-line behavior. We introduce Scalable Command-Line Anomaly Detection Engine (SCADE), who addresses these challenges by introducing a dual-layered detection framework that combines a global statistical analysis with local context-specific anomaly detection, innovatively using a novel ensemble of statistical models such as BM25 and Log Entropy, adapted for command-line data. The framework also features a dynamic thresholding mechanism for adaptive anomaly detection, ensuring high precision and recall even in environments with extremely high Signal-to-Noise Ratios (SNRs). Initial experimental results demonstrate the effectiveness of the framework, achieving above 98% SNR in identifying unusual command-line behavior while minimizing false positives. In this paper, we present SCADE's core architecture, including its metadata-enriched approach to anomaly detection and the design choices behind its scalability for enterprise-level deployment. We argue that SCADE represents a significant advancement in command-line anomaly detection, offering a robust, adaptive framework for security analysts and researchers seeking to enhance detection accuracy in high-computation environments.
... Our goal is to detect anomalous edges in these different datasets. As in these cases, the true labels for anomalous or regular edges are unavailable, we manually inject random edges into the dataset 8,19,23 and label them as anomalous. Then, we run our anomaly detection algorithm and evaluate the model's ability to identify anomalies. ...
Article
Full-text available
Anomaly detection is an essential task in the analysis of dynamic networks, offering early warnings of abnormal behavior. We present a principled approach to detect anomalies in dynamic networks that integrates community structure as a foundational model for regular behavior. Our model identifies anomalies as irregular edges while capturing structural changes. Our approach leverages a Markovian framework for temporal transitions and latent variables for community and anomaly detection, inferring hidden parameters to detect unusual interactions. Evaluations on synthetic and real-world datasets show strong anomaly detection across various scenarios. In a case study on professional football player transfers, we detect patterns influenced by club wealth and country, as well as unexpected transactions both within and across community boundaries. This work provides a framework for adaptable anomaly detection, highlighting the value of integrating domain knowledge with data-driven techniques for improved interpretability and robustness in complex networks.
... Behavioral and Device Pattern Graph. Constructing graphs based on user behaviors and device usage patterns (e.g., logins, queries, transaction requests, device types, operating systems, and device IDs) can reveal the similarities or abnormal patterns between different accounts [81][82][83]. By connecting accounts whose behavior and device usage similarity exceeds a certain threshold, it's possible to uncover networks of accounts potentially operated by fraudsters. ...
Preprint
The landscape of financial transactions has grown increasingly complex due to the expansion of global economic integration and advancements in information technology. This complexity poses greater challenges in detecting and managing financial fraud. This review explores the role of Graph Neural Networks (GNNs) in addressing these challenges by proposing a unified framework that categorizes existing GNN methodologies applied to financial fraud detection. Specifically, by examining a series of detailed research questions, this review delves into the suitability of GNNs for financial fraud detection, their deployment in real-world scenarios, and the design considerations that enhance their effectiveness. This review reveals that GNNs are exceptionally adept at capturing complex relational patterns and dynamics within financial networks, significantly outperforming traditional fraud detection methods. Unlike previous surveys that often overlook the specific potentials of GNNs or address them only superficially, our review provides a comprehensive, structured analysis, distinctly focusing on the multifaceted applications and deployments of GNNs in financial fraud detection. This review not only highlights the potential of GNNs to improve fraud detection mechanisms but also identifies current gaps and outlines future research directions to enhance their deployment in financial systems. Through a structured review of over 100 studies, this review paper contributes to the understanding of GNN applications in financial fraud detection, offering insights into their adaptability and potential integration strategies.
Article
The landscape of financial transactions has grown increasingly complex due to the expansion of global economic integration and advancements in information technology. This complexity poses greater challenges in detecting and managing financial fraud. This review explores the role of Graph Neural Networks (GNNs) in addressing these challenges by proposing a unified framework that categorizes existing GNN methodologies applied to financial fraud detection. Specifically, by examining a series of detailed research questions, this review delves into the suitability of GNNs for financial fraud detection, their deployment in real-world scenarios, and the design considerations that enhance their effectiveness. This review reveals that GNNs are exceptionally adept at capturing complex relational patterns and dynamics within financial networks, significantly outperforming traditional fraud detection methods. Unlike previous surveys that often overlook the specific potentials of GNNs or address them only superficially, our review provides a comprehensive, structured analysis, distinctly focusing on the multifaceted applications and deployments of GNNs in financial fraud detection. This review not only highlights the potential of GNNs to improve fraud detection mechanisms but also identifies current gaps and outlines future research directions to enhance their deployment in financial systems. Through a structured review of over 100 studies, this review paper contributes to the understanding of GNN applications in financial fraud detection, offering insights into their adaptability and potential integration strategies.
Chapter
This chapter delves into the realm of anomaly detection in Wireless Sensor Networks (WSNs) and the Internet of Things (IoT), emphasizing their pivotal role in bolstering security. Focusing on diverse domains such as healthcare, environmental monitoring, and process industries, the chapter consolidates findings from various studies employing innovative anomaly detection techniques. One notable approach integrates supervised and unsupervised methods for continuous patient monitoring, showcasing successful anomaly detection in physiological variables using an autoencoder and XGBoost algorithm. The survey extends its scope to large-scale environmental sensing systems, where the proposed Anomaly Detection Framework demonstrates effectiveness in detecting emission events. Moreover, the paper explores sustainability initiatives, utilizing contextual anomaly detection in collaboration with Power smiths. The proposed algorithm, validated in simulation environments using historical data, exhibits promising real-time performance. An array of anomaly detection algorithms is presented, addressing challenges in diverse domains. These include a variance-based algorithm for sensor data, BRBAR for handling uncertain sensor data, anomaly detection in medical data, outlier detection in big sensor data, integration of SVM and YASA for activity recognition, density estimation for anomaly detection, and biomedical signal analysis. The survey concludes by highlighting future research directions, emphasizing the importance of addressing challenges in WSNs and IoT, such as resource constraints and collaboration with prevention-based techniques. Ongoing research aims to incorporate data stream mining techniques, apply anomaly detection methods to specific industries, and explore benchmark data selection for comprehensive evaluations. The taxonomy presented in the survey categorizes techniques, models, and architectures, providing a valuable guide for researchers and practitioners navigating the intricate landscape of anomaly detection in sensor systems. Open research inquiries pave the way for future investigations, contributing to the continuous evolution and improvement of anomaly detection methodologies.
Article
Change point detection is crucial for identifying state transitions and anomalies in dynamic systems, with applications in network security, health care, and social network analysis. Dynamic systems are represented by dynamic graphs with spatial and temporal dimensions. As objects and their relations in a dynamic graph change over time, detecting these changes is essential. Numerous methods for change point detection in dynamic graphs have been developed, but no systematic review exists. This paper addresses this gap by introducing change point detection tasks in dynamic graphs, discussing two tasks based on input data types: detection in graph snapshot series (focusing on graph topology changes) and time series on graphs (focusing on changes in graph entities with temporal dynamics). We then present related challenges and applications, provide a comprehensive taxonomy of surveyed methods, including datasets and evaluation metrics, and discuss promising research directions.
Article
Fraud detection has always been one of the primary concerns in social and economic activities and is becoming a decisive force in the booming digital economy. Graph structures formed by rich user interactions naturally serve as important clues for identifying fraudsters. While numerous graph neural network-based methods have been proposed, the diverse interactive connections within graphs and the heterophilic connections deliberately established by fraudsters to normal users as camouflage pose new research challenges. In this light, we propose H 2 IDE (Homophily and Heterophily Identification with Disentangled Embeddings) for accurate fraud detection in multi-relation graphs. H 2 IDE features in an independence-constrained disentangled representation learning scheme to capture various latent behavioral patterns in graphs, along with a supervised identification task to specifically model the factor-wise heterophilic connections, both of which are proven crucial to fraud detection. We also design a relation-aware attention mechanism for hierarchical and adaptive neighborhood aggregation in H 2 IDE. Extensive comparative experiments with state-of-the-art baseline methods on two real-world multi-relation graphs and two large-scale homogeneous graphs demonstrate the superiority and scalability of our proposed method and highlight the key role of disentangled representation learning with homophily and heterophily identification.
Article
Full-text available
Graph Neural Networks (GNNs) are neural models that use message transmission between graph nodes to represent the dependency of graphs. Variants of Graph Neural Networks (GNNs), such as graph recurrent networks (GRN), graph attention networks (GAT), and graph convolutional networks (GCN), have shown remarkable results on a variety of deep learning tasks in recent years. In this study, we offer a generic design pipeline for GNN models, go over the variations of each part, classify the applications in an organized manner, and suggest four outstanding research issues. Dealing with graph data, which provides extensive connection information among pieces, is necessary for many learning tasks. A model that learns from graph inputs is required for modelling physics systems, learning molecular fingerprints, predicting protein interfaces, and identifying illnesses. Reasoning on extracted structures (such as the dependency trees of sentences and the scene graphs of photos) is an important research issue that also requires graph reasoning models in other domains, such as learning from non-structural data like texts and images. Graph Neural Networks (GNNs) are primarily designed for dealing with graph-structured data, where relationships between entities are modeled as edges in a graph. While GNNs are not traditionally applied to image classification problems, researchers have explored ways to leverage graph-based structures to enhance the performance of Convolutional Neural Networks (CNNs) in certain scenario. GNN have been increasingly applied to Natural Language Processing (NLP) tasks, leveraging their ability to model structured data and capture relationships between elements in a graph. GNN are also applied for traffic related problems particularly in modeling and optimizing traffic flow, analyzing transportation networks, and addressing congestion issues. GNN can be used for traffic flow prediction, dynamic routing & navigation, Anomaly detection, public transport network utilization etc.
Article
The growing volume of graph data may exhaust the main memory. It is crucial to design a disk-based graph storage system to ingest updates and analyze graphs efficiently. However, existing dynamic graph storage systems suffer from read or write amplification and face the challenge of optimizing both read and write performance simultaneously. To address this challenge, we propose LSMGraph, a novel dynamic graph storage system that combines the write-friendly LSM-tree and the read-friendly CSR. It leverages the multi-level structure of LSM-trees to optimize write performance while utilizing the compact CSR structures embedded in the LSM-trees to boost read performance. LSMGraph uses a new memory structure, MemGraph, to efficiently cache graph updates and uses a multi-level index to speed up reads within the multi-level structure. Furthermore, LSMGraph incorporates a vertex-grained version control mechanism to mitigate the impact of LSM-tree compaction on read performance and ensure the correctness of concurrent read and write operations. Our evaluation shows that LSMGraph significantly outperforms state-of-the-art (graph) storage systems on both graph update and graph analytical workloads.
Article
Full-text available
Anomaly detection identifies objects or events that do not behave as expected or correlate with other data points. Anomaly detection has been used to identify and investigate abnormal data components. Detecting anomalous activities is challenging due to insufficient data size of anomalous reality, ground training data, factors related to differences in environmental conditions, working position of capturing cameras, and illumination situations. Anomaly detection has enormous applications that include (but not limited to) industrial damage prevention, sensor network, health-care services, traffic surveillance, and violence prediction. Machine learning techniques, particularly deep learning has enabled tremendous advancements in the area of anomaly detection. In this paper, we sort out an all-inclusive review of the up-to-date research on anomaly detection techniques. We seek to serve as an extensive and comprehensive review of machine and deep learning anomaly detection techniques throughout the foregoing three years 2019-2021. Particularly, we discuss both machine learning and deep learning anomaly detection applications, performance measurements, and anomaly detection classification. We also point out various datasets that have been applied in anomaly detection along with some fairly new real-world datasets. Finally, we investigate current challenges and future research prospects in this area.
Conference Paper
Full-text available
If a friend called you 50 times last month, how many times did you call him back? Does the answer change if we ask about SMS, or e-mails? We want to quantify reciprocity between individuals in weighted networks, and we want to discover whether it depends on their topological features (like degree, or number of common neighbors). Here we answer these questions, by studying the call- and SMS records of millions of mobile phone users from a large city, with more than 0.5 billion phone calls and 60 million SMSs, exchanged over a period of six months. Our main contributions are: (1) We propose a novel distribution, the Triple Power Law (3PL), that fits the reciprocity behavior of all 3 datasets we study, with a better fit than older competitors, (2) 3PL is parsimonious; it has only three parameters and thus avoids over-fitting, (3) 3PL can spot anomalies, and we report the most surprising ones, in our real networks, (4) We observe that the degree of reciprocity between users is correlated with their local topological features; reciprocity is higher among mutual users with larger local network overlap and greater degree similarity.
Article
Full-text available
This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to deal with the data mining problem of skewed data distributions. This method uses backpropagation (BP), together with naive Bayesian (NB) and C4.5 algorithms, on data partitions derived from minority oversampling with replacement. Its originality lies in the use of a single meta-classifier (stacking) to choose the best base classifiers, and then combine these base classifiers' predictions (bagging) to improve cost savings (stacking-bagging). Results from a publicly available automobile insurance fraud detection data set demonstrate that stacking-bagging performs slightly better than the best performing bagged algorithm, C4.5, and its best classifier, C4.5 (2), in terms of cost savings. Stacking-bagging also outperforms the common technique used in industry (BP without both sampling and partitioning). Subsequently, this paper compares the new fraud detection method (meta-learning approach) against C4.5 trained using undersampling, oversampling, and SMOTEing without partitioning (sampling approach). Results show that, given a fixed decision threshold and cost matrix, the partitioning and multiple algorithms approach achieves marginally higher cost savings than varying the entire training data set with different class distributions. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.
Article
Full-text available
Evolutionary network analysis has found an increasing interest in the literature because of the importance of different kinds of dynamic social networks, email networks, biological networks, and social streams. When a network evolves, the results of data mining algorithms such as community detection need to be correspondingly updated. Furthermore, the specific kinds of changes to the structure of the network, such as the impact on community structure or the impact on network structural parameters, such as node degrees, also needs to be analyzed. Some dynamic networks have a much faster rate of edge arrival and are referred to as network streams or graph streams. The analysis of such networks is especially challenging, because it needs to be performed with an online approach, under the one-pass constraint of data streams. The incorporation of content can add further complexity to the evolution analysis process. This survey provides an overview of the vast literature on graph evolution analysis and the numerous applications that arise in different contexts.
Chapter
Scan statistics have been used extensively in several areas of science and technology to test for the occurrence of clusters of rare events. In this article we present a survey of results in the area of scan statistics that are applicable to quality control and reliability theory. In quality control, scan statistics have been used in the design and analysis of acceptance sampling schemes. We survey results that are useful in evaluating the operating characteristics of such sampling schemes. In reliability theory, scan statistics have been used in the analysis and design of a k-within-consecutive-m-out-of-n:F system of n components arranged in a linear or circular fashion. This system fails if k components fail within any segment of m consecutive components. In this article, we survey results for approximations and bounds for the system reliability of such systems. Two-dimensional scan statistics have been used in the analysis and design of a k-within-consecutive-(m1, m2)-out-of-(n1, n2): F system of n1n2 components arranged in an n1 by n2 rectangular or cylindrical lattice. This system fails if k components fail within an m1 by m2 rectangular subregion in the n1 by n2 lattice. We survey results for the approximations and bounds for system reliability of these systems. In material science, reliability of certain materials can be compromised by the appearance of a large number of microcracks in a small area or volume. We present a survey of two- and three-dimensional scan statistics that have been used in analyzing the occurrences of such microcracks. Keywords: acceptance sampling; control charts; k-within-consecutive-m-out-of-n system; k-within-consecutive-(m1; m 2)-out-of-(n1; n 2)-system; material fatigue
Conference Paper
How can we visualize billion-scale graphs? How to spot outliers in such graphs quickly? Visualizing graphs is the most direct way of understanding them; however, billion-scale graphs are very difficult to visualize since the amount of information overflows the resolution of a typical screen. In this paper we propose Net-Ray, an open-source package for visualizationbased mining on billion-scale graphs. Net-Ray visualizes graphs using the spy plot (adjacency matrix patterns), distribution plot, and correlation plot which involve careful node ordering and scaling. In addition, Net-Ray efficiently summarizes scatter clusters of graphs in a way that finds outliers automatically, and makes it easy to interpret them visually. Extensive experiments show that Net-Ray handles very large graphs with billions of nodes and edges efficiently and effectively. Specifically, among the various datasets that we study, we visualize in multiple ways the YahooWeb graph which spans 1.4 billion webpages and 6.6 billion links, and the Twitter whofollows- whom graph, which consists of 62.5 million users and 1.8 billion edges. We report interesting clusters and outliers spotted and summarized by Net-Ray.
Article
The central idea of the MDL (Minimum Description Length) principle is to represent a class of models (hypotheses) by a universal model capable of imitating the behavior of any model in the class. The principle calls for a model class whose representative assigns the largest probability or density to the observed data. Two examples of universal models for parametric classes M are the normalized maximum likelihood (NML) model f(xn | M) = f(xn | e(xn)) f /(yn | (yn))dyn, where is an appropriately selected set, and a mixture fw(x\M) = I f(xe)w(6)d9 as a convex linear functional of the models. In this interpretation a Bayes factor fω(xn \f(xn|θ) θ)ω(θ)dθ of mixture representatives of two model classes. However, mixtures need not be the best representatives, and as will be shown the NML model provides a strictly better test for the mean being zero in the Gaussian cases where the variance is known or taken as a parameter.
Article
Given a graph with node attributes, how can we find meaningful patterns such as clusters, bridges, and outliers? Attributed graphs appear in real world in the form of social networks with user interests, gene interaction networks with gene expression information, phone call networks with customer demographics, and many others. In effect, we want to group the nodes into clusters with similar connectivity and homogeneous attributes. Most existing graph clustering algorithms either consider only the connectivity structure of the graph and ignore the node attributes, or require several user-defined parameters such as the number of clusters. We propose PICS, a novel, parameter-free method for mining at- Tributed graphs. Two key advantages of our method are that (1) it requires no user-specified parameters such as the number of clusters and similarity functions, and (2) its running time scales linearly with total graph and attribute size. Our experiments show that PICS reveals meaningful and insightful patterns and outliers in both synthetic and real datasets, including call networks, political books, political blogs, and collections from Twitter and YouTube which have more than 70K nodes and 30K attributes. Copyright
Article
Random walk graph kernel has been used as an important tool for various data mining tasks including classi fication and similarity computation. Despite its usefulness, however, it suffers from the expensive computational cost which is at least O(n3) or O(m2) for graphs with n nodes and m edges. In this paper, we propose Ark, a set of fast algorithms for random walk graph kernel computation. Ark is based on the observation that real graphs have much lower intrinsic ranks, compared with the orders of the graphs. Ark exploits the low rank structure to quickly compute random walk graph kernels in O(n 2) or O(m) time. Experimental results show that our method is up to 97,865× faster than the existing algorithms, while providing more than 91.3% of the accuracies. Copyright
Article
We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first consider Simrank [7] as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We argue that Simrank fails to properly identify query similarities in our application, and we present two enhanced versions of Simrank: one that exploits weights on click graph edges and another that exploits "evidence." We experimentally evaluate our new schemes against Simrank, using actual click graphs and queries from Yahoo!, and using a variety of metrics. Our results show that the enhanced methods can yield more and better query rewrites.
Article
Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mech-anisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Article
User-generated online reviews can play a significant role in the success of retail products, hotels, restaurants, etc. However, review systems are often targeted by opinion spammers who seek to distort the perceived quality of a product by creating fraudulent reviews. We propose a fast and effective framework, FRAUDEAGLE, for spotting fraudsters and fake reviews in online review datasets. Our method has several advantages: (1) it exploits the network effect among reviewers and products, unlike the vast majority of existing methods that focus on review text or behavioral analysis, (2) it consists of two complementary steps; scoring users and reviews for fraud detection, and grouping for visualization and sensemaking, (3) it operates in a completely unsupervised fashion requiring no labeled data, while still incorporating side information if available, and (4) it is scalable to large datasets as its run time grows linearly with network size. We demonstrate the effectiveness of our framework on synthetic and real datasets; where FRAUDEAGLE successfully reveals fraud-bots in a large online app review database. Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Article
Given large, multimillion-node graphs (e.g., Facebook, Web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this article we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this article: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and fine-tuned algorithm to compute the radii and the diameter of massive graphs, that runs on the top of the Hadoop / MapReduce system, with excellent scale-up on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multimodal/bimodal shape of the Radius plot, and its palindrome motion over time.
Article
Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.
Conference Paper
How can we find the virtual twin (i.e., the same or similar user) on Linked In for a user on Facebook? How can we effectively link an information network with a social network to support cross-network search? Graph alignment - the task of finding the node correspondences between two given graphs - is a fundamental building block in numerous application domains, such as social networks analysis, bioinformatics, chemistry, pattern recognition. In this work, we focus on aligning bipartite graphs, a problem which has been largely ignored by the extensive existing work on graph matching, despite the ubiquity of those graphs (e.g., users-groups network). We introduce a new optimization formulation and propose an effective and fast algorithm to solve it. We also propose a fast generalization of our approach to align unipartite graphs. The extensive experimental evaluations show that our method outperforms the state-of-art graph matching algorithms in both alignment accuracy and running time, being up to 10x more accurate or 174x faster on real graphs.
Article
Online shopping reviews provide valuable information for customers to compare the quality of products, store services, and many other aspects of future purchases. However, spammers are joining this community trying to mislead consumers by writing fake or unfair reviews to confuse the consumers. Previous attempts have used reviewers’ behaviors such as text similarity and rating patterns, to detect spammers. These studies are able to identify certain types of spammers, for instance, those who post many similar reviews about one target. However, in reality, there are other kinds of spammers who can manipulate their behaviors to act just like normal reviewers, and thus cannot be detected by the available techniques. In this article, we propose a novel concept of review graph to capture the relationships among all reviewers, reviews and stores that the reviewers have reviewed as a heterogeneous graph. We explore how interactions between nodes in this graph could reveal the cause of spam and propose an iterative computation model to identify suspicious reviewers. In the review graph, we have three kinds of nodes, namely, reviewer, review, and store. We capture their relationships by introducing three fundamental concepts, the trustiness of reviewers, the honesty of reviews, and the reliability of stores, and identifying their interrelationships: a reviewer is more trustworthy if the person has written more honesty reviews; a store is more reliable if it has more positive reviews from trustworthy reviewers; and a review is more honest if many other honest reviews support it. This is the first time such intricate relationships have been identified for spam detection and captured in a graph model. We further develop an effective computation method based on the proposed graph model. Different from any existing approaches, we do not use an review text information. Our model is thus complementary to existing approaches and able to find more difficult and subtle spamming activities, which are agreed upon by human judges after they evaluate our results.
Book
The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.
Article
This paper postulates that there are natural distributions of opinions in product reviews. In particular, we hypoth-esize that for a given domain, there is a set of represen-tative distributions of review rating scores. A deceptive business entity that hires people to write fake reviews will necessarily distort its distribution of review scores, leaving distributional footprints behind. In order to val-idate this hypothesis, we introduce strategies to create dataset with pseudo-gold standard that is labeled auto-matically based on different types of distributional foot-prints. A range of experiments confirm the hypothesized connection between the distributional anomaly and de-ceptive reviews. This study also provides novel quanti-tative insights into the characteristics of natural distri-butions of opinions in the TripAdvisor hotel review and the Amazon product review domains.
Article
Anomaly and event detection has been studied widely for having many applications in fraud detection, network intrusion detection, detection of epidemic outbreaks, and so on. In this paper we propose an algorithm that operates on a time-varying network of agents with edges representing interactions between them and (1) spots "anomalous" points in time at which many agents "change" their behavior in a way it deviates from the norm; and (2) attributes the detected anomaly to those agents that contribute to the "change" the most. Experiments on a large mobile phone network (of 2 million anonymous customers with 50 million interactions over a period of 6 months) shows that the "change"-points detected by our algorithm coincide with the social events and the festivals in our data.
Article
Network dynamics has become a popular area of study as it is well known that networks evolve and adapt over time. With this in mind, abnormal change detection is critical to the understanding and control of network dynamics. This paper presents differences in graph diameter as a method for detecting abnormal change in a network time series. A formal definition of graph diameter is presented, with theoretical implications, examples and computational results. Also presented is an apparent means for characterization of network state without dependence on other networks in the time series. This leads directly to the ability to identify anomalous change and characterizing the affects on the network communications.
Article
Graph clustering and graph outlier detection have been studied extensively on plain graphs, with various applications. Recently, algorithms have been extended to graphs with attributes as often observed in the real-world. However, all of these techniques fail to incorporate the user preference into graph mining, and thus, lack the ability to steer algorithms to more interesting parts of the attributed graph. In this work, we overcome this limitation and introduce a novel user-oriented approach for mining attributed graphs. The key aspect of our approach is to infer user preference by the so-called focus attributes through a set of user-provided exemplar nodes. In this new problem setting, clusters and outliers are then simultaneously mined according to this user preference. Specifically, our FocusCO algorithm identifies the focus, extracts focused clusters and detects outliers. Moreover, FocusCO scales well with graph size, since we perform a local clustering of interest to the user rather than global partitioning of the entire graph. We show the effectiveness and scalability of our method on synthetic and real-world graphs, as compared to both existing graph clustering and outlier detection approaches.
Article
Data sources representing social networks with additional attribute information about the nodes are widely available in today's applications. Recently, combined clustering methods were introduced that consider graph information and attribute information simultaneously to detect meaningful clusters in such networks. In many cases, such attributed graphs also evolve over time. Therefore, there is a need for clustering methods that are able to trace clusters over different time steps and analyze their evolution over time. In this paper, we extend our combined clustering method DB-CSC to the analysis of evolving combined clusters.