Conference Paper

Robust Anomaly Detection for Large-Scale Sensor Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Large scale sensor networks are ubiquitous nowadays. An important objective of deploying sensors is to detect anomalies in the monitored system or infrastructure, which allows remedial measures to be taken to prevent failures, inefficiencies, and security breaches. Most existing sensor anomaly detection methods are local, i.e., they do not capture the global dependency structure of the sensors, nor do they perform well in the presence of missing or erroneous data. In this paper, we propose an anomaly detection technique for large scale sensor data that leverages relationships between sensors to improve robustness even when data is missing or erroneous. We develop a probabilistic graphical model-based global outlier detection technique that represents a sensor network as a pairwise Markov Random Field and uses graphical model inference to detect anomalies. We show our model is more robust than local models, and detects anomalies with 90% accuracy even when 50% of sensors are erroneous. We also build a synthetic graphical model generator that preserves statistical properties of a real data set to test our outlier detection technique at scale.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Les capteurs sont souvent utilisés pour suivre divers paramètres d'environnement et de localisation dans de nombreuses applications du monde réel. Les anomalies dans les données de capteurs font référence à des défauts de capteurs ou des événements (tels que des intrusions) imprévus (Rajasegarar et al., 2008;Hayes et Capretz, 2014;Rabatel et al., 2011;Chakrabarti et al., 2016). Les données de capteurs peuvent être binaires, discrètes, continues, audio, vidéo, etc. ...
Thesis
Les outils de supervision et de monitoring sont communément utilisés dans l’industrie pour analyser les données issues de différents capteurs. Ces données sont souvent affectées par des événements inhabituels ou des changements temporaires et ont tendance à comporter des irrégularités et des valeurs aberrantes qui demandent des connaissances métiers du domaine et une intervention humaine pour être détectées. Dans de telles situations, la détection d’anomalies peut être un moyen crucial pour identifier les évènements anormaux et détecter les comportements inhabituels permettant ainsi aux experts d’agir rapidement et d’atténuer les effets d’une situation indésirable. Dans cette thèse, nous avons focalisé sur l’utilisation de techniques d’apprentissage automatique dans le but d’automatiser et de consolider le processus de détection des anomalies dans les données de réseaux de capteurs. Ces données proviennent de capteurs se présentent sous forme de séries temporelles. Pour ce faire, nous avons défini deux objectifs principaux : la détection d’anomalies multiples et la génération de règles interprétables par l’être humain pour la détection d’anomalies. Le premier objectif consiste à détecter différents types d’anomalies dans les données de capteurs. Dans les travaux de recherche existants, il existe un travail approfondi sur la détection d’anomalies. Cependant, la plupart des techniques recherchent des objets individuels qui sont différents des objets normaux ou bien des séquences de données, mais ne prennent pas en compte la détection de multiples anomalies. Pour résoudre cette problématique et atteindre notre premier enjeu, nous avons créé un système configurable de détection d’anomalies multiples qui est basé sur des motifs pour détecter les anomalies dans les séries temporelles. L’algorithme que nous proposons, Composition of Remarquable Point (CoRP), est basé sur le principe de recherche de motifs. Cet algorithme applique un ensemble de motifs afin d’annoter les points remarquables dans une série temporelle univariée, puis détecte les anomalies par composition de motifs. Les motifs d’annotation et les compositions de motifs sont définis avec l’aide de l’expert du domaine. Notre méthode a l’avantage de localiser et de catégoriser les différents types d’anomalies détectées. Le deuxième objectif de la thèse est la génération de règles interprétables et intelligibles par les experts pour la détection d’anomalies. Pour ceci, nous avons proposé un algorithme, Composition based Decision Tree (CDT), qui permet de produire automatiquement des règles ajustables et modifiables par les experts. Pour ce faire, nous avons conçu une modélisation variable des motifs de détection des points remarquables pour labéliser les séries temporelles. Sur la base de la série temporelle étiquetée, un arbre de décision est construit en considérant les nœuds comme des compositions de motifs. Enfin, l’arbre est converti en un ensemble de règles de décision, compréhensibles parles experts. Nous avons aussi défini une mesure de qualité pour les règles produites. Nous avons testé les performances de CoRP et CDT avec des compétiteurs, sur des données réelles et des données issues de la littérature (benchmarks). Les deux méthodes font preuve d’efficacité pour la détection d’anomalies multiples. Les résultats ont une bonne précision offrant un taux élevé de détection avec un faible taux de faux positifs. Les travaux développés dans cette thèse ont été menés dans le cadre du projet neoCampus et financés par le Service de Gestion et d’Exploitation rattaché au rectorat de Toulouse.
... Data aggregation [45], consumption analysis [20,1,15,46,60,[93][94][95][96][97][98], detection of anomalies [38,50,[99][100][101][102][103][104][105][106][107][108][109][110][111][112][113][114][115][116][117] • Principle component analysis (PCA) [50] • ARIMA and adaptive artificial neural network (ANN) [49] • K-nearest neighborhood (KNN) [102] • PARX [118] • Log-normal distribution function [118] • Statistical-based [119] • Nearest neighbor-based [119] • Cluster-based [119] • Classification-based [119] • Spectral decomposition-based techniques [119] • Unsupervised contextual and collective detection approach [120] • Stacked sparse autoencoder [121] (continued) askahuja2002@gmail.com ...
Article
Purpose This paper aims to focus on data analytic tools and integrated data analyzing approaches used on smart energy meters (SEMs). Furthermore, while observing the diverse techniques and frameworks of data analysis of SEM, the authors propose a novel framework for SEM by using gamification approach for enhancing the involvement of consumers to conserve energy and improve efficiency. Design/methodology/approach A few research strategies have been accounted for analyzing the raw data, yet at the same time, a considerable measure of work should be done in making these commercially reasonable. Data analytic tools and integrated data analyzing approaches are used on SEMs. Furthermore, while observing the diverse techniques and frameworks of data analysis of SEM, the authors propose a novel framework for SEM by using gamification approach for enhancing the involvement of consumers to conserve energy and improve efficiency. Advantages of SEM’s are additionally discussed for inspiring consumers, utilities and their respective partners. Findings Consumers, utilities and researchers can also take benefit of the recommended framework by planning their routine activities and enjoying rewards offered by gamification approach. Through gamification, consumers’ commitment enhances, and it changes their less manageable conduct on an intentional premise. The practical implementation of such approaches showed the improved energy efficiency as a consequence.
Chapter
In current energy production and distribution system, a smart energy meter has been a significant conceptual paradigm. There is a dire requirement to make energy usage more efficient and effective due to limited nonrenewable energy resources and renewable energies (REs) available at high cost. It creates a critical environment for future economic developments and social improvements such as smart cities. In recent years, numbers of smart meters are being installed in residential areas and other sites of smart cities. Smart meters are capable to provide numerous informative recordings of electricity consumption along with accurate processing of billing, Automated Meter Reading (AMR) data processing, detection of energy theft and early warning of blackouts, fast detection of turbulences in energy supply, real time pricing updates, and Demand Response (DR) system for energy saving and efficient usage of energy generated. To take full benefit of smart metering intelligence, numbers of technical issues are required to be addressed. The major concern is to work with very large volume of data. There is a need to develop efficient data fusion and integration techniques. Numerous big data integration and analytics engines are required, which can perform tasks such as outage management, asset management and fault detection especially in case of DR system, customer segmenting, load forecasting and targeting. Data analytic approaches transform volume of data into actionable information for consumers, utilities and authorities. Although numerous analytical algorithms are available, which can process huge volume of data, but many of these are not capable to complete task sufficiently. A few research procedures have been accounted for on investigating streaming data, but still needs a lot of work to be done in making these commercially reasonable. In this chapter, smart energy meters’ data analytics framework is proposed by employing latest data processing techniques/tools along with gamification approach for enhancing consumers’ engagement. Benefits of smart energy meter’s analytics are also discussed for motivating consumers, utilities and stakeholders. Researchers, utilities, authorities can take benefits from proposed algorithm by planning their future action with supplementary participation of real time consumers due to gamification approach. By gamification, consumers’ engagement improves and it alters their less sustainable behavior on a voluntary basis.
Conference Paper
Full-text available
The increasing sophistication of malicious software calls for new defensive techniques that are harder to evade, and are capable of protecting users against novel threats. We present Aesop, a scalable algorithm that identifies malicious exe-cutable files by applying Aesop's moral that "a man is known by the company he keeps." We use a large dataset volun-tarily contributed by the members of Norton Community Watch, consisting of partial lists of the files that exist on their machines, to identify close relationships between files that often appear together on machines. Aesop leverages locality-sensitive hashing to measure the strength of these inter-file relationships to construct a graph, on which it per-forms large scale inference by propagating information from the labeled files (as benign or malicious) to the preponder-ance of unlabeled files. Aesop attained early labeling of 99% of benign files and 79% of malicious files, over a week before they are labeled by the state-of-the-art techniques, with a 0.9961 true positive rate at flagging malware, at 0.0001 false positive rate.
Article
Full-text available
Recently, researchers have demonstrated that loopy belief propagation - the use of Pearls polytree algorithm IN a Bayesian network WITH loops OF error- correcting codes.The most dramatic instance OF this IS the near Shannon - limit performance OF Turbo Codes codes whose decoding algorithm IS equivalent TO loopy belief propagation IN a chain - structured Bayesian network. IN this paper we ask : IS there something special about the error - correcting code context, OR does loopy propagation WORK AS an approximate inference schemeIN a more general setting? We compare the marginals computed using loopy propagation TO the exact ones IN four Bayesian network architectures, including two real - world networks : ALARM AND QMR.We find that the loopy beliefs often converge AND WHEN they do, they give a good approximation TO the correct marginals.However,ON the QMR network, the loopy beliefs oscillated AND had no obvious relationship TO the correct posteriors. We present SOME initial investigations INTO the cause OF these oscillations, AND show that SOME simple methods OF preventing them lead TO the wrong results.
Article
Full-text available
In the field of wireless sensor networks, those measurements that significantly deviate from the normal pattern of sensed data are considered as outliers. The potential sources of outliers include noise and errors, events, and malicious attacks on the network. Traditional outlier detection techniques are not directly applicable to wireless sensor networks due to the nature of sensor data and specific requirements and limitations of the wireless sensor networks. This survey provides a comprehensive overview of existing outlier detection techniques specifically developed for the wireless sensor networks. Additionally, it presents a technique-based taxonomy and a comparative table to be used as a guideline to select a technique suitable for the application at hand based on characteristics such as data type, outlier type, outlier identity, and outlier degree.
Conference Paper
Full-text available
We present an efficient algorithm to generate random graphs with a given sequence of expected degrees. Existing algorithms run in O(N2)\mathcal{O}(N^2) time where N is the number of nodes. We prove that our algorithm runs in O(N+M)\mathcal{O}(N+M) expected time where M is the expected number of edges. If the expected degrees are chosen from a distribution with finite mean, this is O(N)\mathcal{O}(N) as N → ∞.
Article
Full-text available
We describe a learning-based method for low-level vision problems—estimating scenes from images. We generate a synthetic world of scenes and their corresponding rendered images, modeling their relationships with a Markov network. Bayesian belief propagation allows us to efficiently find a local maximum of the posterior probability for the scene, given an image. We call this approach VISTA—Vision by Image/Scene TrAining. We apply VISTA to the “super-resolution” problem (estimating high frequency details from a low-resolution image), showing good results. To illustrate the potential breadth of the technique, we also apply it in two other problem domains, both simplified. We learn to distinguish shading from reflectance variations in a single image under particular lighting conditions. For the motion estimation problem in a “blobs world”, we show figure/ground discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic machinery.
Article
Full-text available
Numerous real-world applications produce networked data such as web data (hypertext documents connected via hyperlinks) and communication networks (peo- ple connected via communication links). A recent focus in machine learning re- search has been to extend traditional machine learning classification techniques to classify nodes in such data. In this report, we attempt to provide a brief intro- duction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and empirically compare them on both synthetic and real-world data.
Article
Full-text available
to difierentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the efiectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the difierent existing techniques in that category are variants of the basic tech- nique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the difierent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Article
Full-text available
In this paper, we propose a belief-propagation (BP)-based decoding algorithm which utilizes normalization to improve the accuracy of the soft values delivered by a previously proposed simplified BP-based algorithm. The normalization factors can be obtained not only by simulation, but also, importantly, theoretically. This new BP-based algorithm is much simpler to implement than BP decoding as it requires only additions of the normalized received values and is universal, i.e., the decoding is independent of the channel characteristics. Some simulation results are given, which show this new decoding approach can achieve an error performance very close to that of BP on the additive white Gaussian noise channel, especially for low-density parity check (LDPC) codes whose check sums have large weights. The principle of normalization can also be used to improve the performance of the max-log-MAP algorithm in turbo decoding, and some coding gain can be achieved if the code length is long enough
Article
Full-text available
We attempt to trace the history and development of Markov chain Monte Carlo (MCMC) from its early inception in the late 1940s through its use today. We see how the earlier stages of Monte Carlo (MC, not MCMC) research have led to the algorithms currently in use. More importantly, we see how the development of this methodology has not only changed our solutions to problems, but has changed the way we think about problems.
Chapter
The kernel estimator can be motivated not only as the limiting case of averaged shifted histogram but also by other techniques. The kernel density estimate inherits all the properties of its kernel. It is easy to check that the ratio of asymptotic integrated variance (AIV) to asymptotic integrated squared bias (AISB) in the asymptotic mean integrated squared error (AMISE)* is 4:1. The theoretical analysis of multivariate kernel estimators is same as for frequency polygons save for a few details. The unbiased and biased cross-validation algorithms for histogram are easily extended to both kernel and average shifted histogram (ASH) estimators. Adaptive estimators are made feasible by reducing the dimension of the adaptive smoothing function. This chapter presents a survey of kernel methods by revisiting several options for computing a kernel estimator on an equally spaced mesh based on a sample of size n.
Conference Paper
Large scale deployment of sensors is essential to practical applications in cyber physical systems. For instance, instrumenting a commercial building for 'smart energy' management requires deployment and operation of thousands of measurement and metering sensors and actuators that direct operation of the HVAC system. Each of these sensors need to be named consistently and constantly calibrated. Doing this process manually is not only time consuming but also error prone given the scale, heterogeneity and complexity of buildings as well as lack of uniform naming schemas. To address this challenge, we propose Zodiac - a framework for automatically classifying, naming and managing sensors based on active learning from sensor metadata. In contrast to prior work, Zodiac requires minimal user input in terms of labelling examples while being more accurate. To evaluate Zodiac, we deploy it across four real buildings on our campus and label the ground truth metadata for all the sensors in these buildings manually. Using a combination of hierarchical clustering and random forest classifiers we show that Zodiac can successfully classify sensors with an average accuracy of 98% with 28% fewer training examples when compared to a regular expression based method.
Article
We present Polonium, a novel Symantec technology that detects malware through large-scale graph inference. Based on the scalable Belief Propagation algorithm, Polonium infers every file's reputation, flagging files with low reputation as malware. We evaluated Polonium with a billion-node graph constructed from the largest file submissions dataset ever published (60 terabytes). Polonium attained a high true positive rate of 87% in detecting malware; in the field, Polonium lifted the detection rate of existing methods by 10 absolute percentage points. We detail Polonium's design and implementation features instrumental to its success. Polonium has served 120 million people and helped answer more than one trillion queries for file reputation.
Conference Paper
Online social networks have become ubiquitous to today's society and the study of data from these networks has improved our understanding of the processes by which relationships form. Research in statistical relational learning focuses on methods to exploit correlations among the attributes of linked nodes to predict user characteristics with greater accuracy. Concurrently, research on generative graph models has primarily focused on modeling network structure without attributes, producing several models that are able to replicate structural characteristics of networks such as power law degree distributions or community structure. However, there has been little work on how to generate networks with real-world structural properties and correlated attributes. In this work, we present the Attributed Graph Model (AGM) framework to jointly model network structure and vertex attributes. Our framework learns the attribute correlations in the observed network and exploits a generative graph model, such as the Kronecker Product Graph Model (KPGM) and Chung Lu Graph Model (CL), to compute structural edge probabilities. AGM then combines the attribute correlations with the structural probabilities to sample networks conditioned on attribute values, while keeping the expected edge probabilities and degrees of the input graph model. We outline an efficient method for estimating the parameters of AGM, as well as a sampling method based on Accept-Reject sampling to generate edges with correlated attributes. We demonstrate the efficiency and accuracy of our AGM framework on two large real-world networks, showing that AGM scales to networks with hundreds of thousands of vertices, as well as having high attribute correlation.
Book
The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.
Conference Paper
In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges---a distributed state representation as in dynamic Bayesian networks (DBNs)---and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data.
Conference Paper
Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graph-parallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.
Conference Paper
We explore the feasibility of using commercial aircraft as sensors for observing weather phenomena at a continental scale. We focus specifically on the problem of wind forecasting and explore the use of machine learning and inference methods to harness air and ground speeds reported by aircraft at different locations and altitudes. We validate the learned predictive model with a field study where we release an instrumented high-altitude balloon and compare the predicted trajectory with the sensed winds. The experiments show the promise of using airplane in flight as a large-scale sensor network. Beyond making predictions, we explore the guidance of sensing with value-of-information analyses, where we consider uncertainties and needs of sets of routes and maximize information value in light of the costs of acquiring data from airplanes. The methods can be used to select ideal subsets of planes to serve as sensors and also to evaluate the value of requesting shifts in trajectories of flights for sensing.
Article
Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured {\em graph} data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we provide a comprehensive exploration of both data mining and machine learning algorithms for these {\em detection} tasks. we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly {\em attribution} and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.
Chapter
Abstract "Inference" problems arise in statistical physics, computer vision, error - correcting coding theory, and AI We explain the principles behind the belief propagation (BP) algorithm, which is an efficient way to solve inference problems based on passing lo - cal messages We develop a unified approach, with examples, notation, and graphical models borrowed from the relevant disciplines We explain the close connection between the BP algorithm and the Bethe approx - imation of statistical physics In particular, we show that BP can only converge to a fix ed point that is also a stationary point of the Bethe approximation to the free energy This result helps expain the successes of the BP algorithm, and enables connections to be made with variational approaches to approximate inference The connection of BP with the Bethe approximation also suggests a way to con - struct new message passing algorithms based on improvements to Bethe's approxima - tion introduced by Kikuchi and others The new generalized belief propagation (GBP) algorithms are significantly more accurate than ordinary BP for some problems We illustrate how to construct GBP algorithms with a detailed example
Article
We give a Markov chain that converges to its stationary distribution very slowly. It has the form of a Gibbs sampler running on a posterior distribution of a parameter θ given data X. Consequences for Gibbs sampling are discussed.
Article
Network Intrusion Detection Systems (NIDSs) have become an important component in network security infrastructure. Currently, many NIDSs are rule-based systems whose performances highly depend on their rule sets. Unfortunately, due to the huge volume of network traffic, coding the rules by security experts becomes difficult and time-consuming. Since data mining techniques can build intrusion detection models adaptively, data mining-based NIDSs have significant advantages over rule-based NIDSs. Therefore, we apply one of the efficient data mining algorithms called random forests for network intrusion detection. The NIDS can be employed to detect intrusions online. In this paper, we discuss the approaches for handling imbalanced intrusions, selecting features, and optimizing the parameters of random forests. We also report our experimental results over the KDD'99 datasets. The results show that the proposed approach provides better performance compared to the best results from the KDD'99 contest.
Conference Paper
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.
Conference Paper
How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such terabyte-scale graphs? In this work, we focus on inference, which often corresponds, intuitively, to "guilt by association" scenarios. For example, if a person is a drug-abuser, probably its friends are so, too; if a node in a social network is of male gender, his dates are probably females. We show how to do inference on such huge graphs through our proposed HADOOP LINE GRAPH FIXED POINT (HA-LFP), an efficient parallel algorithm for sparse billion-scale graphs, using the HADOOP platform. Our contributions include (a) the design of HA-LFP, observing that it corresponds to a fixed point on a line graph induced from the original graph; (b) scalability analysis, showing that our algorithm scales up well with the number of edges, as well as with the number of machines; and (c) experimental results on two private, as well as two of the largest publicly available graphs — the Web Graphs from Yahoo! (6.6 billion edges and 0.24 Tera bytes), and the Twitter graph (3.7 billion edges and 0.13 Tera bytes). We evaluated our algorithm using M45, one of the top 50 fastest supercomputers in the world, and we report patterns and anomalies discovered by our algorithm, which would be invisible otherwise. Index Terms—HA-LFP, Belief Propagation, Hadoop, Graph
Conference Paper
Event detection is a critical task in sensor networks for a variety of real-world applications. Many real- world events often exhibit complex spatio-temporal patterns whereby they manifest themselves via ob- servations over time and space proximities. These spatio-temporal events cannot be handled well by many of the previous approaches. In this paper, we propose a new Spatio-Temporal Event Detec- tion (STED) algorithm in sensor networks based on a dynamic conditional random field (DCRF) model. Our STED method handles the uncertainty of sen- sor data explicitly and permits neighborhood in- teractions in both observations and event labels. Experiments on both real data and synthetic data demonstrate that our STED method can provide ac- curate event detection in near real time even for large-scale sensor networks.
Article
Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.
Conference Paper
The occurrence of outliers in industrial data is often the rule rather than the exception. Many standard outlier detection methods fail to detect outliers in industrial data because of the high dimensionality of the data. Outlier detection in the case of chemical plant data can be particularly difficult since these data sets are often rank deficient. These problems can be solved by using robust model-based methods that do not require the data to be of full rank. We explore the use of a robust model-based outlier detection approach that makes use of the characteristics of the support vectors obtained by the support vector machine method.
Article
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k th nearest neighbor. We rank each point on the basis of its distance to its k th nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nestedloop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality. 1
Outlier detection techniques
  • H.-P Kriegel
  • P Kröger
  • A Zimek
Aric Hagberg, Efficient generation of networks with given expected degrees
  • C Joel
  • Miller
Getoor, L. Introduction to statistical relational learning
  • L Getoor
  • Getoor L.
Spatio-temporal event detection using dynamic conditional random fields
  • J Yin
  • D H Hu
  • Q Yin
  • J Hu
  • Yin J.