Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The biggest shortcoming of efficient Bayesian learning is the attribute independence assumption. Current state-of-the-art learners weaken this to capture pairwise correla-tions with quadratic efficiency. This paper details a recent advance in mining correlated data through the integration of techniques from the fields of outlier detection, clustering and supervised learning. The algorithm described here is able to learn and exploit cor-relations between a potentially unbounded number of at-tributes in one pass over the data, with strictly bounded memory and processing costs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To discover arbitrary clusters in data stream and to handle outliers, Cao et al [6] have proposed a density-based clustering algorithm, called DenStream. Finally, unsupervised naive-Bayesian network algorithms which are designed to discover clusters in distributed data streams [17,5] may be relevant for this work. In fact, a network is here employed to model spatial arrangement of data. ...
Article
Full-text available
Many emerging applications are characterized by real-time stream data acquisition through sensors which have geographical loca-tions and/or spatial extents. Streaming prevents from storing all data from the stream and performing multiple scans of the entire data sets as normally done in traditional applications. The drift of data distribu-tion poses additional challenges to the spatio-temporal data mining tech-niques. We address these challenges for a class of spatio-temporal pat-terns, called trend-clusters, which combine the semantics of both clusters and trends in spatio-temporal environments. We propose an algorithm to interleave spatial clustering and trend discovery in order to continuously cluster geo-referenced data which vary according to a similar trajectory (trend) in the recent past (window time). An experimental study demon-strates the effectiveness of our algorithm.
... It is noteworthy that no technique reported above takes into account the possibly distributed arrangement of readings. The exceptions are the naive-Bayesian network techniques [17,5] which permit to discover clusters from distributed data streams. Although these works are close to our research, the kind of pattern we discover, clusters with a trend polyline to describe how readings vary in time, requires a time-series processing of the readings which is not performed by these naive Bayesian network techniques. ...
Conference Paper
Full-text available
Emerging real life applications, such as environmental compliance, ecological studies and meteorology, are characterized by real-time data acquisition through remote sensor networks. The most important aspect of the sensor readings is that they comprise a space dimension and a time dimension which are both information bearing. Additionally, they usually arrive at a rapid rate in a continuous, unbounded stream. Streaming prevents us from storing all readings and performing multiple scans of the entire data set. The drift of data distribution poses the additional problem of mining patterns which may change over the time. We address these challenges for the trend cluster cluster discovery, that is, the discovery of clusters of spatially close sensors which transmit readings, whose temporal variation, called trend polyline, is similar along the time horizon of a window. We present a stream framework which segments the stream into equally-sized windows, computes online intra-window trend clusters and stores these trend clusters in a database. Trend clusters are queried offline at any time, to determine trend clusters along larger windows (i.e. windows of windows). Experiments with several streams demonstrate the effectiveness of the proposed framework in discovering accurate and relevant to human trend clusters.
Article
We present a novel statistical analysis of legislative rhetoric in the US Senate that sheds a light on hidden patterns in the behaviour of Senators as a function of their time in office. Using natural language processing, we create a novel comprehensive data set based on the speeches of all Senators who served on the US Senate Committee on Energy and Natural Resources in 2001–2011. We develop a new measure of congressional speech, based on Senators’ attitudes towards the dominant energy interests. To evaluate intrinsically dynamic formation of groups among Senators, we adopt a model‐free unsupervised space–time data mining algorithm that has been proposed in the context of tracking dynamic clusters in environmental georeferenced data streams. Our approach based on a two‐stage hybrid supervised–unsupervised learning methodology is innovative and data driven and transcends conventional disciplinary borders. We discover that legislators become much more alike after the first few years of their term, regardless of their partisanship and campaign promises.
Article
Stream mining is the process of mining a continuous, ordered sequence of data items in real-time. Naïve Bayes (NB) classification is one of the popular classification methods for stream mining because it is an incremental classification method whose model can be easily updated as new data arrives. It has been observed in the literature that the performance of the NB classifier improves when irrelevant features are eliminated from the modeling process. This paper reports studies that were conducted to identify efficient computational methods for selecting relevant features for NB classification based on the sliding window method of stream mining. The paper also provides experimental results which demonstrate that continuous feature selection for NB stream mining provides high levels of predictive performance.
Article
Full-text available
We present a collective approach to mine Bayesian networks from distributed heterogenous web-log data streams. In this approach we rst learn a local Bayesian network at each site using the local data. Then each site identies the observations that are most likely to be evi- dence of coupling between local and non-local variables and transmits a subset of these observations to a central site. Another Bayesian network is learnt at the central site using the data transmitted from the local site. The local and central Bayesian networks are combined to obtain a collective Bayesian network, that models the entire data. This tech- nique is then suitably adapted to an online Bayesian learning technique, where the network parameters are updated sequentially based on new data from multiple streams. We applied this technique to mine multiple data streams where data centralization is dicult because of large re- sponse time and scalability issues. This approach is particularly suitable for mining applications with distributed sources of data streams in an environment with non-zero communication cost (e.g. wireless networks). Experimental results and theoretical justication that demonstrate the feasibility of our approach are presented.
Article
Full-text available
Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.
Article
Full-text available
The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute inter-dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross-section of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.
Conference Paper
Full-text available
In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. In this paper we extend the VFDT system in two directions: the ability to deal with continuous data and the use of more powerful classification techniques at tree leaves. The proposed system, VFDTc, can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We study the behaviour of VFDTc in different problems and demonstrate its utility in large and medium data sets. Under a bias-variance analysis we observe that VFDTc in comparison to C4.5 is able to reduce the variance component.
Conference Paper
Full-text available
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
Conference Paper
Full-text available
In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible. We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.
Article
Full-text available
Of numerous proposals to improve the accuracy of naive Bayes by weak- ening its attribute independence assumption, both LBR and super-parent TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach to weak- ening the attribute independence assumption by averaging all of a constrained class of classifiers. In extensive experiments this technique delivers comparable prediction accuracy to LBR and super-parent TAN with substantially improved computational efficiency at test time relative to the former and at training time relative to the latter. The new algorithm is shown to have low variance and is suited to incremental learning.
Article
Full-text available
In many applications such as IP network management, data arrives in streams and queries over those streams need to be processed online using limited storage. Correlated-sum (CS) aggregates are a natural class of queries formed by composing basic aggregates on (x, y) pairs and are of the form SUM{g(y) : x ≤ f(AGG(x))}, where AGG(x) can be any basic aggregate and f(), g() are user-specified functions. CS-aggregates cannot be computed exactly in one pass through a data stream using limited storage; hence, we study the problem of computing approximate CS-aggregates. We guarantee a priori error bounds when AGG(x) can be computed in limited space (e.g., MIN, MAX, AVG), using two variants of Greenwald and Khanna's summary structure for the approximate computation of quantiles. Using real data sets, we experimentally demonstrate that an adaptation of the quantile summary structure uses much less space, and is significantly faster, than a more direct use of the quantile summary structure, for the same a posteriori error bounds. Finally, we prove that, when AGG(x) is a quantile (which cannot be computed over a data stream in limited space), the error of a CS-aggregate can be arbitrarily large.
Article
Full-text available
We present a method for automatically clustering similar attribute values in a database system spanning mulitple domains. The method constructs an attribute abstraction hierarchy for each attribute using rules that are derived from the database instance. The rules have a confidence and popularity that combine to express the "usefullness" of the rule. Attribute values are clustered if they are used as the premise for rules with the same consequence. By iteratively applying the algorithm, a hierarchy of clusters can be found. The algorithm can be improved by allowing domain expert supervision during the clustering process. An example as well as experimental results from a large transportation database are included. 1 Introduction In a conventional database system, queries are answered with absolute certainty. If a query has no exact answer, then the user's needs remain unsatisfied. A cooperative query answering (CQA) system behaves like a conventional system when the query can be answe...
Article
Full-text available
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a...
Article
Full-text available
Naive Bayes is a simple, computationally efficient and remarkably accurate approach to classification learning. These properties have led to its wide deployment in many online applications. However, it is based on an assumption that all attributes are conditionally independent given the class. This assumption leads to decreased accuracy in some applications. AODE overcomes the attribute independence assumption of naive Bayes by averaging over all models in which all attributes depend upon the class and a single other attribute. The resulting classification learning algorithm for nominal data is computationally efficient and achieves very low error rates.
Article
Full-text available
Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.
Conference Paper
Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be pro tably modi- ed to make use of this information. In experiments with arti cial constraints on six data sets, we observe improvements in clustering accuracy. We also apply this method to the real-world problem of automatically detecting road lanes from GPS data and observe dramatic increases in performance. 1.
Conference Paper
Reverse Nearest Neighbor (RNN) queries have been studied for finite, stored data sets and are of interest for decision support. However, in many applications such as fixed wireless telephony access and sensor-based highway traffic monitoring, the data arrives in a stream and cannot be stored. Exploratory analysis on this data stream can be formalized naturally using the notion of RNN aggregates (RNNAs), which involve the computation of some aggregate (such as C0UNT or MAX DISTANCE) over the set of reverse nearest neighbor "clients" associated with each "server". In this paper, we introduce and investigate the problem of computing three types of RNNA queries over data streams of "client" locations: (i) Max-RNNA: given K servers, return the maximum RNNA over all clients to their closest servers; (ii) List-RNNA: given K servers, return a list of RNNAs over all clients to each of the K servers; and (iii) Opt-RNNA: find a subset of at most K servers for which their RNNAs are below a given threshold. While exact computation of these queries is not possible in the data stream model, we present efficient algorithms to approximately answer these RNNA queries over data streams with error guarantees. We provide analytical proofs of constant factor approximations for many RNNA queries, and complement our analyses with experimental evidence of the accuracy of our techniques.
Conference Paper
This chapter discusses a framework for clustering evolving data streams. The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream render most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) the quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. The chapter discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. It divides the clustering process into an online component, which periodically stores detailed summary statistics and an offline component, which uses only this summary statistics. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. The concepts of a pyramidal time frame in conjunction with a micro-clustering approach are used.
Article
This paper presents a Bayesian method for constructing probabilistic networks from databases. In particular, we focus on constructing Bayesian belief networks. Potential applications include computer-assisted hypothesis testing, automated scientific discovery, and automated construction of probabilistic expert systems. We extend the basic method to handle missing data and hidden (latent) variables. We show how to perform probabilistic inference by averaging over the inferences of multiple belief networks. Results are presented of a preliminary evaluation of an algorithm for constructing a belief network from a database of cases. Finally, we relate the methods in this paper to previous work, and we discuss open problems.
Conference Paper
Algorithms for tracking concept drift are important for many applications. We present a general method based on the weighted majority algorithm for using any online learner for concept drift. Dynamic weighted majority (DWM) maintains an ensemble of base learners, predicts using a weighted-majority vote of these "experts", and dynamically creates and deletes experts in response to changes in performance. We empirically evaluated two experimental systems based on the method using incremental naive Bayes and incremental tree inducer [ITI] as experts. For the sake of comparison, we also included Blum's implementation of weighted majority. On the STAGGER concepts and on the SEA concepts, results suggest that the ensemble method learns drifting concepts almost as well as the base algorithms learn each concept individually. Indeed, we report the best overall results for these problems to date.
Conference Paper
This paper presents a novel Fourier analysis-based technique to aggregate, communicate and visualize decision trees in a mobile environment. A Fourier representation of a decision tree has several useful properties that are particularly useful for mining continuous data streams from small mobile computing devices. This paper presents algorithms to compute the Fourier spectrum of a decision tree and vice versa. It offers a framework to aggregate decision trees in their Fourier representations. It also describes a touchpad/ticker-based approach to visualize decision trees using their Fourier spectrum and an implementation for PDAs
Conference Paper
We study clustering under the data stream model of computation where: given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. The data stream model is relevant to new classes of applications involving massive data sets, such as Web click stream analysis and multimedia data analysis. We give constant-factor approximation algorithms for the k-median problem in the data stream model of computation in a single pass. We also show negative results implying that our algorithms cannot be improved in a certain sense
Article
Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. In this paper, we revisit this problem. We make the following two contributions: 1) We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.
Article
Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. This paper describes and evaluates VFDT, an anytime system that builds decision trees using constant memory and constant time per example. VFDT can incorporate tens of thousands of examples per second using o#-the-shelf hardware. It uses Hoe#ding bounds to guarantee that its output is asymptotically nearly identical to that of a conventional learner. We study VFDT's properties and demonstrate its utility through an extensive set of experiments on synthetic data. We apply VFDT to mining the continuous stream of Web access data from the whole University of Washington main campus.
Article
Theory refinement is the task of updating a domain theory in the light of new cases, to be done automatically or with some expert assistance. The problem of theory refinement under uncertainty is reviewed here in the context of Bayesian statistics, a theory of belief revision. The problem is reduced to an incremental learning task as follows: the learning system is initially primed with a partial theory supplied by a domain expert, and thereafter maintains its own internal representation of alternative theories which is able to be interrogated by the domain expert and able to be incrementally refined from data. Algorithms for refinement of Bayesian networks are presented to illustrate what is meant by "partial theory", "alternative theory representation ", etc. The algorithms are an incremental variant of batch learning algorithms from the literature so can work well in batch and incremental mode. 1 Introduction Theory refinement is the task of updating a domain theory in the light of...
Article
Algorithms for tracking concept drift are important for many applications. We present a general method based on the Weighted Majority algorithm for using any on-line learner for concept drift. Dynamic Weighted Majority (DWM) maintains an ensemble of base learners, predicts using a weighted-majority vote of these "experts", and dynamically creates and deletes experts in response to changes in performance. We empirically evaluated two experimental systems based on the method using incremental naive Bayes and Incremental Tree Inducer (ITI) as experts.
Article
Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.
C4.5 – programs for machine learning. The Morgan Kaufmann series in machine learning
  • J Quinlan
J. Quinlan. C4.5 – programs for machine learning. The Morgan Kaufmann series in machine learning, 1993.