Conference Paper

?-Anomica: A Fast Support Vector Based Novelty Detection Technique.

DOI: 10.1109/ICDM.2009.42 Conference: ICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009
Source: DBLP

ABSTRACT In this paper we propose ν-Anomica, a novel anomaly detection technique that can be trained on huge data sets with much reduced running time compared to the benchmark one-class Support Vector Machines algorithm. In ν-Anomica, the idea is to train the machine such that it can provide a close approximation to the exact decision plane using fewer training points and without losing much of the generalization performance of the classical approach. We have tested the proposed algorithm on a variety of continuous data sets under different conditions. We show that under all test conditions the developed procedure closely preserves the accuracy of standard one- class Support Vector Machines while reducing both the training time and the test time by 5 � 20 times.

0 Bookmarks
 · 
69 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of data sets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only because of the massive volume of data but also because these data sets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available data sets: (i) the NASA MODIS satellite images and (ii) a simulated aviation data set generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS). © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 393–406, 2011
    Statistical Analysis and Data Mining 01/2011; 4:393-406.

Full-text (2 Sources)

View
21 Downloads
Available from
May 31, 2014