iSAX: disk-aware mining and indexing of massive time series datasets

Data Mining and Knowledge Discovery (Impact Factor: 1.74). 08/2009; 19(1):24-57. DOI: 10.1007/s10618-009-0125-6

ABSTRACT Current research in indexing and mining time series data has produced many interesting algorithms and representations. However,
the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets
encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation
which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature.
To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast
exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time
series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential
scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance
to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate
search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We
further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms,
allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Time series data appear in a broad variety of economic, medical, and scientific applications. Because of their high dimensionality, time series data are managed by using representation methods. Symbolic representation has attracted particular attention because of the possibility it offers to benefit from algorithms and techniques of other fields in computer science. The symbolic aggregate approximation method (SAX) is one of the most important symbolic representation techniques of times series data. SAX is based on the assumption of "high Gaussianity" of normalized time series which permits it to use breakpoints obtained from Gaussian lookup tables. The use of these breakpoints is the heart of SAX. In this paper we show that this assumption of Gaussianity oversimplifies the problem and can result in very large errors in time series mining tasks. We present an alternative scheme, based on the genetic algorithms (GASAX), to find the breakpoints. The new scheme does not assume any particular distribution of the data, and it does not require normalizing the data either. We conduct experiments on different datasets and we show that the new scheme clearly outperforms the original scheme.
    Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery; 09/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CITATION: A new framework for the analysis of large scale, multi-rate power data is introduced. The system comprises high rate power grid data acquisition devices, software modules for big data management and large scale time series analysis. The power grid modeling and simulation modules enable to run power flow simulations. Visualization methods support data exploration for captured, simulated and analyzed energy data. A remote software control module for the proposed tools is provided.
    KIT Scientific Working Papers ; 21 08/2014; KIT, Karlsruhe., ISBN: ISSN: 2194-1629
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modeling the strategic objectives has been shown to be useful both for understanding a business as well as planning and guiding the overall activities within an enterprise. Business strategy is modeled according to human expertise, setting up the goals as well as the indicators that monitor activities and goals. However, usually indicators provide high-level aggregated views of data, making it difficult to pinpoint problems within specific sub-areas until they have a significant impact into the aggregated value. By the time these problems become evident, they have already hindered the performance of the organization. However, performing a detailed analysis manually can be a daunting task, due to the size of the data space. In order to solve this problem, we propose a user-driven method to analyze the data related to each business indicator by means of data mining. We illustrate our approach with a real world example based on the Europe 2020 framework. Our approach allows us not only to identify latent problems, but also to highlight deviations from anticipated trends that may represent opportunities and exceptional situations, thereby enabling an organization to take advantage of them.
    Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research; 11/2013


1 Download
Available from