iSAX: disk-aware mining and indexing of massive time series datasets

Data Mining and Knowledge Discovery (Impact Factor: 2.88). 08/2009; 19(1):24-57. DOI: 10.1007/s10618-009-0125-6

ABSTRACT Current research in indexing and mining time series data has produced many interesting algorithms and representations. However,
the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets
encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation
which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature.
To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast
exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time
series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential
scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance
to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate
search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We
further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms,
allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to understand the dependencies in the power system we try to derive state information by combining high-rate voltage time series captures at different locations together with data analysis at different scales. This may enable large-scale simulation and modeling of the grid. Data captured by our recently introduced Electrical Data Recorders (EDR) and power grid simulation data are stored in the large scale data facility (LSDF) at Karlsruhe Institute of Technology (KIT) and growing rapidly in size. In this article we compare classic sequential multithreaded time series data processing to a distributed processing using Pig on a Hadoop cluster. Further we present our ideas for a better organization for our raw- and metadata that is indexable, searchable and suitable for big data.
    PDP 2013 - 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 02/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CITATION: A new framework for the analysis of large scale, multi-rate power data is introduced. The system comprises high rate power grid data acquisition devices, software modules for big data management and large scale time series analysis. The power grid modeling and simulation modules enable to run power flow simulations. Visualization methods support data exploration for captured, simulated and analyzed energy data. A remote software control module for the proposed tools is provided.
    KIT Scientific Working Papers ; 21 08/2014; KIT, Karlsruhe., ISBN: ISSN: 2194-1629
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modeling the strategic objectives has been shown to be useful both for understanding a business as well as planning and guiding the overall activities within an enterprise. Business strategy is modeled according to human expertise, setting up the goals as well as the indicators that monitor activities and goals. However, usually indicators provide high-level aggregated views of data, making it difficult to pinpoint problems within specific sub-areas until they have a significant impact into the aggregated value. By the time these problems become evident, they have already hindered the performance of the organization. However, performing a detailed analysis manually can be a daunting task, due to the size of the data space. In order to solve this problem, we propose a user-driven method to analyze the data related to each business indicator by means of data mining. We illustrate our approach with a real world example based on the Europe 2020 framework. Our approach allows us not only to identify latent problems, but also to highlight deviations from anticipated trends that may represent opportunities and exceptional situations, thereby enabling an organization to take advantage of them.
    Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research; 11/2013


1 Download
Available from