ISAX: Disk-aware mining and indexing of massive time series datasets

Data Mining and Knowledge Discovery (Impact Factor: 1.99). 08/2009; 19(1):24-57. DOI: 10.1007/s10618-009-0125-6


Current research in indexing and mining time series data has produced many interesting algorithms and representations. However,
the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets
encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation
which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature.
To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast
exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time
series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential
scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance
to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate
search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We
further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms,
allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.

Full-text preview

Available from:
  • Source
    • "Discretize the real-valued signal into letters, using the Symbolic Aggregate Approximation (SAX) algorithm (Shieh and Keogh, 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: BigWig files are a compressed, indexed, binary format for genome-wide signal data for calculations (e.g. GC percent) or experiments (e.g. ChIP-seq/RNA-seq read depth). bwtool is a tool designed to read bigWig files rapidly and efficiently, providing functionality for extracting data and summarizing it in several ways, globally or at specific regions. Additionally, the tool enables the conversion of the positions of signal data from one genome assembly to another, also known as ‘lifting’. We believe bwtool can be useful for the analyst frequently working with bigWig data, which is becoming a standard format to represent functional signals along genomes. The article includes supplementary examples of running the software.Availability and implementation: The C source code is freely available under the GNU public license v3 at, andypohl@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 01/2014; 30(11). DOI:10.1093/bioinformatics/btu056 · 4.98 Impact Factor
  • Source
    • "It uses HBase (a column based Hadoop Storage) and iSAX [8] to achieve both scale and index/search respectively. Lumberyard is based on iSAX and uses the jMotif Java library that provides symbolic aggregate approximation (SAX) [9] and iSAX for time series data enabling outlier detection and the search for often occurring patterns. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to understand the dependencies in the power system we try to derive state information by combining high-rate voltage time series captures at different locations together with data analysis at different scales. This may enable large-scale simulation and modeling of the grid. Data captured by our recently introduced Electrical Data Recorders (EDR) and power grid simulation data are stored in the large scale data facility (LSDF) at Karlsruhe Institute of Technology (KIT) and growing rapidly in size. In this article we compare classic sequential multithreaded time series data processing to a distributed processing using Pig on a Hadoop cluster. Further we present our ideas for a better organization for our raw- and metadata that is indexable, searchable and suitable for big data.
    PDP 2013 - 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 02/2013
  • Source
    • "Table 2 summarizes the comparison. It is worth to note that even though suffix trees are useful in a wide range of applications, there also exist specialized index structures for particular applications like genome data [11] and time series analysis [16]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.
Show more