Article

ISAX: Disk-aware mining and indexing of massive time series datasets

Data Mining and Knowledge Discovery (Impact Factor: 1.99). 08/2009; 19(1):24-57. DOI: 10.1007/s10618-009-0125-6

ABSTRACT

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However,
the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets
encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation
which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature.
To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast
exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time
series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential
scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance
to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate
search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We
further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms,
allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.

Full-text preview

Available from: escholarship.org
  • Source
    • "This is a generator , where a random number is drawn from a Gaussian distribution N(0, 1), then at each time point a new number is drawn from this distribution and added to the value of the last number. This kind of data generation has been extensively used in the past[1,10,25,2,29,30,6], and has been shown to effectively model real-world financial data[10]. Each dataset is z-normalized before being indexed . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous applications continuously produce big amounts of data series, and in several time critical scenarios analysts need to be able to query these data as soon as they become available, which is not currently possible with the state-of-the-art indexing methods and for very large data series collections. In this paper, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. The main idea is that instead of building the complete index over the complete data set up-front and querying only later, we interactively and adaptively build parts of the index, only for the parts of the data on which the users pose queries. The net effect is that instead of waiting for extended periods of time for the index creation, users can immediately start exploring the data series. We present a detailed design and evaluation of adaptive data series indexing over both synthetic data and real-world workloads. The results show that our approach can gracefully handle large data series collections, while drastically reducing the data to query delay: by the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), adaptive data series indexing has already answered $3*10^5$ queries.
    Full-text · Conference Paper · Jun 2014
  • Source
    • "Discretize the real-valued signal into letters, using the Symbolic Aggregate Approximation (SAX) algorithm (Shieh and Keogh, 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: BigWig files are a compressed, indexed, binary format for genome-wide signal data for calculations (e.g. GC percent) or experiments (e.g. ChIP-seq/RNA-seq read depth). bwtool is a tool designed to read bigWig files rapidly and efficiently, providing functionality for extracting data and summarizing it in several ways, globally or at specific regions. Additionally, the tool enables the conversion of the positions of signal data from one genome assembly to another, also known as ‘lifting’. We believe bwtool can be useful for the analyst frequently working with bigWig data, which is becoming a standard format to represent functional signals along genomes. The article includes supplementary examples of running the software.Availability and implementation: The C source code is freely available under the GNU public license v3 at http://cromatina.crg.eu/bwtool.Contact: andrew.pohl@crg.eu, andypohl@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.
    Full-text · Article · Jan 2014 · Bioinformatics
  • Source
    • "It uses HBase (a column based Hadoop Storage) and iSAX [8] to achieve both scale and index/search respectively. Lumberyard is based on iSAX and uses the jMotif Java library that provides symbolic aggregate approximation (SAX) [9] and iSAX for time series data enabling outlier detection and the search for often occurring patterns. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to understand the dependencies in the power system we try to derive state information by combining high-rate voltage time series captures at different locations together with data analysis at different scales. This may enable large-scale simulation and modeling of the grid. Data captured by our recently introduced Electrical Data Recorders (EDR) and power grid simulation data are stored in the large scale data facility (LSDF) at Karlsruhe Institute of Technology (KIT) and growing rapidly in size. In this article we compare classic sequential multithreaded time series data processing to a distributed processing using Pig on a Hadoop cluster. Further we present our ideas for a better organization for our raw- and metadata that is indexable, searchable and suitable for big data.
    Full-text · Conference Paper · Feb 2013
Show more