ABSTRACT: Current research in indexing and mining time series data has produced many interesting algorithms and representations. However,
the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets
encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation
which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature.
To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast
exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time
series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential
scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance
to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate
search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We
further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms,
allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.
Data Mining and Knowledge Discovery 04/2012; 19(1):24-57. · 1.54 Impact Factor