This study addresses the task of automatically identifying water mixing events in the multivariate time series of salinity, temperature and dissolved oxygen provided by the Koljö fjord observatory. The observatory is used to test new underwater sensory technology and to monitor water quality with respect to hypoxia and oxygenation in the fjord and has been collecting data since April 2011. The fjord water properties change, manifesting as peaks or drops of dissolved oxygen, salinity and temperature, when affected by inflows of new water originating from the open sea or by rivers connected to the fjord system. An acute state of oxygen depletion can harm wildlife and the ecosystem permanently. The major challenge for the analysis is that the water property changes are marked by highly varying peak strength and correlation between the signals. The proposed data-driven analysis method extends existing univariate outlier detection approaches, based on clustering techniques, to identify the water mixing events. It incorporates three major steps: 1. smoothing of the input data, to counter noise, 2. individual outlier detection within the separate variables, 3. clustering of the results using the DBSCAN clustering algorithm to determine the anomalous events. The proposed approach is able to detect the water mixing events with a F1-measure of 0.885, a precision of 0.931—that is 93.1% of all events have been correctly detected—and a recall of 0.843–84.3% of events that should have been found actually also have been. Using the proposed method, the oceanographers can be informed automatically about the status of the fjord without manual interaction or physical presence at the experiment site.
Clustering algorithms in the field of data-mining are used to aggregate similar objects into common groups. One of the best-known of these algorithms is called DBSCAN. Its distinct design enables the search for an apriori unknown number of arbitrarily shaped clusters, and at the same time allows to filter out noise. Due to its sequential formulation, the parallelization of DBSCAN renders a challenge. In this paper we present a new parallel approach which we call HPDBSCAN. It employs three major techniques in order to break the sequentiality, empower workload-balancing as well as speed up neighborhood searches in distributed parallel processing environments i) a computation split heuristic for domain decomposition, ii) a data index preprocessing step and iii) a rule-based cluster merging scheme.
As a proof-of-concept we implemented HPDBSCAN as an OpenMP/MPI hybrid application. Using real-world data sets, such as a point cloud from the old town of Bremen, Germany, we demonstrate that our implementation is able to achieve a significant speed-up and scale-up in common HPC setups. Moreover, we compare our approach with previous attempts to parallelize DBSCAN showing an order of magnitude improvement in terms of computation time and memory consumption.
In this paper, the combination of unsupervised clustering algorithms with feedforward neural networks in exchange rate time series forecasting is studied. Unsupervised clustering algorithms have the desirable property of deciding on the number of partitions required to accurately segment the input space during the clustering process, thus relieving the user from making this ad hoc choice. Combining this input space partitioning methodology with feedforward neural networks acting as local predictors for each identified cluster helps alleviate the problem of nonstationarity frequently encountered in real-life applications. An improvement in the one-step-ahead forecasting accuracy was achieved compared to a global feedforward neural network model for the time series of the exchange rate of the German Mark to the US Dollar.
Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Drinking water network is vulnerable to toxic chemicals. Anomaly detection-based event detection can provide reliable indication of contamination by analyzing the real-time water quality data, collected by online-distributed sensors in water network. This article reviews the water quality event detection methodologies based on the correlation of water quality parameters and contaminants. Further, we review how to reduce the impact of contamination in water distribution network, including sensor placement optimization and contamination source determination.
Rapid detection of anomalous operating conditions within a water distribution network is desirable for the protection of the network against both accidental and malevolent contamination events. In the absence of a suite of in-situ, real-time sensors that can accurately identify a wide range of contaminants, we focus on detecting changes in water quality through analysis of existing data streams from in-situ water quality sensors. Three different change detection algorithms are tested: time series increments, linear filter and multivariate distance. Each of these three algorithms uses previous observations of the water quality to predict future water quality values. Large deviations between the predicted or previously measured values and observed values at future times indicate a change in the expected water quality. The definition of what constitutes a large deviation is quantified by a threshold value applied to the observed differences. Both simulated time series of water quality as well as measured chlorine residual values from two different locations within a distribution network are used as the background water quality values. The simulated time series are created specifically to challenge the change detection algorithms with bimodally distributed water quality values having a square wave and sin wave time series, with and without correlated noise. Additionally, a simulated time series resembling observed water quality time series is created with different levels of variability. The algorithms are tested in two different ways. First, background water quality without any anomalous events are used to test the ability of each algorithm to identify the water quality value at the next time step. Summary statistics on the prediction errors as well as the number of false positive detections quantify the ability of each algorithm to predict the background water quality. The performance of the algorithms with respect to limiting false positives is also compared against a simpler "set point" approach to detecting water quality changes. The second mode of testing employs events in the form of square waves superimposed on top of modeled/measured background water quality data. Three different event strengths are examined and the event detection capabilities of each algorithm are evaluated through the use of receiver operating characteristic (ROC) curves. The area under the ROC curve provides a quantitative basis of comparison across the three algorithms. Results show that the multivariate algorithm produces the lowest prediction errors for all cases of background water quality. A comparison of the number of false positives reported from the change detection algorithms and a set point approach highlights the efficiency of the change detection algorithms. Across all three algorithms, most prediction errors are within one standard deviation of the mean water quality. The event detection results show that the best performing algorithm varies across different background water quality models and simulated event strength. This paper was presented at the 8th Annual Water Distribution Systems Analysis Symposium which was held with the generous support of Awwa Research Foundation (AwwaRF).