ArticlePDF Available

Pattern Recognition in Non-Stationary Environmental Time Series Using Sparse Regression

Authors:

Abstract and Figures

The various real-world tasks of environmental management make it necessary to obtain the hindcasts and forecasts of natural events (wind, ocean waves and currents, sea ice, etc.) using data-driven techniques for metocean processes simulation. The models can be fitted to specific fragments of the non-stationary multivariate time series individually to reproduce metocean environment with desired characteristics. In the paper, the approach based on the LASSO regularised regression is proposed for the environmental time series clustering. It allows the identify the situations with specific interaction between variables, that can be interpreted by the values regression coefficients. The weather generator was used to produce both synthetic time series similar to the general dataset and the identified clusters. The obtained results can be used to increase the quality of the computationally lightweight environmental models’ identification and interpretation.
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 156 (2019) 357–366
1877-0509 © 2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
10.1016/j.procs.2019.08.212
10.1016/j.procs.2019.08.212 1877-0509
© 2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientic committee of the 8th International Young Scientist Conference on Computational
Science.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
8th International Young Scientist Conference on Computational Science
Pattern Recognition in Non-Stationary Environmental Time Series
Using Sparse Regression
Irina Deevaa,, Nikolay O. Nikitina, Anna V. Kaluyzhnayaa
aITMO University, 49 Kronverksky Pr. St. Petersburg, 197101, Russian Federation
Abstract
The various real-world tasks of environmental management make it necessary to obtain the hindcasts and forecasts of natural events
(wind, ocean waves and currents, sea ice, etc.) using data-driven techniques for metocean processes simulation. The models can
be fitted to specific fragments of the non-stationary multivariate time series individually to reproduce metocean environment with
desired characteristics. In the paper, the approach based on the LASSO regularised regression is proposed for the environmental
time series clustering. It allows the identify the situations with specific interaction between variables, that can be interpreted by
the values regression coecients. The weather generator was used to produce both synthetic time series similar to the general
dataset and the identified clusters. The obtained results can be used to increase the quality of the computationally lightweight
environmental models’ identification and interpretation.
c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational
Science.
Keywords: data-driven models; metocean simulation; time series clustering; synthetic data; pattern mining
1. Introduction
The actual tasks of the oshore development raise the problems of the structural optimisation for nearshore con-
structions. One of the main external concerns for this task is the environmental factors [7], especially the extreme
events and univariate and multivariate anomalies in the atmospheric, hydrodynamic or wave processes [22].
The obtain a more reliable solution and decrease the environmental risks, the long-term climate dataset for meto-
cean variables in specific points are required. It can be obtained from historical reanalysis: global (for example, the
reanalysis ERA Interim [6] for 1979-2019 years) or regional (for example, the pan-arctic reanalysis GLORYS2v4
[10] for 1992-2015 years). However, their spatio-temporal coverage and resolution are often insucient for the cor-
rect representation of the process in the small-scale domains [19]. The satellite-based observations also have many
similar restrictions.
Corresponding author. Tel.: +7-928-292-0889
E-mail address: iriny.deeva@gmail.com
1877-0509 c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
8th International Young Scientist Conference on Computational Science
Pattern Recognition in Non-Stationary Environmental Time Series
Using Sparse Regression
Irina Deevaa,, Nikolay O. Nikitina, Anna V. Kaluyzhnayaa
aITMO University, 49 Kronverksky Pr. St. Petersburg, 197101, Russian Federation
Abstract
The various real-world tasks of environmental management make it necessary to obtain the hindcasts and forecasts of natural events
(wind, ocean waves and currents, sea ice, etc.) using data-driven techniques for metocean processes simulation. The models can
be fitted to specific fragments of the non-stationary multivariate time series individually to reproduce metocean environment with
desired characteristics. In the paper, the approach based on the LASSO regularised regression is proposed for the environmental
time series clustering. It allows the identify the situations with specific interaction between variables, that can be interpreted by
the values regression coecients. The weather generator was used to produce both synthetic time series similar to the general
dataset and the identified clusters. The obtained results can be used to increase the quality of the computationally lightweight
environmental models’ identification and interpretation.
c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational
Science.
Keywords: data-driven models; metocean simulation; time series clustering; synthetic data; pattern mining
1. Introduction
The actual tasks of the oshore development raise the problems of the structural optimisation for nearshore con-
structions. One of the main external concerns for this task is the environmental factors [7], especially the extreme
events and univariate and multivariate anomalies in the atmospheric, hydrodynamic or wave processes [22].
The obtain a more reliable solution and decrease the environmental risks, the long-term climate dataset for meto-
cean variables in specific points are required. It can be obtained from historical reanalysis: global (for example, the
reanalysis ERA Interim [6] for 1979-2019 years) or regional (for example, the pan-arctic reanalysis GLORYS2v4
[10] for 1992-2015 years). However, their spatio-temporal coverage and resolution are often insucient for the cor-
rect representation of the process in the small-scale domains [19]. The satellite-based observations also have many
similar restrictions.
Corresponding author. Tel.: +7-928-292-0889
E-mail address: iriny.deeva@gmail.com
1877-0509 c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
358 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
2Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
The possible solution to this problem is the long-term simulation using regional high-resolution configurations
of the coupled numerical metocean models. However, these models are very computationally expensive and often
require the expert’s involvement for the configuration’s tuning for specific regional conditions. The other problem
of the simulated data is the rare occurrence of some events that should be taken into account during the coastal and
nearshore constructions’ optimisation.
The promising solution is to build the artificial environment for the optimising structure using synthetic environ-
mental datasets with desired characteristics. The various data-driven models can be applied to generate the dataset
and preserve physical consistency between the variables (for example, sea surface height, wave height, wind speed,
etc.). However, the potential quality of the data-driven model quality can be increased through segmentation of the
time series for dierent metocean events (storms, calmness, swells).
This approach is widely applied for the reproduction of the metocean events with specific properties (for example,
flood-causing cyclones [28]). In can be considered as an unsupervised learning task. Then, the labelled dataset can be
used to train the classification model for metocean events.
In this paper, the regression model-based clustering is proposed to resolve this task. The special attention is taken to
the intractability of the obtained results. It is important because the data generation process should be understandable
for the domain expert to ensure the correctness of the data set.
As a case study, the Kara Sea domain was used. To generate the reference long-term dataset, the oine-coupled
system of the pan-Arctic ocean, ice, atmospheric and wave models were configured and executed using Lomonosov-2
supercomputer. We conduct a set of experiments to compare the dierent approaches to sub-models identification.
This paper is structured as follows. Sec. 2describes the problem statement and mathematical formalisation of the
synthetic data generation task. Sec. 3provides an overview of various works and researches in the field of meteoro-
logical time series clustering and patterns recognition. Sec. 4describes metocean datasets and the methods that were
used for clustering and searching patterns in the source data. Sec. 5provides the results of the conducted experiments
and their analysis. Finally, Sec. 6provides a summary of the paper and the generalisation of the obtained experimental
results.
2. Problem statement
The initial problem for the usage of data-driven models for environmental tasks is multidimensional relation among
many processes. Aside from this, an additional problem is non-stationarity of process and, in general case, no infor-
mation about a certain type of non-stationarity. In the presence of such limitations, it is obvious that prior assumption
about the model of the process is very uncertain. This fact aects the conclusions about appropriate mathematical
methods and makes them biased (e.g. usage of methods for Gaussian processes when the targeted process is far from
Gaussian). To overcome this issue it seems reasonable to make the first step that allows to identify and split initial data
to intervals that have homogeneous statistical characteristics. In case of environmental data homogeneous statistical
characteristics are usually associated with certain weather patterns. Such homogeneous chunks are more suitable and
easier for identification of prediction model structure. In this article, we investigate the possibility for identification
of the patterns with a homogeneous structure of approximation function. As a class of approximation functions was
chosen sparse polynomial regression with L1 regularisation (LASSO). The choice of such class of functions comes
from the ability of curvilinear regression to approximate some kinds of non-linearities and possibility to discover it’s
structure by neglecting the insignificant terms.
The fitness function with L1-penalty for the LASSO model can be written in such formulation:
min
βRp{(YXβ)T(YXβ)+λ|β|1}, where|β|1=
p
j=1
|β|j.(1)
In formulation 1, X is a matrix of predictors (with curvilinear terms), Y is a vector of response (or predicted variable),
λis a degree of penalisation. As a proof of concept and example of visible separation of LASSO coecients to patterns
and their alignment with dierent environmental situations is presented in Fig. 1.
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 359
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 3
Fig. 1. The example of the interaction between wave-breaker and the metocean environment. The red rectangles highlight the coecients for the
open water period (without ice coverage), the small blue rectangles inside it highlights the anomalous negative values of the regression coecients
that indicate the specific pattern existence.
3. Related work
The clustering of the environmental data is the special case of multivariate non-stationary time series clustering
[13]. The clustering and classification of the time series is a challenging problem that is widely studied but still
haven’t universal solution [1]. In the case of time series for weather data, we deal with multidimensional time series,
the clustering of which is carried out by certain methods. For example, in this article [12] the application of the
Finite Element Method is considered in the context of time series analysis. The approach allows the identification
of some hidden regimes in time series characterised by a set of model-specific parameters and some model distance
functional. The dierent approaches can be applied like deep neural networks [9] for classification, autoregressive [5]
and dynamic time warping metric [17] for clustering.
The common approach to the interpretation of the multivariate time series and clusters in it is a dimensionality
reduction and visualisation techniques [26]. However, the internal structure of some types of data-driven models
(regression-based models, tree-based models) also can be interpreted [21] directly. Due to this reason, the model-
based approach to the time-series clustering can be chosen as an appropriate method according to the several studies
[11,34,4]. For the parameters of weather models can also be applied methods to reduce the dimension [15], [14].
A separate important issue is the determination of the number of clusters for further clustering of weather time
series. Since we do not know in advance how many hidden modes (clusters) can be observed in a time series, we can
set a metric by which the quality of the identified clusters for their dierent numbers will be evaluated [2].
Also, there are several approaches to the task of synthetic environment identification that can be applied for the
processing of the time series in identified clusters. The first one based at the application of the data-driven or statistical
models for the restoration of the main variable using a set of other variables. There are stochastic models [29], auto-
regressive models [3] and neural networks [30] are widely applied in this area. The main disadvantage of many
methods is the insucient quality of the rare events reconstruction and unreliable preservation of the data consistency,
while the specifics of the metocean data is a strong physical dependence between dierent environmental processes
represented as variables in a dataset.
The other approach is the weather generators [40]. They allow creating a consistent time series based at the existing
environmental dataset. The promising approach presented in is to identify the multi-year variability using ARIMA
or wavelet-based models [36] and that apply Markov chain models [35] to reconstruct daily variability (the most
of weather generators are not represent hourly variability, however, there are few techniques that support hourly
simulations too [25,33]).
The application of the data-driven models for the wind waves forecasting is also widely discussed in the literature.
They can be applied for both tasks of forecasting [41] and the extension of real observations with synthetic data [30].
360 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
4Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
Usually, the wind variables shifted by time and space are used to reproduce the corresponding wave characteristics
[20] in the point. Also, the neural networks can be applied to reproduce the whole wave field in some region [18].
To increase the quality of the simulation for the specific group of events, the problem-specific model can be applied
instead of common data-driven wave models. This specific model can dier in structure or training data set.
The described methods can be used for the generation of the synthetic external data for various simulations, but
the specific task of the artificial envelopment requires more flexible and interpretable approach. According to this
results in these studies, we decided to base the research at linear regression to build the proof-of-concept model for
non-stationary metocean time series based at interpretable clustering.
4. Data and methods
4.1. Dataset overview
We choose the Arctic region as a case study due to the several causes. First of all, the dynamics in sea ice cover
causes the changes in the wind-wave interaction, that makes the task of the reliable data-driven model identification
more complicated. Also, the insucient time-spatial coverage of the observations and reanalyses in polar regions
makes the synthetic data generation task quite actual. Finally, the climate changes in the last decades make it necessary
to take the long-term trends into attention during data analysis.
The reference dataset was obtained using the system of state-of-art high-resolution computer models. It includes
the NEMO ocean model [24] and LIM3 ice model [32] WRF atmospheric model [31] and Wave Watch III model [38].
The configuration was specifically adapted to the regional Arctic simulation as described in [16]. The time resolution
of all models is 1 hour. The 30-year run was executed using Lomonosov-2 supercomputer.
The final datasets contain 8 variables which describe wave-wind processes (Datetime, Wind speed, Wind direction,
Sea surface height, Significant wave height, Wave peak period, Mean wave period, Ice concentration). As a case study,
the 18 points were chosen in the small area in the Kara Sea near the Ob Bay coast. Their locations and sea depths in the
points are presented in Fig. 2. To analyse the variability in dierent scales, we prepare the daily- and yearly-averaged
time series.
Fig. 2. The locations and indexes of the selected data points in the Kara Sea. The colormap represents the sea depth in each cell of grid.
To obtain a synthetic dataset solution that is similar to the real (model-based) one, the stochastic weather generator
was configured using R package weathermen obtained from [39] repository. It uses ARIMA and Markov chain models
to generate the dataset of the desired length. The source code was modified to introduce the various input and output
variables support (the reference package was strictly configured for the specific variables’ set only).
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 361
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 5
4.2. Model-based clustering and pattern mining
The metocean multivariate time series are characterised by specific annual and inter-annual patterns that can be
identified using statistical techniques [37]. Also, the application of the clustering methods allows identifying the local
sea waves events (storms, calmness, swells, ice- and non-ice periods) even without manual pre-labelling. In the same
time, the clustering results should be interpretable for the domain expert.
To reach this aims, we decided to build the data-driven model that includes that significant wave height (hs) as target
variable and the set of predictor variables, that includes wind, ice and wave characteristics in target and neighbour
points.
To reduce the dimension of the original data and select the most significant predictors, we used the LASSO method
with the alpha parameter (with constant that multiplies the L1 term) equal to 0.1. It can be seen from Fig. 3that in
some cases the coecients of the predictors turn to zero, which indicates the insignificance of this component of the
model, and therefore can be associated with any weather pattern.
Fig. 3. Coecients of LASSO. It can be seen that in some cases the coecients of the predictors turn to zero, which indicates the insignificance of
this component of the model, and therefore can be associated with any weather pattern.
The concept of the model-based clustering of time series is illustrated in Fig. 4a, that also contains the demonstra-
tion of the chunk concept that was involved into the algorithm as a sliding window of pre-defined size.
At first, all observations were divided into equal chunks. It should be noted here that the quality of clustering
depended on the size of the chunk. The metric used was adjusted rand score that computes a similarity measure
between two clusters by considering all pairs of samples and counting pairs that are assigned in the same or dierent
clusters in the predicted and real clusters. During the experiments, it was found that the most ecient cluster size is 72
hours, the results of the metrics for dierent chunks can be seen in Fig. 4b. The architecture of the proposed solution
presented in Fig. 5. The density estimations for the distribution of the coecients in dierent environmental cases are
presented in Fig. 6.
362 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
6Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
Fig. 4. The visualisation of the model-based clustering approach. The upper plot contains the time series from data set, the lower plot contains the
regression coecients identified for every subset.
Fig. 5. The proposed workflow for interpretable generation of the case-specific synthetic environmental datasets.
Fig. 6. The density estimation for the regression coecients values distribution for dierent cases: storm and calm sea. The dierence between
distribution confirms the idea of the coecient-based clustering for the identification of specific environmental cases.
5. Experimental studies
The several experiments were conducted to examine the possibility of the regression-based time series clustering
in dierent cases. The main aspects that should be validated are the practical possibility of the interpretable clusters
identification and the eectiveness of the model trained on the real dataset for the simulated synthetic time series.
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 363
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 7
5.1. Chunk-based clustering using k-means
The following experiment was aimed at testing the possibility of clustering weather patterns in the parameter space
of linear regression, built on the data at a point. The wave height was taken as a response to linear regression, and
three parameters selected by the LASSO as predictors: wind speed, wave peaks period, average wave period.
Then, on each chunk, a linear regression model was trained, the coecients of which were consistently preserved.
The resulting visualization in the space of model coecients, however, demonstrates that it is rather dicult to identify
explicit patterns purely visually. Therefore, an attempt was made to divide the coecients obtained by the K-means
method. This separation revealed two clusters, which again may indicate the presence of two periods in the observa-
tions - ice and ice-free.
5.2. Dimensionality reduction and density-beased clustering
The following experiment was aimed at testing the possibility of clustering weather patterns in the parameter space
of linear regression, built on the data at a point. The wave height was taken as a response to linear regression, and
three parameters selected by the LASSO as predictors: wind speed, wave peaks period, average wave period.
Then, on each chunk, a linear regression model was trained, the coecients of which were consistently preserved.
The resulting visualization in the space of model coecients, however, demonstrates that it is rather dicult to identify
explicit patterns purely visually. Therefore, an attempt was made to divide the coecients obtained by the K-means
method. This separation revealed two clusters, which again may indicate the presence of two periods in the observa-
tions - ice and ice-free.
The second subset of experiments was aimed to check eective clustering during dierent ice conditions. The t-
SNE [23] dimensionality reduction method was used to convert the coecients space to 2D. Then, the DBSCAN [8]
algorithm was applied for the clustering, because it allows estimating not just labels, but also the number of clusters.
The results of the processing of the full data set are presented in Fig. 7a. There are two clear clusters are highlighted.
Fig. 7. The results of tSNE clustering for a) full dataset b) ice-free period
To demonstrate the interpretation of the clusters, the time series with cluster-specific markup are presented in
Fig. 8a. However, it’s still hard to identify the sub-clusters for the ice-free period. After that, the ice period (when
the wave height equals to zero) was removed from time series and data were clustered again. As it can be seen in
Fig. 7b, the density-based clustering still allow obtaining two clear groups. To interpret the clusters, the time series
for 3 years were visualised again in Fig. 8b. It can be seen that the ice melting and freezing cases are covered by a
smaller cluster in reduced space, that allow to separate in from both ice-covered and no-ice cases using only obtained
regression coecients.
364 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
8Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
Fig. 8. The clustered time series: a) for the full data set variables. The green color represents cluster for ice-free and freeze-melt periods, red color
represents the cluster for the ice-covered sea b) for the ice-free period in data set. The green color represents cluster for ice-free period, red color
represents freeze-melt periods.
5.3. Cluster-specific synthetic data generation
The next experiment design consists of the synthetic data generation based at one of the clusters. The modified
weather generator described above was initialised with ice melt-freeze cluster (green colour in Fig. 7b). Then, the data
from the identified cluster were used as a training set for the generator. The comparison of the obtained synthetic data
with real data and generic synthetic data is presented in Fig. 9.
Fig. 9. The estimation of the distribution density for environmental variables from the cluster-based synthetic data (red), real cluster data (blue) and
generic synthetic data (black).
It can be seen that the distribution of the variables in common time series is dierent from both data from a cluster
and synthetic cluster-based values. At the same time, the reality is synthetic data for the cluster is similar enough.
6. Conclusion
In the paper, the analysis of the dierent approaches to the data-driven model identification and interpretation was
provided. The time series for each spatial point were separated into chunks. The LASSO regression was fitted in each
chunk. The optimal size of chunk equal to 72 hours was estimated using the set of experiment with the adjusted Rand
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 365
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 9
index metric. The several interpretable clusters were recognised and analysed due to the comparison of variables’
distributions in the clusters. Then, the synthetic time series were generated for the identified cluster.
The conducted results confirm the practical eectiveness of the clustering in the regression coecients space
described in the paper. The proposed approach to the metocean time series clustering allows increasing the quality of
the synthetic data generation for specific cases.
All source code and materials used in the paper are available in the repository [27].
Acknowledgements
This research is financially supported by The Russian Scientific Foundation, Agreement #19-71-00150.
The research is carried out using the equipment of the shared research facilities of HPC computing resources at
Lomonosov Moscow State University.
References
[1] Aghabozorgi, S., Shirkhorshidi, A.S., Wah, T.Y., 2015. Time-series clustering–a decade review. Information Systems 53, 16–38.
[2] Andrew, C., Julie, K., L., 2012. Data clustering reveals climate impacts on local wind phenomena. Journal of Applied Meteorology and
Climatology 51, 1547–1557.
[3] Biller, B., Nelson, B.L., 2003. Modeling and generating multivariate time-series input processes using a vector autoregressive technique. ACM
Transactions on Modeling and Computer Simulation (TOMACS) 13, 211–237.
[4] Chamroukhi, F., Sam´
e, A., Aknin, P., Govaert, G., 2011. Model-based clustering with hidden markov model regression for time series with
regime changes, in: The 2011 International Joint Conference on Neural Networks, IEEE. pp. 2814–2821.
[5] Corduas, M., Piccolo, D., 2008. Time series clustering and classification by the autoregressive metric. Computational statistics & data analysis
52, 1860–1872.
[6] Dee, D.P., Uppala, S., Simmons, A., Berrisford, P., Poli, P., Kobayashi, S., Andrae, U., Balmaseda, M., Balsamo, G., Bauer, d.P., et al., 2011.
The era-interim reanalysis: Configuration and performance of the data assimilation system. Quarterly Journal of the royal meteorological
society 137, 553–597.
[7] Ehlers, S., Kujala, P., Veitch, B., Khan, F., Vanhatalo, J., 2014. Scenario based risk management for arctic shipping and operations, in:
ASME 2014 33rd International Conference on Ocean, Oshore and Arctic Engineering, American Society of Mechanical Engineers. pp.
V010T07A006–V010T07A006.
[8] Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al., 1996. A density-based algorithm for discovering clusters in large spatial databases with
noise., in: Kdd, pp. 226–231.
[9] Fawaz, H.I., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A., 2019. Deep learning for time series classification: a review. Data Mining
and Knowledge Discovery , 1–47.
[10] Garric, G., Parent, L., Greiner, E., Dr´
evillon, M., Hamon, M., Lellouche, J.M., R´
egnier, C., Desportes, C., Le Galloudec, O., Bricaud, C., et al.,
2017. Performance and quality assessment of the global ocean eddy-permitting physical reanalysis glorys2v4., in: EGU General Assembly
Conference Abstracts, p. 18776.
[11] Grun, B., 2018. Model-based clustering. arXiv preprint arXiv:1807.01987 .
[12] Horenko, I., 2010a. Finite element approach to clustering of multidimensional time series. SIAM J. Sci. Comput. 32, 62–83.
[13] Horenko, I., 2010b. On clustering of non-stationary meteorological time series. Dynamics of Atmospheres and Oceans 49, 164–187.
[14] Horenko, I., Johannes, S.E., Christof, S., 2006. Set-oriented dimension reduction: Localizing principal component analysis via hidden markov
models. Lecture Notes in Bioinformatics 4216, 98–115.
[15] Horenko, I., Rupert, K., Stamen, D., Christof, S., 2008. Automated generation of reduced stochastic weather models i: simultaneous dimension
and model reduction for time series analysis. SIAM Mult. Mod. Sim. 6, 1125–1145.
[16] Hvatov, A., Nikitin, N.O., Kalyuzhnaya, A.V., Kosukhin, S.S., 2018. Adaptation of nemo-lim3 model for multigrid high resolution arctic
simulation. arXiv preprint arXiv:1810.03657 .
[17] Izakian, H., Pedrycz, W., Jamal, I., 2015. Fuzzy clustering of time series data using dynamic time warping distance. Engineering Applications
of Artificial Intelligence 39, 235–244.
[18] James, S.C., Zhang, Y., O’Donncha, F., 2018. A machine learning framework to forecast wave conditions. Coastal Engineering 137, 1–10.
[19] Jianping, T., Ming, Z., Bingkai, S., 2007. The eects of model resolution on the simulation of regional climate extreme events. Journal of
Meteorological Research 21, 129–140.
[20] Kambekar, A., Deo, M., 2010. Wave simulation and forecasting using wind time history and data-driven methods. Ships and Oshore
Structures 5, 253–266.
[21] Lipton, Z.C., 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 .
[22] Lopatoukhin, L., Boukhanovsky, A., 2009. Multivariable extremes of metocean events (extreme and freak as the examples), in: EGU General
Assembly Conference Abstracts, p. 2098.
[23] Maaten, L.v.d., Hinton, G., 2008. Visualizing data using t-sne. Journal of machine learning research 9, 2579–2605.
[24] Madec, G., et al., 2015. Nemo ocean engine .
366 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
10 Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
[25] Mezghani, A., Hingray, B., 2009. A combined downscaling-disaggregation weather generator for stochastic generation of multisite hourly
weather variables over complex terrain: Development and multi-scale validation for the upper rhone river basin. Journal of Hydrology 377,
245–260.
[26] Nguyen, M., Purushotham, S., To, H., Shahabi, C., 2017. m-tsne: A framework for visualizing high-dimensional multivariate time series. arXiv
preprint arXiv:1708.07942 .
[27] Nikitin, N., Deeva, I., 2019. The source code for the experimental studies presented in the paper. URL: https://github.com/Anaxagor/
Wave_wind_model.git.
[28] Nikitin, N.O., Spirin, D.S., Visheratin, A.A., Kalyuzhnaya, A.V., 2016. Statistics-based models of flood-causing cyclones for the baltic sea
region. Procedia Computer Science 101, 272–281.
[29] Papalexiou, S.M., 2018. Unified theory for stochastic modelling of hydroclimatic processes: Preserving marginal distributions, correlation
structures, and intermittency. Advances in water resources 115, 234–252.
[30] Peres, D., Iuppa, C., Cavallaro, L., Cancelliere, A., Foti, E., 2015. Significant wave height record extension by neural networks and reanalysis
wind data. Ocean Modelling 94, 128–140.
[31] Powers, J.G., Klemp, J.B., Skamarock, W.C., Davis, C.A., Dudhia, J., Gill, D.O., Coen, J.L., Gochis, D.J., Ahmadov, R., Peckham, S.E., et al.,
2017. The weather research and forecasting model: Overview, system eorts, and future directions. Bulletin of the American Meteorological
Society 98, 1717–1737.
[32] Rousset, C., Vancoppenolle, M., Madec, G., Fichefet, T., Flavoni, S., Barth´
elemy, A., Benshila, R., Chanut, J., Levy, C., Masson, S., Vivier, F.,
2015. The louvain-la-neuve sea ice model lim3.6: global and regional capabilities. Geoscientific Model Development 8, 2991–3005. URL:
https://www.geosci-model- dev.net/8/2991/2015/, doi:10.5194/gmd-8- 2991-2015.
[33] Safeeq, M., Fares, A., 2011. Accuracy evaluation of climgen weather generator and daily to hourly disaggregation methods in tropical condi-
tions. Theoretical and applied climatology 106, 321–341.
[34] Sam´
e, A., Chamroukhi, F., Govaert, G., Aknin, P., 2011. Model-based clustering and segmentation of time series with changes in regime.
Advances in Data Analysis and Classification 5, 301–321.
[35] Shamshad, A., Bawadi, M., Hussin, W.W., Majid, T., Sanusi, S., 2005. First and second order markov chain models for synthetic generation of
wind speed time series. Energy 30, 693–708.
[36] Steinschneider, S., Brown, C., 2013. A semiparametric multivariate, multisite weather generator with low-frequency variability for use in
climate risk assessments. Water resources research 49, 7205–7220.
[37] Stopa, J.E., Cheung, K.F., Tolman, H.L., Chawla, A., 2013. Patterns and cycles in the climate forecast system reanalysis wind and wave data.
Ocean Modelling 70, 207–220.
[38] Tolman, H.L., et al., 2009. User manual and system documentation of wavewatch iii tm version 3.14. Technical note, MMAB Contribution
276, 220.
[39] Walker, J.D., 2019. Introduction to the weathergen package. URL: http://walkerjeffd.github.io/weathergen/.
[40] Wilks, D.S., Wilby, R.L., 1999. The weather generation game: a review of stochastic weather models. Progress in physical geography 23,
329–357.
[41] Zamani, A., Solomatine, D., Azimian, A., Heemink, A., 2008. Learning from data for wind–wave forecasting. Ocean engineering 35, 953–962.
Article
Full-text available
The current oil industry is moving towards digitalization, which is a good opportunity that will bring value to all its stakeholders. The digitalization of oil and gas discovery, which are production-based industries, is driven by enabling technologies which include machine learning (ML) and big data analytics. However, the existing Metocean system generates data manually using sensors such as the wave buoy, anemometer, and acoustic doppler current profiler (ADCP). Additionally, these data which appear in ASCII format to the Metocean system are also manual and silos. This slows down provisioning, while the monitoring element of the Metocean data path is partial. In this paper, we demonstrate the capabilities of ML for the development of Metocean data integration interoperability based on intelligent operations and automation. A comprehensive review of several research studies, which explore the needs of ML in oil and gas industries by investigating the in-depth integration of Metocean data interoperability for intelligent operations and automation using an ML-based approach, is presented. A new model integrated with the existing Metocean data system using ML algorithms to monitor and interoperate with maximum performance is proposed. The study reveals that ML is one of the crucial and key enabling tools that the oil and gas industries are now focused on for implementing digital transformation, which allows the industry to automate, enhance production, and have less human capacity. Lastly, user recommendations for potential future investigations are offered.
Article
Technologies such as Big Data and IoT have shown the need for intelligent unsupervised processing of Multivariate Time Series (MTS), MTS clustering among them. The challenges in MTS clustering includes not only the selection of the algorithm but also the MTS representation and the similarity measurement among the instances. This study proposes an ensemble of MTS clustering methods that merges different MTS representations and distance functions, aggregating them to obtain a similarity measurement. Furthermore, a proposal for prior knowledge representation is propose to balance the aggregation of the distances. The final clustering is performed either using k-means or hierarchical clustering. The experimentation set up includes the implementation of the ensemble with either 4 or 5 different methods, including an MTS extension of k-Shape. The results show that the ensemble is biased towards the best methods, which helps the clustering practitioner in the selection of the most suitable prototypes. Moreover, the evaluation of the ensemble with the number of clusters set to the number of labels shows that metrics, such as the sensitivity and specificity, must drive the rule of the elbow; alternatively, this value represents the most interesting prior knowledge bit in MTS clustering. Further work includes the study of digital markers to compare MTS representations and distance functions and the use of external metrics to balance the aggregation of the methods.
Article
Full-text available
Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-of-the-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.
Article
Full-text available
Hydroclimatic processes come in all “shapes and sizes”. They are characterized by different spatiotemporal correlation structures and probability distributions that can be continuous, mixed-type, discrete or even binary. Simulating such processes by reproducing precisely their marginal distribution and linear correlation structure, including features like intermittency, can greatly improve hydrological analysis and design. Traditionally, modelling schemes are case specific and typically attempt to preserve few statistical moments providing inadequate and potentially risky distribution approximations. Here, a single framework is proposed that unifies, extends, and improves a general-purpose modelling strategy, based on the assumption that any process can emerge by transforming a specific “parent” Gaussian process. A novel mathematical representation of this scheme, introducing parametric correlation transformation functions, enables straightforward estimation of the parent-Gaussian process yielding the target process after the marginal back transformation, while it provides a general description that supersedes previous specific parameterizations, offering a simple, fast and efficient simulation procedure for every stationary process at any spatiotemporal scale. This framework, also applicable for cyclostationary and multivariate modelling, is augmented with flexible parametric correlation structures that parsimoniously describe observed correlations. Real-world simulations of various hydroclimatic processes with different correlation structures and marginals, such as precipitation, river discharge, wind speed, humidity, extreme events per year, etc., as well as a multivariate example, highlight the flexibility, advantages, and complete generality of the method.
Article
Full-text available
A machine learning framework is developed to estimate ocean-wave conditions. By supervised training of machine learning models on many thousands of iterations of a physics-based wave model, accurate representations of significant wave heights and period can be used to predict ocean conditions. A model of Monterey Bay was used as the example test site; it was forced by measured wave conditions, ocean-current nowcasts, and reported winds. These input data along with model outputs of spatially variable wave heights and characteristic period were aggregated into supervised learning training and test data sets, which were supplied to machine learning models. These machine learning models replicated wave heights with a root-mean-squared error of 9 cm and correctly identify over 90% of the characteristic periods for the test-data sets. Impressively, transforming model inputs to outputs through matrix operations requires only a fraction (<1/1,000th) of the computation time compared to forecasting with the physics-based model.
Article
Full-text available
Since its initial release in 2000, the Weather Research and Forecasting (WRF) Model has become one of the world’s most widely used numerical weather prediction models. Designed to serve both research and operational needs, it has grown to offer a spectrum of options and capabilities for a wide range of applications. In addition, it underlies a number of tailored systems that address Earth system modeling beyond weather. While the WRF Model has a centralized support effort, it has become a truly community model, driven by the developments and contributions of an active worldwide user base. The WRF Model sees significant use for operational forecasting, and its research implementations are pushing the boundaries of finescale atmospheric simulation. Future model directions include developments in physics, exploiting emerging compute technologies, and ever-innovative applications. From its contributions to research, forecasting, educational, and commercial efforts worldwide, the WRF Model has made a significant mark on numerical weather prediction and atmospheric science.
Article
Full-text available
Coastal floods are natural extreme phenomena that have rare occurrence and small duration. For reconstruction of different scenarios of these events synthetic cyclones can be used. In this paper, statistics-based approaches for synthetic cyclone creation are applied for reconstruction of coastal floods with rare occurrence in Saint-Petersburg. The model developed in the frame of this research is based on analysis and clusterization of historical 50-year dataset for cyclones over Baltic Sea. As a result, realistic pressure profile (RPP) model of cyclones with specified parameters was proposed. It can be used for reconstruction of historical coastal floods and estimation of potential flood parameters for different cases of cyclone movement.
Conference Paper
Full-text available
Multivariate time series (MTS) have become increasingly common in healthcare domains where human vital signs and laboratory results are collected for predictive diagnosis. Recently, there have been increasing efforts to visualize healthcare MTS data based on star charts or parallel coordinates. However, such techniques might not be ideal for visualizing a large MTS dataset, since it is difficult to obtain insights or interpretations due to the inherent high dimensionality of MTS. In this paper, we propose " m-TSNE " : a simple and novel framework to visualize high-dimensional MTS data by projecting them into a low-dimensional (2-D or 3-D) space while capturing the underlying data properties. Our framework is easy to use and provides interpretable insights for healthcare professionals to understand MTS data. We evaluate our visualization framework on two real-world datasets and demonstrate that the results of our m-TSNE show patterns that are easy to understand while the other methods' visualization may have limitations in interpretability.
Article
High-resolution regional hindcasting of ocean and sea ice plays an important role in the assessment of shipping and operational risks in the Arctic Ocean. The ice–ocean model NEMO-LIM3 was modified to improve its simulation quality for appropriate spatio-temporal resolutions. A multigrid model setup with connected coarse- (14 km) and fine-resolution (5 km) model configurations was devised. These two configurations were implemented and run separately. The resulting computational cost was lower when compared to that of the built-in AGRIF nesting system. Ice and tracer boundary-condition schemes were modified to achieve the correct interaction between coarse- and fine grids through a long ice-covered open boundary. An ice-restoring scheme was implemented to reduce spin-up time. The NEMO-LIM3 configuration described in this article provides more flexible and customisable tools for high-resolution regional Arctic simulations.
Article
The new 3.6 version of the Louvain-la-Neuve sea ice model (LIM) is presented, as integrated in the most recent stable release of Nucleus for European Modelling of the Ocean (NEMO) (3.6). The release will be used for the next Climate Model Inter-comparison Project (CMIP6). Developments focussed around three axes: improvements of robustness, versatility and sophistication of the code, which involved numerous changes. Robustness was improved by enforcing exact conservation through the inspection of the different processes driving the air–ice–ocean exchanges of heat, mass and salt. Versatility was enhanced by implementing lateral boundary conditions for sea ice and more flexible ice thickness categories. The latter includes a more practical computation of category boundaries, parameterizations to use LIM3.6 with a single ice category and a flux redistributor for coupling with atmospheric models that cannot handle multiple sub-grid fluxes. Sophistication was upgraded by including the effect of ice and snow weight on the sea surface. We illustrate some of the new capabilities of the code in two standard simulations. One is an ORCA2-LIM3 global simulation at a nominal 2° resolution, forced by reference atmospheric climatologies. The other one is a regional simulation at 2 km resolution around the Svalbard Archipelago in the Arctic Ocean, with open boundaries and tides. We show that the LIM3.6 forms a solid and flexible base for future scientific studies and model developments.
Article
The fifth-generation Pennsylvania State University/NCAR Mesoscale Model Version 3 (MM5V3) was used to simulate extreme heavy rainfall events over the Yangtze River Basin in June 1999. The effects of model's horizontal and vertical resolution on the extreme climate events were investigated in detail. In principle, the model was able to characterize the spatial distribution of monthly heavy precipitation. The results indicated that the increase in horizontal resolution could reduce the bias of the modeled heavy rain and reasonably simulate the change of daily precipitation during the study period. A finer vertical resolution led to obviously improve rainfall simulations with smaller biases, and hence, better resolve heavy rainfall events. The increase in both horizontal and vertical resolution could produce better predictions of heavy rainfall events. Not only the rainfall simulation altered in the cases of different horizontal and vertical grid spacing, but also other meteorological fields demonstrated diverse variations in terms of resolution change in the model. An evident improvement in the simulated sea level pressure resulted from the increase of horizontal resolution, but the simulation was insensitive to vertical grid spacing. The increase in vertical resolution could enhance the simulation of surface temperature as well as atmospheric circulation at low levels, while the simulation of circulation at middle and upper levels were found to be much less dependent on changing resolution. In addition, cumulus parameterization schemes showed high sensitivity to horizontal resolution. Different convective schemes exhibited large discrepancies in rainfall simulations with regards to changing resolution. The percentage of convective precipitation in the Grell scheme increased with increasing horizontal resolution. In contrast, the Kain-Fritsch scheme caused a reduced ratio of convective precipitation to total rainfall accumulations corresponding to increasing horizontal resolution.
Article
Accuracy of wave climate assessment is related to the length of available observed records of sea state variables of interest (significant wave height, mean direction, mean period, etc.). Data availability may be increased by record extension methods.In the paper, we investigate the use of artificial neural networks (ANNs) fed with reanalysis wind data to extend an observed time series of significant wave heights. In particular, six-hourly 10 m a.s.l. u- and v-wind speed data of the NCEP/NCAR Reanalysis I (NRA1) project are used to perform an extension of observed significant wave height series back to 1948 at the same time resolution. Wind for input is considered at several NRA1 grid-points and at several time lags as well, and the influence of the distance of input points and of the number of lags is analyzed to derive best-performing models, conceptually taking into account wind fetch and duration.Applications are conducted for buoys of the Italian Sea Monitoring Network of different climatic features, for which more than 15 years of observations are available. Results of the ANNs are compared to those of state-of-the-art wave reanalyses NOAA WAVEWATCH III/CFSR and ERA-Interim, and indicate that model performs slightly better than the former, which in turn outperforms the latter. The computational times for model training on a common workstation are typically of few hours, so the proposed method is potentially appealing to engineering practice.