Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 156 (2019) 357–366
1877-0509 © 2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
10.1016/j.procs.2019.08.212
10.1016/j.procs.2019.08.212 1877-0509
© 2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientic committee of the 8th International Young Scientist Conference on Computational
Science.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
8th International Young Scientist Conference on Computational Science
Pattern Recognition in Non-Stationary Environmental Time Series
Using Sparse Regression
Irina Deevaa,∗, Nikolay O. Nikitina, Anna V. Kaluyzhnayaa
aITMO University, 49 Kronverksky Pr. St. Petersburg, 197101, Russian Federation
Abstract
The various real-world tasks of environmental management make it necessary to obtain the hindcasts and forecasts of natural events
(wind, ocean waves and currents, sea ice, etc.) using data-driven techniques for metocean processes simulation. The models can
be fitted to specific fragments of the non-stationary multivariate time series individually to reproduce metocean environment with
desired characteristics. In the paper, the approach based on the LASSO regularised regression is proposed for the environmental
time series clustering. It allows the identify the situations with specific interaction between variables, that can be interpreted by
the values regression coefficients. The weather generator was used to produce both synthetic time series similar to the general
dataset and the identified clusters. The obtained results can be used to increase the quality of the computationally lightweight
environmental models’ identification and interpretation.
c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational
Science.
Keywords: data-driven models; metocean simulation; time series clustering; synthetic data; pattern mining
1. Introduction
The actual tasks of the offshore development raise the problems of the structural optimisation for nearshore con-
structions. One of the main external concerns for this task is the environmental factors [7], especially the extreme
events and univariate and multivariate anomalies in the atmospheric, hydrodynamic or wave processes [22].
The obtain a more reliable solution and decrease the environmental risks, the long-term climate dataset for meto-
cean variables in specific points are required. It can be obtained from historical reanalysis: global (for example, the
reanalysis ERA Interim [6] for 1979-2019 years) or regional (for example, the pan-arctic reanalysis GLORYS2v4
[10] for 1992-2015 years). However, their spatio-temporal coverage and resolution are often insufficient for the cor-
rect representation of the process in the small-scale domains [19]. The satellite-based observations also have many
similar restrictions.
∗Corresponding author. Tel.: +7-928-292-0889
E-mail address: iriny.deeva@gmail.com
1877-0509 c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
8th International Young Scientist Conference on Computational Science
Pattern Recognition in Non-Stationary Environmental Time Series
Using Sparse Regression
Irina Deevaa,∗, Nikolay O. Nikitina, Anna V. Kaluyzhnayaa
aITMO University, 49 Kronverksky Pr. St. Petersburg, 197101, Russian Federation
Abstract
The various real-world tasks of environmental management make it necessary to obtain the hindcasts and forecasts of natural events
(wind, ocean waves and currents, sea ice, etc.) using data-driven techniques for metocean processes simulation. The models can
be fitted to specific fragments of the non-stationary multivariate time series individually to reproduce metocean environment with
desired characteristics. In the paper, the approach based on the LASSO regularised regression is proposed for the environmental
time series clustering. It allows the identify the situations with specific interaction between variables, that can be interpreted by
the values regression coefficients. The weather generator was used to produce both synthetic time series similar to the general
dataset and the identified clusters. The obtained results can be used to increase the quality of the computationally lightweight
environmental models’ identification and interpretation.
c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational
Science.
Keywords: data-driven models; metocean simulation; time series clustering; synthetic data; pattern mining
1. Introduction
The actual tasks of the offshore development raise the problems of the structural optimisation for nearshore con-
structions. One of the main external concerns for this task is the environmental factors [7], especially the extreme
events and univariate and multivariate anomalies in the atmospheric, hydrodynamic or wave processes [22].
The obtain a more reliable solution and decrease the environmental risks, the long-term climate dataset for meto-
cean variables in specific points are required. It can be obtained from historical reanalysis: global (for example, the
reanalysis ERA Interim [6] for 1979-2019 years) or regional (for example, the pan-arctic reanalysis GLORYS2v4
[10] for 1992-2015 years). However, their spatio-temporal coverage and resolution are often insufficient for the cor-
rect representation of the process in the small-scale domains [19]. The satellite-based observations also have many
similar restrictions.
∗Corresponding author. Tel.: +7-928-292-0889
E-mail address: iriny.deeva@gmail.com
1877-0509 c
2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science.
358 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
2Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
The possible solution to this problem is the long-term simulation using regional high-resolution configurations
of the coupled numerical metocean models. However, these models are very computationally expensive and often
require the expert’s involvement for the configuration’s tuning for specific regional conditions. The other problem
of the simulated data is the rare occurrence of some events that should be taken into account during the coastal and
nearshore constructions’ optimisation.
The promising solution is to build the artificial environment for the optimising structure using synthetic environ-
mental datasets with desired characteristics. The various data-driven models can be applied to generate the dataset
and preserve physical consistency between the variables (for example, sea surface height, wave height, wind speed,
etc.). However, the potential quality of the data-driven model quality can be increased through segmentation of the
time series for different metocean events (storms, calmness, swells).
This approach is widely applied for the reproduction of the metocean events with specific properties (for example,
flood-causing cyclones [28]). In can be considered as an unsupervised learning task. Then, the labelled dataset can be
used to train the classification model for metocean events.
In this paper, the regression model-based clustering is proposed to resolve this task. The special attention is taken to
the intractability of the obtained results. It is important because the data generation process should be understandable
for the domain expert to ensure the correctness of the data set.
As a case study, the Kara Sea domain was used. To generate the reference long-term dataset, the offline-coupled
system of the pan-Arctic ocean, ice, atmospheric and wave models were configured and executed using Lomonosov-2
supercomputer. We conduct a set of experiments to compare the different approaches to sub-models identification.
This paper is structured as follows. Sec. 2describes the problem statement and mathematical formalisation of the
synthetic data generation task. Sec. 3provides an overview of various works and researches in the field of meteoro-
logical time series clustering and patterns recognition. Sec. 4describes metocean datasets and the methods that were
used for clustering and searching patterns in the source data. Sec. 5provides the results of the conducted experiments
and their analysis. Finally, Sec. 6provides a summary of the paper and the generalisation of the obtained experimental
results.
2. Problem statement
The initial problem for the usage of data-driven models for environmental tasks is multidimensional relation among
many processes. Aside from this, an additional problem is non-stationarity of process and, in general case, no infor-
mation about a certain type of non-stationarity. In the presence of such limitations, it is obvious that prior assumption
about the model of the process is very uncertain. This fact affects the conclusions about appropriate mathematical
methods and makes them biased (e.g. usage of methods for Gaussian processes when the targeted process is far from
Gaussian). To overcome this issue it seems reasonable to make the first step that allows to identify and split initial data
to intervals that have homogeneous statistical characteristics. In case of environmental data homogeneous statistical
characteristics are usually associated with certain weather patterns. Such homogeneous chunks are more suitable and
easier for identification of prediction model structure. In this article, we investigate the possibility for identification
of the patterns with a homogeneous structure of approximation function. As a class of approximation functions was
chosen sparse polynomial regression with L1 regularisation (LASSO). The choice of such class of functions comes
from the ability of curvilinear regression to approximate some kinds of non-linearities and possibility to discover it’s
structure by neglecting the insignificant terms.
The fitness function with L1-penalty for the LASSO model can be written in such formulation:
min
β∈Rp{(Y−Xβ)T(Y−Xβ)+λ|β|1}, where|β|1=
p
j=1
|β|j.(1)
In formulation 1, X is a matrix of predictors (with curvilinear terms), Y is a vector of response (or predicted variable),
λis a degree of penalisation. As a proof of concept and example of visible separation of LASSO coefficients to patterns
and their alignment with different environmental situations is presented in Fig. 1.
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 359
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 3
Fig. 1. The example of the interaction between wave-breaker and the metocean environment. The red rectangles highlight the coefficients for the
open water period (without ice coverage), the small blue rectangles inside it highlights the anomalous negative values of the regression coefficients
that indicate the specific pattern existence.
3. Related work
The clustering of the environmental data is the special case of multivariate non-stationary time series clustering
[13]. The clustering and classification of the time series is a challenging problem that is widely studied but still
haven’t universal solution [1]. In the case of time series for weather data, we deal with multidimensional time series,
the clustering of which is carried out by certain methods. For example, in this article [12] the application of the
Finite Element Method is considered in the context of time series analysis. The approach allows the identification
of some hidden regimes in time series characterised by a set of model-specific parameters and some model distance
functional. The different approaches can be applied like deep neural networks [9] for classification, autoregressive [5]
and dynamic time warping metric [17] for clustering.
The common approach to the interpretation of the multivariate time series and clusters in it is a dimensionality
reduction and visualisation techniques [26]. However, the internal structure of some types of data-driven models
(regression-based models, tree-based models) also can be interpreted [21] directly. Due to this reason, the model-
based approach to the time-series clustering can be chosen as an appropriate method according to the several studies
[11,34,4]. For the parameters of weather models can also be applied methods to reduce the dimension [15], [14].
A separate important issue is the determination of the number of clusters for further clustering of weather time
series. Since we do not know in advance how many hidden modes (clusters) can be observed in a time series, we can
set a metric by which the quality of the identified clusters for their different numbers will be evaluated [2].
Also, there are several approaches to the task of synthetic environment identification that can be applied for the
processing of the time series in identified clusters. The first one based at the application of the data-driven or statistical
models for the restoration of the main variable using a set of other variables. There are stochastic models [29], auto-
regressive models [3] and neural networks [30] are widely applied in this area. The main disadvantage of many
methods is the insufficient quality of the rare events reconstruction and unreliable preservation of the data consistency,
while the specifics of the metocean data is a strong physical dependence between different environmental processes
represented as variables in a dataset.
The other approach is the weather generators [40]. They allow creating a consistent time series based at the existing
environmental dataset. The promising approach presented in is to identify the multi-year variability using ARIMA
or wavelet-based models [36] and that apply Markov chain models [35] to reconstruct daily variability (the most
of weather generators are not represent hourly variability, however, there are few techniques that support hourly
simulations too [25,33]).
The application of the data-driven models for the wind waves forecasting is also widely discussed in the literature.
They can be applied for both tasks of forecasting [41] and the extension of real observations with synthetic data [30].
360 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
4Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
Usually, the wind variables shifted by time and space are used to reproduce the corresponding wave characteristics
[20] in the point. Also, the neural networks can be applied to reproduce the whole wave field in some region [18].
To increase the quality of the simulation for the specific group of events, the problem-specific model can be applied
instead of common data-driven wave models. This specific model can differ in structure or training data set.
The described methods can be used for the generation of the synthetic external data for various simulations, but
the specific task of the artificial envelopment requires more flexible and interpretable approach. According to this
results in these studies, we decided to base the research at linear regression to build the proof-of-concept model for
non-stationary metocean time series based at interpretable clustering.
4. Data and methods
4.1. Dataset overview
We choose the Arctic region as a case study due to the several causes. First of all, the dynamics in sea ice cover
causes the changes in the wind-wave interaction, that makes the task of the reliable data-driven model identification
more complicated. Also, the insufficient time-spatial coverage of the observations and reanalyses in polar regions
makes the synthetic data generation task quite actual. Finally, the climate changes in the last decades make it necessary
to take the long-term trends into attention during data analysis.
The reference dataset was obtained using the system of state-of-art high-resolution computer models. It includes
the NEMO ocean model [24] and LIM3 ice model [32] WRF atmospheric model [31] and Wave Watch III model [38].
The configuration was specifically adapted to the regional Arctic simulation as described in [16]. The time resolution
of all models is 1 hour. The 30-year run was executed using Lomonosov-2 supercomputer.
The final datasets contain 8 variables which describe wave-wind processes (Datetime, Wind speed, Wind direction,
Sea surface height, Significant wave height, Wave peak period, Mean wave period, Ice concentration). As a case study,
the 18 points were chosen in the small area in the Kara Sea near the Ob Bay coast. Their locations and sea depths in the
points are presented in Fig. 2. To analyse the variability in different scales, we prepare the daily- and yearly-averaged
time series.
Fig. 2. The locations and indexes of the selected data points in the Kara Sea. The colormap represents the sea depth in each cell of grid.
To obtain a synthetic dataset solution that is similar to the real (model-based) one, the stochastic weather generator
was configured using R package weathermen obtained from [39] repository. It uses ARIMA and Markov chain models
to generate the dataset of the desired length. The source code was modified to introduce the various input and output
variables support (the reference package was strictly configured for the specific variables’ set only).
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 361
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 5
4.2. Model-based clustering and pattern mining
The metocean multivariate time series are characterised by specific annual and inter-annual patterns that can be
identified using statistical techniques [37]. Also, the application of the clustering methods allows identifying the local
sea waves events (storms, calmness, swells, ice- and non-ice periods) even without manual pre-labelling. In the same
time, the clustering results should be interpretable for the domain expert.
To reach this aims, we decided to build the data-driven model that includes that significant wave height (hs) as target
variable and the set of predictor variables, that includes wind, ice and wave characteristics in target and neighbour
points.
To reduce the dimension of the original data and select the most significant predictors, we used the LASSO method
with the alpha parameter (with constant that multiplies the L1 term) equal to 0.1. It can be seen from Fig. 3that in
some cases the coefficients of the predictors turn to zero, which indicates the insignificance of this component of the
model, and therefore can be associated with any weather pattern.
Fig. 3. Coefficients of LASSO. It can be seen that in some cases the coefficients of the predictors turn to zero, which indicates the insignificance of
this component of the model, and therefore can be associated with any weather pattern.
The concept of the model-based clustering of time series is illustrated in Fig. 4a, that also contains the demonstra-
tion of the chunk concept that was involved into the algorithm as a sliding window of pre-defined size.
At first, all observations were divided into equal chunks. It should be noted here that the quality of clustering
depended on the size of the chunk. The metric used was adjusted rand score that computes a similarity measure
between two clusters by considering all pairs of samples and counting pairs that are assigned in the same or different
clusters in the predicted and real clusters. During the experiments, it was found that the most efficient cluster size is 72
hours, the results of the metrics for different chunks can be seen in Fig. 4b. The architecture of the proposed solution
presented in Fig. 5. The density estimations for the distribution of the coefficients in different environmental cases are
presented in Fig. 6.
362 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
6Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
Fig. 4. The visualisation of the model-based clustering approach. The upper plot contains the time series from data set, the lower plot contains the
regression coefficients identified for every subset.
Fig. 5. The proposed workflow for interpretable generation of the case-specific synthetic environmental datasets.
Fig. 6. The density estimation for the regression coefficients values distribution for different cases: storm and calm sea. The difference between
distribution confirms the idea of the coefficient-based clustering for the identification of specific environmental cases.
5. Experimental studies
The several experiments were conducted to examine the possibility of the regression-based time series clustering
in different cases. The main aspects that should be validated are the practical possibility of the interpretable clusters
identification and the effectiveness of the model trained on the real dataset for the simulated synthetic time series.
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 363
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 7
5.1. Chunk-based clustering using k-means
The following experiment was aimed at testing the possibility of clustering weather patterns in the parameter space
of linear regression, built on the data at a point. The wave height was taken as a response to linear regression, and
three parameters selected by the LASSO as predictors: wind speed, wave peaks period, average wave period.
Then, on each chunk, a linear regression model was trained, the coefficients of which were consistently preserved.
The resulting visualization in the space of model coefficients, however, demonstrates that it is rather difficult to identify
explicit patterns purely visually. Therefore, an attempt was made to divide the coefficients obtained by the K-means
method. This separation revealed two clusters, which again may indicate the presence of two periods in the observa-
tions - ice and ice-free.
5.2. Dimensionality reduction and density-beased clustering
The following experiment was aimed at testing the possibility of clustering weather patterns in the parameter space
of linear regression, built on the data at a point. The wave height was taken as a response to linear regression, and
three parameters selected by the LASSO as predictors: wind speed, wave peaks period, average wave period.
Then, on each chunk, a linear regression model was trained, the coefficients of which were consistently preserved.
The resulting visualization in the space of model coefficients, however, demonstrates that it is rather difficult to identify
explicit patterns purely visually. Therefore, an attempt was made to divide the coefficients obtained by the K-means
method. This separation revealed two clusters, which again may indicate the presence of two periods in the observa-
tions - ice and ice-free.
The second subset of experiments was aimed to check effective clustering during different ice conditions. The t-
SNE [23] dimensionality reduction method was used to convert the coefficients space to 2D. Then, the DBSCAN [8]
algorithm was applied for the clustering, because it allows estimating not just labels, but also the number of clusters.
The results of the processing of the full data set are presented in Fig. 7a. There are two clear clusters are highlighted.
Fig. 7. The results of tSNE clustering for a) full dataset b) ice-free period
To demonstrate the interpretation of the clusters, the time series with cluster-specific markup are presented in
Fig. 8a. However, it’s still hard to identify the sub-clusters for the ice-free period. After that, the ice period (when
the wave height equals to zero) was removed from time series and data were clustered again. As it can be seen in
Fig. 7b, the density-based clustering still allow obtaining two clear groups. To interpret the clusters, the time series
for 3 years were visualised again in Fig. 8b. It can be seen that the ice melting and freezing cases are covered by a
smaller cluster in reduced space, that allow to separate in from both ice-covered and no-ice cases using only obtained
regression coefficients.
364 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
8Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
Fig. 8. The clustered time series: a) for the full data set variables. The green color represents cluster for ice-free and freeze-melt periods, red color
represents the cluster for the ice-covered sea b) for the ice-free period in data set. The green color represents cluster for ice-free period, red color
represents freeze-melt periods.
5.3. Cluster-specific synthetic data generation
The next experiment design consists of the synthetic data generation based at one of the clusters. The modified
weather generator described above was initialised with ice melt-freeze cluster (green colour in Fig. 7b). Then, the data
from the identified cluster were used as a training set for the generator. The comparison of the obtained synthetic data
with real data and generic synthetic data is presented in Fig. 9.
Fig. 9. The estimation of the distribution density for environmental variables from the cluster-based synthetic data (red), real cluster data (blue) and
generic synthetic data (black).
It can be seen that the distribution of the variables in common time series is different from both data from a cluster
and synthetic cluster-based values. At the same time, the reality is synthetic data for the cluster is similar enough.
6. Conclusion
In the paper, the analysis of the different approaches to the data-driven model identification and interpretation was
provided. The time series for each spatial point were separated into chunks. The LASSO regression was fitted in each
chunk. The optimal size of chunk equal to 72 hours was estimated using the set of experiment with the adjusted Rand
Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366 365
Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000 9
index metric. The several interpretable clusters were recognised and analysed due to the comparison of variables’
distributions in the clusters. Then, the synthetic time series were generated for the identified cluster.
The conducted results confirm the practical effectiveness of the clustering in the regression coefficients space
described in the paper. The proposed approach to the metocean time series clustering allows increasing the quality of
the synthetic data generation for specific cases.
All source code and materials used in the paper are available in the repository [27].
Acknowledgements
This research is financially supported by The Russian Scientific Foundation, Agreement #19-71-00150.
The research is carried out using the equipment of the shared research facilities of HPC computing resources at
Lomonosov Moscow State University.
References
[1] Aghabozorgi, S., Shirkhorshidi, A.S., Wah, T.Y., 2015. Time-series clustering–a decade review. Information Systems 53, 16–38.
[2] Andrew, C., Julie, K., L., 2012. Data clustering reveals climate impacts on local wind phenomena. Journal of Applied Meteorology and
Climatology 51, 1547–1557.
[3] Biller, B., Nelson, B.L., 2003. Modeling and generating multivariate time-series input processes using a vector autoregressive technique. ACM
Transactions on Modeling and Computer Simulation (TOMACS) 13, 211–237.
[4] Chamroukhi, F., Sam´
e, A., Aknin, P., Govaert, G., 2011. Model-based clustering with hidden markov model regression for time series with
regime changes, in: The 2011 International Joint Conference on Neural Networks, IEEE. pp. 2814–2821.
[5] Corduas, M., Piccolo, D., 2008. Time series clustering and classification by the autoregressive metric. Computational statistics & data analysis
52, 1860–1872.
[6] Dee, D.P., Uppala, S., Simmons, A., Berrisford, P., Poli, P., Kobayashi, S., Andrae, U., Balmaseda, M., Balsamo, G., Bauer, d.P., et al., 2011.
The era-interim reanalysis: Configuration and performance of the data assimilation system. Quarterly Journal of the royal meteorological
society 137, 553–597.
[7] Ehlers, S., Kujala, P., Veitch, B., Khan, F., Vanhatalo, J., 2014. Scenario based risk management for arctic shipping and operations, in:
ASME 2014 33rd International Conference on Ocean, Offshore and Arctic Engineering, American Society of Mechanical Engineers. pp.
V010T07A006–V010T07A006.
[8] Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al., 1996. A density-based algorithm for discovering clusters in large spatial databases with
noise., in: Kdd, pp. 226–231.
[9] Fawaz, H.I., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A., 2019. Deep learning for time series classification: a review. Data Mining
and Knowledge Discovery , 1–47.
[10] Garric, G., Parent, L., Greiner, E., Dr´
evillon, M., Hamon, M., Lellouche, J.M., R´
egnier, C., Desportes, C., Le Galloudec, O., Bricaud, C., et al.,
2017. Performance and quality assessment of the global ocean eddy-permitting physical reanalysis glorys2v4., in: EGU General Assembly
Conference Abstracts, p. 18776.
[11] Grun, B., 2018. Model-based clustering. arXiv preprint arXiv:1807.01987 .
[12] Horenko, I., 2010a. Finite element approach to clustering of multidimensional time series. SIAM J. Sci. Comput. 32, 62–83.
[13] Horenko, I., 2010b. On clustering of non-stationary meteorological time series. Dynamics of Atmospheres and Oceans 49, 164–187.
[14] Horenko, I., Johannes, S.E., Christof, S., 2006. Set-oriented dimension reduction: Localizing principal component analysis via hidden markov
models. Lecture Notes in Bioinformatics 4216, 98–115.
[15] Horenko, I., Rupert, K., Stamen, D., Christof, S., 2008. Automated generation of reduced stochastic weather models i: simultaneous dimension
and model reduction for time series analysis. SIAM Mult. Mod. Sim. 6, 1125–1145.
[16] Hvatov, A., Nikitin, N.O., Kalyuzhnaya, A.V., Kosukhin, S.S., 2018. Adaptation of nemo-lim3 model for multigrid high resolution arctic
simulation. arXiv preprint arXiv:1810.03657 .
[17] Izakian, H., Pedrycz, W., Jamal, I., 2015. Fuzzy clustering of time series data using dynamic time warping distance. Engineering Applications
of Artificial Intelligence 39, 235–244.
[18] James, S.C., Zhang, Y., O’Donncha, F., 2018. A machine learning framework to forecast wave conditions. Coastal Engineering 137, 1–10.
[19] Jianping, T., Ming, Z., Bingkai, S., 2007. The effects of model resolution on the simulation of regional climate extreme events. Journal of
Meteorological Research 21, 129–140.
[20] Kambekar, A., Deo, M., 2010. Wave simulation and forecasting using wind time history and data-driven methods. Ships and Offshore
Structures 5, 253–266.
[21] Lipton, Z.C., 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 .
[22] Lopatoukhin, L., Boukhanovsky, A., 2009. Multivariable extremes of metocean events (extreme and freak as the examples), in: EGU General
Assembly Conference Abstracts, p. 2098.
[23] Maaten, L.v.d., Hinton, G., 2008. Visualizing data using t-sne. Journal of machine learning research 9, 2579–2605.
[24] Madec, G., et al., 2015. Nemo ocean engine .
366 Irina Deeva et al. / Procedia Computer Science 156 (2019) 357–366
10 Irina Deeva et al. /Procedia Computer Science 00 (2019) 000–000
[25] Mezghani, A., Hingray, B., 2009. A combined downscaling-disaggregation weather generator for stochastic generation of multisite hourly
weather variables over complex terrain: Development and multi-scale validation for the upper rhone river basin. Journal of Hydrology 377,
245–260.
[26] Nguyen, M., Purushotham, S., To, H., Shahabi, C., 2017. m-tsne: A framework for visualizing high-dimensional multivariate time series. arXiv
preprint arXiv:1708.07942 .
[27] Nikitin, N., Deeva, I., 2019. The source code for the experimental studies presented in the paper. URL: https://github.com/Anaxagor/
Wave_wind_model.git.
[28] Nikitin, N.O., Spirin, D.S., Visheratin, A.A., Kalyuzhnaya, A.V., 2016. Statistics-based models of flood-causing cyclones for the baltic sea
region. Procedia Computer Science 101, 272–281.
[29] Papalexiou, S.M., 2018. Unified theory for stochastic modelling of hydroclimatic processes: Preserving marginal distributions, correlation
structures, and intermittency. Advances in water resources 115, 234–252.
[30] Peres, D., Iuppa, C., Cavallaro, L., Cancelliere, A., Foti, E., 2015. Significant wave height record extension by neural networks and reanalysis
wind data. Ocean Modelling 94, 128–140.
[31] Powers, J.G., Klemp, J.B., Skamarock, W.C., Davis, C.A., Dudhia, J., Gill, D.O., Coen, J.L., Gochis, D.J., Ahmadov, R., Peckham, S.E., et al.,
2017. The weather research and forecasting model: Overview, system efforts, and future directions. Bulletin of the American Meteorological
Society 98, 1717–1737.
[32] Rousset, C., Vancoppenolle, M., Madec, G., Fichefet, T., Flavoni, S., Barth´
elemy, A., Benshila, R., Chanut, J., Levy, C., Masson, S., Vivier, F.,
2015. The louvain-la-neuve sea ice model lim3.6: global and regional capabilities. Geoscientific Model Development 8, 2991–3005. URL:
https://www.geosci-model- dev.net/8/2991/2015/, doi:10.5194/gmd-8- 2991-2015.
[33] Safeeq, M., Fares, A., 2011. Accuracy evaluation of climgen weather generator and daily to hourly disaggregation methods in tropical condi-
tions. Theoretical and applied climatology 106, 321–341.
[34] Sam´
e, A., Chamroukhi, F., Govaert, G., Aknin, P., 2011. Model-based clustering and segmentation of time series with changes in regime.
Advances in Data Analysis and Classification 5, 301–321.
[35] Shamshad, A., Bawadi, M., Hussin, W.W., Majid, T., Sanusi, S., 2005. First and second order markov chain models for synthetic generation of
wind speed time series. Energy 30, 693–708.
[36] Steinschneider, S., Brown, C., 2013. A semiparametric multivariate, multisite weather generator with low-frequency variability for use in
climate risk assessments. Water resources research 49, 7205–7220.
[37] Stopa, J.E., Cheung, K.F., Tolman, H.L., Chawla, A., 2013. Patterns and cycles in the climate forecast system reanalysis wind and wave data.
Ocean Modelling 70, 207–220.
[38] Tolman, H.L., et al., 2009. User manual and system documentation of wavewatch iii tm version 3.14. Technical note, MMAB Contribution
276, 220.
[39] Walker, J.D., 2019. Introduction to the weathergen package. URL: http://walkerjeffd.github.io/weathergen/.
[40] Wilks, D.S., Wilby, R.L., 1999. The weather generation game: a review of stochastic weather models. Progress in physical geography 23,
329–357.
[41] Zamani, A., Solomatine, D., Azimian, A., Heemink, A., 2008. Learning from data for wind–wave forecasting. Ocean engineering 35, 953–962.