ArticlePDF Available

Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm

Authors:

Abstract and Figures

Time series from fisheries often contain multiple missing data. This is a severe limitation that prevents using the data for research on population dynamics, stock assessment, forecasting, and, hence, decision-making around marine resources. Several methods have been proposed to impute missing data in univariate time series. Still, their performances depend not only on the amount of missing data but also on the data structure. This study compares the performance of twelve imputation methods on the time series of marine fishery landings for six species in the Colombian Pacific Ocean. Unlike other studies, we validate the precision of the imputations in the same target time series that include missing data, using the Known Sub-Sequence Algorithm (KSSA), a novelty validation approach that simulates missing data in known sub-sequences of the target time series. The results showed that the best methods for imputation are Seasonal Decomposition with Kalman filters and Structural Models with Kalman filters fitted by maximum likelihood. Results also show that validating the imputation methods with other time series different to the target time series, leads to wrong imputation methods choices. It is noteworthy that these methods and also the validation framework are mainly suited to time series with non-random distribution of missing data, this is, missing data produced systematically in chunks or clusters with predictable frequency, which are common in marine sciences.
Content may be subject to copyright.
Aquaculture and Fisheries xxx (xxxx) xxx
Please cite this article as: Iván F. Benavides, Aquaculture and Fisheries, https://doi.org/10.1016/j.aaf.2021.12.013
2468-550X/© 2021 Shanghai Ocean University. Publishing services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article
under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Original research article
Assessing methods for multiple imputation of systematic missing data in
marine sheries time series with a new validation algorithm
Iv´
an F. Benavides
a
, Marlon Santacruz
b
,
*
, Jhoana P. Romero-Leiton
c
,
d
, Carlos Barreto
e
, John
Josephraj Selvaraj
a
a
Instituto de Estudios del Pacíco, Universidad Nacional de Colombia, Tumaco Campus, Kil´
ometro 30-31, Cajapí Vía Nacional Tumaco Pasto, Tumaco, Nari˜
no,
522020, Colombia
b
Ingeniería en Producci´
on Acuícola, Universidad de Nari˜
no, Cra. 22 #30-63, Torre 4 apto. 703, San Juan de Pasto, Nari˜
no, 520004, Colombia
c
Facultad de ingeniería, Universidad Cesmag, Pasto, 520004, Colombia
d
Datambiente, 520001, Colombia
e
Autoridad Nacional de Pesca y Acuicultura (AUNAP), 520001, Colombia
ARTICLE INFO
Keywords:
Data gaps
Fish populations
Artisanal shing
Time series imputation
ABSTRACT
Time series from sheries often contain multiple missing data. This is a severe limitation that prevents using the
data for research on population dynamics, stock assessment, forecasting, and, hence, decision-making around
marine resources. Several methods have been proposed to impute missing data in univariate time series. Still,
their performances depend not only on the amount of missing data but also on the data structure. This study
compares the performance of twelve imputation methods on the time series of marine shery landings for six
species in the Colombian Pacic Ocean. Unlike other studies, we validate the precision of the imputations in the
same target time series that include missing data, using the Known Sub-Sequence Algorithm (KSSA), a novelty
validation approach that simulates missing data in known sub-sequences of the target time series. The results
showed that the best methods for imputation are Seasonal Decomposition with Kalman lters and Structural
Models with Kalman lters tted by maximum likelihood. Results also show that validating the imputation
methods with other time series different to the target time series, leads to wrong imputation methods choices. It
is noteworthy that these methods and also the validation framework are mainly suited to time series with non-
random distribution of missing data, this is, missing data produced systematically in chunks or clusters with
predictable frequency, which are common in marine sciences.
1. Introduction
Marine sheries represent one of the most important activities for the
economy and development of societies worldwide, and the data they
provide are fundamental for the research and decision-making on ma-
rine resources (FAO, 2020). Time series are the most common and useful
type of data obtained from marine sh landings, which normally involve
monthly or yearly records of various lengths (Farmer & Froeschke,
2015). These time series constitute the basis for studies on population
dynamics, stock assessment, food webs, forecasting, among others.
(Litzow & Urban, 2009; Pikitch et al., 2004; Selvaraj et al., 2020; Ster-
giou & Christou, 1996).
However, marine shery landings time series frequently contain
multiple missing data (MD) that hinder their use for research because
most of the statistical models in marine science require complete series
(De Valpine, 2002; Peacock et al., 2020; Rudd & Branch, 2016; Selvaraj
et al., 2020). This makes time series often useless, in spite of the good
information they may contain. Causes of MD are diverse, though com-
mon examples include discontinuity in data recording, instrument fail-
ure, bureaucratic problems in the institutions in charge of data
collection and management, and dubious values that must be discarded
from the datasets (Kihoro et al., 2013; Afrifa-Yamoah et al., 2020). Some
researchers just remove MD and work with the available information left
in time series or ll MD with means or medians (Engels & Diehr, 2003;
Wu et al., 2015). Nevertheless, these practices are dangerous because
they corrupt the nature of the series, alter the outcome of the statistical
models, and bias the parameter estimations (Junger & Ponce de Leon,
2015), leading to wrong conclusions and costly decisions.
* Corresponding author.
E-mail address: ibenavidesm@unal.edu.co (M. Santacruz).
Contents lists available at ScienceDirect
Aquaculture and Fisheries
journal homepage: www.keaipublishing.com/en/journals/aquaculture-and-fisheries
https://doi.org/10.1016/j.aaf.2021.12.013
Received 24 June 2021; Received in revised form 24 November 2021; Accepted 24 December 2021
Aquaculture and Fisheries xxx (xxxx) xxx
2
Imputation methods are a reasonable solution to handle multiple
MD, which helps to prevent misleading conclusions or to archive the
issues. These methods ll in MD according to different criteria (Afri-
fa-Yamoah et al., 2020; Rubin, 1987; Spratt et al., 2010). However,
imputation in univariate time series such as marine shery landings that
do not include accompanying independent covariates implies severe
challenges since the only information available for the imputation are
the values throughout the time series itself (Moritz et al., 2015). In
sheries, where the univariate landings somehow reect a natural
behavior of the ocean in terms of periodic high and low sh production,
time dependency structures such as trend, seasonality, and cycles are of
high importance for MD imputation. Therefore, these time dependency
structures must be captured by any method that intends to perform
reliable imputations of multiple MD in univariate shery time series
(Shumway & Stoffer, 2011, p. p218).
Many methods that meet these requirements are being proposed and
continuously implemented as libraries in the different statistical pro-
gramming software. Some of these algorithms such as the Kalman lters
(Afrifa-Yamoah et al., 2020), Structural models (Moritz &
Bartz-Beielstein, 2017), ARIMA models (Kihoro et al., 2013; Magare
et al., 2020), Singular Spectrum Analysis (Mahmoudvand &
Canas-Rodrigues 2016) and others, have been successfully used in elds
other than sheries such as engineering, energy, meteorology, econo-
metrics, medicine, biochemistry, etc. Some of these studies have
compared the efciency of different imputation methods on other time
series structures (Demirhan & Renwick, 2018; Engels & Diehr, 2003;
Hassani et al., 2019; Huque et al., 2018; Liu et al., 2020; Wei et al., 2018;
Yozgatligil et al., 2013), to show the importance of choosing the
best-suited method for a specic time series. For the shery and marine
ecology sciences, some studies have covered the issue of MD in uni-
variate time series by using imputation methods such as the, linear
interpolation (Coro et al., 2016), structural models (Selvaraj et al.,
2020), ARIMA (Preciado et al., 2006), transfer functions (Preciado et al.,
2006) and expansion factors (Peacock et al., 2020). However, these
studies have not been focused on comparing different imputation
methods but only on using them to impute a few values to proceed to
other analyses such as forecasting or stock assessment. A study con-
ducted by Genolini et al. (2013) used a time series from an automatic
pattern recognition system applied to the monitoring of sh migration
and compared the efciency of 12 imputation methods with a MD
simulation framework. However, this study was not aimed at shery or
marine sciences specically. Still, it only used this sh dataset as an
example.
One very important observation that arises from reviewing several
published papers on method comparison for multiple MD imputation in
disciplines other than shery is that the efciency of each method is
dataset-dependent (Schaffer, 1997; Yozgatligil et al., 2013; Schafer &
Graham, 2002). This means that each way performs differently
depending on the structure of each time series; in other words, there is
no single method suited to all types of data. This needs to be taken
seriously into account when attempting to choose an appropriate
method for multiple MD data imputation in marine shery landings time
series, which are characterized by being heterogeneous in structural
elements such as length, seasonality, trend, autocorrelation and sto-
chastic components (Coro et al., 2016; FAO, 2017; Koslow & Davison,
2016; Selvaraj et al., 2020). Thus, the best approach would be always to
compare several imputation methods for the same target time series
(time series of interest), perform validations, and nally make a decision
about the best approach.
Things get more complicated when considering the type of MD in
time series. For example, when data is Missing Completely at Random
(MCAR), the imputations are more easily done than when dealing with
data Non-Missing at Random (NMAR) (Little & Rubin, 1987). This is
because NMAR usually implies missing observations in chunks that
might be big enough to hide the structural components of time series
(Schafer, 1997; Donders et al., 2006; Beck et al., 2018). However, if
there is enough temporal structure in the series, imputation methods
that leverage characteristics such as seasonality, cycles, and autocorre-
lation will have an advantage over other more simplistic methods.
Hence, different ways will have additional precision to impute multiple
MD in marine sheries time series, and accuracy will depend on the
amount of MD chunks and MD size (Beck et al., 2018).
From the above information, we can highlight three important points
that could determine the efciency of methods for multiple MD impu-
tation in marine shery time series: 1) the type of method, 2) the
structure of time series, and 3) the type and size of MD. This implies that
to choose the best imputation method for a particular time series, the
validation of methods is a keystone aspect for the success of the process.
Validation of imputation methods is commonly done using complete
data (absent MD) other than the target time series (Hassani et al., 2019;
Moritz et al., 2015; Phan et al., 2020) (classical validation). However,
due to the reasons explained before, this kind of validation can lead to
wrong choices of imputation methods. For example, if the target time
series is seasonal, but the validation is performed with non-seasonal
data, or if the target time series has a trend, but the validation data is
stationary, or if the target time series is cyclical the validation data is
not. These kinds of possibilities can lead to wrong choices of validation
methods, unreliable imputed values, and therefore, to corrupted time
series and unrealistic conclusions if they are used in statistical analysis
and decision making. Additionally, suppose one wishes to validate
imputation methods with data other than the target time series. In that
case, it could be tough to nd available complete data from marine
shery landings with a very similar structure to the target time series.
Therefore, a possible solution for the problem described above is to
conduct validations on the same target time series, ensuring that the
method selected for making imputations is best suited for the data of
interest. Although this raises the well known restriction of making im-
putations in series that already contain MD (Phan et al., 2020), here we
propose and develop a validation algorithm that overcomes this obstacle
by taking advantage of sub-sequences within the time series witouth
MD, and that hold enough length and structure in such a way that they
allow to simulate new MD, impute it with different methods, and
compare the results with performance metrics in order to select a
method that reduces the overall error of imputations (see further details
in Materials and Methods). We assess and compare the imputation ef-
ciency of twelve methods for multiple MD imputation on marine
shery landings time series of six species from the Colombian Pacic
Ocean, and validate the results with KSSA in order to nd a best method
suited to each time series. To evaluate the efciency of KSSA, we
compare its results with classical validations performed using real life
time series from different data sources.
2. Materials and methods
2.1. Data source
Data encompass monthly time series of shery landings (tons) for
eight years and ve months (January 2012 to May 2020) in the
Colombian Pacic Ocean (CPO). These time series were gathered from
the website of the Colombian sheries statistical service (Servicio
Estadístico Pesquero Colombiano, SEPEC, by its acronym in Spanish)
(SEPEC) which is the main tool of the national sheries and aquaculture
authority (Autoridad Nacional de Pesca y Acuicultura, AUNAP, by its
acronym in Spanish) to provide public domain statistics of shery pro-
duction. We acquired the time series of the following six species: catsh
(Bagre spp), white shrimp (Litopenaeus occidentalis), pacic red snapper
(Lutjanus peru), mexican barracuda (Sphyraena ensis), seer sh (Scom-
beromerus spp), and cachema weaksh (Cynoscion phoxocephalus). These
data belong to the monthly reports of shery marine landings made by
artisanal boats that sh from the coastline up to two miles out to the sea
(Fig. 1), and include a quality control certied by the National Depart-
ment of Statistics in Colombia (Departamento Administrativo Nacional
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
3
de Estadística, DANE by its acronym in Spanish). This quality control
emerges from a stratied and probabilistic sampling in space and time of
each shing species, which operates within homogeneous groups of
shing gear, stratum, and ecological characteristics of the shing
grounds (SEPEC 2018).
From the 148 species that SEPEC monitors in the CPO, we choose
these six species because they represent the higher importance shery
products for the regional commerce. This is, more than 80% of the
economic income for artisanal sheries (Duarte et al., 2019). Therefore,
making good analyses and decision making around these important re-
sources based on their time series, would rst imply making reliable
imputations to start with complete series.
2.2. Data exploration and preparation
The time series for each species in SEPEC are available by munici-
pality. There are 13 main municipalities across the CPO where shing
landings are reported. The data was downloaded for each municipality
and summed up to obtain a series for the whole CPO by each species,
from January 2012 to May 2020, with a monthly resolution. The
structure of the MD was analyzed in terms of number of MD, gap size
(number of MD per gap), gap position across the time series and the
distribution of MD within and between gaps.
Periodogram analyzes (R-package ‘TSA) (Chan & Ripley, 2020),
Mann-Kendall tests (R-package ‘trend) (Thorsten 2020) and Seasonal
Decomposition by moving averages (R-package ‘stats) (R Core Team,
2021) were used to evaluate the seasonality and trend of time series.
Runs tests (R-package ‘snpar) were performed to check if the
Fig. 1. The red stripe on the map indicates the coastal area of artisanal shing for which SEPEC records the shery landings time series used in this study. EEZ =
Economic Exraction Zone.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
4
distribution of MD in each univariate time series occurred at random,
and a Littles test (R-package ‘MissMech) (Jamshidian et al., 2014) was
performed to check if data was MCAR in the multivariate dataset of all
species. When the null hypotheses of these tests are rejected, it suggests a
systematic process generating NMAR-type MD across the series. Table 1
summarizes the structure of MD for each series.
2.3. Methods for multiple missing data imputation
Since there is not any available set of independent predictors in the
study area to try imputation with multivariate approaches, we relied
exclusively on attributes of the time series themselves, such as trend,
seasonality and time-dependency structures. This is one of the con-
straints of multiple imputations when using univariate time series.
However, when there is enough data in the series, it is possible to pro-
duce reliable imputations based on capturing these time-dependency
attributes. Available published literature on multiple imputation for
time series offers different methods to solve this issue, whose perfor-
mances and robustness depend on the structure and quality of the data,
and the structure of MD across the time series.
Twelve methods for multiple imputations in time series were used
from three R-packages. From package ‘imputeTS (Moritz &
Bartz-Beielstein, 2017) we used: 1) Structural models with Kalman l-
ters tted by maximum likelihood (STRUCT); 2) Autoregressive Inte-
grated Moving Average in State Space (ARIMA_SS); 3) Linear
Interpolation (LIT); 4) Spline Interpolation (SIT); 5) Stineman Interpo-
lation (STIT); 6) Simple Moving Average (SMA); 7) Linear Weighted
Moving Average (LWMA); 8) Exponential Weighted Moving Average
(EWMA); 9) Last Observation Carried Forward (LOCF) and 10) Seasonal
Decomposition with Kalman lters (SEADEC). From package ‘forecast
we used: 11) Linear Interpolation with Trend and Seasonal Decompo-
sition (LTS) (Hyndman & Khandakar, 2008). From package ‘Rssa, we
used 12) Singular Spectrum Analysis (SSA) (Golyandina & Kor-
obeynikov, 2014).
SEADEC, STRUCT, ARIMA, LTS and SSA offer more sophisticated
approaches for missing data imputation in univariate series because they
capture time-dependency attributes such as trend, seasonality and cy-
cles. However, they require higher computation times. SIT, STIT, LIT
and LOCF are based on inter-attribute correlations, whilst and SMA,
EWMA and LWMA are half-way between both.
2.4. Simulation of MD, comparison and validation of methods
When comparing methods for MD imputation in univariate time
series, the typical approach is to take complete series (without MD),
simulate MD (MCAR or NMAR), perform the imputations and compare
them with true values. Then a measure of error like the Root Mean
Squared Error (RMSE) or Mean Absolute Scaled Error (MASE) is calcu-
lated to assess the difference between true and imputed values. This
process shed lights about the best method for a particular dataset. For
validation of methods, previous studies have used real complete series
from sources other than the target series of interest as well as simulated
series. This is due to the claim that validation cannot be performed in
series with true MD (Phan et al., 2020).
SEPEC do not offer any complete time series on which to simulate
MD, and since the success of the imputation depends on characteristics
of the data itself, we wanted to use the same SEPEC time series with true
MD to perform algorithm validation, in order to achieve more reliable
conclusions about their effectiveness. The next section describes a pro-
posed method to simulate MD in target time series already containing
true MD and thereof, perform algorithm validation (see Fig. 2).
2.5. Known Sub-Sequence Algorithm (KSSA)
Known Sub-Sequence Algorithm (KSSA) (Fig. 2) is a method to
validate the efciency of imputation algorithms that was developed for
this research in order to support the selection of a best algorithm to
impute MD in the marine shery landing time series from SEPEC. It
consists of simulating MD only in known sub-sequences of the target
time series already containing true MD. KSSA gives the advantage of
taking more realistic validations of imputation algorithms, since the
precision of imputation methods are dataset-dependent (Genolini et al.,
2013). KSSA takes a target time series of interest for imputation that
contains true MD and split it in segments according to a pre-stablished
useful criterion (i.e., years, months etc.). In this study, the criterion
was splitting it in years, because of the dominant annual seasonality (see
Table 1). Later, a rst imputation of true MD is performed with a
particular imputation algorithm in order to obtain a complete series on
which to simulate new MD. New MD is simulated in known
sub-sequences of the target time series that do not contain true MD using
a simulation window that randomly slides through the sub-sequence
within each segment. Simulated MD is then imputed with the same
particular algorithm used previously, and the obtained values are sta-
tistically compared to real values. This process is repeated for several
runs by changing the position and size of the simulation window across
the known sub-sequences within each segment. Thus, high amounts of
MD of varying sizes and positions can be generated across the TSS
iterative programming. KSSA is performed with different imputation
algorithms for comparative purposes.
We wrote a code in R-studio (R Core Team, 2021) that performed the
next eight action steps: 1) imputed true MD in the original target time
series of a single species with a single imputation algorithm; 2) divided
the time series in 8 equal segments (one for each year); 3) set a moving
and expanding window within each segment that simulated MD chunks
of varying sizes (110) in known sub-sequences (avoiding previously
imputed data in true MD); 4) imputed simulated MD chunks in each
segment with a single algorithm and calculated RMSE and MASE
(R-package ‘metrics) between the actual and the imputed time series; 5)
iterated steps 1 to 4 10.000 times with bootstraping; 6) repeated steps 1
to 5 for all algorithms; 7) repeated steps 16 for all species, and 8) stored
all the results in a nal data frame containing MD size, RMSE and MASE
for the 10.000 runs of each algorithm and species. Fig. 3 claries this
process in a owchart.
As a result of this simulation, new simulated MD chunks varied in
size within each segment, producing a range of 1070 MD across the
whole target time series. The range of MD sizes of interest for this
research goes from 20 to 40, since this is the range of MD in the SEPEC
Table 1
Structure of Missing Data (MD), trend and seasonality in the SEPEC time series
from Jan-2012 to May-2020. NMAR-chunks were identied when the runs and
Little tests were signicant.
Species Total
MD
Total
MD
gaps
Range
of MD
in gaps
Gap
distribution
Trend and
seasonality
after
imputation
with SD and
ST
B. pinnimaculatus 24
(23%)
8 1 to 6 NMAR-
chunks
it, sea(12)
L. occidentalis 23
(22%)
8 1 to 6 NMAR-
chunks
sta, sea(12)
L. peru 28
(27%)
8 1 to 6 NMAR-
chunks
it, sea(12)
S. ensis 33
(32%)
8 1 to 10 NMAR-
chunks
it, sea(12)
S. sierra 24
(23%)
8 1 to 6 NMAR-
chunks
it, sea(12)
C. phoxocephalus 26
(25%)
9 1 to 6 NMAR-
chunks
it, sea(12)
Abbreviations: it: increasing trend; sta: stationary; sea: seasonal. Months of
seasonality appear in brackets.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
5
time series. However, we kept MD sizes above 40 for the analysis, in
order to show a broad range of MD sizes that allow the visualization of
the increase in errors. Nonetheless, it is worth to keep in mind that
imputations for MD above 40 are increasingly unreliable. MD chunks
also varied in position within segments and across the whole time series.
However, position was not considered for further statistical analysis and
interpretation in this study.
2.6. Validation of imputation methods with different real-life time series
We used ve real-life time series (complete data without MD) from
different sources to perform a classical validation of the imputation
methods. With ‘classical, we refer to validation of imputation methods
with complete time series different from the target time series. We
compared the results of these classical validations against the results of
Fig. 2. Graphical representation of the Known Sub-Sequence Algorithm (KSSA) for validating and comparing algorithms for MD imputation in time series data. A)
Target Time Series (TTS) with true MD. B) A rst imputation is carried out with algorithm x in order to obtain a full time series prior to simulation of MD and
validation of imputation algorithms. C) Time series is split in segments and a sliding and expanding window is set within each segment to simulate MD in known sub-
subsequences, avoiding previously imputed values. D) TTS with new simulated MD. E) Simulated MD is imputed with algorithm x. F) Imputed values with algorithm
x are compared to real values from known sub-sequences. The process is repeated with different imputation algorithms for comparative purposes.
Fig. 3. Flowchart showing the action steps followed in this study for the simulation of MD in marine shery landing time series and the validation of imputation
algorithms with the Known Sub-Sequence Algorithm. TTS: Target Time Series.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
6
KSSA. All these time series were originally larger than our target time
series, so they were cut to match the 8-years span with monthly reso-
lution from 2012 to 2020 as the SEPEC time series. Simulation of MD in
these time series was performed in a NMAR-chunk fashion following
steps 2 to 8 described in Fig. 3. MD chunks were allowed to vary from 1
to 10 individual MD and where located in the same positions of the
target time series (rst semester of each year). This conguration
allowed us to keep a reasonable framework for realistic comparisons
between classical validation and KSSA. Time series for this section are
described in Table 5 and comprise: 1) the CPUE (Catch per Unit of Effort)
of the yellown tuna (Thunnus albacares), 2) the skipjack tuna (Katsu-
nowus pelamis), and 3) the bigeye tuna (Thunnus obesus) in the Eastern
Tropical Pacic from the purse seine records of the Inter American
Tropical Tuna Commission (IATTC); 4) the average surface sea
Table 2
Resulting RMSE values from the MD simulation and KSSA validation in the SEPEC shery landings time series with the 12 imputation algorithms. Cells in bold
represent the lowest values from contrasts among algorithms (row-wise). Results are shown by ranges of increasing 10 MD from 1 to 70.
MD size =1 to 10
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 1.50
a
1.90
b
1.47
c
1.82
b
1.91
b
1.96
b
1.39
d
1.93
b
1.06
b
1.32
c
1.19
c
2.53
c
L. occidentalis 1.69
a
1.01
b
1.40
c
1.24
b
1.88
b
1.31
b
1.43
d
1.58
b
1.71
b
1.48
b
1.41
b
2.66
d
L. peru 0.12
b
0.10
b
0.23
a
0.26
a
1.11
b
1.46
c
2.94
d
1.28
c
0.42
a
0.34
a
0.33
a
2.42
a
S. ensis 1.52
a
1.12
b
1.46
b
1.00
c
1.82
b
1.35
b
1.26
d
1.39
b
1.75
b
1.46
b
1.97
b
3.00
d
S. sierra 1.06
a
1.02
b
1.47
b
1.10
b
1.54
b
1.15
b
1.91
c
1.47
b
1.32
b
1.25
b
1.73
b
3.40
b
C. phoxocephalus 1.43
a
1.42
a
1.58
a
1.64
a
4114
b
1.58
a
1.49
b
1.57
a
1.01
b
1.96
b
1.89
b
3.24
b
MD size =10 to 20
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 2.02
a
3.07
b
3.47
b
4.15
c
4.33
c
3.30
b
4.92
c
4.38
c
4.85
c
4.06
c
2.77
b
2.80
b
L. occidentalis 2.59
a
4.60
b
10.11
c
10.61
c
5.37
b
6.73
b
8.20
c
3.84
b
8.60
c
5.90
d
5.32
d
5.60
d
L. peru 1.06
a
1.67
b
1.90
b
1.68
b
1.42
b
2.06
c
1.86
b
2.13
c
2.02
c
2.45
c
1.51
b
2.02
c
S. ensis 1.53
a
3.46b 2.97
b
2.05
b
2.27
b
2.09
b
2.19
b
2.26
b
2.05
b
2.92
b
3.30
c
3.15
c
S. sierra 6.05
a
9.75
b
9.12
b
9.52
b
9.06
b
8.00
b
11.83
c
9.66
b
10.27
c
8.49
b
7.97
b
12.60
c
C. phoxocephalus 3.95
b
3.69
b
3.20
b
2.59
b
8.29
c
3.18
a
5.87
d
2.73
e
3.41
e
3.94
e
4.05
d
5.05
d
MD Size =20 to 30
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 4.50
b
4.90
b
5.47
b
4.82
b
4.91
b
4.96
b
8.39
a
4.93
b
5.06
b
5.32
b
5.19
b
5.53
b
L. occidentalis 7.87
a
8.34
a
11.47
b
8.72
a
11.44
b
9.32
c
15.99
d
9.9
e
9.72
e
9.35
e
9.78
e
14.23
d
L. peru 2.19
d
2.10
e
2.23
d
1.99
e
3.11
f
3.46
b
4.94
a
3.28
h
2.42
c
2.34
c
2.33
i
2.66
i
S. ensis 3.52
b
4.12
b
4.46
b
3.00
b
3.82
b
3.35
b
8.26
a
3.39
b
3.75
b
4.46
b
3.97
b
6.00
ab
S. sierra 10.06
b
11.02
b
11.47
b
13.10
b
12.54
b
11.15
b
19.91
a
11.47
b
11.32
b
11.25
b
10.73
b
13.40
b
C. phoxocephalus 4.43
c
4.42
c
4.58
c
4.64
c
6.14
b
4.58
c
7.49
a
4.57
c
5.01
c
4.96
c
4.89
c
6.24
b
MD Size =30 to 40
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 5.70
a
5.87
b
6.26
c
5.97
d
6.77
e
5.92
d
10.57
f
6.23
d
6.31
c
6.40
g
6.29
c
7.58
h
L. occidentalis 9.52
a
9.97
a
13.77
b
10.47
c
14.32
b
11.55
d
22.47
e
11.96
f
11.64
g
11.83
f
11.75
f
13.88
b
L. peru 2.53
a
2.57
a
2.84
b
2.54
a
3.76
c
3.90
e
6.97
d
3.90
e
3.06
f
3.31
f
2.85
b
5.68
g
S. ensis 4.46
a
5.20
b
5.66
c
5.56
c
5.35
b
4.96b 12.07
d
5.02
b
5.09
b
5.42
c
4.87
b
6.66
e
S. sierra 13.00
a
12.93
a
13.92
b
15.18
c
15.43
d
13.86
e
25.64
f
14.26
f
14.27
d
13.85
a
13.72
a
16.38
g
C. phoxocephalus 5.36
a
5.71
b
5.88
c
5.71
b
6.55
d
5.72
b
10.49
e
5.84
b
6.19
f
6.11
g
6.12
g
6.87
h
MD Size =40 to 50
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 6.55
a
6.85
b
7.29
b
7.05
b
8.18
c
6.89
b
14.40
d
7.15
b
7.49
e
7.54
d
7.51
d
8.55
e
L. occidentalis 10.99
a
11.76
a
16.13
b
12.23
c
15.96
d
13.93
e
29.68
f
14.22
e
13.63
g
13.43
g
13.78
e
17.16
b
L. peru 3.15
a
3.28
b
3.45
b
3.16
a
4.26
c
4.29
c
10.27
d
4.34
c
3.79
e
3.91
f
3.54
b
6.21
g
S. ensis 5.42
a
6.12
b
6.72
c
6.35
b
6.60b 6.09b 15.44f 6.14b 6.03b 6.48b 5.89b 7.38b
S. sierra 15.11
a
15.32
b
17.07
c
17.45
d
20.91
e
16.70
c
34.75
f
17.19
g
16.72
c
16.29
c
16.54
c
19.30
h
C. phoxocephalus 6.36
a
6.68
b
6.95
c
6.78
c
7.78
d
6.78
c
13.02
e
6.94
c
7.44
f
7.23
g
7.29
g
8.13
h
MD Size =50 to 60
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 7.52
a
7.96
b
8.39
b
8.12
b
9.88
c
7.89
b
16.30
d
8.20
b
8.52
e
8.41
b
8.42
b
9.66
c
L. occidentalis 13.02
a
13.32
a
17.99
b
14.35
c
18.11
b
15.88
d
40.19f 16.77
g
15.75
d
15.23
h
15.62
d
18.82
i
L. peru 3.61
a
3.82
b
4.44
c
4.04
c
4.69
d
4.65
e
12.97
f
4.67
d
4.42
c
4.51
e
4.23
c
7.20
g
S. ensis 6.33
a
6.95
b
7.58
c
7.31
d
8.10
e
7.22
f
18.62
g
7.44
d
7.15
f
7.31
d
6.99
b
8.27
h
S. sierra 17.77
a
16.98
b
19.31
c
20.12
d
25.51
e
19.41
d
41.33
f
20.28
g
18.95
c
18.48
h
18.74
h
21.24
i
C. phoxocephalus 7.22
a
7.42
b
8.23
c
7.70
d
8.90
e
7.82
d
16.03
f
8.24
c
8.50
g
8.40
g
8.27
c
8.90
e
MD Size =60 to 70
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 8.53
a
8.74
a
9.85
b
10.02
b
12.45
c
8.62
a
19.53
d
9.29
a
9.38
a
9.40
a
9.64
b
10.53
b
L. occidentalis 15.28
a
15.06
a
20.17
a
16.42
a
19.84
a
19.12
a
44.77
b
18.12
a
17.13
a
16.81
a
17.94
a
20.13
a
L. peru 4.46
c
4.71
c
5.06
c
4.92
c
6.32
d
5.29
c
14.66
a
5.34
c
5.16
c
5.33
c
5.12
c
8.36
b
S. ensis 7.81
b
7.60
b
8.62
b
8.62
b
10.07
b
8.25
b
25.10
a
8.27
b
8.06
b
7.86
b
8.13
b
9.11
b
S. sierra 19.74
a
18.07
b
23.28
c
24.31
d
29.53
e
23.78
c
35.90
f
23.52
c
20.93
g
20.54
g
21.67
g
24.03
c
C. phoxocephalus 8.38
b
8.45
b
8.85
b
9.08
b
10.39
b
8.90
b
25.52
a
8.76
b
9.01
b
9.44
b
9.16
b
9.92
b
Different superscript letters indicate statistically signicant differences in RMSE values between imputation methods at 0.05 level (rowwise comparisons), according to
the Tukey HSD tests. Same superscript letters indicate statistically non-signicant differences.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
7
temperature in the Colombian Pacic Ocean (SST-CPO), and 5) the
average concentration of Chlorophyll-a in the Colombian Pacic Ocean
(CHLA_CPO) from the Marine Copernicus remote sensing product Global
Analysis Forecast Bio 001-0.28.
2.7. Data analysis
RMSE and MASE were compared among algorithms using simple
analysis of variance (ANOVA) plus Tukey HSD tests (Zar, 2009) for
balanced data (R-package ‘agricolae). Ordinary Least Squares (OLS)
regression were tted to estimate the slope of RMSE and MASE as a
function of MD size. Slopes among algorithms were compared using
Table 3
Resulting MASE values from MD simulation and KSSA validation in the SEPEC shery landings time series with 12 imputation algorithms and six species. Cells in bold
represent the lowest values from contrasts among algorithms (row-wise). Results are shown by ranges of increasing 10 MD from 1 to 70).
MD Size =1 to 10
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.05
a
0.09
a
0.16
b
0.11
a
0.06
a
0.11
a
0.24
c
0.10
b
0.08
a
0.11
b
0.11
b
0.17
b
L. occidentalis 0.02
a
0.07
b
0.15
c
0.01
d
0.13
c
0.09
c
0.30
e
0.10
c
0.09
b
0.19
c
0.19
c
0.24
f
L. peru 0.02
a
0.01
a
0.10
a
0.01
b
0.10
a
0.37
c
0.28
c
0.33
c
0.31
c
0.39
c
0.31
c
0.31
c
S. ensis 0.01
a
0.05
a
0.08
a
0.04
a
0.02
b
0.03
a
0.19
c
0.1c
6
0.15
c
0.16
c
0.14
c
0.27
d
S. sierra 0.06
a
0.10
b
0.13
b
0.24
c
0.09
a
0.13
b
0.27
c
0.13
b
0.29
d
0.25
d
0.29
d
0.28
d
C. phoxocephalus 0.07
a
0.08
a
0.11
b
0.08
a
0.15
b
0.19
b
0.23
c
0.20
c
0.20
c
0.25
c
0.21
c
0.22
c
MD Size =10 to 20
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.16
a
0.11
a
0.20
b
0.21
b
0.22
b
0.18
b
0.18
b
0.19
b
0.20
b
0.20
b
0.19
b
0.19
b
Cynoscion spp 0.10
a
0.15
a
0.20
b
0.12
c
0.16
a
0.16
a
0.22
d
0.20
d
0.21
d
0.22
d
0.24
d
0.24
d
L. peru 0.08
a
0.13
b
0.27
c
0.14
b
0.16
b
0.18
b
0.22
c
0.26
c
0.26
c
0.26
c
0.14
b
0.16
b
S. ensis 0.06
a
0.17
b
0.15
b
0.15
b
0.19
b
0.16
b
0.17
b
0.16
b
0.18
b
0.23
c
0.27
c
0.27
c
S. sierra 0.18
a
0.24
b
0.23
b
0.23
b
0.19
c
0.18
c
0.24
b
0.23
b
0.23
b
0.28
b
0.19
c
0.20
c
C. phoxocephalus 0.12
a
0.20
a
0.20
a
0.14
b
0.40
c
0.17
a
0.30
c
0.15
a
0.16
a
0.20
a
0.20
a
0.20
a
MD Size =20 to 30
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.23
a
0.29
b
0.36
c
0.31
d
0.28
a
0.31
d
0.44
e
0.30
d
0.28
f
0.31
d
0.31
d
0.37
g
L. occidentalis 0.21
a
0.26
a
0.36
b
0.23
a
0.33
b
0.29
c
0.48
d
0.31
e
0.31
e
0.29
e
0.31
e
0.5
f
L. peru 0.22
a
0.20
a
0.30
a
0.21
b
0.30
a
0.57
c
0.48
c
0.53
c
0.31
d
0.29
a
0.31
d
0.31
d
S. ensis 0.18
a
0.25
b
0.28
b
0.24
b
0.21
a
0.23
b
0.39
c
0.26
b
0.20
a
0.26
b
0.24
b
0.47
d
S. sierra 0.26
a
0.30
a
0.33
b
0.44
c
0.29
a
0.33
d
0.47
e
0.33
d
0.29
a
0.30
a
0.29
a
0.38
e
C. phoxocephalus 0.27
a
0.28
a
0.31
b
0.28
a
0.35
c
0.29
a
0.43
d
0.30
a
0.30
a
0.30
a
0.30
a
0.42
d
MD Size =30 to 40
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.39
a
0.41
b
0.47
c
0.44
d
0.41
a
0.42
b
0.66
e
0.44
f
0.40
a
0.42
b
0.42
b
0.58
g
L. occidentalis 0.29
a
0.35
a
0.54
b
0.32
c
0.47
b
0.42
d
0.76
e
0.43
f
0.41
d
0.42
f
0.41
f
0.55
b
L. peru 0.29
d
0.30
d
0.45
c
0.30
d
0.40
c
0.72
b
0.76
b
0.72
b
0.44
c
0.47
c
0.42
c
1.06
a
S. ensis 0.27
a
0.36
b
0.41
c
0.48
d
0.36
b
0.39
c
0.64
e
0.39
c
0.32
f
0.37
b
0.33
f
0.56
g
S. sierra 0.40
a
0.41
a
0.46
b
0.56
c
0.44
d
0.47
e
0.70
f
0.48
e
0.43
g
0.42
g
0.43
g
0.57
c
C. phoxocephalus 0.37
a
0.41
b
0.45
c
0.39
d
0.45
e
0.42
f
0.68
g
0.43
h
0.42
f
0.42
i
0.42
f
0.56
j
MD Size =40 to 50
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.51
a
0.53
b
0.61
c
0.59
d
0.55
e
0.56
f
0.96
g
0.57
h
0.54
b
0.55
f
0.56
f
0.75
i
L. occidentalis 0.42
a
0.46
a
0.65
b
0.39
c
0.59
d
0.57
b
1.11
e
0.58
d
0.53
f
0.53
f
0.55
d
0.77
b
L. peru 0.40
a
0.42
a
0.59
b
0.42
a
0.52
c
0.87
d
1.20
e
0.86
d
0.60
f
0.61
f
0.58
b
1.27
g
S. ensis 0.37
a
0.48
b
0.54
c
0.60
d
0.50
e
0.53
f
0.91
g
0.53
h
0.43
i
0.50
j
0.45
k
0.70
l
S. sierra 0.53
d
0.54
d
0.64
c
0.72
b
0.66
c
0.64
c
1.05
a
0.64
c
0.56
d
0.56
d
0.58
d
0.75
b
C. phoxocephalus 0.49
a
0.54
b
0.60
c
0.53
d
0.60
e
0.57
f
0.93
g
0.58
h
0.56
f
0.55
i
0.56
f
0.72
j
MD Size =50 to 60
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.64
a
0.69
b
0.78
c
0.74
d
0.74
d
0.70
b
1.21
e
0.72
d
0.68
b
0.68
b
0.70
b
0.92
f
L. occidentalis 0.55
a
0.57
a
0.8
b
0.51
c
0.74
b
0.73
d
1.61
e
0.77
f
0.67
d
0.66
g
0.68
g
0.94
h
L. peru 0.50
a
0.54
a
0.84
b
0.60
c
0.65
d
1.02
e
1.63
f
0.99
e
0.75
g
0.77
b
0.76
g
1.59
f
S. ensis 0.49
a
0.59
b
0.67
c
0.75
d
0.68
e
0.69
e
1.20
f
0.70
g
0.58
b
0.62
h
0.60
b
0.85
i
S. sierra 0.69
a
0.66
a
0.79
b
0.90
c
0.89
d
0.82
e
1.35
f
0.85
g
0.70
a
0.70
a
0.72
h
0.91
c
C. phoxocephalus 0.62
a
0.65
b
0.78
c
0.66
d
0.75
e
0.72
f
1.24
g
0.76
e
0.70
h
0.70
h
0.70
h
0.88
i
MD Size =60 to 70
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.79
a
0.81
a
1.00
b
0.98
b
0.94
b
0.85
a
1.52
c
0.90
b
0.81
a
0.82
a
0.85
a
1.07
d
L. occidentalis 0.72
a
0.7
a
1.02
a
0.66
a
0.88
a
0.98
a
1.85
b
0.89
a
0.8
a
0.81
a
0.85
a
1.1
a
L. peru 0.66
a
0.72
b
1.07
c
0.79
b
0.93
d
1.22
e
2.04
f
1.22
e
0.93
d
0.98
d
0.97
d
1.96
f
S. ensis 0.66
a
0.70
a
0.83
b
0.96
c
0.92
d
0.85
b
1.66
e
0.86
b
0.70
a
0.72
f
0.76
f
1.01
c
S. sierra 0.83
a
0.76
a
1.07
b
1.18
c
1.06
b
1.12
d
1.36
e
1.07
b
0.83
a
0.83
a
0.91
f
1.13
d
C. phoxocephalus 0.77
b
0.81
b
0.92
b
0.85
b
0.93
b
0.89
b
1.96
a
0.88
b
0.82
b
0.85
b
0.85
b
1.05
b
Different superscript letters indicate statistically signicant differences in MASE values between imputation methods at 0.05 level (rowwise comparisons), according to
the Tukey HSD tests. Same superscript letters indicate statistically non-signicant differences.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
8
ANOVA and Tukey HSD tests (Zar, 2009). The best algorithms for
imputation were determined as those having lower RMSE and MASE, as
well as smaller slopes, which indicate less imputation error in average
and a lesser decay of imputation precision with increasing MD. Data
analysis for classical simulation was focused to MD Sizes from 20 to 40,
since this is the range of MD of most importance in the target time series.
The level of signicance of statistical tests was set at 0.05.
3. Results
3.1. Structure of time series and MD
All time series showed a 12-month seasonality. Regarding to the
trend, L. occidentalis and was stationary and the rest of species showed a
signicant increasing trend (see Table 1). Time series for all species
showed a NMAR-type of MD (see Table 1), with sequential MD occurring
at chunks, most of them at the beginning of each year. The largest chunk
was found for S. ensis, with 10 MD from January to October 2012. For
the rest of species, the greatest chunk had 6 MD from January to June
2018. The rest of chunks ranged from 2 to 5 MD and varied in position
across the months from January to December (see Fig. 6). The difference
in total MD among series was due to few individual MD (<25%) scat-
tered throughout the series. Total MD in the series ranged from 20 to 33
(19%32%), being highest for S. ensis and lowest for L. occidentalis.
3.2. Performance metrics among algorithms and MD size
Tables 2 and 3 summarize the mean RMSE and MASE resulting from
the MD simulation and KSSA validation for each algorithm, species and
MD size. For the whole experiment, 64% of the simulations resulted in
lowest RMSE values for the SEADEC algorithm, 25% for STRUCT and
11% for LTS. The rest of algorithms did not show lowest RMSE values for
any species at any MD size. SEADEC was consistently better for all
species, whilst LTS showed lowest RMSE values only for S. ensis and
L. peru at MD sizes from 1 to 10, 10 to 20 and 20 to 30. However, in most
of cases these differences were not statistically different. Regarding
MASE, the results were very similar in every way as for RMSE, except
that the SSA algorithm achieved the lowest values for B. pinnimaculatus
at MD sizes from 1 to 10 and 20 to 30, and for S. ensis at MD sizes from 1
to 10.
Both RMSE and MASE increased linearly and consistently with MD
size in all algorithms and species. At low MD sizes (110), RMSE and
MSE were the lowest, but also the difference among-algorithms was the
lowest. When MD was low and/or scattered, all algorithms performed
similarly. Simulations with low MD sizes often resulted in 1 or 2 MD per
segment, which were easily imputed by all algorithms with similar
values. LOCF was an exception to this pattern because even at low MD
sizes, its RMSE and MASE values were higher. With few MD per segment,
imputations are made based on inter attribute correlations, with no need
to account for time dependency structures. Hence, under this situation,
there was no distinguishable advantage of any algorithm over others.
As MD increased and chunks got greater, differences in RMSE and
MASE among algorithms were more pronounced. However, the statis-
tical signicance of these differences did not increase proportionally
with MD size. The difference in RMSE among algorithms was more
statistically signicant for MD Sizes from 10 to 50, where SD and ST had
up to 20%50% more precision against algorithms with low perfor-
mance. Above 60 MD, the statistical signicance of the differences
among algorithms decreased. This is related to the fact that the variance
of imputed values among runs in simulations with high MD size. When
MD size was over 60, most of the essential structure of time series
(seasonality, trend, and cycles) was lost, making it very difcult for any
algorithm to capture it and impute reliable values. As the simulation
runs evolved, some high RMSE and MASE values emerged due to those
MD structures that left no time structure in the time series, thus
increasing the variance of imputations. As a consequence, at MD sizes
above 60, the imputations of all algorithms are not reliable.
Table 4 and Figs. 4 and 5 show the results of the OLS regressions to
estimate the slopes of RMSE and MASE as a function of MD Size.
Regarding RMSE, SEADEC and STRUCT had the lowest slopes among
algorithms. Slopes for SEADEC were signicantly lower for
B. pinnimaculatus, S. ensis and C. phoxocephalus. For the rest of species,
either SEADEC and STRUCT obtained the lowest slopes. Regarding
MASE, results were similar, except that for L. occidentalis the algorithm
LTS obtained the lowest slope. This means that the decay of imputation
precision with increasing MD is lower when using the SEADEC and
STRUCT algorithms. Taking into account both performance metrics
RMSE and MSE, and slopes from OLS regression, these results point to
SEADEC and STRUCT as the two best algorithms for MD imputation in
the SEPEC marine shery landings time series. However, it should be
noted that according to our results, the higher performance of these
algorithms is applicable for MD sizes between 20 and 50 (Figs. 4 and 5),
and for MD following a NMAR-type.
3.3. Comparison between KSSA and classical validaiton
Table 6 shows the results of classical validation of imputation
methods in terms of the best method resulting for each time series and
MD Size. According with these results, ARIMA was the best method for
T. obseus; LTS, SMA and SSA the best for T. albacares; LTS, SEADEC and
SSA the best for K. pelamis; SSA the best for CHLA-CPO, and LTS the best
for SST-CPO. When focusing on MD sizes from 20 to 40, which are of
main interest for us since this range is dominant in our target time series
from SEPEC, the results show that ARIMA was the best method for
T. obseus; SMA and SSA the best for T. albacares; SSA and SEADEC the
best for K. pelamis; SSA the best for CHLA-CPO, and LTS the best for SST-
CPO. These results are opposed to KSSA, which reveled SEADEC and
STRUCT as the rst and second best imputation methods respectively for
all of our six target time series.
From Figs. 4 and 5 one can visually estimate the increase in the
overall imputation errors for a particular MD size, when switching from
one method to another. To understand the utility of KSSA, it must be
observed that at any MD size, both for RMSE and MASE, imputations
with ARIMA, SSA and LTS suggested as the best by the classical vali-
dation, increase the error.
4. Discussion
In this work, we assessed and compared 12 state-of-the-art algo-
rithms for multiple imputation of missing data in univariate marine
shery landing time series from six species of the Colombian Pacic
Ocean, which frequently include missing data, preventing their use for
quantitative analysis and decision making in sheries. We validated the
results with our novelty KSSA approach, which allowed us to simulate
missing data in the same target time series, hence giving us good con-
dence in our results, which revealed SEADEC and STRUCT as the best
algorithms for multiple missing data imputation in these marine shery
time series. The RMSE and MASE values for SEADEC were signicantly
lower in most of the simulations and most of the MD Sizes, with highly
similar results among species. This indicates that for our specic target
time series, this imputation method was not only effective but
consistent.
SEADEC emerged as the best imputation method due to the marked
seasonal nature of the shery landings. SEADEC takes a time series and
removes its seasonal component to afterward perform the imputations
on the deseasonalized series using different algorithms such as the
Kalman lters (Moritz & Bartz-Beielstein, 2017). Among other imputa-
tion options, we choose these lters because, beyond the 12-month
seasonality, our target time series contain different levels of time de-
pendency structures, such as 1 and 2 autocorrelation lags (as evidenced
from Partial Autocorrelation Functions not shown in the Results sec-
tion), which would have been poorly captured by simple methods such
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
9
as interpolation, mean values, random values or moving averages
(available options in the imputeTS R-package). In fact, previous tests of
SEADEC using these options yielded unsatisfactory results. Hence, out-
lines SEADEC as an effective method for seasonal and autocorrelated
series such as the marine shery landings, whose behavior reect the
cyclical productivity of the ocean (Lloret et al., 2000; Preciado et al.,
2006; Coro et al., 2016).
Other imputation methods such as the Seasonal Window Moving
Algorithm (SWMA) (Chandrasekaran et al., 2016), the Seasonally Split
missing Value Imputation (SEASPLIT) (Moritz & Bartz-Beielstein, 2017),
the Pattern Sequence Forecasting (PFS) (Bokde et al., 2018), the
Table 4
Slopes estimated from OLS regression for RMSE and MASE as a function of MD size. Cells in bold represent the lowest values for contrasts among algorithms (rowwise).
RMSE
SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF
B. pinnimaculatus 0.1481
a
0.1558
b
0.1642
d
0.1593
b
0.1869
b
0.1539
b
0.3265
e
0.1604
b
0.1669
c
0.1681
c
0.167
c
0.1767
d
L. occidentalis 0.2472
a
0.2663
a
0.3662
b
0.278
a
0.4315
d
0.3105
c
0.6576
e
0.3215
c
0.3099
c
0.3057
c
0.3072
c
0.3527
b
L. peru 0.0701
a
0.0721
a
0.0796
b
0.073
a
0.0973
c
0.0899
c
0.2261
e
0.0908
c
0.0825
b
0.0872
c
0.0791
b
0.1310
d
S. ensis 0.1215
a
0.137
b
0.1496
c
0.1343
b
0.1484
c
0.1331
b
0.3465
e
0.1353
b
0.133
b
0.1445
c
0.1305
b
0.1463
c
S. sierra 0.3435
a
0.3400
a
0.3791
b
0.3806
b
0.4584
c
0.3744
b
0.7512
d
0.3878
b
0.3739
b
0.3685
b
0.3695
b
0.4021
b
C. phoxocephalus 0.1434
a
0.1495
b
0.1576
b
0.1527
b
0.1757
d
0.1522
b
0.3005
e
0.1566
b
0.1656
c
0.1633
c
0.1628
c
0.1676
c
MASE
B. pinnimaculatus 0.0114
a
0.0121
b
0.0133
c
0.0132
c
0.0123
b
0.0125
b
0.0219
e
0.0129
c
0.0120
b
0.0123
b
0.0124
b
0.01665
d
L. occidentalis 0.0091
a
0.0100
a
0.0141
c
0.0088
a
0.0152
c
0.0127
b
0.0246
d
0.0131
b
0.0120
b
0.0120
b
0.0123
b
0.01708
c
L. peru 0.0089
a
0.0094
a
0.0140
b
0.0099
a
0.0118
b
0.0190
c
0.0269
d
0.0187
c
0.0135
b
0.0137
b
0.0135
b
0.03072
d
S. ensis 0.0083
a
0.0106
b
0.0119
b
0.0143
c
0.0112
b
0.0122
b
0.0207
d
0.0124
b
0.0100
b
0.0109
b
0.0105
b
0.01554
c
S. sierra 0.0120
a
0.0120
a
0.0141
b
0.0168
c
0.0144
b
0.0143
b
0.0226
d
0.0145
b
0.0124
a
0.0124
a
0.0128
a
0.01687
c
C. phoxocephalus 0.0109
a
0.0119
d
0.0134
d
0.0117
d
0.0134
d
0.0127
d
0.0215
e
0.0131
c
0.0124
d
0.0124
d
0.0125
d
0.01607
b
Different superscript letters indicate statistically signicant differences in slope values between imputation methods at 0.05 level (rowwise comparisons), according to
the Tukey HSD tests. Same superscript letters indicate statistically non-signicant differences.
Table 5
Time series from sources different than SEPEC used for classical validation of
imputation methods.
Time series Seasonality Trend Source
T. albacares 12-Months Stationary https://www.iattc.org/
K. pelamis 12-Months Stationary
T. obesus 6-Months Stationary
SST-CPO 12-Months Stationary https://marine.copernicus.eu/
CHLA-CPO 12-Months Decreasing
Fig. 4. OLS regressions for RMSE as a function of MD size for all imputation algorithms and species after appliying KSSA. Lines represent the tted values for the 12
algorithms on each species, and the vertical gray bands represent NA sizes in the range of 2040, which are typical in the SEPEC time series analyzed in this study.
Coefcients of determination R2 for the linear regressions ranged from 72% to 81%.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
10
Dynamic Time Warping (DTW) (Phan et al., 2020), the Gaussian
Autoregressive (AR-G) and Students Autoregressive models (AR-T) (Liu
et al., 2019) which have been claimed as efcient imputation methods
for univariate sets, were also tested in our target time series. However,
due to their low imputation efciency (lower RMSE and MASE values
than the reported here) or unsatisfactory implementation for NMAR MD
in the case of DTW (this method works only for one gap of MD), the
results were not included in this paper. Anyway, these results points
rstly to SEADEC and secondly to STRUCT as the best imputation
methods for the time series that we addressed here, and to KSSA as a
powerful validation approach to realize it, and to prevent misleading
results.
As demonstrated here, validating the methods with other time series
different to the target time series of interest (classical validation), lead to
the choice of inaccurate imputation methods, given that the efciency of
the methods is dataset-dependent. The aftermath of this bad choice are
increased errors in imputed values, which is one of the chances to avoid
in any statistical procedure. For example, if our validation had been
done with the time series of T. albacares, K. pelamis, T. obesus or CHLA-
CPO, ARIMA, SSA or SMA would have been chosen as the best methods
for NA sizes between 20 and 40. This would have produced imputed
values with up to 40% more error. This same mistake would have
occurred if the validations had been done with SST-CPO, which would
have pointed to LTS as the best imputation method, implying up to 35%
more error in imputed values. These increased errors could inate the
random component of time series, risking to denaturing their season-
ality, trend, cycle and autocorrelation structures, making them useless
for further shery analysis and decision making. KSSA can work with
any structure of MD. However, one important restriction is the length of
MD chunks. Suppose MD chunks are big enough to hide the essential
structure of time series. In that case, imputation methods will not be able
to capture temporal components, then making estimations unreliable.
Enough MD size to make true this restriction will depend however on the
total length and temporal resolution the target time series.
As it was observed during our experiments, some drawbacks need to
be considered when using SEADEC or STRUCTS. Both methods are good
at capturing the general properties of time series but at the price of
missing high peaks or low valleys (because of the ltering effect). Other
methods such as SSA, SMWA and DTW are better to reconstruct gaps
that include very high or low values, but at the price of not capturing the
general pattern of seasons and trend along the whole time series. This
suggests possible trade-offs among imputation methods, worth to
explore in further research. We also encourage the research to combine
different imputation methods that operate on one single target time
series. This approach might be able to impute both general patterns as
well as ner details, improving the imputation efciency of univariate
time series.
Chunk or gap size is also a limitation for SEADEC and STRUCTS.
According to our results, both performed reasonably well under a mean
chunk size of 8 individual MD. However, above this number, results are
not trustable. Liu et al. (2020) mention that large gaps for SEADEC will
result in long-time-period linear interpolation before seasonal decom-
position, affecting the seasonality component and leading to poor
imputation results. Regarding computation times, SEADEC and STRUCT
Fig. 5. OLS regressions for MASE as a function of MD size for all imputation algorithms and species after simulation and validation with the Known Sub-Sequence
Algorithm (KSSA). Lines represent the tted values for the 12 algorithms on each species, and the vertical gray bands represent MD sizes in the range of 2040, which
are typical in the SEPEC time series analyzed in this study. Coefcients of determination R2 for the linear regressions ranged from 75% to 84%.
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
11
were halfway (30 min1 h) between methods that required just a few
seconds to perform the 10.000 simulations such as LOCF, SIT, STIT, LIT,
SMA EWMA, LWMA and LTS, and methods such as ARIMA and SSA that
took up to 3 h. Thus, computation times can be a limitation depending
on the computer destined for simulations. A nal consideration are
negative imputations. Every once in a while during simulations (3% of
the times in our experiment) SEADEC and STRUCT return negative
values, which are worth to omit in further statistical analysis if the time
series is only comprised by positive values, such as the marine shery
landings.
The purpose of imputations always needs to be taken into account to
understand which range of imputed values are acceptable for a partic-
ular target time series. If the purpose is for example to produce a forecast
of the general trend of the time series, very high or low values should not
be problematic. Oppositely, very high or low values are indeed a prob-
lem when the purpose is to reconstruct particular missing values for
stock assessment, price analysis or other ner detail task for sheries. In
these cases, the experience of the researcher with the study system and
the data is highly valuable, and should always complement the infor-
mation yielded by imputation methods.
The results of this research highlight the convenience of conducting
KSSA to choose ad-hoc methods best suited to MD imputation in
particular target time series. As stated by Hyndman & Khandakar (2008)
and Moritz and Bartz-Beielstein (2017), there is no single univariate
imputation technique suited to all types of data. Imputation perfor-
mance is always very dependent on the characteristics of the input time
series. Time series comprise various known components such as length,
resolution, trend, seasonality, cycles, autocorrelation and randomness,
Fig. 6. Fishery landings time series after true MD imputation with algorithms SEADEC and STRUCTS. Black lines represent real data, red lines represent imputations
with SEADEC and blue lines represent imputations with STRUCTS. (For interpretation of the references to colour in this gure legend, the reader is referred to the
Web version of this article.)
Table 6
Results of classical validation of imputation methods with ve real lifetime se-
ries. The best method for each time series and MD size is shown since it obtained
the lowest RMSE and MASE values. MD sizes from 20 to 40 are highlighted
because they represent the range of dominant MD size in the target time series
from SEPEC.
MD Size T. obesus T. albacares K. pelamis CHLA-CPO SST-CPO
RMSE
1 to 10 ARIMA LTS LTS SSA LTS
10 to 20 ARIMA SMA SEADEC SSA LTS
20 to 30 ARIMA SSA SSA SSA LTS
30 to 40 ARIMA SSA SEADEC SSA LTS
40 to 50 ARIMA SSA SEADEC SSA LTS
50 to 60 ARIMA SSA SEADEC SSA LTS
60 to 70 ARIMA SSA SEADEC SSA LTS
MASE
1 to 10 ARIMA LTS SSA SSA LTS
10 to 20 ARIMA SMA SEADEC SSA LTS
20 to 30 ARIMA SMA SSA SSA LTS
30 to 40 ARIMA SSA SEADEC SSA LTS
40 to 50 ARIMA SSA SEADEC SSA LTS
50 to 60 ARIMA SSA SEADEC SSA LTS
60 to 70 ARIMA SSA SEADEC SSA LTS
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
12
whose possibility of combinations are virtually innite. In addition, it
must be recalled that time series are gathered from observational or
experimental units that imply random variation in space and time,
whose causes are unknown. Albeit target time series are articially split
for KSSA, the segments are only used operationally for simulating NMAR
MD and imputation, but the imputation itself is made based on the
statistical properties of the whole target time series.
Each time series is a unique ‘ngerprintthat cannot be captured by a
solely method. This is why researchers working with time series in
marine sciences should have a broad toolkit of imputation methods to
handle multiple missing data and consider KSSA as a potential approach
that can better help decide which method to use for a particular series.
This of course applies not only to marine science, but to all quantitative
sciences that uses time series.
Precaution must therefore be raised on every new published paper
claiming a new imputation method to be better than the rest, because it
might be better only for the time series that authors used for validation.
This kind of research can signicantly impact marine science because
beyond mathematical details, sheries and researchers on marine sci-
ence need sound and easily implemented methods for imputation of MD
and validation of methods. Improving imputation methods combined
with improving validation algorithms and improving the researchers
knowledge on the marine systems and the data it produces help to
design, implement and evaluate management strategies and recovery
plans from threatened or exploited species (Altamar et al., 2020; Pea-
cock et al., 2020). Furthermore, the specic results produced in this
research will be useful for researchers working with marine sheries
time series around the world, and especially for researchers in AUNAP
and SEPEC who are willing to use the time series from the six species
covered here for stock assessment and forecasting. Finally, KSSA might
be also useful for other research elds apart from sheries, where
imputation methods are need to be validated in order to chose the best
one.
Declaration of competing interest
The authors declare that there are no conicts of interest.
CRediT authorship contribution statement
Iv´
an F. Benavides: Conceptualization, Data curation, Formal anal-
ysis, Funding acquisition, Investigation, Methodology, Software, Su-
pervision, Validation, Writing original draft. Marlon Santacruz: Data
curation, Investigation, Methodology, Writing review & editing.
Jhoana P. Romero-Leiton: Investigation, Visualization, Writing re-
view & editing. Carlos Barreto: Data curation, Methodology, Writing
review & editing. John Josephraj Selvaraj: Funding acquisition,
Investigation, Project administration, Resources, Supervision, Writing
review & editing.
Acknowledgments
The authors thank to SEPEC (Colombia) for providing the datasets
and the technical training on it; to the Instituto de Estudios del Pacíco
(IEP) at Universidad Nacional de Colombia, Tumaco for providing the
logistic support for the research; to professor Luis Enriquez Benavides at
Universidad de Nari˜
no (Colombia) for providing logistic support. F.
Benavides and J. Romero appreciate the support of Fundaci´
on CEIBA
(Colombia).
References
Afrifa-Yamoah, E., Mueller, U. A., Taylor, S. M., & Fisher, A. J. (2020). Missing data
imputation of high-resolution temporal climate time series data. Meteorological
Applications, 27(1), 118. https://doi.org/10.1002/met.1873
Altamar, J., Correa-Helbrum, J., Restrepo-Leal, D., & Robles-Algarín, C. (2020).
Reconstructed data of landings for the artisanal beach seine shery in the marine-
coastal area of Taganga, Colombian Caribbean Sea. Data in Brief, 30. https://doi.org/
10.1016/j.dib.2020.105604
Beck, M. W., Bokde, N., Asencio-Cort´
es, G., & Kulat, K. (2018). R package imputeTestbench
to compare imputation methods for univariate time series.
Bokde, N., Beck, M. W., Martínez ´
Alvarez, F., & Kulat, K. (2018). A novel imputation
methodology for time series based on pattern sequence forecasting. Pattern
Recognition Letters, 116, 8896. https://doi.org/10.1016/j.patrec.2018.09.020
Chandrasekaran, S., Moritz, S., Zaefferer, M., Stork, J., Bartz-Beielstein, T., & Bartz-
Beielstein, T. (2016). Data preprocessing: A new algorithm for univariate imputation
designed specically for industrial needs (pp. 120). January: Workshop Computational
Intelligence. https://cos.bibl.thkoeln.de/frontdoor/deliver/index/docId/433/f
ile/Chan16a.pdf.
Chan, K.-S., & Ripley, B. (2020). TSA: Time series analysis. R package version 1.3 https:
//CRAN.R-project.org/package=TSA.
Coro, G., Large, S., Magliozzi, C., & Pagano, P. (2016). Analysing and forecasting
sheries time series: Purse seine in Indian ocean as a case study. ICES Journal of
Marine Science: Journal Du Conseil, 73(10), 25522571. https://doi.org/10.1093/
icesjms/fsw131
De Valpine, P. (2002). Review of methods for tting time-series models with process and
observation error and likelihood calculations for nonlinear, non-Gaussian state-space
models. Bulletin of Marine Science, 70(2), 455471.
Demirhan, H., & Renwick, Z. (2018). Missing value imputation for short to mid-term
horizontal solar irradiance data. Applied Energy, 225(May), 9981012. https://doi.
org/10.1016/j.apenergy.2018.05.054
Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006).
Review: A gentle introduction to imputation of missing values. Journal of Clinical
Epidemiology, 59(10), 10871091.
Duarte, L. O., Manjarr´
es, L., & Reyes-Ardila, M.y H. (2019). Estadísticas de desembarco y
esfuerzo de las pesquerías artesanales e industriales de Colombia entre febrero y diciembre
de 2019 (p. 95). Bogot´
a: Autoridad Nacional de Acuicultura y Pesca (AUNAP).
Engels, J. M., & Diehr, P. (2003). Imputation of missing longitudinal data: A comparison
of methods. Journal of Clinical Epidemiology, 56(10), 968976. https://doi.org/
10.1016/S0895-4356(03)00170-7
FAO. (2017). Fisheries and aquaculture software. FishStatJ - software for shery statistical
time series. Version 3.01.0. Rome: FAO Fisheries and Aquaculture De- partment.
Updated 14 2017 http://fao.org/fishery/statistics/software/fishstatj/en.
FAO. (2020). The state of world sheries and aquaculture. Sustainability in Action, 225,
978-92-5-132756-2 http://www.fao.org/3/ca9229en/ca9229en.pdf.
Farmer, N. A., & Froeschke, J. T. (2015). Forecasting for recreational sheries
management: whats the catch? North American Journal of Fisheries Management, 35,
720735. https://doi.org/10.1080/02755947.2015.1044628
Genolini, Christophe, Ecochard, Rene, & Jacqmin-Gadda, H´
el`
ene (2013). Copy mean: A
new method to impute intermittent missing values in longitudinal studies. Open
Journal of Statistics, 3. https://doi.org/10.4236/ojs.2013.34A004
Golyandina, N., & Korobeynikov, A. (2014). Basic singular Spectrum analysis and
forecasting with R. Computational Statistics & Data Analysis, 71, 934954. https://doi.
org/10.1016/j.csda.2013.04.009
Hassani, H., Kalantari, M., & Ghodsi, Z. (2019). Evaluating the performance of multiple
imputation methods for handling missing values in time series data: A study focused
on East Africa, soil-carbonate-stable isotope data. Stats, 2(4), 457467. https://doi.
org/10.3390/stats2040032
Huque, Md Hamidul, Carlin, John, B, Simpson, Julie, A, & Lee, Katherine, J (2018).
A comparison of multiple imputation methods for missing data in longitudinal
studies. BMC Medical Research Methodology, 18(168). https://doi.org/10.1186/
s12874-018-0615-6
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast
package for R. Journal of Statistical Software, 27(3), 22. http://www.jstatsoft.org/
v27/i03/paper.
Jamshidian, M., Jalal, S., & Jansen, C. (2014). MissMech: An R package for testing
homoscedasticity, multivariate normality, and missing completely at random
(MCAR). Journal of Statistical Software, 56(6), 131. https://doi.org/10.18637/jss.
v056.i06
Junger, W. L., & Ponce de Leon, A. (2015). Imputation of missing data in time series for
air pollutants. Atmospheric Environment, 102, 96104. https://doi.org/10.1016/j.
atmosenv.2014.11.049
Kihoro, J., & Athiany, K. (2013). Imputation of incomplete nonstationary seasonal time
series data. Mathematical Theory and Modeling, 3(12), 142154.
Koslow, J. A., & Davison, P. C. (2016). Productivity and biomass of shes in the California
current large marine ecosystem: Comparison of shery-dependent and -independent time
series (Vol. 17). Environmental Development. https://doi.org/10.1016/j.
envdev.2015.08.005
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John
Wiley.
Liu, Y., Dillon, T., Yu, W., Rahayu, W., & Mostafa, F. (2020). Missing value imputation
for industrial IoT sensor data with large gaps. IEEE Internet of Things Journal, 7(8),
68556867. https://doi.org/10.1109/JIOT.2020.2970467
Litzow, Michael, Urban, Daniel, et al. (2009). Fishing through (and up) Alaskan food
webs. Canadian Journal of Fisheries and Aquatic Sciences, 26(2). https://doi.org/
10.1139/F08-207
Liu, J., Kumar, S., & Palomar, D. P. (2019). Parameter estimation of heavy-tailed AR
model with missing data via stochastic em. IEEE Transactions on Signal Processing, 67
(8), 21592172. https://doi.org/10.1109/TSP.2019.2899816
Lloret, Josep, Lleonart, Jordi, & Sol´
e, Ignasi (2000). Time series modelling of landings in
Northwest Mediterranean Sea. ICES Journal of Marine Science, 57. https://doi.org/
10.1006/jmsc.2000.0570
I.F. Benavides et al.
Aquaculture and Fisheries xxx (xxxx) xxx
13
Magare, D., Labde, S., Gofane, M., & Vyawahare, V. (2020). Imputation of missing data in
time series by different computation methods in various data set applications. ITM
Web of Conference. ITM Web of Conferences, 32. https://doi.org/10.1051/itmconf/
20203203010, 03010.
Mahmoudvand, R., & Rodrigues, P. C. (2016). Missing value imputation in time series
using Singular Spectrum Analysis. International Journal of Energy and Statistics,
1650005. https://doi.org/10.1142/s2335680416500058, 04(01).
Moritz, S., & Bartz-Beielstein, T. (2017). imputeTS: Time series missing value imputation
in R. R Journal, 9(1), 207218. https://doi.org/10.32614/rj-2017-009
Moritz, S., Sard´
a, A., Bartz-Beielstein, T., Zaefferer, M., & Stork, J. (2015). Comparison of
different methods for univariate time series imputation in R. arXiv https://arxiv.
org/abs/1510.03924.
Peacock, S. J., Hertz, E., Holt, C. A., Connors, B., Freshwater, C., & Connors, K. (2020).
Evaluating the consequences of common assumptions in run reconstructions on
pacic salmon biological status assessments. Canadian Journal of Fisheries and
Aquatic Sciences, 77(12), 19041920. https://doi.org/10.1139/cjfas-2019-0432
Phan, T.-T.-H., Caillault, ´
E. P., Lefebvre, A., Bigand, A., & others. (2020). Dynamic time
warping-based imputation for univariate time series data. Pattern Recognition Letters,
139. https://doi.org/10.1016/j.patrec.2017.08.019
Pikitch, E., Santora, C., Babcock, E., Bakun, A., Bonl, R., Conover, D., & Dayton, P.
(2004). Ecosystem-based shery management. Science (Washington), 305, 346347.
Preciado, I., Punz´
on, A., Gallego, J. L., & Vila, Y. (2006). Using time series methods for
completing sheries historical series. Boletin del Instituto Espanol de Oceanograa, 22
(4), 8390.
R Core Team. (2021). R: A language and environment for statistical computing. Vienna,
Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Rudd, Merril, B, & Branch, Trevor, A (2016). Does unreported catch lead to overshing?
Fish and Fisheries, 18(2). https://doi.org/10.1111/faf.12181
Schafer, J. L. (1997a). Analysis of incomplete multivariate data. Boca Raton, Florida:
Chapman and Hall, CRC, ISBN 978-0412040610 [p218].
Schafer, J. L. (1997b). Analysis of incomplete multivariate data. London: Chapman and
Hall/CRC.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7(2), 147177. https://doi.org/10.1037/1082-989X.7.2.147
Selvaraj, J. J., Arunachalam, V., Coronado-Franco, K. V., Romero-Orjuela, L. V., &
Ramírez-Yara, Y. N. (2020). Time-series modeling of shery landings in the
Colombian Pacic Ocean using an ARIMA model. Regional Studies in Marine Science,
39, 101477. https://doi.org/10.1016/j.rsma.2020.101477
SEPEC, Servicio Estadístico Pesquero Colombiano. (2018). Operaci´
on estadística:
Estimaci´
on de volúmenes artesanales desembarcados en sitios pesqueros. Ficha
metodol´
ogica. http://sepec.aunap.gov.co/Archivos/Estadisticas/Metodologia/SPC-
DIFM15_FICHA_METODOLOGICA.pdf.
Shumway, R. H., & Stoffer, D. S. (2011). Time series analysis and its applications: With R
examples (3rd ed., p. p218). New York, New York: Springer- Verlag.
Spratt, M., Carpenter, J., Sterne, J. A. C., Carlin, J. B., Heron, J., Henderson, J., &
Tilling, K. (2010). Strategies for multiple imputation in longitudinal studies.
American Journal of Epidemiology, 172, 478487.
Stergiou, K., & Christou, E. (1996). Modelling and forecasting annual sheries catches:
Comparison of regression, univariate and multi-variate time series methods. Fisheries
Research, 25, 105138.
Wei, Runmin, Wang, Jingye, Su, Mingming, Jia, Erik, Chen, Shaoqiu, Chen, Tianlu, &
Ni, Yan (2018). Missing value imputation approach for mass spectrometry-based
metabolomics data. Nature Scientic Reports, 8, Article 663. https://doi.org/
10.1038/s41598-017-19120-0
Wu, S., Chang, C., & Lee, S. (2015). Time series forecasting with missing values. In 1st Int.
Conf. Ind. Networks Intell. Syst. (pp. 151156). https://doi.org/10.4108/icst.
iniscom.2015.258269
Yozgatligil, C., Aslan, S., Iyigun, C., & Batmaz, I. (2013). Comparison of missing value
imputation methods in time series: The case of Turkish meteorological data.
Theoretical and Applied Climatology, 112(12), 143167. https://doi.org/10.1007/
s00704-012-0723-x
Zar, J. H. (2009). Biostatistical analysis (5th ed.). Pearson.
I.F. Benavides et al.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Information on biological status is essential for designing, implementing, and evaluating management strategies and recovery plans for threatened or exploited species. However, the data required to quantify status are often limited, and it is important to understand how assessments of status may be biased by assumptions in data analysis. For Pacific salmon, biological status assessments based on spawner abundances and spawner–recruitment (SR) analyses often involve “run reconstructions” that impute missing spawner data, expand observed spawner abundance to account for unmonitored streams, assign catch to individual stocks, and quantify age-at-return. Using a stochastic simulation approach, we quantified how common assumptions in run reconstructions biased assessments of biological status based on spawner abundance. We found that status assessments were robust to most common assumptions in run reconstructions, even in the face of declining monitoring coverage, but that overestimating catch tended to increase rates of status misclassification. Our results lend confidence to biological status assessments based on spawner abundances and SR analyses, even in the face of incomplete data.
Article
Full-text available
In a modern technology generation, big volumes of data are evolved under numerous operations compared to an earlier era. However, collection of data without missing single value, is a great challenge ahead. In practice, there are many solutions suggested to avoid the missing values in time series applications. The existing methods used in imputation and their prediction with time series, varies with applications. The existing methods mostly available for imputation are least squares support vector machine (LSSVM), autoregressive integrated moving average models (ARIMA), Artificial Neural Network (ANN), Artificial Intelligence (AI) techniques, state space models, Kalman filtering and fuzzy model. The extensive experimental application data is used to analyze these methods. In addition, a synthetic set of data can also be used to forecast missing value, which improves performance of imputation methods in time series. In this paper, predominantly used imputation methods have been listed with their fundamental computational information along with their verification on set of data mentioned.
Article
Full-text available
This paper presents a dataset on the abiotic (oceanographic, atmospheric and global climatic indices) and fishery variables of the marine-coastal area of the Magdalena Province in the area between Taganga and Bahía Concha, located north of Santa Marta in the Colombian Caribbean. The abiotic variables were downloaded from the satellites of the National Aeronautics and Space Administration (NASA) and the meteorological stations of the Institute of Hydrology, Meteorology and Environmental Studies (IDEAM). The fishery variables were obtained through field trips in the study area. A dynamic artificial neural network was implemented to reconstruct the missing data in the fishery variables from the known abiotic variables (Precipitation, North Atlantic Oscillation and Multivariate ENSO Indices). In this way, a dataset was obtained that is important to determine the historical changes of fishery resources for the study area and to make catch forecasts incorporating the variability of the environmental conditions (atmospheric and oceanographic).
Article
Full-text available
Analysis of high‐resolution data offers greater opportunity to understand the nature of data variability, behaviours, trends and to detect small changes. Climate studies often require complete time series data which, in the presence of missing data, means imputation must be undertaken. Research on the imputation of high‐resolution temporal climate time series data is still at an early phase. In this study, multiple approaches to the imputation of missing values were evaluated, including a structural time series model with Kalman smoothing, an autoregressive integrated moving average (ARIMA) model with Kalman smoothing and multiple linear regression. The methods were applied to complete subsets of data from 12 month time series of hourly temperature, humidity and wind speed data from four locations along the coast of Western Australia. Assuming that observations were missing at random, artificial gaps of missing observations were studied using a five‐fold cross‐validation methodology with the proportion of missing data set to 10%. The techniques were compared using the pooled mean absolute error, root mean square error and symmetric mean absolute percentage error. The multiple linear regression model was generally the best model based on the pooled performance indicators, followed by the ARIMA with Kalman smoothing. However, the low error values obtained from each of the approaches suggested that the models competed closely and imputed highly plausible values. To some extent, the performance of the models varied among locations. It can be concluded that the modelling approaches studied have demonstrated suitability in imputing missing data in hourly temperature, humidity and wind speed data and are therefore recommended for application in other fields where high‐resolution data with missing values are common. Multiple linear regression with modified error assumptions and univariate time series models by state‐space methods impute highly plausible values for missing observations in high‐resolution temporal temperature, humidity and wind speed time series data.
Article
Full-text available
The autoregressive (AR) model is a widely used model to understand time series data. Traditionally, the innovation noise of the AR is modeled as Gaussian. However, many time series applications, for example, financial time series data, are non-Gaussian, therefore, the AR model with more general heavytailed innovations is preferred. Another issue that frequently occurs in time series is missing values, due to system data record failure or unexpected data loss. Although there are numerous works about Gaussian AR time series with missing values, as far as we know, there does not exist any work addressing the issue of missing data for the heavy-tailed AR model. In this paper, we consider this issue for the first time, and propose an efficient framework for parameter estimation from incomplete heavy-tailed time series based on a stochastic approximation expectation maximization (SAEM) coupled with a Markov Chain Monte Carlo (MCMC) procedure. The proposed algorithm is computationally cheap and easy to implement. The convergence of the proposed algorithm to a stationary point of the observed data likelihood is rigorously proved. Extensive simulations and real datasets analyses demonstrate the efficacy of the proposed framework.
Article
Full-text available
Background: Multiple imputation (MI) is now widely used to handle missing data in longitudinal studies. Several MI techniques have been proposed to impute incomplete longitudinal covariates, including standard fully conditional specification (FCS-Standard) and joint multivariate normal imputation (JM-MVN), which treat repeated measurements as distinct variables, and various extensions based on generalized linear mixed models. Although these MI approaches have been implemented in various software packages, there has not been a comprehensive evaluation of the relative performance of these methods in the context of longitudinal data. Method: Using both empirical data and a simulation study based on data from the six waves of the Longitudinal Study of Australian Children (N = 4661), we investigated the performance of a wide range of MI methods available in standard software packages for investigating the association between child body mass index (BMI) and quality of life using both a linear regression and a linear mixed-effects model. Results: In this paper, we have identified and compared 12 different MI methods for imputing missing data in longitudinal studies. Analysis of simulated data under missing at random (MAR) mechanisms showed that the generally available MI methods provided less biased estimates with better coverage for the linear regression model and around half of these methods performed well for the estimation of regression parameters for a linear mixed model with random intercept. With the observed data, we observed an inverse association between child BMI and quality of life, with available data as well as multiple imputation. Conclusion: Both FCS-Standard and JM-MVN performed well for the estimation of regression parameters in both analysis models. More complex methods that explicitly reflect the longitudinal structure for these analysis models may only be needed in specific circumstances such as irregularly spaced data.
Article
Seer fish (Scomberomorus sierra) and mullet (Mugil cephalus) are some of the most important marine fishery resources along the Colombian Pacific Ocean. The objective of this study was to forecast the landings of seer fish and mullet based on data from time-series annual landings reported by the Food and Agriculture Organization of the United Nations (FAO) from 1971 to 2014. The study considered autoregressive integrated moving-average (ARIMA) processes to forecast the landings of the species. The ARIMA model (5,1,5) for seer fish and ARIMA model (2,2,1) for mullet showed good agreement concerning the observed data on landings based on the Akaike information criterion. The results show the ARIMA model to be a suitable method for analyzing statistics. In data-poor fisheries situations, this method can support potential evaluations of fishery production for decision making and management.
Article
In recent years, the Internet of Things (IoT)-oriented smart manufacturing has become a prominent solution in realizing evolutional digital transformation. Missing data is one of the biggest problems for data preprocessing in an IoT architecture, and it is crucial that missing values are recovered to improve the reliability of monitoring applications. However, due to the high-frequency collection of sensor data, missing data in IoT brings new challenges. Several methods have been developed to recover missing IoT data by utilizing data from sensors which are geographically close to the sensor which is responsible for the missing data, or from sensors which provide data which is highly correlated to the missing data. In IoT systems, because of the transmission of a large volume of data over networks, common mode failures need to be considered where a single event can lead to the loss of data from a large number of sensors. In this situation, it would be infeasible to recover missing data from other sensors. To address this issue, in this paper, we focus on missing data imputation for large gaps in univariate time series data and propose an iterative framework using multiple segmented gap iteration called Itr-MS-STLecImp to provide the most appropriate values. The gap is firstly segmented into several pieces to initialize the missing value imputation process and then we iteratively run gap reconstruction and gap concatenation to obtain the final imputation results. We validate the proposed approach using sensor data collected from real manufacturing plants in Australia and the comparison results show that the proposed Itr-MS-STLecImp outperforms the state-of-the-art methods in terms of root mean square error. Under different gap-length conditions, the proposed approach consistently reduces the error rate more than the baseline algorithm, and the error reduction is greater when the lengths of the gaps increase, indicating that the performance is significantly improved. These analysis results further prove the effectiveness of the multiple segmentation of missing gaps and the iteration operation.
Article
In all fields of quantitative research, analyzing data with missing values is an excruciating challenge. It should be no surprise that given the fragmentary nature of fossil records, the presence of missing values in geographical databases is unavoidable. As in such studies ignoring missing values may result in biased estimations or invalid conclusions, adopting a reliable imputation method should be regarded as an essential consideration. In this study, the performance of singular spectrum analysis (SSA) based on L 1 norm was evaluated on the compiled δ 13 C data from East Africa soil carbonates, which is a world targeted historical geology data set. Results were compared with ten traditionally well-known imputation methods showing L 1-SSA performs well in keeping the variability of the time series and providing estimations which are less affected by extreme values, suggesting the method introduced here deserves further consideration in practice.