Content uploaded by Felipe Benavides

Author content

All content in this area was uploaded by Felipe Benavides on Feb 18, 2022

Content may be subject to copyright.

Aquaculture and Fisheries xxx (xxxx) xxx

Please cite this article as: Iván F. Benavides, Aquaculture and Fisheries, https://doi.org/10.1016/j.aaf.2021.12.013

2468-550X/© 2021 Shanghai Ocean University. Publishing services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article

under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Original research article

Assessing methods for multiple imputation of systematic missing data in

marine sheries time series with a new validation algorithm

Iv´

an F. Benavides

a

, Marlon Santacruz

b

,

*

, Jhoana P. Romero-Leiton

c

,

d

, Carlos Barreto

e

, John

Josephraj Selvaraj

a

a

Instituto de Estudios del Pacíco, Universidad Nacional de Colombia, Tumaco Campus, Kil´

ometro 30-31, Cajapí Vía Nacional Tumaco – Pasto, Tumaco, Nari˜

no,

522020, Colombia

b

Ingeniería en Producci´

on Acuícola, Universidad de Nari˜

no, Cra. 22 #30-63, Torre 4 apto. 703, San Juan de Pasto, Nari˜

no, 520004, Colombia

c

Facultad de ingeniería, Universidad Cesmag, Pasto, 520004, Colombia

d

Datambiente, 520001, Colombia

e

Autoridad Nacional de Pesca y Acuicultura (AUNAP), 520001, Colombia

ARTICLE INFO

Keywords:

Data gaps

Fish populations

Artisanal shing

Time series imputation

ABSTRACT

Time series from sheries often contain multiple missing data. This is a severe limitation that prevents using the

data for research on population dynamics, stock assessment, forecasting, and, hence, decision-making around

marine resources. Several methods have been proposed to impute missing data in univariate time series. Still,

their performances depend not only on the amount of missing data but also on the data structure. This study

compares the performance of twelve imputation methods on the time series of marine shery landings for six

species in the Colombian Pacic Ocean. Unlike other studies, we validate the precision of the imputations in the

same target time series that include missing data, using the Known Sub-Sequence Algorithm (KSSA), a novelty

validation approach that simulates missing data in known sub-sequences of the target time series. The results

showed that the best methods for imputation are Seasonal Decomposition with Kalman lters and Structural

Models with Kalman lters tted by maximum likelihood. Results also show that validating the imputation

methods with other time series different to the target time series, leads to wrong imputation methods choices. It

is noteworthy that these methods and also the validation framework are mainly suited to time series with non-

random distribution of missing data, this is, missing data produced systematically in chunks or clusters with

predictable frequency, which are common in marine sciences.

1. Introduction

Marine sheries represent one of the most important activities for the

economy and development of societies worldwide, and the data they

provide are fundamental for the research and decision-making on ma-

rine resources (FAO, 2020). Time series are the most common and useful

type of data obtained from marine sh landings, which normally involve

monthly or yearly records of various lengths (Farmer & Froeschke,

2015). These time series constitute the basis for studies on population

dynamics, stock assessment, food webs, forecasting, among others.

(Litzow & Urban, 2009; Pikitch et al., 2004; Selvaraj et al., 2020; Ster-

giou & Christou, 1996).

However, marine shery landings time series frequently contain

multiple missing data (MD) that hinder their use for research because

most of the statistical models in marine science require complete series

(De Valpine, 2002; Peacock et al., 2020; Rudd & Branch, 2016; Selvaraj

et al., 2020). This makes time series often useless, in spite of the good

information they may contain. Causes of MD are diverse, though com-

mon examples include discontinuity in data recording, instrument fail-

ure, bureaucratic problems in the institutions in charge of data

collection and management, and dubious values that must be discarded

from the datasets (Kihoro et al., 2013; Afrifa-Yamoah et al., 2020). Some

researchers just remove MD and work with the available information left

in time series or ll MD with means or medians (Engels & Diehr, 2003;

Wu et al., 2015). Nevertheless, these practices are dangerous because

they corrupt the nature of the series, alter the outcome of the statistical

models, and bias the parameter estimations (Junger & Ponce de Leon,

2015), leading to wrong conclusions and costly decisions.

* Corresponding author.

E-mail address: ibenavidesm@unal.edu.co (M. Santacruz).

Contents lists available at ScienceDirect

Aquaculture and Fisheries

journal homepage: www.keaipublishing.com/en/journals/aquaculture-and-ﬁsheries

https://doi.org/10.1016/j.aaf.2021.12.013

Received 24 June 2021; Received in revised form 24 November 2021; Accepted 24 December 2021

Aquaculture and Fisheries xxx (xxxx) xxx

2

Imputation methods are a reasonable solution to handle multiple

MD, which helps to prevent misleading conclusions or to archive the

issues. These methods ll in MD according to different criteria (Afri-

fa-Yamoah et al., 2020; Rubin, 1987; Spratt et al., 2010). However,

imputation in univariate time series such as marine shery landings that

do not include accompanying independent covariates implies severe

challenges since the only information available for the imputation are

the values throughout the time series itself (Moritz et al., 2015). In

sheries, where the univariate landings somehow reect a natural

behavior of the ocean in terms of periodic high and low sh production,

time dependency structures such as trend, seasonality, and cycles are of

high importance for MD imputation. Therefore, these time dependency

structures must be captured by any method that intends to perform

reliable imputations of multiple MD in univariate shery time series

(Shumway & Stoffer, 2011, p. p218).

Many methods that meet these requirements are being proposed and

continuously implemented as libraries in the different statistical pro-

gramming software. Some of these algorithms such as the Kalman lters

(Afrifa-Yamoah et al., 2020), Structural models (Moritz &

Bartz-Beielstein, 2017), ARIMA models (Kihoro et al., 2013; Magare

et al., 2020), Singular Spectrum Analysis (Mahmoudvand &

Canas-Rodrigues 2016) and others, have been successfully used in elds

other than sheries such as engineering, energy, meteorology, econo-

metrics, medicine, biochemistry, etc. Some of these studies have

compared the efciency of different imputation methods on other time

series structures (Demirhan & Renwick, 2018; Engels & Diehr, 2003;

Hassani et al., 2019; Huque et al., 2018; Liu et al., 2020; Wei et al., 2018;

Yozgatligil et al., 2013), to show the importance of choosing the

best-suited method for a specic time series. For the shery and marine

ecology sciences, some studies have covered the issue of MD in uni-

variate time series by using imputation methods such as the, linear

interpolation (Coro et al., 2016), structural models (Selvaraj et al.,

2020), ARIMA (Preciado et al., 2006), transfer functions (Preciado et al.,

2006) and expansion factors (Peacock et al., 2020). However, these

studies have not been focused on comparing different imputation

methods but only on using them to impute a few values to proceed to

other analyses such as forecasting or stock assessment. A study con-

ducted by Genolini et al. (2013) used a time series from an automatic

pattern recognition system applied to the monitoring of sh migration

and compared the efciency of 12 imputation methods with a MD

simulation framework. However, this study was not aimed at shery or

marine sciences specically. Still, it only used this sh dataset as an

example.

One very important observation that arises from reviewing several

published papers on method comparison for multiple MD imputation in

disciplines other than shery is that the efciency of each method is

dataset-dependent (Schaffer, 1997; Yozgatligil et al., 2013; Schafer &

Graham, 2002). This means that each way performs differently

depending on the structure of each time series; in other words, there is

no single method suited to all types of data. This needs to be taken

seriously into account when attempting to choose an appropriate

method for multiple MD data imputation in marine shery landings time

series, which are characterized by being heterogeneous in structural

elements such as length, seasonality, trend, autocorrelation and sto-

chastic components (Coro et al., 2016; FAO, 2017; Koslow & Davison,

2016; Selvaraj et al., 2020). Thus, the best approach would be always to

compare several imputation methods for the same target time series

(time series of interest), perform validations, and nally make a decision

about the best approach.

Things get more complicated when considering the type of MD in

time series. For example, when data is Missing Completely at Random

(MCAR), the imputations are more easily done than when dealing with

data Non-Missing at Random (NMAR) (Little & Rubin, 1987). This is

because NMAR usually implies missing observations in chunks that

might be big enough to hide the structural components of time series

(Schafer, 1997; Donders et al., 2006; Beck et al., 2018). However, if

there is enough temporal structure in the series, imputation methods

that leverage characteristics such as seasonality, cycles, and autocorre-

lation will have an advantage over other more simplistic methods.

Hence, different ways will have additional precision to impute multiple

MD in marine sheries time series, and accuracy will depend on the

amount of MD chunks and MD size (Beck et al., 2018).

From the above information, we can highlight three important points

that could determine the efciency of methods for multiple MD impu-

tation in marine shery time series: 1) the type of method, 2) the

structure of time series, and 3) the type and size of MD. This implies that

to choose the best imputation method for a particular time series, the

validation of methods is a keystone aspect for the success of the process.

Validation of imputation methods is commonly done using complete

data (absent MD) other than the target time series (Hassani et al., 2019;

Moritz et al., 2015; Phan et al., 2020) (classical validation). However,

due to the reasons explained before, this kind of validation can lead to

wrong choices of imputation methods. For example, if the target time

series is seasonal, but the validation is performed with non-seasonal

data, or if the target time series has a trend, but the validation data is

stationary, or if the target time series is cyclical the validation data is

not. These kinds of possibilities can lead to wrong choices of validation

methods, unreliable imputed values, and therefore, to corrupted time

series and unrealistic conclusions if they are used in statistical analysis

and decision making. Additionally, suppose one wishes to validate

imputation methods with data other than the target time series. In that

case, it could be tough to nd available complete data from marine

shery landings with a very similar structure to the target time series.

Therefore, a possible solution for the problem described above is to

conduct validations on the same target time series, ensuring that the

method selected for making imputations is best suited for the data of

interest. Although this raises the well known restriction of making im-

putations in series that already contain MD (Phan et al., 2020), here we

propose and develop a validation algorithm that overcomes this obstacle

by taking advantage of sub-sequences within the time series witouth

MD, and that hold enough length and structure in such a way that they

allow to simulate new MD, impute it with different methods, and

compare the results with performance metrics in order to select a

method that reduces the overall error of imputations (see further details

in Materials and Methods). We assess and compare the imputation ef-

ciency of twelve methods for multiple MD imputation on marine

shery landings time series of six species from the Colombian Pacic

Ocean, and validate the results with KSSA in order to nd a best method

suited to each time series. To evaluate the efciency of KSSA, we

compare its results with classical validations performed using real life

time series from different data sources.

2. Materials and methods

2.1. Data source

Data encompass monthly time series of shery landings (tons) for

eight years and ve months (January 2012 to May 2020) in the

Colombian Pacic Ocean (CPO). These time series were gathered from

the website of the Colombian sheries statistical service (Servicio

Estadístico Pesquero Colombiano, SEPEC, by its acronym in Spanish)

(SEPEC) which is the main tool of the national sheries and aquaculture

authority (Autoridad Nacional de Pesca y Acuicultura, AUNAP, by its

acronym in Spanish) to provide public domain statistics of shery pro-

duction. We acquired the time series of the following six species: catsh

(Bagre spp), white shrimp (Litopenaeus occidentalis), pacic red snapper

(Lutjanus peru), mexican barracuda (Sphyraena ensis), seer sh (Scom-

beromerus spp), and cachema weaksh (Cynoscion phoxocephalus). These

data belong to the monthly reports of shery marine landings made by

artisanal boats that sh from the coastline up to two miles out to the sea

(Fig. 1), and include a quality control certied by the National Depart-

ment of Statistics in Colombia (Departamento Administrativo Nacional

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

3

de Estadística, DANE by its acronym in Spanish). This quality control

emerges from a stratied and probabilistic sampling in space and time of

each shing species, which operates within homogeneous groups of

shing gear, stratum, and ecological characteristics of the shing

grounds (SEPEC 2018).

From the 148 species that SEPEC monitors in the CPO, we choose

these six species because they represent the higher importance shery

products for the regional commerce. This is, more than 80% of the

economic income for artisanal sheries (Duarte et al., 2019). Therefore,

making good analyses and decision making around these important re-

sources based on their time series, would rst imply making reliable

imputations to start with complete series.

2.2. Data exploration and preparation

The time series for each species in SEPEC are available by munici-

pality. There are 13 main municipalities across the CPO where shing

landings are reported. The data was downloaded for each municipality

and summed up to obtain a series for the whole CPO by each species,

from January 2012 to May 2020, with a monthly resolution. The

structure of the MD was analyzed in terms of number of MD, gap size

(number of MD per gap), gap position across the time series and the

distribution of MD within and between gaps.

Periodogram analyzes (R-package ‘TSA’) (Chan & Ripley, 2020),

Mann-Kendall tests (R-package ‘trend’) (Thorsten 2020) and Seasonal

Decomposition by moving averages (R-package ‘stats’) (R Core Team,

2021) were used to evaluate the seasonality and trend of time series.

Runs tests (R-package ‘snpar’) were performed to check if the

Fig. 1. The red stripe on the map indicates the coastal area of artisanal shing for which SEPEC records the shery landings time series used in this study. EEZ =

Economic Exraction Zone.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

4

distribution of MD in each univariate time series occurred at random,

and a Little’s test (R-package ‘MissMech’) (Jamshidian et al., 2014) was

performed to check if data was MCAR in the multivariate dataset of all

species. When the null hypotheses of these tests are rejected, it suggests a

systematic process generating NMAR-type MD across the series. Table 1

summarizes the structure of MD for each series.

2.3. Methods for multiple missing data imputation

Since there is not any available set of independent predictors in the

study area to try imputation with multivariate approaches, we relied

exclusively on attributes of the time series themselves, such as trend,

seasonality and time-dependency structures. This is one of the con-

straints of multiple imputations when using univariate time series.

However, when there is enough data in the series, it is possible to pro-

duce reliable imputations based on capturing these time-dependency

attributes. Available published literature on multiple imputation for

time series offers different methods to solve this issue, whose perfor-

mances and robustness depend on the structure and quality of the data,

and the structure of MD across the time series.

Twelve methods for multiple imputations in time series were used

from three R-packages. From package ‘imputeTS’ (Moritz &

Bartz-Beielstein, 2017) we used: 1) Structural models with Kalman l-

ters tted by maximum likelihood (STRUCT); 2) Autoregressive Inte-

grated Moving Average in State Space (ARIMA_SS); 3) Linear

Interpolation (LIT); 4) Spline Interpolation (SIT); 5) Stineman Interpo-

lation (STIT); 6) Simple Moving Average (SMA); 7) Linear Weighted

Moving Average (LWMA); 8) Exponential Weighted Moving Average

(EWMA); 9) Last Observation Carried Forward (LOCF) and 10) Seasonal

Decomposition with Kalman lters (SEADEC). From package ‘forecast’

we used: 11) Linear Interpolation with Trend and Seasonal Decompo-

sition (LTS) (Hyndman & Khandakar, 2008). From package ‘Rssa’, we

used 12) Singular Spectrum Analysis (SSA) (Golyandina & Kor-

obeynikov, 2014).

SEADEC, STRUCT, ARIMA, LTS and SSA offer more sophisticated

approaches for missing data imputation in univariate series because they

capture time-dependency attributes such as trend, seasonality and cy-

cles. However, they require higher computation times. SIT, STIT, LIT

and LOCF are based on inter-attribute correlations, whilst and SMA,

EWMA and LWMA are half-way between both.

2.4. Simulation of MD, comparison and validation of methods

When comparing methods for MD imputation in univariate time

series, the typical approach is to take complete series (without MD),

simulate MD (MCAR or NMAR), perform the imputations and compare

them with true values. Then a measure of error like the Root Mean

Squared Error (RMSE) or Mean Absolute Scaled Error (MASE) is calcu-

lated to assess the difference between true and imputed values. This

process shed lights about the best method for a particular dataset. For

validation of methods, previous studies have used real complete series

from sources other than the target series of interest as well as simulated

series. This is due to the claim that validation cannot be performed in

series with true MD (Phan et al., 2020).

SEPEC do not offer any complete time series on which to simulate

MD, and since the success of the imputation depends on characteristics

of the data itself, we wanted to use the same SEPEC time series with true

MD to perform algorithm validation, in order to achieve more reliable

conclusions about their effectiveness. The next section describes a pro-

posed method to simulate MD in target time series already containing

true MD and thereof, perform algorithm validation (see Fig. 2).

2.5. Known Sub-Sequence Algorithm (KSSA)

Known Sub-Sequence Algorithm (KSSA) (Fig. 2) is a method to

validate the efciency of imputation algorithms that was developed for

this research in order to support the selection of a best algorithm to

impute MD in the marine shery landing time series from SEPEC. It

consists of simulating MD only in known sub-sequences of the target

time series already containing true MD. KSSA gives the advantage of

taking more realistic validations of imputation algorithms, since the

precision of imputation methods are dataset-dependent (Genolini et al.,

2013). KSSA takes a target time series of interest for imputation that

contains true MD and split it in segments according to a pre-stablished

useful criterion (i.e., years, months etc.). In this study, the criterion

was splitting it in years, because of the dominant annual seasonality (see

Table 1). Later, a rst imputation of true MD is performed with a

particular imputation algorithm in order to obtain a complete series on

which to simulate new MD. New MD is simulated in known

sub-sequences of the target time series that do not contain true MD using

a simulation window that randomly slides through the sub-sequence

within each segment. Simulated MD is then imputed with the same

particular algorithm used previously, and the obtained values are sta-

tistically compared to real values. This process is repeated for several

runs by changing the position and size of the simulation window across

the known sub-sequences within each segment. Thus, high amounts of

MD of varying sizes and positions can be generated across the TSS

iterative programming. KSSA is performed with different imputation

algorithms for comparative purposes.

We wrote a code in R-studio (R Core Team, 2021) that performed the

next eight action steps: 1) imputed true MD in the original target time

series of a single species with a single imputation algorithm; 2) divided

the time series in 8 equal segments (one for each year); 3) set a moving

and expanding window within each segment that simulated MD chunks

of varying sizes (1–10) in known sub-sequences (avoiding previously

imputed data in true MD); 4) imputed simulated MD chunks in each

segment with a single algorithm and calculated RMSE and MASE

(R-package ‘metrics’) between the actual and the imputed time series; 5)

iterated steps 1 to 4 10.000 times with bootstraping; 6) repeated steps 1

to 5 for all algorithms; 7) repeated steps 1–6 for all species, and 8) stored

all the results in a nal data frame containing MD size, RMSE and MASE

for the 10.000 runs of each algorithm and species. Fig. 3 claries this

process in a owchart.

As a result of this simulation, new simulated MD chunks varied in

size within each segment, producing a range of 10–70 MD across the

whole target time series. The range of MD sizes of interest for this

research goes from 20 to 40, since this is the range of MD in the SEPEC

Table 1

Structure of Missing Data (MD), trend and seasonality in the SEPEC time series

from Jan-2012 to May-2020. NMAR-chunks were identied when the runs and

Little tests were signicant.

Species Total

MD

Total

MD

gaps

Range

of MD

in gaps

Gap

distribution

Trend and

seasonality

after

imputation

with SD and

ST

B. pinnimaculatus 24

(23%)

8 1 to 6 NMAR-

chunks

it, sea(12)

L. occidentalis 23

(22%)

8 1 to 6 NMAR-

chunks

sta, sea(12)

L. peru 28

(27%)

8 1 to 6 NMAR-

chunks

it, sea(12)

S. ensis 33

(32%)

8 1 to 10 NMAR-

chunks

it, sea(12)

S. sierra 24

(23%)

8 1 to 6 NMAR-

chunks

it, sea(12)

C. phoxocephalus 26

(25%)

9 1 to 6 NMAR-

chunks

it, sea(12)

Abbreviations: it: increasing trend; sta: stationary; sea: seasonal. Months of

seasonality appear in brackets.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

5

time series. However, we kept MD sizes above 40 for the analysis, in

order to show a broad range of MD sizes that allow the visualization of

the increase in errors. Nonetheless, it is worth to keep in mind that

imputations for MD above 40 are increasingly unreliable. MD chunks

also varied in position within segments and across the whole time series.

However, position was not considered for further statistical analysis and

interpretation in this study.

2.6. Validation of imputation methods with different real-life time series

We used ve real-life time series (complete data without MD) from

different sources to perform a classical validation of the imputation

methods. With ‘classical’, we refer to validation of imputation methods

with complete time series different from the target time series. We

compared the results of these classical validations against the results of

Fig. 2. Graphical representation of the Known Sub-Sequence Algorithm (KSSA) for validating and comparing algorithms for MD imputation in time series data. A)

Target Time Series (TTS) with true MD. B) A rst imputation is carried out with algorithm x in order to obtain a full time series prior to simulation of MD and

validation of imputation algorithms. C) Time series is split in segments and a sliding and expanding window is set within each segment to simulate MD in known sub-

subsequences, avoiding previously imputed values. D) TTS with new simulated MD. E) Simulated MD is imputed with algorithm x. F) Imputed values with algorithm

x are compared to real values from known sub-sequences. The process is repeated with different imputation algorithms for comparative purposes.

Fig. 3. Flowchart showing the action steps followed in this study for the simulation of MD in marine shery landing time series and the validation of imputation

algorithms with the Known Sub-Sequence Algorithm. TTS: Target Time Series.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

6

KSSA. All these time series were originally larger than our target time

series, so they were cut to match the 8-years span with monthly reso-

lution from 2012 to 2020 as the SEPEC time series. Simulation of MD in

these time series was performed in a NMAR-chunk fashion following

steps 2 to 8 described in Fig. 3. MD chunks were allowed to vary from 1

to 10 individual MD and where located in the same positions of the

target time series (rst semester of each year). This conguration

allowed us to keep a reasonable framework for realistic comparisons

between classical validation and KSSA. Time series for this section are

described in Table 5 and comprise: 1) the CPUE (Catch per Unit of Effort)

of the yellown tuna (Thunnus albacares), 2) the skipjack tuna (Katsu-

nowus pelamis), and 3) the bigeye tuna (Thunnus obesus) in the Eastern

Tropical Pacic from the purse seine records of the Inter American

Tropical Tuna Commission (IATTC); 4) the average surface sea

Table 2

Resulting RMSE values from the MD simulation and KSSA validation in the SEPEC shery landings time series with the 12 imputation algorithms. Cells in bold

represent the lowest values from contrasts among algorithms (row-wise). Results are shown by ranges of increasing 10 MD from 1 to 70.

MD size =1 to 10

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 1.50

a

1.90

b

1.47

c

1.82

b

1.91

b

1.96

b

1.39

d

1.93

b

1.06

b

1.32

c

1.19

c

2.53

c

L. occidentalis 1.69

a

1.01

b

1.40

c

1.24

b

1.88

b

1.31

b

1.43

d

1.58

b

1.71

b

1.48

b

1.41

b

2.66

d

L. peru 0.12

b

0.10

b

0.23

a

0.26

a

1.11

b

1.46

c

2.94

d

1.28

c

0.42

a

0.34

a

0.33

a

2.42

a

S. ensis 1.52

a

1.12

b

1.46

b

1.00

c

1.82

b

1.35

b

1.26

d

1.39

b

1.75

b

1.46

b

1.97

b

3.00

d

S. sierra 1.06

a

1.02

b

1.47

b

1.10

b

1.54

b

1.15

b

1.91

c

1.47

b

1.32

b

1.25

b

1.73

b

3.40

b

C. phoxocephalus 1.43

a

1.42

a

1.58

a

1.64

a

4114

b

1.58

a

1.49

b

1.57

a

1.01

b

1.96

b

1.89

b

3.24

b

MD size =10 to 20

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 2.02

a

3.07

b

3.47

b

4.15

c

4.33

c

3.30

b

4.92

c

4.38

c

4.85

c

4.06

c

2.77

b

2.80

b

L. occidentalis 2.59

a

4.60

b

10.11

c

10.61

c

5.37

b

6.73

b

8.20

c

3.84

b

8.60

c

5.90

d

5.32

d

5.60

d

L. peru 1.06

a

1.67

b

1.90

b

1.68

b

1.42

b

2.06

c

1.86

b

2.13

c

2.02

c

2.45

c

1.51

b

2.02

c

S. ensis 1.53

a

3.46b 2.97

b

2.05

b

2.27

b

2.09

b

2.19

b

2.26

b

2.05

b

2.92

b

3.30

c

3.15

c

S. sierra 6.05

a

9.75

b

9.12

b

9.52

b

9.06

b

8.00

b

11.83

c

9.66

b

10.27

c

8.49

b

7.97

b

12.60

c

C. phoxocephalus 3.95

b

3.69

b

3.20

b

2.59

b

8.29

c

3.18

a

5.87

d

2.73

e

3.41

e

3.94

e

4.05

d

5.05

d

MD Size =20 to 30

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 4.50

b

4.90

b

5.47

b

4.82

b

4.91

b

4.96

b

8.39

a

4.93

b

5.06

b

5.32

b

5.19

b

5.53

b

L. occidentalis 7.87

a

8.34

a

11.47

b

8.72

a

11.44

b

9.32

c

15.99

d

9.9

e

9.72

e

9.35

e

9.78

e

14.23

d

L. peru 2.19

d

2.10

e

2.23

d

1.99

e

3.11

f

3.46

b

4.94

a

3.28

h

2.42

c

2.34

c

2.33

i

2.66

i

S. ensis 3.52

b

4.12

b

4.46

b

3.00

b

3.82

b

3.35

b

8.26

a

3.39

b

3.75

b

4.46

b

3.97

b

6.00

ab

S. sierra 10.06

b

11.02

b

11.47

b

13.10

b

12.54

b

11.15

b

19.91

a

11.47

b

11.32

b

11.25

b

10.73

b

13.40

b

C. phoxocephalus 4.43

c

4.42

c

4.58

c

4.64

c

6.14

b

4.58

c

7.49

a

4.57

c

5.01

c

4.96

c

4.89

c

6.24

b

MD Size =30 to 40

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 5.70

a

5.87

b

6.26

c

5.97

d

6.77

e

5.92

d

10.57

f

6.23

d

6.31

c

6.40

g

6.29

c

7.58

h

L. occidentalis 9.52

a

9.97

a

13.77

b

10.47

c

14.32

b

11.55

d

22.47

e

11.96

f

11.64

g

11.83

f

11.75

f

13.88

b

L. peru 2.53

a

2.57

a

2.84

b

2.54

a

3.76

c

3.90

e

6.97

d

3.90

e

3.06

f

3.31

f

2.85

b

5.68

g

S. ensis 4.46

a

5.20

b

5.66

c

5.56

c

5.35

b

4.96b 12.07

d

5.02

b

5.09

b

5.42

c

4.87

b

6.66

e

S. sierra 13.00

a

12.93

a

13.92

b

15.18

c

15.43

d

13.86

e

25.64

f

14.26

f

14.27

d

13.85

a

13.72

a

16.38

g

C. phoxocephalus 5.36

a

5.71

b

5.88

c

5.71

b

6.55

d

5.72

b

10.49

e

5.84

b

6.19

f

6.11

g

6.12

g

6.87

h

MD Size =40 to 50

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 6.55

a

6.85

b

7.29

b

7.05

b

8.18

c

6.89

b

14.40

d

7.15

b

7.49

e

7.54

d

7.51

d

8.55

e

L. occidentalis 10.99

a

11.76

a

16.13

b

12.23

c

15.96

d

13.93

e

29.68

f

14.22

e

13.63

g

13.43

g

13.78

e

17.16

b

L. peru 3.15

a

3.28

b

3.45

b

3.16

a

4.26

c

4.29

c

10.27

d

4.34

c

3.79

e

3.91

f

3.54

b

6.21

g

S. ensis 5.42

a

6.12

b

6.72

c

6.35

b

6.60b 6.09b 15.44f 6.14b 6.03b 6.48b 5.89b 7.38b

S. sierra 15.11

a

15.32

b

17.07

c

17.45

d

20.91

e

16.70

c

34.75

f

17.19

g

16.72

c

16.29

c

16.54

c

19.30

h

C. phoxocephalus 6.36

a

6.68

b

6.95

c

6.78

c

7.78

d

6.78

c

13.02

e

6.94

c

7.44

f

7.23

g

7.29

g

8.13

h

MD Size =50 to 60

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 7.52

a

7.96

b

8.39

b

8.12

b

9.88

c

7.89

b

16.30

d

8.20

b

8.52

e

8.41

b

8.42

b

9.66

c

L. occidentalis 13.02

a

13.32

a

17.99

b

14.35

c

18.11

b

15.88

d

40.19f 16.77

g

15.75

d

15.23

h

15.62

d

18.82

i

L. peru 3.61

a

3.82

b

4.44

c

4.04

c

4.69

d

4.65

e

12.97

f

4.67

d

4.42

c

4.51

e

4.23

c

7.20

g

S. ensis 6.33

a

6.95

b

7.58

c

7.31

d

8.10

e

7.22

f

18.62

g

7.44

d

7.15

f

7.31

d

6.99

b

8.27

h

S. sierra 17.77

a

16.98

b

19.31

c

20.12

d

25.51

e

19.41

d

41.33

f

20.28

g

18.95

c

18.48

h

18.74

h

21.24

i

C. phoxocephalus 7.22

a

7.42

b

8.23

c

7.70

d

8.90

e

7.82

d

16.03

f

8.24

c

8.50

g

8.40

g

8.27

c

8.90

e

MD Size =60 to 70

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 8.53

a

8.74

a

9.85

b

10.02

b

12.45

c

8.62

a

19.53

d

9.29

a

9.38

a

9.40

a

9.64

b

10.53

b

L. occidentalis 15.28

a

15.06

a

20.17

a

16.42

a

19.84

a

19.12

a

44.77

b

18.12

a

17.13

a

16.81

a

17.94

a

20.13

a

L. peru 4.46

c

4.71

c

5.06

c

4.92

c

6.32

d

5.29

c

14.66

a

5.34

c

5.16

c

5.33

c

5.12

c

8.36

b

S. ensis 7.81

b

7.60

b

8.62

b

8.62

b

10.07

b

8.25

b

25.10

a

8.27

b

8.06

b

7.86

b

8.13

b

9.11

b

S. sierra 19.74

a

18.07

b

23.28

c

24.31

d

29.53

e

23.78

c

35.90

f

23.52

c

20.93

g

20.54

g

21.67

g

24.03

c

C. phoxocephalus 8.38

b

8.45

b

8.85

b

9.08

b

10.39

b

8.90

b

25.52

a

8.76

b

9.01

b

9.44

b

9.16

b

9.92

b

Different superscript letters indicate statistically signicant differences in RMSE values between imputation methods at 0.05 level (rowwise comparisons), according to

the Tukey HSD tests. Same superscript letters indicate statistically non-signicant differences.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

7

temperature in the Colombian Pacic Ocean (SST-CPO), and 5) the

average concentration of Chlorophyll-a in the Colombian Pacic Ocean

(CHLA_CPO) from the Marine Copernicus remote sensing product Global

Analysis Forecast Bio 001-0.28.

2.7. Data analysis

RMSE and MASE were compared among algorithms using simple

analysis of variance (ANOVA) plus Tukey HSD tests (Zar, 2009) for

balanced data (R-package ‘agricolae’). Ordinary Least Squares (OLS)

regression were tted to estimate the slope of RMSE and MASE as a

function of MD size. Slopes among algorithms were compared using

Table 3

Resulting MASE values from MD simulation and KSSA validation in the SEPEC shery landings time series with 12 imputation algorithms and six species. Cells in bold

represent the lowest values from contrasts among algorithms (row-wise). Results are shown by ranges of increasing 10 MD from 1 to 70).

MD Size =1 to 10

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.05

a

0.09

a

0.16

b

0.11

a

0.06

a

0.11

a

0.24

c

0.10

b

0.08

a

0.11

b

0.11

b

0.17

b

L. occidentalis 0.02

a

0.07

b

0.15

c

0.01

d

0.13

c

0.09

c

0.30

e

0.10

c

0.09

b

0.19

c

0.19

c

0.24

f

L. peru 0.02

a

0.01

a

0.10

a

0.01

b

0.10

a

0.37

c

0.28

c

0.33

c

0.31

c

0.39

c

0.31

c

0.31

c

S. ensis 0.01

a

0.05

a

0.08

a

0.04

a

0.02

b

0.03

a

0.19

c

0.1c

6

0.15

c

0.16

c

0.14

c

0.27

d

S. sierra 0.06

a

0.10

b

0.13

b

0.24

c

0.09

a

0.13

b

0.27

c

0.13

b

0.29

d

0.25

d

0.29

d

0.28

d

C. phoxocephalus 0.07

a

0.08

a

0.11

b

0.08

a

0.15

b

0.19

b

0.23

c

0.20

c

0.20

c

0.25

c

0.21

c

0.22

c

MD Size =10 to 20

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.16

a

0.11

a

0.20

b

0.21

b

0.22

b

0.18

b

0.18

b

0.19

b

0.20

b

0.20

b

0.19

b

0.19

b

Cynoscion spp 0.10

a

0.15

a

0.20

b

0.12

c

0.16

a

0.16

a

0.22

d

0.20

d

0.21

d

0.22

d

0.24

d

0.24

d

L. peru 0.08

a

0.13

b

0.27

c

0.14

b

0.16

b

0.18

b

0.22

c

0.26

c

0.26

c

0.26

c

0.14

b

0.16

b

S. ensis 0.06

a

0.17

b

0.15

b

0.15

b

0.19

b

0.16

b

0.17

b

0.16

b

0.18

b

0.23

c

0.27

c

0.27

c

S. sierra 0.18

a

0.24

b

0.23

b

0.23

b

0.19

c

0.18

c

0.24

b

0.23

b

0.23

b

0.28

b

0.19

c

0.20

c

C. phoxocephalus 0.12

a

0.20

a

0.20

a

0.14

b

0.40

c

0.17

a

0.30

c

0.15

a

0.16

a

0.20

a

0.20

a

0.20

a

MD Size =20 to 30

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.23

a

0.29

b

0.36

c

0.31

d

0.28

a

0.31

d

0.44

e

0.30

d

0.28

f

0.31

d

0.31

d

0.37

g

L. occidentalis 0.21

a

0.26

a

0.36

b

0.23

a

0.33

b

0.29

c

0.48

d

0.31

e

0.31

e

0.29

e

0.31

e

0.5

f

L. peru 0.22

a

0.20

a

0.30

a

0.21

b

0.30

a

0.57

c

0.48

c

0.53

c

0.31

d

0.29

a

0.31

d

0.31

d

S. ensis 0.18

a

0.25

b

0.28

b

0.24

b

0.21

a

0.23

b

0.39

c

0.26

b

0.20

a

0.26

b

0.24

b

0.47

d

S. sierra 0.26

a

0.30

a

0.33

b

0.44

c

0.29

a

0.33

d

0.47

e

0.33

d

0.29

a

0.30

a

0.29

a

0.38

e

C. phoxocephalus 0.27

a

0.28

a

0.31

b

0.28

a

0.35

c

0.29

a

0.43

d

0.30

a

0.30

a

0.30

a

0.30

a

0.42

d

MD Size =30 to 40

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.39

a

0.41

b

0.47

c

0.44

d

0.41

a

0.42

b

0.66

e

0.44

f

0.40

a

0.42

b

0.42

b

0.58

g

L. occidentalis 0.29

a

0.35

a

0.54

b

0.32

c

0.47

b

0.42

d

0.76

e

0.43

f

0.41

d

0.42

f

0.41

f

0.55

b

L. peru 0.29

d

0.30

d

0.45

c

0.30

d

0.40

c

0.72

b

0.76

b

0.72

b

0.44

c

0.47

c

0.42

c

1.06

a

S. ensis 0.27

a

0.36

b

0.41

c

0.48

d

0.36

b

0.39

c

0.64

e

0.39

c

0.32

f

0.37

b

0.33

f

0.56

g

S. sierra 0.40

a

0.41

a

0.46

b

0.56

c

0.44

d

0.47

e

0.70

f

0.48

e

0.43

g

0.42

g

0.43

g

0.57

c

C. phoxocephalus 0.37

a

0.41

b

0.45

c

0.39

d

0.45

e

0.42

f

0.68

g

0.43

h

0.42

f

0.42

i

0.42

f

0.56

j

MD Size =40 to 50

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.51

a

0.53

b

0.61

c

0.59

d

0.55

e

0.56

f

0.96

g

0.57

h

0.54

b

0.55

f

0.56

f

0.75

i

L. occidentalis 0.42

a

0.46

a

0.65

b

0.39

c

0.59

d

0.57

b

1.11

e

0.58

d

0.53

f

0.53

f

0.55

d

0.77

b

L. peru 0.40

a

0.42

a

0.59

b

0.42

a

0.52

c

0.87

d

1.20

e

0.86

d

0.60

f

0.61

f

0.58

b

1.27

g

S. ensis 0.37

a

0.48

b

0.54

c

0.60

d

0.50

e

0.53

f

0.91

g

0.53

h

0.43

i

0.50

j

0.45

k

0.70

l

S. sierra 0.53

d

0.54

d

0.64

c

0.72

b

0.66

c

0.64

c

1.05

a

0.64

c

0.56

d

0.56

d

0.58

d

0.75

b

C. phoxocephalus 0.49

a

0.54

b

0.60

c

0.53

d

0.60

e

0.57

f

0.93

g

0.58

h

0.56

f

0.55

i

0.56

f

0.72

j

MD Size =50 to 60

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.64

a

0.69

b

0.78

c

0.74

d

0.74

d

0.70

b

1.21

e

0.72

d

0.68

b

0.68

b

0.70

b

0.92

f

L. occidentalis 0.55

a

0.57

a

0.8

b

0.51

c

0.74

b

0.73

d

1.61

e

0.77

f

0.67

d

0.66

g

0.68

g

0.94

h

L. peru 0.50

a

0.54

a

0.84

b

0.60

c

0.65

d

1.02

e

1.63

f

0.99

e

0.75

g

0.77

b

0.76

g

1.59

f

S. ensis 0.49

a

0.59

b

0.67

c

0.75

d

0.68

e

0.69

e

1.20

f

0.70

g

0.58

b

0.62

h

0.60

b

0.85

i

S. sierra 0.69

a

0.66

a

0.79

b

0.90

c

0.89

d

0.82

e

1.35

f

0.85

g

0.70

a

0.70

a

0.72

h

0.91

c

C. phoxocephalus 0.62

a

0.65

b

0.78

c

0.66

d

0.75

e

0.72

f

1.24

g

0.76

e

0.70

h

0.70

h

0.70

h

0.88

i

MD Size =60 to 70

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.79

a

0.81

a

1.00

b

0.98

b

0.94

b

0.85

a

1.52

c

0.90

b

0.81

a

0.82

a

0.85

a

1.07

d

L. occidentalis 0.72

a

0.7

a

1.02

a

0.66

a

0.88

a

0.98

a

1.85

b

0.89

a

0.8

a

0.81

a

0.85

a

1.1

a

L. peru 0.66

a

0.72

b

1.07

c

0.79

b

0.93

d

1.22

e

2.04

f

1.22

e

0.93

d

0.98

d

0.97

d

1.96

f

S. ensis 0.66

a

0.70

a

0.83

b

0.96

c

0.92

d

0.85

b

1.66

e

0.86

b

0.70

a

0.72

f

0.76

f

1.01

c

S. sierra 0.83

a

0.76

a

1.07

b

1.18

c

1.06

b

1.12

d

1.36

e

1.07

b

0.83

a

0.83

a

0.91

f

1.13

d

C. phoxocephalus 0.77

b

0.81

b

0.92

b

0.85

b

0.93

b

0.89

b

1.96

a

0.88

b

0.82

b

0.85

b

0.85

b

1.05

b

Different superscript letters indicate statistically signicant differences in MASE values between imputation methods at 0.05 level (rowwise comparisons), according to

the Tukey HSD tests. Same superscript letters indicate statistically non-signicant differences.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

8

ANOVA and Tukey HSD tests (Zar, 2009). The best algorithms for

imputation were determined as those having lower RMSE and MASE, as

well as smaller slopes, which indicate less imputation error in average

and a lesser decay of imputation precision with increasing MD. Data

analysis for classical simulation was focused to MD Sizes from 20 to 40,

since this is the range of MD of most importance in the target time series.

The level of signicance of statistical tests was set at 0.05.

3. Results

3.1. Structure of time series and MD

All time series showed a 12-month seasonality. Regarding to the

trend, L. occidentalis and was stationary and the rest of species showed a

signicant increasing trend (see Table 1). Time series for all species

showed a NMAR-type of MD (see Table 1), with sequential MD occurring

at chunks, most of them at the beginning of each year. The largest chunk

was found for S. ensis, with 10 MD from January to October 2012. For

the rest of species, the greatest chunk had 6 MD from January to June

2018. The rest of chunks ranged from 2 to 5 MD and varied in position

across the months from January to December (see Fig. 6). The difference

in total MD among series was due to few individual MD (<25%) scat-

tered throughout the series. Total MD in the series ranged from 20 to 33

(19%–32%), being highest for S. ensis and lowest for L. occidentalis.

3.2. Performance metrics among algorithms and MD size

Tables 2 and 3 summarize the mean RMSE and MASE resulting from

the MD simulation and KSSA validation for each algorithm, species and

MD size. For the whole experiment, 64% of the simulations resulted in

lowest RMSE values for the SEADEC algorithm, 25% for STRUCT and

11% for LTS. The rest of algorithms did not show lowest RMSE values for

any species at any MD size. SEADEC was consistently better for all

species, whilst LTS showed lowest RMSE values only for S. ensis and

L. peru at MD sizes from 1 to 10, 10 to 20 and 20 to 30. However, in most

of cases these differences were not statistically different. Regarding

MASE, the results were very similar in every way as for RMSE, except

that the SSA algorithm achieved the lowest values for B. pinnimaculatus

at MD sizes from 1 to 10 and 20 to 30, and for S. ensis at MD sizes from 1

to 10.

Both RMSE and MASE increased linearly and consistently with MD

size in all algorithms and species. At low MD sizes (1–10), RMSE and

MSE were the lowest, but also the difference among-algorithms was the

lowest. When MD was low and/or scattered, all algorithms performed

similarly. Simulations with low MD sizes often resulted in 1 or 2 MD per

segment, which were easily imputed by all algorithms with similar

values. LOCF was an exception to this pattern because even at low MD

sizes, its RMSE and MASE values were higher. With few MD per segment,

imputations are made based on inter attribute correlations, with no need

to account for time dependency structures. Hence, under this situation,

there was no distinguishable advantage of any algorithm over others.

As MD increased and chunks got greater, differences in RMSE and

MASE among algorithms were more pronounced. However, the statis-

tical signicance of these differences did not increase proportionally

with MD size. The difference in RMSE among algorithms was more

statistically signicant for MD Sizes from 10 to 50, where SD and ST had

up to 20%–50% more precision against algorithms with low perfor-

mance. Above 60 MD, the statistical signicance of the differences

among algorithms decreased. This is related to the fact that the variance

of imputed values among runs in simulations with high MD size. When

MD size was over 60, most of the essential structure of time series

(seasonality, trend, and cycles) was lost, making it very difcult for any

algorithm to capture it and impute reliable values. As the simulation

runs evolved, some high RMSE and MASE values emerged due to those

MD structures that left no time structure in the time series, thus

increasing the variance of imputations. As a consequence, at MD sizes

above 60, the imputations of all algorithms are not reliable.

Table 4 and Figs. 4 and 5 show the results of the OLS regressions to

estimate the slopes of RMSE and MASE as a function of MD Size.

Regarding RMSE, SEADEC and STRUCT had the lowest slopes among

algorithms. Slopes for SEADEC were signicantly lower for

B. pinnimaculatus, S. ensis and C. phoxocephalus. For the rest of species,

either SEADEC and STRUCT obtained the lowest slopes. Regarding

MASE, results were similar, except that for L. occidentalis the algorithm

LTS obtained the lowest slope. This means that the decay of imputation

precision with increasing MD is lower when using the SEADEC and

STRUCT algorithms. Taking into account both performance metrics

RMSE and MSE, and slopes from OLS regression, these results point to

SEADEC and STRUCT as the two best algorithms for MD imputation in

the SEPEC marine shery landings time series. However, it should be

noted that according to our results, the higher performance of these

algorithms is applicable for MD sizes between 20 and 50 (Figs. 4 and 5),

and for MD following a NMAR-type.

3.3. Comparison between KSSA and classical validaiton

Table 6 shows the results of classical validation of imputation

methods in terms of the best method resulting for each time series and

MD Size. According with these results, ARIMA was the best method for

T. obseus; LTS, SMA and SSA the best for T. albacares; LTS, SEADEC and

SSA the best for K. pelamis; SSA the best for CHLA-CPO, and LTS the best

for SST-CPO. When focusing on MD sizes from 20 to 40, which are of

main interest for us since this range is dominant in our target time series

from SEPEC, the results show that ARIMA was the best method for

T. obseus; SMA and SSA the best for T. albacares; SSA and SEADEC the

best for K. pelamis; SSA the best for CHLA-CPO, and LTS the best for SST-

CPO. These results are opposed to KSSA, which reveled SEADEC and

STRUCT as the rst and second best imputation methods respectively for

all of our six target time series.

From Figs. 4 and 5 one can visually estimate the increase in the

overall imputation errors for a particular MD size, when switching from

one method to another. To understand the utility of KSSA, it must be

observed that at any MD size, both for RMSE and MASE, imputations

with ARIMA, SSA and LTS suggested as the best by the classical vali-

dation, increase the error.

4. Discussion

In this work, we assessed and compared 12 state-of-the-art algo-

rithms for multiple imputation of missing data in univariate marine

shery landing time series from six species of the Colombian Pacic

Ocean, which frequently include missing data, preventing their use for

quantitative analysis and decision making in sheries. We validated the

results with our novelty KSSA approach, which allowed us to simulate

missing data in the same target time series, hence giving us good con-

dence in our results, which revealed SEADEC and STRUCT as the best

algorithms for multiple missing data imputation in these marine shery

time series. The RMSE and MASE values for SEADEC were signicantly

lower in most of the simulations and most of the MD Sizes, with highly

similar results among species. This indicates that for our specic target

time series, this imputation method was not only effective but

consistent.

SEADEC emerged as the best imputation method due to the marked

seasonal nature of the shery landings. SEADEC takes a time series and

removes its seasonal component to afterward perform the imputations

on the deseasonalized series using different algorithms such as the

Kalman lters (Moritz & Bartz-Beielstein, 2017). Among other imputa-

tion options, we choose these lters because, beyond the 12-month

seasonality, our target time series contain different levels of time de-

pendency structures, such as 1 and 2 autocorrelation lags (as evidenced

from Partial Autocorrelation Functions not shown in the Results sec-

tion), which would have been poorly captured by simple methods such

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

9

as interpolation, mean values, random values or moving averages

(available options in the imputeTS R-package). In fact, previous tests of

SEADEC using these options yielded unsatisfactory results. Hence, out-

lines SEADEC as an effective method for seasonal and autocorrelated

series such as the marine shery landings, whose behavior reect the

cyclical productivity of the ocean (Lloret et al., 2000; Preciado et al.,

2006; Coro et al., 2016).

Other imputation methods such as the Seasonal Window Moving

Algorithm (SWMA) (Chandrasekaran et al., 2016), the Seasonally Split

missing Value Imputation (SEASPLIT) (Moritz & Bartz-Beielstein, 2017),

the Pattern Sequence Forecasting (PFS) (Bokde et al., 2018), the

Table 4

Slopes estimated from OLS regression for RMSE and MASE as a function of MD size. Cells in bold represent the lowest values for contrasts among algorithms (rowwise).

RMSE

SEADEC STRUCT ARIMA LTS SSA LIT SIT STIT SMA LWMA EWMA LOCF

B. pinnimaculatus 0.1481

a

0.1558

b

0.1642

d

0.1593

b

0.1869

b

0.1539

b

0.3265

e

0.1604

b

0.1669

c

0.1681

c

0.167

c

0.1767

d

L. occidentalis 0.2472

a

0.2663

a

0.3662

b

0.278

a

0.4315

d

0.3105

c

0.6576

e

0.3215

c

0.3099

c

0.3057

c

0.3072

c

0.3527

b

L. peru 0.0701

a

0.0721

a

0.0796

b

0.073

a

0.0973

c

0.0899

c

0.2261

e

0.0908

c

0.0825

b

0.0872

c

0.0791

b

0.1310

d

S. ensis 0.1215

a

0.137

b

0.1496

c

0.1343

b

0.1484

c

0.1331

b

0.3465

e

0.1353

b

0.133

b

0.1445

c

0.1305

b

0.1463

c

S. sierra 0.3435

a

0.3400

a

0.3791

b

0.3806

b

0.4584

c

0.3744

b

0.7512

d

0.3878

b

0.3739

b

0.3685

b

0.3695

b

0.4021

b

C. phoxocephalus 0.1434

a

0.1495

b

0.1576

b

0.1527

b

0.1757

d

0.1522

b

0.3005

e

0.1566

b

0.1656

c

0.1633

c

0.1628

c

0.1676

c

MASE

B. pinnimaculatus 0.0114

a

0.0121

b

0.0133

c

0.0132

c

0.0123

b

0.0125

b

0.0219

e

0.0129

c

0.0120

b

0.0123

b

0.0124

b

0.01665

d

L. occidentalis 0.0091

a

0.0100

a

0.0141

c

0.0088

a

0.0152

c

0.0127

b

0.0246

d

0.0131

b

0.0120

b

0.0120

b

0.0123

b

0.01708

c

L. peru 0.0089

a

0.0094

a

0.0140

b

0.0099

a

0.0118

b

0.0190

c

0.0269

d

0.0187

c

0.0135

b

0.0137

b

0.0135

b

0.03072

d

S. ensis 0.0083

a

0.0106

b

0.0119

b

0.0143

c

0.0112

b

0.0122

b

0.0207

d

0.0124

b

0.0100

b

0.0109

b

0.0105

b

0.01554

c

S. sierra 0.0120

a

0.0120

a

0.0141

b

0.0168

c

0.0144

b

0.0143

b

0.0226

d

0.0145

b

0.0124

a

0.0124

a

0.0128

a

0.01687

c

C. phoxocephalus 0.0109

a

0.0119

d

0.0134

d

0.0117

d

0.0134

d

0.0127

d

0.0215

e

0.0131

c

0.0124

d

0.0124

d

0.0125

d

0.01607

b

Different superscript letters indicate statistically signicant differences in slope values between imputation methods at 0.05 level (rowwise comparisons), according to

the Tukey HSD tests. Same superscript letters indicate statistically non-signicant differences.

Table 5

Time series from sources different than SEPEC used for classical validation of

imputation methods.

Time series Seasonality Trend Source

T. albacares 12-Months Stationary https://www.iattc.org/

K. pelamis 12-Months Stationary

T. obesus 6-Months Stationary

SST-CPO 12-Months Stationary https://marine.copernicus.eu/

CHLA-CPO 12-Months Decreasing

Fig. 4. OLS regressions for RMSE as a function of MD size for all imputation algorithms and species after appliying KSSA. Lines represent the tted values for the 12

algorithms on each species, and the vertical gray bands represent NA sizes in the range of 20–40, which are typical in the SEPEC time series analyzed in this study.

Coefcients of determination R2 for the linear regressions ranged from 72% to 81%.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

10

Dynamic Time Warping (DTW) (Phan et al., 2020), the Gaussian

Autoregressive (AR-G) and Student’s Autoregressive models (AR-T) (Liu

et al., 2019) which have been claimed as efcient imputation methods

for univariate sets, were also tested in our target time series. However,

due to their low imputation efciency (lower RMSE and MASE values

than the reported here) or unsatisfactory implementation for NMAR MD

in the case of DTW (this method works only for one gap of MD), the

results were not included in this paper. Anyway, these results points

rstly to SEADEC and secondly to STRUCT as the best imputation

methods for the time series that we addressed here, and to KSSA as a

powerful validation approach to realize it, and to prevent misleading

results.

As demonstrated here, validating the methods with other time series

different to the target time series of interest (classical validation), lead to

the choice of inaccurate imputation methods, given that the efciency of

the methods is dataset-dependent. The aftermath of this bad choice are

increased errors in imputed values, which is one of the chances to avoid

in any statistical procedure. For example, if our validation had been

done with the time series of T. albacares, K. pelamis, T. obesus or CHLA-

CPO, ARIMA, SSA or SMA would have been chosen as the best methods

for NA sizes between 20 and 40. This would have produced imputed

values with up to 40% more error. This same mistake would have

occurred if the validations had been done with SST-CPO, which would

have pointed to LTS as the best imputation method, implying up to 35%

more error in imputed values. These increased errors could inate the

random component of time series, risking to denaturing their season-

ality, trend, cycle and autocorrelation structures, making them useless

for further shery analysis and decision making. KSSA can work with

any structure of MD. However, one important restriction is the length of

MD chunks. Suppose MD chunks are big enough to hide the essential

structure of time series. In that case, imputation methods will not be able

to capture temporal components, then making estimations unreliable.

Enough MD size to make true this restriction will depend however on the

total length and temporal resolution the target time series.

As it was observed during our experiments, some drawbacks need to

be considered when using SEADEC or STRUCTS. Both methods are good

at capturing the general properties of time series but at the price of

missing high peaks or low valleys (because of the ltering effect). Other

methods such as SSA, SMWA and DTW are better to reconstruct gaps

that include very high or low values, but at the price of not capturing the

general pattern of seasons and trend along the whole time series. This

suggests possible trade-offs among imputation methods, worth to

explore in further research. We also encourage the research to combine

different imputation methods that operate on one single target time

series. This approach might be able to impute both general patterns as

well as ner details, improving the imputation efciency of univariate

time series.

Chunk or gap size is also a limitation for SEADEC and STRUCTS.

According to our results, both performed reasonably well under a mean

chunk size of 8 individual MD. However, above this number, results are

not trustable. Liu et al. (2020) mention that large gaps for SEADEC will

result in long-time-period linear interpolation before seasonal decom-

position, affecting the seasonality component and leading to poor

imputation results. Regarding computation times, SEADEC and STRUCT

Fig. 5. OLS regressions for MASE as a function of MD size for all imputation algorithms and species after simulation and validation with the Known Sub-Sequence

Algorithm (KSSA). Lines represent the tted values for the 12 algorithms on each species, and the vertical gray bands represent MD sizes in the range of 20–40, which

are typical in the SEPEC time series analyzed in this study. Coefcients of determination R2 for the linear regressions ranged from 75% to 84%.

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

11

were halfway (30 min–1 h) between methods that required just a few

seconds to perform the 10.000 simulations such as LOCF, SIT, STIT, LIT,

SMA EWMA, LWMA and LTS, and methods such as ARIMA and SSA that

took up to 3 h. Thus, computation times can be a limitation depending

on the computer destined for simulations. A nal consideration are

negative imputations. Every once in a while during simulations (3% of

the times in our experiment) SEADEC and STRUCT return negative

values, which are worth to omit in further statistical analysis if the time

series is only comprised by positive values, such as the marine shery

landings.

The purpose of imputations always needs to be taken into account to

understand which range of imputed values are acceptable for a partic-

ular target time series. If the purpose is for example to produce a forecast

of the general trend of the time series, very high or low values should not

be problematic. Oppositely, very high or low values are indeed a prob-

lem when the purpose is to reconstruct particular missing values for

stock assessment, price analysis or other ner detail task for sheries. In

these cases, the experience of the researcher with the study system and

the data is highly valuable, and should always complement the infor-

mation yielded by imputation methods.

The results of this research highlight the convenience of conducting

KSSA to choose ad-hoc methods best suited to MD imputation in

particular target time series. As stated by Hyndman & Khandakar (2008)

and Moritz and Bartz-Beielstein (2017), there is no single univariate

imputation technique suited to all types of data. Imputation perfor-

mance is always very dependent on the characteristics of the input time

series. Time series comprise various known components such as length,

resolution, trend, seasonality, cycles, autocorrelation and randomness,

Fig. 6. Fishery landings time series after true MD imputation with algorithms SEADEC and STRUCTS. Black lines represent real data, red lines represent imputations

with SEADEC and blue lines represent imputations with STRUCTS. (For interpretation of the references to colour in this gure legend, the reader is referred to the

Web version of this article.)

Table 6

Results of classical validation of imputation methods with ve real lifetime se-

ries. The best method for each time series and MD size is shown since it obtained

the lowest RMSE and MASE values. MD sizes from 20 to 40 are highlighted

because they represent the range of dominant MD size in the target time series

from SEPEC.

MD Size T. obesus T. albacares K. pelamis CHLA-CPO SST-CPO

RMSE

1 to 10 ARIMA LTS LTS SSA LTS

10 to 20 ARIMA SMA SEADEC SSA LTS

20 to 30 ARIMA SSA SSA SSA LTS

30 to 40 ARIMA SSA SEADEC SSA LTS

40 to 50 ARIMA SSA SEADEC SSA LTS

50 to 60 ARIMA SSA SEADEC SSA LTS

60 to 70 ARIMA SSA SEADEC SSA LTS

MASE

1 to 10 ARIMA LTS SSA SSA LTS

10 to 20 ARIMA SMA SEADEC SSA LTS

20 to 30 ARIMA SMA SSA SSA LTS

30 to 40 ARIMA SSA SEADEC SSA LTS

40 to 50 ARIMA SSA SEADEC SSA LTS

50 to 60 ARIMA SSA SEADEC SSA LTS

60 to 70 ARIMA SSA SEADEC SSA LTS

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

12

whose possibility of combinations are virtually innite. In addition, it

must be recalled that time series are gathered from observational or

experimental units that imply random variation in space and time,

whose causes are unknown. Albeit target time series are articially split

for KSSA, the segments are only used operationally for simulating NMAR

MD and imputation, but the imputation itself is made based on the

statistical properties of the whole target time series.

Each time series is a unique ‘ngerprint’ that cannot be captured by a

solely method. This is why researchers working with time series in

marine sciences should have a broad toolkit of imputation methods to

handle multiple missing data and consider KSSA as a potential approach

that can better help decide which method to use for a particular series.

This of course applies not only to marine science, but to all quantitative

sciences that uses time series.

Precaution must therefore be raised on every new published paper

claiming a new imputation method to be better than the rest, because it

might be better only for the time series that authors used for validation.

This kind of research can signicantly impact marine science because

beyond mathematical details, sheries and researchers on marine sci-

ence need sound and easily implemented methods for imputation of MD

and validation of methods. Improving imputation methods combined

with improving validation algorithms and improving the researcher’s

knowledge on the marine systems and the data it produces help to

design, implement and evaluate management strategies and recovery

plans from threatened or exploited species (Altamar et al., 2020; Pea-

cock et al., 2020). Furthermore, the specic results produced in this

research will be useful for researchers working with marine sheries

time series around the world, and especially for researchers in AUNAP

and SEPEC who are willing to use the time series from the six species

covered here for stock assessment and forecasting. Finally, KSSA might

be also useful for other research elds apart from sheries, where

imputation methods are need to be validated in order to chose the best

one.

Declaration of competing interest

The authors declare that there are no conicts of interest.

CRediT authorship contribution statement

Iv´

an F. Benavides: Conceptualization, Data curation, Formal anal-

ysis, Funding acquisition, Investigation, Methodology, Software, Su-

pervision, Validation, Writing – original draft. Marlon Santacruz: Data

curation, Investigation, Methodology, Writing – review & editing.

Jhoana P. Romero-Leiton: Investigation, Visualization, Writing – re-

view & editing. Carlos Barreto: Data curation, Methodology, Writing –

review & editing. John Josephraj Selvaraj: Funding acquisition,

Investigation, Project administration, Resources, Supervision, Writing –

review & editing.

Acknowledgments

The authors thank to SEPEC (Colombia) for providing the datasets

and the technical training on it; to the Instituto de Estudios del Pacíco

(IEP) at Universidad Nacional de Colombia, Tumaco for providing the

logistic support for the research; to professor Luis Enriquez Benavides at

Universidad de Nari˜

no (Colombia) for providing logistic support. F.

Benavides and J. Romero appreciate the support of Fundaci´

on CEIBA

(Colombia).

References

Afrifa-Yamoah, E., Mueller, U. A., Taylor, S. M., & Fisher, A. J. (2020). Missing data

imputation of high-resolution temporal climate time series data. Meteorological

Applications, 27(1), 1–18. https://doi.org/10.1002/met.1873

Altamar, J., Correa-Helbrum, J., Restrepo-Leal, D., & Robles-Algarín, C. (2020).

Reconstructed data of landings for the artisanal beach seine shery in the marine-

coastal area of Taganga, Colombian Caribbean Sea. Data in Brief, 30. https://doi.org/

10.1016/j.dib.2020.105604

Beck, M. W., Bokde, N., Asencio-Cort´

es, G., & Kulat, K. (2018). R package imputeTestbench

to compare imputation methods for univariate time series.

Bokde, N., Beck, M. W., Martínez ´

Alvarez, F., & Kulat, K. (2018). A novel imputation

methodology for time series based on pattern sequence forecasting. Pattern

Recognition Letters, 116, 88–96. https://doi.org/10.1016/j.patrec.2018.09.020

Chandrasekaran, S., Moritz, S., Zaefferer, M., Stork, J., Bartz-Beielstein, T., & Bartz-

Beielstein, T. (2016). Data preprocessing: A new algorithm for univariate imputation

designed specically for industrial needs (pp. 1–20). January: Workshop Computational

Intelligence. https://cos.bibl.thkoeln.de/frontdoor/deliver/index/docId/433/f

ile/Chan16a.pdf.

Chan, K.-S., & Ripley, B. (2020). TSA: Time series analysis. R package version 1.3 https:

//CRAN.R-project.org/package=TSA.

Coro, G., Large, S., Magliozzi, C., & Pagano, P. (2016). Analysing and forecasting

sheries time series: Purse seine in Indian ocean as a case study. ICES Journal of

Marine Science: Journal Du Conseil, 73(10), 2552–2571. https://doi.org/10.1093/

icesjms/fsw131

De Valpine, P. (2002). Review of methods for tting time-series models with process and

observation error and likelihood calculations for nonlinear, non-Gaussian state-space

models. Bulletin of Marine Science, 70(2), 455–471.

Demirhan, H., & Renwick, Z. (2018). Missing value imputation for short to mid-term

horizontal solar irradiance data. Applied Energy, 225(May), 998–1012. https://doi.

org/10.1016/j.apenergy.2018.05.054

Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006).

Review: A gentle introduction to imputation of missing values. Journal of Clinical

Epidemiology, 59(10), 1087–1091.

Duarte, L. O., Manjarr´

es, L., & Reyes-Ardila, M.y H. (2019). Estadísticas de desembarco y

esfuerzo de las pesquerías artesanales e industriales de Colombia entre febrero y diciembre

de 2019 (p. 95). Bogot´

a: Autoridad Nacional de Acuicultura y Pesca (AUNAP).

Engels, J. M., & Diehr, P. (2003). Imputation of missing longitudinal data: A comparison

of methods. Journal of Clinical Epidemiology, 56(10), 968–976. https://doi.org/

10.1016/S0895-4356(03)00170-7

FAO. (2017). Fisheries and aquaculture software. FishStatJ - software for shery statistical

time series. Version 3.01.0. Rome: FAO Fisheries and Aquaculture De- partment.

Updated 14 2017 http://fao.org/fishery/statistics/software/fishstatj/en.

FAO. (2020). The state of world sheries and aquaculture. Sustainability in Action, 225,

978-92-5-132756-2 http://www.fao.org/3/ca9229en/ca9229en.pdf.

Farmer, N. A., & Froeschke, J. T. (2015). Forecasting for recreational sheries

management: what’s the catch? North American Journal of Fisheries Management, 35,

720–735. https://doi.org/10.1080/02755947.2015.1044628

Genolini, Christophe, Ecochard, Rene, & Jacqmin-Gadda, H´

el`

ene (2013). Copy mean: A

new method to impute intermittent missing values in longitudinal studies. Open

Journal of Statistics, 3. https://doi.org/10.4236/ojs.2013.34A004

Golyandina, N., & Korobeynikov, A. (2014). Basic singular Spectrum analysis and

forecasting with R. Computational Statistics & Data Analysis, 71, 934–954. https://doi.

org/10.1016/j.csda.2013.04.009

Hassani, H., Kalantari, M., & Ghodsi, Z. (2019). Evaluating the performance of multiple

imputation methods for handling missing values in time series data: A study focused

on East Africa, soil-carbonate-stable isotope data. Stats, 2(4), 457–467. https://doi.

org/10.3390/stats2040032

Huque, Md Hamidul, Carlin, John, B, Simpson, Julie, A, & Lee, Katherine, J (2018).

A comparison of multiple imputation methods for missing data in longitudinal

studies. BMC Medical Research Methodology, 18(168). https://doi.org/10.1186/

s12874-018-0615-6

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast

package for R. Journal of Statistical Software, 27(3), 22. http://www.jstatsoft.org/

v27/i03/paper.

Jamshidian, M., Jalal, S., & Jansen, C. (2014). MissMech: An R package for testing

homoscedasticity, multivariate normality, and missing completely at random

(MCAR). Journal of Statistical Software, 56(6), 1–31. https://doi.org/10.18637/jss.

v056.i06

Junger, W. L., & Ponce de Leon, A. (2015). Imputation of missing data in time series for

air pollutants. Atmospheric Environment, 102, 96–104. https://doi.org/10.1016/j.

atmosenv.2014.11.049

Kihoro, J., & Athiany, K. (2013). Imputation of incomplete nonstationary seasonal time

series data. Mathematical Theory and Modeling, 3(12), 142–154.

Koslow, J. A., & Davison, P. C. (2016). Productivity and biomass of shes in the California

current large marine ecosystem: Comparison of shery-dependent and -independent time

series (Vol. 17). Environmental Development. https://doi.org/10.1016/j.

envdev.2015.08.005

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John

Wiley.

Liu, Y., Dillon, T., Yu, W., Rahayu, W., & Mostafa, F. (2020). Missing value imputation

for industrial IoT sensor data with large gaps. IEEE Internet of Things Journal, 7(8),

6855–6867. https://doi.org/10.1109/JIOT.2020.2970467

Litzow, Michael, Urban, Daniel, et al. (2009). Fishing through (and up) Alaskan food

webs. Canadian Journal of Fisheries and Aquatic Sciences, 26(2). https://doi.org/

10.1139/F08-207

Liu, J., Kumar, S., & Palomar, D. P. (2019). Parameter estimation of heavy-tailed AR

model with missing data via stochastic em. IEEE Transactions on Signal Processing, 67

(8), 2159–2172. https://doi.org/10.1109/TSP.2019.2899816

Lloret, Josep, Lleonart, Jordi, & Sol´

e, Ignasi (2000). Time series modelling of landings in

Northwest Mediterranean Sea. ICES Journal of Marine Science, 57. https://doi.org/

10.1006/jmsc.2000.0570

I.F. Benavides et al.

Aquaculture and Fisheries xxx (xxxx) xxx

13

Magare, D., Labde, S., Gofane, M., & Vyawahare, V. (2020). Imputation of missing data in

time series by different computation methods in various data set applications. ITM

Web of Conference. ITM Web of Conferences, 32. https://doi.org/10.1051/itmconf/

20203203010, 03010.

Mahmoudvand, R., & Rodrigues, P. C. (2016). Missing value imputation in time series

using Singular Spectrum Analysis. International Journal of Energy and Statistics,

1650005. https://doi.org/10.1142/s2335680416500058, 04(01).

Moritz, S., & Bartz-Beielstein, T. (2017). imputeTS: Time series missing value imputation

in R. R Journal, 9(1), 207–218. https://doi.org/10.32614/rj-2017-009

Moritz, S., Sard´

a, A., Bartz-Beielstein, T., Zaefferer, M., & Stork, J. (2015). Comparison of

different methods for univariate time series imputation in R. arXiv https://arxiv.

org/abs/1510.03924.

Peacock, S. J., Hertz, E., Holt, C. A., Connors, B., Freshwater, C., & Connors, K. (2020).

Evaluating the consequences of common assumptions in run reconstructions on

pacic salmon biological status assessments. Canadian Journal of Fisheries and

Aquatic Sciences, 77(12), 1904–1920. https://doi.org/10.1139/cjfas-2019-0432

Phan, T.-T.-H., Caillault, ´

E. P., Lefebvre, A., Bigand, A., & others. (2020). Dynamic time

warping-based imputation for univariate time series data. Pattern Recognition Letters,

139. https://doi.org/10.1016/j.patrec.2017.08.019

Pikitch, E., Santora, C., Babcock, E., Bakun, A., Bonl, R., Conover, D., & Dayton, P.

(2004). Ecosystem-based shery management. Science (Washington), 305, 346–347.

Preciado, I., Punz´

on, A., Gallego, J. L., & Vila, Y. (2006). Using time series methods for

completing sheries historical series. Boletin del Instituto Espanol de Oceanograa, 22

(4), 83–90.

R Core Team. (2021). R: A language and environment for statistical computing. Vienna,

Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/.

Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.

Rudd, Merril, B, & Branch, Trevor, A (2016). Does unreported catch lead to overshing?

Fish and Fisheries, 18(2). https://doi.org/10.1111/faf.12181

Schafer, J. L. (1997a). Analysis of incomplete multivariate data. Boca Raton, Florida:

Chapman and Hall, CRC, ISBN 978-0412040610 [p218].

Schafer, J. L. (1997b). Analysis of incomplete multivariate data. London: Chapman and

Hall/CRC.

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147

Selvaraj, J. J., Arunachalam, V., Coronado-Franco, K. V., Romero-Orjuela, L. V., &

Ramírez-Yara, Y. N. (2020). Time-series modeling of shery landings in the

Colombian Pacic Ocean using an ARIMA model. Regional Studies in Marine Science,

39, 101477. https://doi.org/10.1016/j.rsma.2020.101477

SEPEC, Servicio Estadístico Pesquero Colombiano. (2018). Operaci´

on estadística:

Estimaci´

on de volúmenes artesanales desembarcados en sitios pesqueros. Ficha

metodol´

ogica. http://sepec.aunap.gov.co/Archivos/Estadisticas/Metodologia/SPC-

DIFM15_FICHA_METODOLOGICA.pdf.

Shumway, R. H., & Stoffer, D. S. (2011). Time series analysis and its applications: With R

examples (3rd ed., p. p218). New York, New York: Springer- Verlag.

Spratt, M., Carpenter, J., Sterne, J. A. C., Carlin, J. B., Heron, J., Henderson, J., &

Tilling, K. (2010). Strategies for multiple imputation in longitudinal studies.

American Journal of Epidemiology, 172, 478–487.

Stergiou, K., & Christou, E. (1996). Modelling and forecasting annual sheries catches:

Comparison of regression, univariate and multi-variate time series methods. Fisheries

Research, 25, 105–138.

Wei, Runmin, Wang, Jingye, Su, Mingming, Jia, Erik, Chen, Shaoqiu, Chen, Tianlu, &

Ni, Yan (2018). Missing value imputation approach for mass spectrometry-based

metabolomics data. Nature Scientic Reports, 8, Article 663. https://doi.org/

10.1038/s41598-017-19120-0

Wu, S., Chang, C., & Lee, S. (2015). Time series forecasting with missing values. In 1st Int.

Conf. Ind. Networks Intell. Syst. (pp. 151–156). https://doi.org/10.4108/icst.

iniscom.2015.258269

Yozgatligil, C., Aslan, S., Iyigun, C., & Batmaz, I. (2013). Comparison of missing value

imputation methods in time series: The case of Turkish meteorological data.

Theoretical and Applied Climatology, 112(1–2), 143–167. https://doi.org/10.1007/

s00704-012-0723-x

Zar, J. H. (2009). Biostatistical analysis (5th ed.). Pearson.

I.F. Benavides et al.