ChapterPDF Available

Big Maritime Data Management


Abstract and Figures

Maritime stakeholders are continuously collecting large volumes of heterogeneous spatiotemporal data from various sources, for example, sensor data, AIS data, traffic data, port call data, and environmental monitoring data. The maritime data value chain defines the series of the four key activities needed to appropriately manage this data, namely data acquisition, pre-processing, storage, and usage. As described in this chapter, a large arsenal of technological tools and frameworks are currently available for efficiently collecting, cleaning, integrating, storing, and analysing the data in order to extract value and useful insights that will satisfy several critical applications in the maritime industry (e.g., optimising port operations, planning optimised routes, performing predictive maintenance). Nevertheless, the large volume and variety of data in combination with the unique characteristics of spatiotemporal data, are turning data mining, big data analytics, and data visualisation into significantly challenging issues in the maritime domain due to high computation and communication complexities. In addition, the integration of data management technologies that span multiple ships and ports is still an open challenge mainly because of unreliable and slow transmissions as well as incompatible application programming interfaces. With regard to spatiotemporal systems, current distributed ones (e.g., GeoMesa, SpatialSpark, GeoSpark) are capable of handling large volumes but lack support for advanced operations for geometric, geographic, and top geometric processing and analysis. Hence, a multi-discipline, coordinated effort is still needed to advance the features and functionalities provided by the most relevant prior research projects and large-scale data processing systems used by the maritime industry today.
Content may be subject to copyright.
Big Maritime Data Management
Herodotos Herodotou , Sheraz Aslam , Henrik Holm ,
and Socrates Theodossiou
1 Maritime Data Value Chain
Modern maritime equipment constructors, ship owners and agents, transport and
logistics companies, and port authorities are collecting enormous amounts of
heterogeneous data at an unprecedented scale and pace. Almost all kinds of vessels
are now equipped not only with satellite positioning sensors for collecting posi-
tioning information, but also sensor devices recording ship performance, condition,
temperature, and humidity (Lytra, Vidal, Orlandi, & Attard, 2017). AIS (Automatic
Identification System) data involving the position, course, and speed of vessels
travelling on the ocean is openly available and mandatory for ships of over 300 gross
tonnage and all passenger vessels. MarineTraffic, an AIS vessel tracking web site,
reports collecting 520 million AIS messages daily involving 180 thousand distinct
vessels from 3000 active AIS stations worldwide (Perobelli, 2016). Port authorities
and various port actors (e.g., cargo terminals, tug operators) are also collecting port
call data related to the activities of arrival, berthing, loading/unloading, shifting,
anchorage, and departure of ships from ports (Michaelides, Herodotou, Lind, &
Watson, 2019). From an environmental perspective, various sensors are also being
deployed at sea, recording data related to various oceanographic, environmental,
H. Herodotou ()·S.Aslam
Cyprus University of Technology, Limassol, Cyprus
H. Holm
Svenska Beräkningsbyrån AB, Torslanda, Gothenburg, Sweden
S. Theodossiou
Tototheo Maritime, Limassol, Cyprus
© The Editor(s) (if applicable) and The Author(s), under exclusive licence
to Springer Nature Switzerland AG 2021
M. Lind et al. (eds.), Maritime Informatics, Progress in IS,
314 H. Herodotou et al.
and meteorological parameters of interest, with data volumes reaching up to 5GB
per day (Lytra et al., 2017).
As more data is acquired, stored, and analysed, all maritime stakeholders
are focusing on performing timely and cost-effective analytical processing for
generating value and deep insights to automate various decision-making processes.
Analytical processing is already driving several new application scenarios with sig-
nificant impact across the maritime industry such as optimising marine transport and
preventing accidents (Zhao, Li, Feng, Ochieng, & Schuster, 2014), improving fuel
consumption (Be¸sikçi, Arslan, Turan, & Ölçer, 2016), optimising port operational
efficiency (Yang et al., 2018), environment preservation and monitoring (Akyuz,
Ilbahar, Cebi, & Celik, 2017), and real-time cargo tracking (Yeoh et al., 2011).
The maritime data value chain defines the series of activities needed to appro-
priately manage data during the entire data life-cycle as well as to extract value and
useful insights from maritime data (Cavanillas, Curry, & Wahlster, 2016; Ferreira
et al., 2017). The European Commission considers the data value chain to be ‘the
centre of the future knowledge economy, bringing the opportunities of the digital
developments to the more traditional sectors (e.g., transport, financial services,
health, manufacturing, retail)’ (DGConnect, 2013). In the maritime domain, four
key activities are identified: (1) Data Acquisition for collecting the data across
different and geographically-dispersed data sources; (2) Data Pre-processing for
transforming, integrating, and assessing the quality of the data; (3) Data Storage
for storing data in a persistent and scalable way; and (4) Data Usage for processing
the data and extracting value. Figure 1outlines the main activities that comprise the
maritime data value chain.
Data Acquisition is the process of collecting data from several sources before
it is stored in a data warehouse or some other storage system (Curry, 2016).
The maritime data sources generate structured, semi-structured, and unstructured
data about ships, routes and trajectories, port operations, fishing and maritime
biodiversity, oceans, and environmental conditions (Ferreira et al., 2017). This data
is often spatiotemporal in nature, including both a geographical position and a time
component, while it is reported in various formats such as GeoJSON, KML, CSV,
Diverse data
Various data
formats and
Data acquision
Data Pre-
Data curaon
and cleaning
•Data integraon
Data reducon
Distributed file
• Exploraon
• Analycs
• Visualisaon
Fig. 1 The maritime data value chain
Big Maritime Data Management 315
or RDF. The new data generated by each source is collected using a data acquisition
framework following the message queuing, publish/subscribe, or event processing
paradigms (Curry, 2016). The data acquisition process is described in Sect. 2.
Data Pre-processing consists of a set of methods for transforming, linking, and
cleaning the data to make certain that it meets the desired data quality requirements
for its effective usage (Yablonsky, 2018). Data, depending on the source and
communication link characteristics, can suffer from a variety of faults including
noise, outliers, and bias. Furthermore, certain parts of the data may be entirely
missing due to a faulty communication link or sensor. Various data cleaning and
reconstruction techniques are used for removing faulty data as well as replacing
missing data. After cleaning the data from each source individually, data integration
techniques are used for combining data based on a common timeline and frame of
reference in order to identify and exclude inconsistent data. This involves dealing
with cases of delayed data, or an out-of-order sequence, or having different sampling
rates. Finally, data integration also aims at increasing data interoperability via
linking existing repositories of relevant scientific open data with raw maritime
data coming from the various sources (Lytra et al., 2017). The data pre-processing
methods are categorised and detailed in Sect. 3.
Data Storage refers to technological solutions for storing data in a way to ensure
data persistence, consistency, availability, and scalability (Ferreira et al., 2017).
These solutions often rely on partitioning, distribution, compression, and indexing
of data for ensuring applications have fast and easy access to the data. While
relational database management systems (RDBMS) are widely used for storing
maritime data, the semi- or unstructured nature of data as well as the prominent
spatiotemporal component of data are challenging the status quo. NoSQL technolo-
gies (e.g., MongoDB, HBase, Cassandra) have been designed for supporting more
flexible data models while storing the data in more scalable ways. At the same time,
specialised spatial or spatiotemporal systems such as PostGIS and GeoMesa have
emerged for dealing with such type of data efficiently and effectively. Data storage
solutions are presented in Sect. 4.
Data Usage involves query processing, analytics, and visualisation techniques and
tools for accessing the underlying maritime data and generating value for various
data-driven business activities. Query processing may involve browsing, searching,
reporting, finding correlations, identifying patterns, and predicting relations across
maritime data (Curry, 2016). Several query processing engines offer declarative or
scripting languages and different execution methods for managing and processing
data over different infrastructures and data models. Section 5discusses query
processing while the next two chapters cover data analytics and visualisation.
316 H. Herodotou et al.
2 Data Acquisition
Data acquisition is the first fundamental phase of the maritime data value chain
(recall Fig. 1) and involves the processes of (1) identifying the relevant data
sources such as port calls, AIS stations, Internet of Things (IoT) devices, and
weather stations (Sect. 2.1), (2) collecting the structured, semi-structured, and/or
unstructured data in various formats and encodings (Sect. 2.2), and (3) employing a
data acquisition framework for gathering and delivering information (Sect. 2.3).
2.1 Data Sources
The maritime domain includes a wide variety of heterogeneous data sources that
complement each other and generate data about port visits, vessel routes and trajec-
tories, sensory information about equipment, as well as oceans and environmental
conditions (Lytra et al., 2017).
Port Operations Systems Port community, transport, and logistics systems are
considered an important source of maritime data related to the activities of arrival,
berthing, loading/unloading, shifting, anchorage, and departure of ships from ports.
Examples of such data include arrival/departure timestamps, cargo information,
crew lists, customs declarations, and much more. The data tend to be structured
or semi-structured in nature and fairly accurate as it is often used to calculate port
fees and other commercial operations (Rødseth, Perera, & Mo, 2016).
AIS Stations AIS is an automatic tracking system installed on vessels and used
primarily by vessel traffic services (VTS). AIS allows maritime stakeholders to
monitor and track vessel movements in real time within the range of very high
frequency (VHF) radios on base stations located along coastlines or satellites. An
AIS module typically contains a standardised VHF transceiver, a positioning system
(e.g., global navigation satellite system (GNSS) receiver), and other navigation
sensors (e.g., gyrocompass). AIS data is transmitted by ships every 3–10s encoded
in a specialised format and contains information about the vessel (e.g., unique
identification, name, flag, type, dimension), the current status (e.g., longitude,
latitude, speed, heading), and its voyage (e.g., destination, estimated time of arrival,
draught). It is estimated that over 520 million vessel positions are recorded and
processed daily (Perobelli, 2016).
Sensing Devices In the current maritime era, most of the ships and ports are
connected with IoT devices containing an array of sensors that generate a huge
amount of data in real time. For instance, modern vessels use built-in sensors,
advanced dynamic position systems, control systems, navigational sensor systems,
etc., to enable automatic fault detection and preemptive maintenance, cargo track-
ing, and energy-efficient operations. IoT-based sensors are also used in containers
Big Maritime Data Management 317
for monitoring and maintaining temperature and humidity to ensure the viability of
perishable goods.
Oceanic and Weather Stations Various decisions in the shipping sector, such
as route planning and voyage optimisation, are directly impacted by oceanic
and weather data as discussed in Chapter 22 (Kyriakides, Hayes, & Tsiantis,
2020). Such data typically includes temperature, humidity, wind speed, rainfall, sea
currents, tidal variations, and wave characteristics. Usually, the maritime industry
forecasts weather information using ensemble models, historical data, and current
measurements. Historical weather data is available online and comes in various
formats and resolutions. However, this data is less beneficial in areas where weather
phenomena are affected by small-scale geographic features such as being near the
coast or within narrow ocean currents (Rødseth et al., 2016).
2.2 Data Formats and Encodings
Various data formats and encodings are used for both storing and exchanging mar-
itime data. The various formats have been developed targeting different objectives
and are often used for specific applications. Some of the most common formats
Network Common Data Form (NetCDF) is a set of machine-independent
data formats employed to create, share, and access array-oriented scientific
data. One of its key features is that the underlying data structure, the variable
names, and necessary metadata are embedded with the actual data. NetCDF
is commonly used in various oceanography and GIS applications, including
weather forecasting and climate change.
Geospatial data interchange format based on JSON (GeoJSON) is an open
standard format based on the popular JSON format and it is used to describe
simple geographical features, along with non-spatial attributes (Gelernter &
Maheshwari, 2019). GeoJSON is a widely used format, especially for map
visualisations and other map-related functionalities.
Geography Markup Language (GML) is an XML-based modelling language
for representing geographical features in a standard fashion, including their
properties and interrelationships. In addition to geometric properties, GML can
represent physical entities (e.g., ships, rivers) and sensor data that may or may not
have geometric aspects. Finally, GML also serves as an open interchange format
for online geographic transactions.
Keyhole Markup Language (KML) is another XML-based language used
primarily for the visualisation of geographic information on two or three
dimensional maps. It was made popular by its use on Google Earth. Even though
the KML grammar has several similarities with the GML grammar, the two
formats are not compatible with each other.
318 H. Herodotou et al.
Parquet,RCFile,andORC are open-source column-oriented data formats
that are widely used within the Apache Hadoop ecosystem. These formats are
optimised for large streaming reads of specific data columns, but also support
finding required rows quickly. In addition, these formats typically offer very
efficient compression and enable fast query processing.
2.3 Data Acquisition Frameworks
Most data acquisition frameworks follow the message queuing, publish/subscribe,
or event processing paradigms for collecting new data generated by the data
sources and sending it to the data storage system (Lyko, Nitzschke, & Ngomo,
2016). Internally, the data acquisition frameworks typically implement a predefined
protocol. While several organisations have devised their own proprietary, enterprise-
specific protocols, a few open protocols have been widely adopted over the last few
The Advanced Message Queuing Protocol (AMQP) (Kramer, 2009)wasthe
result of a collaboration among 23 companies aiming to compile a protocol that
(1) is easily extensible and simple to implement; (2) allows message encryption;
(3) has reliable failure semantics; (4) can support different messaging patterns (e.g.,
direct messaging, publish/subscribe); (5) is independent of specific implementations
or vendors. To enable these features, AMQP relies on four key layers, namely the
transport, messaging, transaction, and security layers. According to the transport
layer, messages originate from sender nodes, are forwarded by relay nodes, and
consumed by receiver nodes. The messaging layer is responsible for ensuring the
structure of valid messages, while the transaction one controls the transfers between
senders and receivers. Finally, the security layer enables the encryption of messages.
There are several data acquisition frameworks that are employed to collect,
access, and share data in the maritime industry. Apache Kafka (Apache Kafka,
2019) is a distributed publish-subscribe messaging system designed to transfer
data from various sources to downstream storage systems for batch processing or
real-time processing with stream engines. Hence, Kafka enables using a single
data pipeline for both offline and online data consumers. Kafka also provides
mechanisms for data partitioning as well as parallel load into Hadoop-compatible
systems. Apache Flume (Apache Flume, 2019) is another open-source system
specialised in collecting, aggregating, and moving large amounts of log data from
multiple sources into a single (often distributed) data storage system. Flume has a
flexible architecture based on streaming data flows and supports a variety of failover
and recovery mechanisms (Cavanillas et al., 2016).
Big Maritime Data Management 319
3 Data Pre-processing
Data pre-processing is an essential step in the maritime big data value chain in
order to enable effective data exploration and data analysis. The data pre-processing
pipeline, illustrated in Fig. 2, consists of (1) data curation and cleaning to ensure the
quality of the data (Sect. 3.1); (2) data integration to efficiently link heterogeneous
data (Sect. 3.2); (3) data transformation to simplify or speedup further analysis
(Sect. 3.3); (4) data reduction to reduce the amount of data needed for analysis
(Sect. 3.4).
3.1 Data Curation and Cleaning
The potential impact of data analysis depends heavily on the quality of the
underlying data used for the analysis. For instance, if dirty data (i.e., data pol-
luted with misspellings, truncations, corruptions, unexpected notations, and other
irregularities) is not properly cleaned before analysis, then the quality of the
data analysis will suffer. Hence, data quality issues can have a major impact on
data management operations, particularly on the decision-making phase of any
organisation (Curry, Freitas, & O’Riáin, 2010). In order to maximise data utilisation,
data curation provides technological and methodological data management support
by enhancing the quality of the data. Data curation is defined as ‘the active and
on-going management of data through its life-cycle of interest and usefulness;
curation activities enable data discovery and retrieval, maintain quality, add value,
and provide for reuse over time’ (Cragin, Heidorn, Palmer, & Smith, 2007). In
Data Pre-
Data Curaon
& Cleaning
Missing Values
Noisy Data
Inconsistent Data
Data Consolidaon
Data Federaon
Data Propagaon
Reduc on
Aribute Subset
Reduc on
Data Cube
Reduc on
Fig. 2 Taxonomy for data pre-processing techniques
320 H. Herodotou et al.
simple terms, curation is a key activity for managing and promoting the data usage
from its point of creation and for ensuring the quality of the data for future purposes
or reuse.
With reference to the maritime industry, several data management methods
focus on AIS data cleaning tasks. There are several issues with AIS data related
to the fact that it is a 30-year old technology that was not designed for large-
scale acquisition (Holm & Mellegård, 2018; Svanberg, Santén, Hörteborn, Holm,
& Finnsgård, 2019). Firstly, AIS messages may get lost or become corrupt during
the unsecured VHF transmission. Secondly, some message fields, such as the next
port destination and estimated time of arrival, are manually entered by officers on
the bridge and thus frequently contain inconsistent or non-standardised values. In
addition, the timestamps associated with AIS messages are recorded by AIS stations
and, unless they are in the same time zone and time-synced, can often lead to time
discrepancies that cause a ship’s record to skip in time. Finally, AIS contains a lot
of special cases that need to be handled carefully, such as the ‘UTC second’ field
allowing the values 0–59 (as expected) and the value 61 to indicate that there is
something wrong with the GNSS.
In general, data cleaning is the process of managing and maintaining accurate,
clean, and consistent data through identification and removal of duplicate or
inaccurate data. There are three key tasks in data cleaning: (1) fill missing values,
(2) smooth out noisy data, and (3) correct inconsistent data (Han, Pei, & Kamber,
Missing Values Sometimes, data values are missed during recording due to, for
example, faulty equipment, power outages, or unreliable network transfers. There
exist several data cleaning techniques used to handle missing values, including:
Manually entering missing values, which can be time-consuming and not
scalable for large datasets with too many missing values;
Replacing missing values with the same predefined constant, which represents
either a real default value or a special value to indicate that it is unknown;
Replacing missing values with the mean value computed from all the other values
observed so far or from values observed during a specific time interval;
Replacing missing values with the most probable one, which can be determined
using some inference-based technique such as Bayesian inference or machine
learning (e.g., regression, neural networks).
Noisy Data Distorted or corrupted data that may contain meaningless information
is considered noisy data. The three main methods that can be used to remove noise
from data are the following (Han et al., 2011):
Binning: This method sorts and distributes the data into equi-width buckets/bins.
Data smoothing is then performed by replacing the values in each bin by either
the bin’s median, mean, or boundary value.
Regression: This method smooths noisy data by fitting it into a regression
function. Linear regression and multiple regression are often used for fitting the
data and smoothing out the noise.
Big Maritime Data Management 321
Clustering: This method groups similar values into clusters and marksthe values
that fall outside the clusters as outliers.
Inconsistent Data Inconsistencies in the recorded data for some entries can be
rectified manually by external references. For example, inconsistencies or mistakes
in textual entries such as the manually entered port destination field in AIS
signals can be rectified using natural language processing techniques or matching
against known data sources (e.g., the list of ports worldwide) (Abdallah, Iphar,
Arcieri, & Jousselme, 2019). Moreover, knowledge engineering tools (i.e., software
applications defined by domain experts for a specific purpose) can be used to
find inconsistencies in known data constraints. For instance, known functional
dependencies among attributes can be exploited to find values that contradict the
functional constraints (Malik, Goyal, & Sharma, 2010).
3.2 Data Integration
The process of data integration involves combining and merging data originating
from several sources into a single, organised, and coherent view, often in a data
warehouse. Data integration is considered a key component of various mission-
critical projects of data management, such as building an enterprise data warehouse
for a maritime organisation, migrating data from single or multiple databases to
a new destination in order to organise data in a coherent way, and synchronising
data among several applications (Wang et al., 2016). Consequently, there exists a
variety of data integration techniques exploited by the maritime industry to merge
data from different sources (e.g., AIS data, port call data) to create a single unified
view. This section provides an overview of three key data integration techniques,
namely data consolidation, data federation, and data propagation. Furthermore,
data integration techniques are adapted based on the complexity, heterogeneity, and
volume of involved data sources.
1. Data consolidation refers to the process of consolidating/combining data from
various sources into a centralised and single destination data repository. This
unified data repository is then utilised for different purposes like data analysis.
Moreover, it can also serve as a data source for downstream applications. There
are two operational paradigms for data consolidation: batch and inline (Loshin,
2010). The batch paradigm gathers snapshots of datasets into a single location
(e.g., a temporary location or the unified database), and then performs on them
a variety of consolidation tasks such as parsing, standardisation, blocking, and
matching. The inline paradigm employs some of the consolidation tasks as soon
as new data enters the system within the operational services of the system.
For example, new data instances are parsed and standardised on the fly before
being compared and/or consolidated with existing data instances already stored
in the centralised repository. Overall, data consolidation methods minimise
inefficiencies in data management systems, like duplication of data, cost related
322 H. Herodotou et al.
to reliance on several databases, and maintaining multiple data management
2. Data federation is another technique of data integration, which is employed to
consolidate data and simplify access for end-users and front-end applications.
In this technique, distributed data with different data models is integrated into a
virtual database that features a unified data model. This means that there is no
physical data integration involved behind the virtual database. Rather, a uniform
user interface is created by data abstraction for accessing and retrieving data.
Consequently, when an application and/or database users query the federated
virtual database, the query is decomposed and sent to the relevant underlying
data source. Hence, unlike a centralised data repository, data federation provides
on-demand data access to users and applications for data stored in various
3. Data propagation is opted for data integration when data is distributed from one
or multiple enterprise data sources to one or several local access databases. Data
warehouses need to organise a huge amount of data on a daily basis. They may
start with a small amount of data and start growing day by day via continuously
receiving and sharing data from/to different data sources. Data warehouses, data
stores, and operational databases are becoming indispensable tools in today’s
organisations. These datasources need to be continuously updated and the update
process often involves shifting data from one system to another. To make the
shifting process more efficient, the moves need to be performed in batches
within a brief period without affecting the performance, i.e., the availability
of data from the warehouse. To tackle the above-mentioned issues, there exist
several technologiesused for data propagation such as enterprise data replication
(EDR), enterprise application integration (EAI), and bulk extract (Alguliyev,
Aliguliyev, & Hajirahimova, 2016). Furthermore, big data integration solutions
are remarkably efficient compared to traditional methods of data integration in
the maritime industry and offer multiple features to support large volume, huge
data diversity, and high speed of data retrieval (Dong & Srivastava, 2013).
3.3 Data Transformation
Data transformation is the process of converting data formats and structures into
other forms that are more appropriate for data storage or data mining. The plurality
of data formats that appear in the maritime industry has culminated the use of a
variety of data transformation techniques, including the following:
1. Numeric normalisation involves scaling attribute data to fall within a predefined
range, such as between 0 and 1. Two common forms of normalisation are min-
max normalisation, which uses the minimum and maximum values to perform
the transformation, and z-score normalisation, which uses the mean and standard
deviation to transform the values.
Big Maritime Data Management 323
2. Aggregation applies various operations on data in order to present information
in summary form, and it is particularly useful for performing statistical analysis.
For instance, the data from daily ship arrivals at a specific port can be aggregated
to calculate weekly or monthly statistics. Aggregation is often employed to
construct a data cube for analysing the data at multiple granularities (Han et al.,
3. Generalisation refers to the process of replacing the low-level or primitive data
with higher-level concepts by using concept hierarchies (Narang, Kumar, &
Ver m a, 2017). For instance, numerical attributes describing the size of vessels
can be generalised to the higher-level concepts, like small, medium, and large
4. Attribute construction is utilised to construct new attributes from a given set
of values in order to enhance accuracy and to assist the data mining procedure
(Malik et al., 2010). For instance, we can add the attribute ‘area’ for any maritime
object, like the waiting area at a port terminal, based on the attributes ‘length’and
width’. By combining various attributes, new information can be discovered and
3.4 Data Reduction
A data warehouse may store terabytes or sometimes petabytes of data and, hence,
it can be very time-consuming to perform data analysis operations on the raw data.
Data reduction methods can be applied to attain a reduced form of data, which is
smaller than the original data but still maintain the key properties of the dataset.
In this way, mining on the reduced data offers more efficiency and produces the
same or very similar statistical results (Narang et al., 2017). There exist various data
reduction strategies, the most common of which are explained below:
1. Attribute subset selection: Sometimes, data contains hundreds of attributes,
some of which may be irrelevant to the data analysis or mining task. For instance,
suppose the task is to classify or compute statistics for the arrived vessels at
a port based on the country or region of origin. In this case, some attributes
are irrelevant, such as the ship name, ship type, etc. Hence, the key objective
of attribute subset selection is to select the relevant attributes to make the data
mining process more efficient. Heuristic and meta-heuristic algorithms are often
employed for attribute selection because of their lower time complexity (Min &
Xu, 2016).
2. Dimensionality reduction: In dimensionality reduction, data encoding is
applied in order to achieve a reduced size of the original data. If the original
dataset can be reconstructed from a compressed dataset without losing any
information, it is called lossless data reduction. On the contrary, if some
information is lost during reconstructing the original dataset, then it is called
lossy data reduction. Principal components analysis and discrete wavelet
324 H. Herodotou et al.
transform are common procedures that can lead to lossy data reduction
(Cunningham & Ghahramani, 2015).
3. Data cube aggregation: Data aggregation operations to construct multi-
dimensional data cubes can also lead to data reduction. For instance, consider
a dataset that contains a daily report of arriving and departing vessels at a port.
However, suppose the task at hand involves generating reports of monthly and
annual vessel traffic at the port. Aggregation operations can be performed to get
a dataset that is smaller in volume but still contains the necessary information
to complete the task (Narang et al., 2017). Data cubes store data in multi-
dimensional arrays, making it efficient to access specific values with fixed
offsets instead of moving across tables, and thus can greatly assist the data
analysis tasks.
4. Numerosity reduction: The volume of data can also be reduced by exploiting
alternative, more condensed data representations. These representations can be
either parametric (where a model is employed to capture the data characteristics
and thus only the model parameters are needed) or non-parametric(e.g., samples,
clusters, sketches, histograms).
4 Data Storage
Storage of large-scale data is a flourishing area that has gained a lot of attention in
the last decade, both commercially and from academia (Kokkinakos et al., 2017).
Especially in the maritime domain, where a lot of sensors are used on board ships
and generate vast volumes, efficient and reliable storage methods are needed to
ensure data quality, scalability, and availability (Ferreira et al., 2017). Usually, data
is stored in various formats that are feasible for fast sequential-logging, without
any indexing or other analytics capabilities (Wang et al., 2016). For instance,
due to small storage requirements and fast loading times, CSV (comma separated
values), Matlab data files, or custom binary files are typically used to store the data.
However, linear algorithms are required for every basic operation while using these
data formats. Basic operations include the selection of data during a specific time
window, column selection, finding specific values, etc. Apart from the traditional
data warehouses, in the current technology era, distributed file systems (Sect. 4.1),
NoSQL data stores (Sect. 4.2), and spatiotemporal systems (Sect. 4.3) are considered
the latest storage technologies and are explained in detail below. Table 1offers a
comparison between the different data storage technologies.
4.1 Distributed File Systems
A distributed file system (DFS) is employed to build a hierarchical view of multiple
file servers and to persistently store files on a set of nodes, which are connected
Big Maritime Data Management 325
Tab l e 1 Comparison of different data storage technologies
System category Example systems Description Benefits Limitations
Distributed file
Clients and storage resources are dispersed in
the network and the goal is to provide a
though it has a distributed implementation.
Data are stored in files of various formats
Data availability and
reliability; load balancing and
scalability; flexibility in
storing unstructured data in
any format
No efficient access to
particular parts of the data;
no indexes
NoSQL key-value
Voldemort, Redis,
Data is represented as a collection of
key-value pairs. Create, read, update, and
delete operations are supported based only on
user-defined keys
Operational simplicity for
lowest total cost of ownership;
great scalability; low access
Only simple operations
supported; no complex
queries or data joins; a value
can be updated only as a
NoSQL document
Data are organised in documents based on
user-defined keys. The system is aware of the
(arbitrary) document structure and supports
indexing on multiple fields. Works best with
semi-structured data
Support for lists, nested
documents, and searches based
on multiple fields; efficient
and scale-out architectures
Data consumption is bit-high
because of de-normalisation;
no support for data joins
NoSQL wide-
columnar stores
BigTable, HBase,
Data is organised into tables and attributes are
organised into column families. A column
family resembles a document, so the system
has knowledge of its underlying structure
Improved cache locality;
efficient at data compression;
good performance on
aggregation queries; good
No transactions; increased
tuple reconstruction costs;
increased cost of inserts
NoSQL graph
Neo4j, ArangoDB,
Specialised in the efficient management of
graph data, which contains nodes connected
by edges. Graph representation shows not only
useful information about entities but also how
each entity connects with or is related to others
Efficient handling of highly
connected data; fast searching
based on relationships
Optimised only for
graph-like data; not efficient
in handling queries that span
the entire database
Database management systems adapted to
manage entities/objects containing space and
time information. Spatiotemporal objects can
dynamically update their spatial location
and/or extents along with time
Optimised for receiving
intensive updates; efficient
access methods to store and
retrieve spatiotemporal objects
Most existing systems are
centralised and offer limited
scalability; distributed
systems lack support for
advanced operations
326 H. Herodotou et al.
through a network (Thanh, Mohan, Choi, Kim, & Kim, 2008). In a DFS, the
data is stored on storage media located on multiple (perhaps remote) servers;
however, the stored data is accessed as if it was stored locally. A DFS allows users
to organise, manipulate, and share data seamlessly, regardless of the actual data
location on the network. However, file availability and location are considered key
issues in DFS, which are solved using the replication of files (Thanh et al., 2008).
Replication methods are categorised into two types, namely pessimistic replication
and optimistic replication (Toader & Toader, 2017). There exist several distributed
file systems, such as the Hadoop Distributed File System (HDFS), Network File
System (NFS), NetWare, and CIFS/SMB.
HDFS (Shvachko, Kuang, Radia, & Chansler, 2010) is the main storage system
utilised in the Apache ecosystem and it is widely used by popular big data analytics
platforms such as Hadoop and Spark. The files stored on HDFS are broken
down into relatively large blocks (64–128MB), which are then replicated and
distributed over the cluster nodes. HDFS implements a distributed file system using
a master/worker architecture, which provides fault tolerance and high-performance
access to data over highly-scalable clusters with commodity hardware. The master
node, called NameNode, is responsible for maintaining the file system namespace
and executing the typical file system operations (e.g., create directory, open file).
The worker nodes, called DataNodes, are responsible for storing the actual data on
locally-attached storage media as well as performing read/write operations and other
various tasks such as block creation, replication, and deletion.
4.2 NoSQL Data Stores
NoSQL data stores have emerged to address horizontal scalability and high avail-
ability requirements of data management in the Cloud, which are hard to achieve
with traditional relational database management systems. NoSQL stores follow a
shared-nothing architecture and work by replicating and partitioning data over large
clusters of network-connected servers. The data storage models do not rely on the
relational data model. In addition, they distribute workloads of simple operations
such as key-based searches, while they support reads and writes of one or a small
number of records. There is no or very little support for complex queries or joins.
NoSQL databases are particularly useful in two cases: (1) to store data without a
predefined schema; and (2) to analyse streams of live data in real time, where pre-
processing and indexing of records is not possible (Wang et al., 2016).
There are four key types of NoSQL stores and multiple systems for each type
(Davoudian, Chen, & Liu, 2018). In particular, there are key-value stores, document
stores, extensible record or wide-columnar stores, and graph stores. Below we
outline the key features of each type.
Key-Value Stores Key-value stores (e.g., Dynamo, Voldemort, Riak) offer a basic
key-value interface. The values are stored based on user-defined keys, while the
Big Maritime Data Management 327
system is unaware of the structure of the value. In a distributed setting, data is
typically replicated and partitioned so that each server contains a subset of the
total data. Moreover, several existing stores, such as Voldemort (Sumbaly et al.,
2012), provide tunable consistency (i.e., strong or eventual consistency), which
offers tradeoffs between availability, latency, and consistency. Key-value stores can
be used with simple applications that are mainly developed for storing and finding
data based on a single attribute. For example, a web interface can visualise aggregate
information for AIS or maritime sensor data that are precomputed and stored in a
key-value store.
Document Stores Document-oriented NoSQL systems (e.g., SimpleDB,
CouchDB, MongoDB) store data in JSON or JSON-like format and support queries
using specific APIs. The primary unitof data is a document, which is a JSON object
comprising named fields. Field values may be strings, numbers, dates, or even
ordered lists and associative maps. Documents are stored based on user-defined
keys but, unlike key-value stores, the system has knowledge of the document
structure. Document stores are useful for applications that store multiple different
kinds of objects and need to search data based on multiple fields. For example, a
document store can be used for storing position and voyage AIS vessel information
to support computational maritime situational awareness applications (Cazzanti,
Millefiori, & Arcieri, 2015).
Wide-Columnar Stores Wide-columnar NoSQL databases such as BigTable,
HBase, and Cassandra follow a sparse data model that resembles a three-
dimensional sorted map from the triplet row key, column key, timestampto a
value. The row key identifies a full record, while a column key comprises a column
family and a column name (i.e., a named field). A column family resembles a
document, so the system has knowledge of its underlying structure. The column
family information is then used to replicate and distribute data both horizontally
and vertically. Overall, wide-columnar databases are optimised for storing large
amounts of semi-structured data such as AIS data and ship routes, which can then
be efficiently analysed by higher-level data processing systems such as Spark (Qin,
Ma, & Niu, 2018).
Graph Stores Graph database systems (e.g., Neo4j, ArangoDB, TitanDB) are
optimised for storing graph data, i.e., data consisting of nodes (vertices) connected
by edges (links). This representation shows not only useful information about
entities but also how each individual entity connects with or is related to others.
Internally, the data is stored as nodes and edges, both of which can have multiple
attributes. Secondary indexes for both nodes and edges are also supported, enabling
the fast retrieval of nodes by attributes and edges by type. For example, a graph
can be built consisting of vessel voyages between ports based on the arrivals and
departures of vessels from ports, which can then be used for analysing vessel
movement patterns (Carlini et al., 2020).
328 H. Herodotou et al.
4.3 Spatiotemporal Systems
A spatiotemporal or spatial-temporal system is a database management system
that is adapted to manage entities/objects containing space and time information.
A spatiotemporal object is a special type of object that dynamically updates
its spatial location and/or extents along with time. Some typical examples from
the maritime domain are a moving vessel whose location continuously changes
over time, or oceanic measurements collected by sensors on-board a drifting
buoy. Spatiotemporal systems have a lot of key applications in various domains,
especially in the maritime industry, and are used with Environmental Information
Systems (EIS), Traffic Control Systems (TCS), Location-Aware Systems (LAS),
and Geographic Information Systems (GIS). Unlike conventional data storage
systems, spatiotemporal ones are able to manage dynamic properties of objects in
an efficient way. In particular, they are optimised for receiving intensive updates
due to the constantly changing properties of spatiotemporal objects. Furthermore,
sampling-based or velocity-based updating methods are adapted to minimise the
objects’ updates (Xiong, Mokbel, & Aref, 2017). Besides the updating methods,
spatiotemporal systems also adopt new access methods to efficiently store and
retrieve spatiotemporal objects (Nguyen-Dinh, Aref, & Mokbel, 2010).
PostGIS (Corti, Kraft, Mather, & Park, 2014) is a spatial database extension
for the popular PostgreSQL database system. PostGIS provides new spatial-specific
data types to PostgreSQL such as support for points, lines, polygons, and other
geometric objects. In addition, it offers specialised features for geometric and
geographic processing, topogeometry functions and topologies, as well as raster
processing and analysis. Finally, it implements several standards such as GML,
KML, and GeoJSON (recall Sect. 2.2). PostGIS-T (Simoes, Queiroz, Ferreira,
Vinhas, & Camara, 2016) is a recent extension of PostGIS that focuses on the
temporal dimension of spatial data. It implements a formal spatiotemporal algebra
and adds supports for three types of spatiotemporal data, namely time series,
trajectories, and coverages. Unlike PostGIS-T that builds on top of PostgreSQL,
TerraLib (Câmara et al., 2008) extends object-relational database techniques to
support spatiotemporal data types, while supporting multiple underlying DBMSs,
including PostgreSQL, MySQL, and Oracle. Furthermore, TerraLib enables spatial,
temporal, and attribute queries on the stored data. Storing maritime data in such
systems has been used in the past for performing various AIS-based route analysis
as well as help visualise ship routes from raw AIS data (Fiorini, Capata, & Bloisi,
In the distributed systems arena, GeoMesa (GeoMesa, 2019) is an open-source
suite of tools that can be used for storing, transforming, indexing, and querying
spatiotemporal data on top of other distributed computing systems. Specifically,
GeoMesa offers spatiotemporal indexing capabilities for point, line, and polygon
data stored in Accumulo, HBase, Google BigTable, and Cassandra. In addition,
GeoMesa provides stream processing in near real time for spatiotemporal data
streamed through Apache Kafka. GeoMesa is currently used by exactEarth,an
Big Maritime Data Management 329
AIS vessel tracking data service company, for storing more than 25 million AIS
messages per day (exactEarth, 2020). Recently, several approaches have extended
Apache Spark for supporting spatial and spatiotemporal data. LocationSpark and
SpatialSpark directly extend Spark’s core data model to provide users with granular
control over spatial operation execution plan, while GeoSpark builds specialised
indexes over spatiotemporal data (Yu & Sarwat, 2019).
5 Data Usage
Data usage involves query processing, analytics, and visualisation methods and
tools for accessing the underlying maritime data and generating value for different
data-driven business activities. The proper data usage in the decision-making of
a growing industry like the maritime industry can improve competitiveness by
reducing operational costs, providing better services to end-users, or any other
parameters, which can be measured against existing performance criteria (Cavanil-
las et al., 2016). Query processing may involve browsing, searching, reporting,
finding correlations, identifying patterns, and predicting relations across maritime
data (Curry, 2016). A plethora of query processing systems have been developed
over the years, the most popular of which are described in this section.
5.1 Query Processing Systems
Query processing systems provide query facades on top of storage or file systems.
They usually offer an SQL-like query interface to access the data; however, they
follow different approaches and exhibit difference performance compared to tradi-
tional DBMSs. For example, Hive and SparkSQL provide SQL-like functionality
over large-scale distributed data, Presto and Impala specialise in interactive data
processing, and GeoMesa and GeoSpark provide spatiotemporal querying and
Apache Hive Apache Hive (Apache Hive, 2019) is built on top of HDFS to
offer query and analysis over large-scale structured data. It provides an SQL-like
interface, termed HiveQL, to query data stored in different storage systems and
databases, which integrate with HDFS. Hive runs queries by translating them in
either Tezor MapReduce jobs. Hence, Hive queries often suffer from high latencies,
even for a smaller dataset. In addition, Hive has the flexibility to develop schemas
efficiently as schemas are stored independently and data is validated only at query
time. This is known as schema-on-read compared to the schema-on-write approach
of traditional relational DBMSs. Hive works best for analysing and generating
reports for structured data such as port management, transport, and logistics data.
330 H. Herodotou et al.
SparkSQL SparkSQL offers a programming abstraction called Data Frames and
acts as a distributed SQL query engine. In contrast to Apache Hive, SparkSQL
provides query processing with lower latency due to the in-memory processing
capabilities of Spark. SparkSQL supports the HiveQL interface and enables HiveQL
queries to run faster than Hive, even without any modifications in data and queries
(Xin et al., 2013). Moreover, it allows strong integration with other Spark ecosys-
tems such as integrating SQL query processing with machine learning applications.
SparkSQL has recently been extended with several geospatial functions, which
enable the execution of complex geospatial analytics tasks in the maritime domain
(Hezbor & Hughes, 2017).
Apache Impala Apache Impala (Apache Impala, 2019) also provides query
processing with low latency. It employs massively parallel processing database
techniques to enable users to issue low latency SQL queries on data stored in
Apache HBase and HDFS without requiring data transformation or movement.
Impala employs an SQL-like interface similar to Apache Hive; however, in order
to achieve low latency, it adapts its own distributed query engine. Impala has been
employed in the past for mass spatiotemporal trajectory data sharing (Zhou, Chen,
Yuan, & Chen, 2016).
Presto Presto (Presto SQL, 2019) is an open-source, distributed SQL query engine
optimised for executing interactive analytical queries over large-scale datasets. It
allows querying data from various sources, e.g., relational DBMSs, Cassandra,
Hive, or even proprietary data stores. A Presto query combines data from several
sources and allows analytics across the entire software stack. Additionally, unlike
other Hadoop-based tools (e.g., Impala), Presto is able to work with any flavour of
Hadoop or without it.
GeoSpark GeoSpark (Yu, Zhang, & Sarwat, 2019) is a cluster computing frame-
work that is developed to process large-scale spatial and spatiotemporal data (e.g.,
ships location data, weather maps, etc.). It extents SparkSQL and Apache Spark with
SpatialSQL to efficiently load, process, and analyse a huge amount of spatial data.
In addition, GeoSpark has introduced a novel interface that follows the MM-Part-3
standard, an international standard specifying the storage, retrieval, and processing
of spatial data using SQL. Overall, GeoSpark is able to produce optimised spatial
query plans and to run spatial queries on large datasets efficiently (Yu et al., 2019).
6 Conclusion
Maritime stakeholders are continuously collecting large volumes of heterogeneous
spatiotemporal data from varioussources, for example, sensor data, AIS data, traffic
data, port call data, and environmental monitoring data. The maritime data value
chain defines the series of the four key activities needed to appropriately manage
this data, namely data acquisition, pre-processing, storage, and usage. As described
Big Maritime Data Management 331
in this chapter, a large arsenal of technological tools and frameworks are currently
available for efficiently collecting, cleaning, integrating, storing, and analysing the
data in order to extract value and useful insights that will satisfy several critical
applications in the maritime industry (e.g., optimising port operations, planning
optimised routes, performing predictive maintenance).
Nevertheless, the large volume and variety of data in combination with the
unique characteristics of spatiotemporal data, are turning data mining, big data
analytics, and data visualisation into significantly challenging issues in the maritime
domain due to high computation and communication complexities. In addition, the
integration of data management technologies that span multiple ships and ports is
still an open challenge mainly because of unreliable and slow transmissions as well
as incompatible application programming interfaces. With regard to spatiotemporal
systems, current distributed ones (e.g., GeoMesa, SpatialSpark, GeoSpark) are
capable of handling large volumes but lack support for advanced operations
for geometric, geographic, and topogeometric processing and analysis. Hence, a
multi-discipline, coordinated effort is still needed to advance the features and
functionalities provided by the most relevant prior research projects and large-scale
data processing systems used by the maritime industry today.
Acknowledgments This work was co-funded by the European Regional Development Fund
and the Republic of Cyprus through the Research and Innovation Foundation (STEAM Project:
Abdallah, N. B., Iphar, C., Arcieri, G., & Jousselme, A.-L. (2019). Fixing errors in the AIS
destination field. In Oceans 2019-Marseille (pp. 1–5).
Akyuz, E., Ilbahar, E., Cebi, S., & Celik, M. (2017). Maritime environmental disaster management
using intelligent techniques. In Intelligence systems in environmental management: Theory and
applications (pp. 135–155). Berlin: Springer.
Alguliyev, R. M., Aliguliyev, R. M., & Hajirahimova, M. S. (2016). Big data integration
architectural concepts for oil and gas industry. In Proceedings of the IEEE 10th International
Conference on Application of Information and Communication Technologies (AICT) (pp. 1–5).
Apache Flume. (2019). Last accessed: November 22, 2019.
Apache Hive. (2019). Last accessed: November 22, 2019.
Apache Impala. (2019). Last accessed: November 22, 2019.
Apache Kafka. (2019). Last accessed: November 22, 2019.
Be¸sikçi, E. B., Arslan, O., Turan, O., & Ölçer, A. (2016). An artificial neural network based deci-
sion support system for energy efficient ship operations. Computers & Operations Research,
66, 393–401.
Câmara, G., Vinhas, L., Ferreira, K. R., De Queiroz, G. R., De Souza, R. C. M., Monteiro, et
al. (2008). TerraLib: An open source GIS library for large-scale environmental and socio-
economic applications. In Open source approaches in spatial data handling (pp. 247–270).
Berlin: Springer.
Carlini, E., de Lira, V. M., Soares, A., Etemad, M., Machado, B. B., & Matwin, S. (2020).
Uncovering vessel movement patterns from AIS data with graph evolution analysis. In
332 H. Herodotou et al.
Proceedings of the 23rd International Conference on Extending Database Technology (EDBT)
(p. 7).
Cavanillas, J. M., Curry, E., & Wahlster, W. (2016). New horizons for a data-driven economy: A
roadmap for usage and exploitation of big data in Europe. Berlin: Springer.
Cazzanti, L., Millefiori, L. M., & Arcieri, G. (2015). A document-based data model for large scale
computational maritime situational awareness. In Proceedings of the 2015 IEEE International
Conference on Big Data (Big Data) (pp. 1350–1356).
Corti, P., Kraft, T. J., Mather, S. V., & Park, B. (2014). PostGIS cookbook. Birmingham: Packt
Publishing Ltd.
Cragin, M. H., Heidorn, P. B., Palmer, C. L., & Smith, L. C. (2007). An educational program on
data curation. In Science and Technology Section of the Annual American Library Association
Cunningham, J. P., & Ghahramani, Z. (2015). Linear dimensionality reduction: Survey, insights,
and generalizations. The Journal of Machine Learning Research, 16(1), 2859–2900.
Curry, E. (2016). The big data value chain: Definitions, concepts, and theoretical approaches. In
J. M. Cavanillas, E. Curry, & W. Wahlster (Eds.), New horizons for a data-driven economy:
A roadmap for usage and exploitation of big data in Europe (pp. 29–37). Cham: Springer
International Publishing.
Curry, E., Freitas, A., & O’Riáin, S. (2010). The role of community-driven data curation for
enterprises. In Linking enterprise data (pp. 25–47). Berlin: Springer.
Davoudian, A., Chen, L., & Liu, M. (2018). A survey on NoSQL stores. ACM Computing Surveys,
51(2), 40.
DGConnect. (2013). A European Strategy on the Data Value Chain.Tech.Rep.Brussels:Euro-
pean Commission.
Dong, X. L., & Srivastava, D. (2013). Big data integration. In Proceedings of the IEEE 29th
International Conference on Data Engineering (ICDE) (pp. 1245–1248).
exactEarth AIS Vessel Tracking. (2020). Last accessed: March 30, 2020. https://www.exactearth.
Ferreira, J., Agostinho, C., Lopes, R., Chatzikokolakis, K., Zissis, D.,Vidal, M.-E., et al. (2017).
Maritime data technology landscape and value chain exploiting oceans of data for maritime
applications. In Proceedings of the 2017 International Conference on Engineering, Technology
and Innovation (ICE/ITMC) (pp. 1113–1122).
Fiorini, M., Capata, A., & Bloisi, D. D. (2016). AIS data visualization for maritime spatial planning
(MSP). International Journal of e-Navigation and Maritime Economy, 5, 45–60.
Gelernter, J., & Maheshwari, N. (2019). Qualitative study of the incompatibility of indoor map file
formats with location software applications. Open Geospatial Data, Software and Standards,
4(1), 7.
GeoMesa. (2019). Last accessed: November 22, 2019.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques.Amsterdam:
Hezbor, A., & Hughes, J. (2017, November). Maritime Location Intelligence with exactEarth data
and GeoMesa. Last accessed: March 30, 2020.
Holm, H., & Mellegård, N. (2018). Fast decoding of automatic identification systems (AIS) data.
In Proceedings of the International Conference on Computer Applications and Information
Technology in the Maritime Industries (COMPIT).
Kokkinakos, P., Michalitsi-Psarrou, A., Mouzakitis, S., Alvertis, I., Askounis, D., & Koussouris,
S. (2017). Big data exploitation for maritime applications: A multi-segment platform to
enable maritime big data scenarios. In Proceedings of the 2017 International Conference on
Engineering, Technology and Innovation (ICE/ITMC) (pp. 1131–1136).
Kramer, J. (2009). Advanced message queuing protocol (AMQP). Linux Journal, 2009(187), 3.
Big Maritime Data Management 333
Kyriakides, I., Hayes, D., & Tsiantis, P. (2020). Intelligent maritime information acquisition and
representation for decision support. In M. Lind, M. Michaelides, R. Ward, & R. T. Watson
(Eds.), Maritime informatics (chap. 22). Cham: Springer. 030-
Loshin, D. (2010). Master data management. Burlington: Morgan Kaufmann.
Lyko, K., Nitzschke, M., & Ngomo, A.-C. N. (2016). Big data acquisition. In J. M. Cavanillas, E.
Curry, & W. Wahlster (Eds.), New horizons for a datadriven economy: A roadmap for usage
and exploitation of big data in Europe (pp. 39–62). Cham: Springer International Publishing.
Lytra, I., Vidal, M.-E., Orlandi, F., & Attard, J. (2017). A big data architecture for managing oceans
of data and maritime applications. In Proceedings of the 2017 International Conference on
Engineering, Technology and Innovation (ICE/ITMC) (pp. 1216–1226).
Malik, J. S., Goyal, P., & Sharma, A. K. (2010). A comprehensive approach towards data
preprocessing techniques & association rules. In Proceedings of the 4th National Conference.
Michaelides, M. P., Herodotou, H., Lind, M., & Watson, R. T. (2019). Port-2-port communication
enhancing short sea shipping performance: The case study of Cyprus and the Eastern
Mediterranean. Sustainability, 11(7), 1912.
Min, F., & Xu, J. (2016). Semi-greedy heuristics for feature selection with test cost constraints.
Granular Computing, 1(3), 199–211.
Narang, S. K., Kumar, S., & Verma, V. (2017). Knowledge discovery from massive data streams. In
Web semantics for textual and visual information retrieval (pp. 109–143). Hershey: IGI Global.
Nguyen-Dinh, L.-V., Aref, W. G., & Mokbel, M. (2010). Spatio-temporal access methods (Part 2).
IEEE Data Engineering Bulletin, 33(2), 46–55.
Perobelli, N. (2016, June). MarineTraffic - A Day in Numbers. Last accessed: March 22, 2019. in-numbers/
Presto SQL. (2019). Last accessed: November 22, 2019.
Qin, J., Ma, L., & Niu, J. (2018). Massive AIS data management based on HBase and Spark.
In Proceedings of the 3rd Asia-Pacific Conference on Intelligent Robot Systems (ACIRS) (pp.
Rødseth, Ø. J., Perera, L. P., & Mo, B. (2016). Big data in shipping - Challenges and opportunities.
In Proceedings of the 15th International Conference on Computer and IT Applications in the
Maritime Industries (COMPIT).
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In
Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
(Vol. 10, pp. 1–10).
Simoes, R. E., de Queiroz, G. R., Ferreira, K. R., Vinhas, L., & Camara, G. (2016). PostGIS-T:
Towards a spatiotemporal PostgreSQL database extension. In Proceedings of the XVII Brazilian
Symposium on Geoinformatics (GeoInfo) (pp. 252–262).
Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., & Shah, S. (2012). Serving large-scale
batch computed data with project Voldemort. In Proceedings of the 10th USENIX Conference
on File and Storage Technologies (pp. 18–30).
Svanberg, M., Santén, V., Hörteborn, A., Holm, H., & Finnsgård, C. (2019). AIS in maritime
research. Marine Policy, 106, 103520.
Thanh, T. D., Mohan, S., Choi, E., Kim, S., & Kim, P. (2008). A taxonomy and survey on
distributed file systems. In Proceedings of the Fourth International Conference on Networked
Computing and Advanced Information Management (Vol. 1, pp. 144–149).
Toader, C., & Toader, D. C. (2017). Modelling a reliable distributed system based on the
management of replication processes. North Economic Review, 1(1), 312–320.
Wang, H., Zhuge, X., Strazdins, G., Wei, Z., Li, G., & Zhang, H. (2016). Data integration and
visualisation for demanding marine operations. In Proceedings of the MTS/IEEE OCEANS
2016 Conference (pp. 1–7).
Xin, R. S., Rosen, J., Zaharia, M., Franklin, M. J., Shenker, S., & Stoica, I. (2013). Shark: SQL and
rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data (pp. 13–24).
334 H. Herodotou et al.
Xiong, X., Mokbel, M. F., & Aref, W. G. (2017). Spatiotemporal database. In Encyclopedia of GIS
(pp. 2150–2151). Berlin: Springer.
Yablonsky, S. (2018). Innovation platforms: Data and analytics platforms. In Multi-Sided Platforms
(MSPs) and sharing strategies in the digital economy: Emerging research and opportunities
(pp. 72–95). Hershey: IGI Global.
Yang, Y., Zhong, M., Yao, H., Yu, F., Fu, X., & Postolache, O. (2018). Internet of things for smart
ports: Technologies and challenges. IEEE Instrumentation & Measurement Magazine, 21(1),
Yeoh, C.-M., Chai, B.-L., Lim, H., Kwon, T.-H., Yi, K.-O., Kim, T.-H., et al. (2011). Ubiquitous
containerized cargo monitoring system development based on wireless sensor network technol-
ogy. International Journal of Computers Communications & Control, 6(4), 779–793.
Yu, J., & Sarwat, M. (2019). Geospatial data management in apache spark: A tutorial. In
Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE) (pp.
Yu, J., Zhang, Z., & Sarwat, M. (2019). Spatial data management in apache spark: The GeoSpark
perspective and beyond. Geoinformatica, 23(1), 37–78.
Zhao, Y.-x., Li, W., Feng, S., Ochieng, W. Y., & Schuster, W. (2014). An improved differential
evolution algorithm for maritime collision avoidance route planning. Abstract and Applied
Analysis, 2014, 10 pp.
Zhou, L., Chen, N., Yuan, S., & Chen, Z. (2016). An efficient method of sharing mass spatio-
temporal trajectory data based on Cloudera Impala. Sensors, 16(11), 1813.
... They are essential multimodal nodes, hotspots for industrial activity, hubs for information, clusters for blue growth, etc. (Deloitte & ESPO, 2021). Modern maritime equipment constructors, ship owners and agents, transport and logistics companies, and port authorities collect enormous amounts of heterogeneous data at an unprecedented scale and pace (Herodotou, Aslam, Holm, & Theodossiou, 2020). Today port environments have become a locus of data generation and concentration. ...
Full-text available
The maritime supply chain is growing in complexity. Ports are at the crossroads of many activities, modes, and stakeholders, and are actively becoming digital hubs. Today, digital and physical connectivity go hand in hand. The port could benefit from taping the opportunities arising from digitalization and data integration since it helps to leverage external knowledge, engage stakeholders, create new decision-making anchors, lower the risk of certain investments, boost productivity and cut costs, and accelerate greening and digital transition, generating possibilities for just-in-time operations and optimizations. The chapter aims to apprehend the use of data science in the port sector. The state of the art in Brazil and Portugal are different. Even inside Brazil, there is no homogeneity of ports in the usage of digital infrastructure, cloud computing, or artificial intelligence. The existing inequalities hinder general cooperation between nations but, at the same time, reveal opportunities to approach specific nodes in the international supply chain.
Full-text available
Artificial Neural Networks (ANNs) have wide applications in aquatic ecology and specifically in modelling water quality and biotic responses to environmental predictors. However, data scarcity is a common problem that raises the need to optimize modelling approaches to overcome data limitations. With this paper, we investigate the optimal k-fold cross validation in building an ANN using a small water-quality data set. The ANN was created to model the chlorophyll-a levels of a shallow eutrophic lake (Mikri Prespa) located in N. Greece. The typical water quality parameters serving as the ANN’s inputs are pH, dissolved oxygen, water temperature, phosphorus, nitrogen, electric conductivity, and Secchi disk depth. The available data set was small, containing only 89 data samples. For that reason, k-fold cross validation was used for training the ANN. To find the optimal k value for the k-fold cross validation, several values of k were tested (ranging from 3 to 30). Additionally, the leave-one-out (LOO) cross validation, which is an extreme case of the k-fold cross validation, was also applied. The ANN’s performance indices showed a clear trend to be improved as the k number was increased, while the best results were calculated for the LOO cross validation as expected. The computational times were calculated for each k value, where it was found the computational time is relatively low when applying the more expensive LOO cross validation; therefore, the LOO is recommended. Finally, a sensitivity analysis was examined using the ANN to investigate the interactions of the input parameters with the Chlorophyll-a, and hence examining the potential use of the ANN as a water management tool for nutrient control.
Conference Paper
Full-text available
The availability of a large amount of Automatic Identification System (AIS) data has fostered many studies on maritime vessel traffic during recent years, often representing vessels and ports relationships as graphs. Although the continuous research effort, only a few works explicitly study the evolution of such graphs and often consider coarse-grained time intervals. In this context, our ultimate goal is to fill this gap by providing a systematic study in the graph evolution by considering voyages over time. By mining the arrivals and departures of vessels from ports, we build a graph consisting of vessel voyages between ports. We then provide a study on topological features calculated from such graphs with a strong focus on their temporal evolution. Finally, we discuss the main limitations of our approach and the future perspectives that will spawn from this work.
Full-text available
Indoor maps, and therefore the indoor map data and software that uses them, are used by architects, designers and planners, those in public safety and emergency response, facilities management and even advertising. Presently, the wide range of commonly-used formats for generating indoor maps means that many who would use indoor map applications either must convert indoor map data to another format or re-create the map, which can be time-consuming, costly, and even result in a flawed map. Because the problem is not recognized widely, the benefits that would come from solving it are not widely considered, and so indoor-map related software is not evolving rapidly. This article brings the problem into focus, which should spur enthusiasm to solve it. The article also considers solutions to interoperability problems, and offers what may be the most expedient solution. The solution, in brief, is that automatic conversion between commonplace indoor map file formats may result in data loss, so encouraging software makers to adopt certain formats is more practical than a solution involving file conversions which will be flawed. This paper advocates for a planned solution to indoor map data incompatibility rather than a market-driven solution which might take years longer to effect.
Full-text available
The sustainability of Short Sea Shipping (SSS) is central to a clean, safe, and efficient European Union (EU) transport system. We report on key challenges for advancing reliability, quality, and safety, and removing unnecessary costs and delays at SSS hubs, with a particular focus on Cyprus and the Eastern Mediterranean. Specifically, we consider the effect of port-2-port (P2P) communication on port efficiency by investigating the factors influencing the various waiting times at the Port of Limassol, both from a qualitative and a quantitative perspective. The qualitative results are based on the views of key stakeholders involved in the port call process. The quantitative analysis relies on data from over 8000 port calls during 2017-2018, which are analyzed with respect to ship type, port of origin, and shipping agent. The calculated Key Performance Indicators (KPIs) include arrival punctuality, berth waiting, and berth utilization. The analysis clearly reveals considerable variation in agent performance regarding the KPIs, suggesting a lack of attention to the social aspect of a port's socio-technical system. We propose measures for improving agent performance based on the principles of Port Collaborative Decision Making (PortCDM), including P2P communication, data sharing and transparency among all involved in a port call process including the agents, and open dissemination of agent-specific KPIs.
Full-text available
The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.
Conference Paper
Full-text available
Decoding large AIS encoded data sets into clear text is time consuming. This paper details an approach on how to structure the innermost part of AIS decoding to increase performance. The method is compared to existing Open Source implementations, as well as to a straight forward example. The proposed approach can increase decoding performance 20-30 times compared to these.
This first book on Maritime Informatics describes the potential for Maritime Informatics to enhance the shipping industry. It examines how decision making in the industry can be improved by digital technology, and introduces the technology required to make Maritime Informatics a distinct and valuable discipline. Based on participating in EU funded research over the last six years to improve the shipping industry, the editors stipulate that there is a need for the new discipline of Maritime Informatics, which studies the application of information systems to increasing the efficiency, safety, and ecological sustainability of the world’s shipping industry. This book examines competition and collaboration between shipping companies, and also companies who serve shipping needs, such as ports and terminals. Practical examples from leading experts give the reader real world examples for better understanding.
Although not originally developed for research use, the Automatic Identification System (AIS) enables its data to be used in research. The present paper provides a structured overview of how AIS data is used for various research applications. Ten areas have been identified, spread across maritime, marine and other journals. Many stakeholders beyond the most frequently mentioned – authorities and maritime administrations – can benefit from the research in which AIS data is used. AIS data can be incorporated in various types of modelling approaches and play a small or large role as a source of data. AIS data can also be validated or used to validate research from other data sources. Although a large amount of AIS-based research adds to the literature, there is still a large potential for using AIS data for research by making greater use of the variety in AIS messages, combining AIS with other sources of data, and extending both spatial and temporal perspectives.
T.S. Eliot once wrote some beautiful poetic lines including one "Where is the knowledge we have lost in information?". Can't say that T.S. Eliot could have anticipated today's scenario which is emerging from his poetic lines. Data in present scenario is a profuse resource in many circumstances and is piling-up and many technical leaders are finding themselves drowning in data. Through this big stream of data there is a vast flood of information coming out and seemingly crossing manageable boundaries. As Information is a necessary channel for educing and constructing knowledge, one can assume the importance of generating new and comprehensive knowledge discovery tools and techniques for digging this overflowing sea of information to create explicit knowledge. This chapter describes traditional as well as modern research techniques towards knowledge discovery from massive data streams. These techniques have been effectively applied not exclusively to completely structured but also to semi-structured and unstructured data. At the same time Semantic Web technologies in today's perspective require many of them to deal with all sorts of raw data.