ArticlePDF Available

A Survey of Big Data Analytics for Smart Forestry

Authors:

Abstract and Figures

Accurate and reliable forestry data can be obtained by means of continuous monitoring of forests using advanced technologies, which provides a major opportunity for the development of smart forestry. However, with the improvement of the precision and acquisition speed of data, the traditional data analysis and storage technology cannot meet the performance requirements of current applications. Forestry big data has brought a new solution to the difficulties encountered in the course of forestry development, which refers to the application of big data technology to forestry data processing. In this paper, we summarize the research and work of the big data in smart forestry in recent years. Firstly, we review the history of the emergence and development of forestry big data, and then briefly summarize the opportunities brought to the forestry by big data technology. One of the most important tasks of forestry big data is to organize the massive data reasonably and effectively and to calculate fast. Therefore, we propose a five-layer architecture model of forestry big data, and summarize the related work of data storage, query, analysis and application. Finally, the challenges of forestry big data are analyzed, and the trend of future development is prospected from three aspects.
Content may be subject to copyright.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
A Survey of Big Data Analytics for Smart
Forestry
WEITAO ZOU1, WEIPENG JING1, (Member IEEE), GUANGSHENG CHEN1, YANG LU1,
AND HOUBING SONG1,2, (Senior Member, IEEE)
1College of Information and Computer Engineering, Northeast Forestry University, Harbin, HLJ China (e-mail: zouweitao1996@nefu.edu.cn;
jwp@nefu.edu.cn; icec@nefu.edu.cn; lowyard@163.com )
2Department of Electrical, Computer, Software, and Systems Engineering, Embry-Riddle Aeronautical University, Daytona Beach, FL 32114 USA (e-
mail: SONGH4@erau.edu )
Corresponding author: Houbing Song (e-mail: SONGH4@erau.edu ).
This work was supported in part by the National Natural Science Foundation of China under Grant 31770768, in part by
the Natural Science Foundation of Heilongjiang Province of China under Grant F2017001, in part by Heilongjiang Province
Applied Technology Research and Development Program Major Project under Grant GA18B301, and in part by China State
Forestry Administration Forestry Industry Public Welfare Project under Grant 201504307.
ABSTRACT Accurate and reliable forestry data can be obtained by means of continuous monitoring of
forests using advanced technologies, which provides a major opportunity for the development of smart
forestry. However, with the improvement of the precision and acquisition speed of data, the traditional data
analysis and storage technology cannot meet the performance requirements of current applications. Forestry
big data has brought a new solution to the difficulties encountered in the course of forestry development,
which refers to the application of big data technology to forestry data processing. In this paper, we
summarize the research and work of the big data in smart forestry in recent years. Firstly, we review the
history of the emergence and development of forestry big data, and then briefly summarize the
opportunities brought to the forestry by big data technology. One of the most important tasks of forestry big
data is to organize the massive data reasonably and effectively and to calculate fast. Therefore, we propose
a five-layer architecture model of forestry big data, and summarize the related work of data storage, query,
analysis and application. Finally, the challenges of forestry big data are analyzed, and the trend of future
development is prospected from three aspects.
INDEX TERMS Big data, Forestry, Forestry big data, Smart forestry, Survey
I. INTRODUCTION
In the last few decades, the scale of global data has grown
at an unprecedented rate. The overall created and copied
data volume in the world increased by nearly nine times
within five years, and the figure is doubling at least every
other two years [1]. While the amount of data is exploding,
the concept of "big data" came into being. Although
currently there is no authoritative, unified standard
definition for big data in the scientific, academic, and
government sectors [2], big data has become a leading term
in all fields [3].
Big data in a broad sense includes not only the structured,
semi-structured and unstructured data that are growing, but
also the processing techniques and processing tools for
these data. With the development of technology and the
increasing demand for computer response speed, big data
technology has been widely used and developed in various
data processing fields. However, the data format and
processing flow of big data are quite different from the
traditional data processing methods. Traditional computing
and statistical methods are not effective enough to handle
the big data. The main characteristics of big data are high
volume, variety and velocity (referred to as “3Vs”) [4].
Because of the increasement of the requirements for the
quality and security of data in various applications and
services, “veracity” is regarded as one of big data's
characteristics at present. Therefore, the characteristics of
big data gradually become "4Vs" [5]. In addition to the
problems caused by the data, there are security and privacy
issues arising from the development of big data technology.
However, the advancement of big data still brings
unprecedented opportunities to various industries [6].
Currently, big data is widely used in the fields of urban
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
infrastructure construction [7, 115, 116], smart grid [8],
intelligent transportation [9], emergency communication
network [10]social network [117] and agriculture and
forestry monitoring [11]. The application of big data
technology to forestry is still in its infancy. Therefore, the
relevant technology of forestry big data still needs to be
further studied.
Forest management is a long-term process, which requires
dynamic information on the distribution, composition,
structure, and disturbance of forests over time. Therefore,
monitoring forests can protect forest ecosystems effectively
and promote the sustainable development of forest
resources. Forestry monitoring relies primarily on remote
sensing and location technology to capture a variety of data,
including remote sensing platforms, such as ground-based
systems, Aircraft-based systems (manned aircraft or UAVs)
and satellite-based systems [12] and GPS positioning
systems , etc. Ground systems include sensors that monitor
a fixed location on the ground and the surrounding
environment, as well as movable monitoring instruments.
The ground system is mainly targeted to obtain real-time
monitoring data of specific attributes within a specified
small range, and satellite remote sensing and aerial remote
sensing provide large-scale global observation data.
Forestry information systems need to provide different
services for different needs. Therefore, in contrast to the
characteristics of data in other fields, forestry data is
updated quickly and with no uniform format. These data
have multidimensional spatial features, which make
forestry data processing more difficult. With the
development of technologies such as sensors, satellites and
drones, the resolution of remote sensing observation data
significantly improves, the number of bands of remote
sensing data gradually increases, and the period of data
acquisition decreases. Therefore, the data volume of remote
sensing data has exploded. Currently, the scale of data in
forestry applications far exceeds the size of previous data,
and the demand for service response is higher. If the
traditional data storage and computation methods are
introduced, it is difficult to guarantee the desired results in
an acceptable time. Specially for real-time applications
such as disaster monitoring, in the case of massive data, it is
meaningless to monitor the forest if the data processing
efficiency is still as low as before. For forestry applications
that need to process massive amounts of data, big data
technologies can be introduced to solve the current
bottlenecks [13].
To fully understand the development of forestry big data
(the application of big data in forestry), in this paper, we
review the historical background, architecture, key
technology development and application, and future
development trends of forestry big data. The rest of this
article is organized as follows. In the section II, we review
the origin of forestry big data and the main technologies,
and summarize the types of forestry data. In the section III,
we outline forestry big data, mainly to illustrate the
opportunities that forestry big data technology brings. In
section IV, we summarize forestry big data applications and
technologies. In Section V, we briefly outline some of the
challenges of forestry big data technology based on open
source big data systems. In Section VI, we point out several
of the current problems in the development of forestry big
data and future development trends, and the VII section
summarizes the full text.
II. FORESTRY BIG DATA'S PAST LIFE AND PRESENT
LIFE
Forest is a complex and widely distributed ecosystem.
Monitoring forest resources is a costly project. In forestry
monitoring, forestry resource survey is an important
process. Forestry data obtained through forestry resource
surveys is the basis for the forestry management. In this
section, we first trace the key methods and techniques of
forestry surveys in various periods and explain the
development process of forestry big data, and then abstract
the data types of forestry data.
A.
THE ORIGIN AND DEVELOPMENT OF FORESTRY BIG DATA
The behavior of forest resources survey occurred when
humans began forest management activities [14]. Early
forestry surveys were totally based on visual method. In the
18th century, the development of mathematics promoted
the forest mensuration. It was not until 1927 that the forest
mensuration system was formed. After the 1920s, the
mathematical statistics and the theory of aerial
photogrammetry and photo interpretation were introduced
into forest survey, forest survey technology has been
greatly improved and rapidly developed. However, at that
time, there was still no unified technical system amongst
the countries for the investigation of forest resources [13].
The defects of these technologies mainly include the
following points: (1) Due to aerial photo is only applicable
in a few applications, a large amount of information is not
mined and utilized; (2) Since the aerial photo is obtained
through the center projection, the workload of transfer is
huge and a large error will occur when calculating area
[15].
The International Forest Survey (Monitoring) Guide
published by the International Union of Forestry Research
Organization in the 1990s, indicates that the integration of
remote sensing (RS), geographic information system (GIS)
and forest sampling technology (FS) is the future direction
of forest surveys. They are deeply integrated and
organically integrated. The Global Positioning System
(GPS) is only used as an auxiliary technology for forest
sample plot location. This technology system can perform
forestry tasks faster and more efficiently, thus reducing the
loss of manpower and material resources [16].
GPS is a highly accurate satellite-based radio navigation
system, which provides information of location, speed and
time. In forestry monitoring, GPS [17] can accurately
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
measure the location of forestry objects including points,
lines, and polygons. Based on monitoring data from GPS,
researchers can perform a range of forest management tasks
such as forest mapping, forest compartment boundary
surveys, forest road surveys, ground live events (remote
sensing), and resource inventory.
With remote sensing technology, continuous, dynamic,
large-scale and low-cost forestry data sources can be
obtained, which promotes the research and development of
forestry technologies. Remote sensing mainly includes
satellite remote sensing, UAV remote sensing and wireless
sensor networks [18]. Among them, wireless sensor
network (WSN), also known as ground remote sensing
[19], is an extension of remote sensing, and they
complement each other. The data needed for different
forestry applications should be obtained from different
remote sensing platforms.
GIS [20-22] is an important tool for forestry data analysis
and forestry monitoring. It has become a basic tool for
forest workers. At present, the application of GIS
technology is diverse. It is used in many applications such
as landscape mapping and yield forecast. In addition to
topographic data, it also needs to store all of the data about
the entire course of timber forests to manage natural
resources on public lands. Each process describes the
progress of the felling cycle, which allows a long-term
prediction of timber yield. The growth rate over time can be
compared in terms of forest, soil and afforestation
treatments. GIS also provides a set of external variables to
predict the spread of disease and the behavior of forest
fires. In addition, it can handle issues related to spatial
relationships among different spatial objects (points, lines,
and regions). Through this function, valuable information is
provided for space forest management planning. GIS can
also configure forest patches based on spatial properties,
including patch size (openness), distribution, shape,
adjacency, connectivity, proximity, and core regions.
Currently, the forestry survey is based on "3S" technology.
More advanced data acquisition and processing
technologies are introduced, which make the survey more
precise, intelligent and refined. In the field of forestry data
collection, the methods are gradually diversified and the
accuracy is getting higher. For instance, we can use ground
laser scanners [23] to generate point cloud data to build 3D
data models for forest surveys, or use radio frequency
technology [24] to collect real-time data on trees and the
environment. By combining big data technology with
traditional forestry data analysis methods, the data
processing efficiency is improved, which meets the need for
real-time and accuracy. Therefore, big data has become an
important technical support for the forestry information
system. Big data has changed the process and structure of
forestry data processing. The data can be effectively
managed by abstracting the data model.
B.
FORESTRY DATA TYPES
Forestry data is involved in a wide range of fields and
applications. The forestry management is often analyzed in
the multi-dimensional scenarios, including time and space.
Therefore, spatial attribute is an important attribute of
forestry data. According to the characteristics and sources
of the data, the spatial data is mainly divided into five
categories, known as remote sensing data, surveying data,
location-based data, Internet of Things data and social
network data [25]. The data commonly used in forestry are
the first four types. The forestry data model is abstracted
into two categories based on the logical structure of the
data. One is the raster data model that divides geographic
information into finite cells which is assigned a specific
value according to its attributes [26] and the other is the
vector data model represented by a sequence of finite points
and line segments. Both of the data models can be used to
represent any spatial conceptual model. However, based on
the characteristics of these models, vector data models are
often used for object-based models, such as a building, a
river, or a type of land. Raster data models are commonly
used for models with "field" properties, such as elevation
models, temperature models, etc. The remote sensing image
is a typical raster data model and one of the most important
forestry data.
FIGURE 1. Three-band raster data case
Raster data is composed of a series of row and column
grids, in term of matrix form. For the remote sensing image,
these grids are represented as pixels, and each pixel
represents an attribute value. Raster data can be represented
by a matrix or a set of matrices, and different raster data
have different bands. As shown in Fig. 1, the 24-bit RGB
image can be represented as a dataset with three bands. For
hyperspectral remote sensing images, the number of bands
can reach several tens. In principle, the logical structures
are similar, but the data processing efficiencies differ
greatly.
For remote sensing data sets, the image data files can be
divided into image pixel data (i.e. pixel data of remote
sensing image) and image metadata. Metadata is used to
describe the attributes of remote sensing data [27],
including image information, geographic information and
satellite sensor information. To process the data, the two
structures can be stored separately, or integrated into one
structure.
Vector data is a data structure that uses points, lines, and
polygons to describe things in space. It is an object-oriented
data model. A vector object describes a spatial object
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
through a combination of geographic information and
object properties. Moreover, vector data has unique
advantages in representing the positional relationship of
spatial objects. Vector data files can be converted to binary
or text formats for further processing.
By representing data as vector data and raster data, it is
possible to simplify the data processing complexity for
large amounts of data. Therefore, when processing forestry
data, it is very simple to ensure the consistency of the data
in various attributes, thereby improving the efficiency of
data processing and accuracy of the result.
III. OPPORTUNITIES BROUGHT BY FORESTRY BIG
DATA
In traditional forestry monitoring, data is stored and
calculated in a single machine. When remote sensing
technology is in its infancy, the amount of data is small and
the quality of data is poor, so that data obtained by
techniques such as remote sensing cannot be effectively
applied to forestry decision-making. With the development
of remote sensing technology, the resolution of remote
sensing data is improved significantly, and the accuracy of
its positioning function is higher, but the scale of remote
sensing data greatly increases. The traditional computing
and storage methods are no longer suitable for the current
large-scale forestry data and performance requirements in
forestry services.
With the increasing popularity of the concept of precision
forestry [28], integration of technologies such as RS, GPS
and GIS has enabled the remote sensing technology to be
comprehensively upgraded and applied in business. The
speed of acquisition of remote sensing data is constantly
increasing. For the remote sensing images, currently, the
features are high resolution and hyperspectral, which
provides more information for forestry applications and
increases the data processing burden. The remote sensing
data from various satellites consists of many large images,
which are very complex in structure, spectrum, and text
features [29], bringing several difficulties to data
management.
Forestry data analysis based on stand-alone GIS tools
cannot meet the requirements of speed and accuracy in
massive heterogeneous forestry data analysis. Applying big
data technology to forestry can simplify the computational
complexity, which not only facilitates the mining and
utilization of data, but also improves the processing
efficiency of forestry data. Therefore, the development of
forestry big data brings new opportunities for the forestry
management.
Comparing with general big data, forestry big data is not
just a huge amount of data. The greater challenge is its
complex data organization and its diverse data structure. In
forestry information systems, many services often use more
than one type of data for calculation and analysis. A system
that can provide common services needs to deal with vast
amounts of heterogeneous, irregular forestry data.
Moreover, forestry services tend to have strict time
requirements. Whether or not to respond to requests
submitted by different forestry applications in real time is
an important task in the field of forestry big data.
The main task of forestry big data is to reasonably organize
and store the forestry data according to the characteristics
of forestry data with the help of the existing big data
software and high-performance hardware, to achieve the
efficient query and analysis of the forestry big data. In
recent years, big data technology related to the forestry
data, including the research of remote sensing big data and
spatial big data, has developed rapidly, such as Hadoop,
Spark, Storm [30], and Flink [31]. There are some typical
computing frameworks such as SpatialHadoop [32] based
on Hadoop platform and GeoSpark [33] based on Apache
Spark. These studies can effectively manage massive
amounts of data, and we can apply these technologies to
forestry, which greatly promotes the development of
forestry.
FIGURE 2. Forestry big data Management platform
IV. RESEARCH AND APPLICATION OF FORESTRY BIG DATA
According to the logical processing flow of data, a forestry
big data management platform can be abstracted into five
layers, which are data acquisition, storage, query, analysis
and application, respectively. The data acquisition layer
mainly refers to a process in which various data are
acquired through different platforms and technologies and
initially processed at the data collection end. At the storage
layer, the main task is to design and improve the method
that data is stored and indexed. The query layer mainly
includes requesting and filtering data. The main task of the
analysis layer is to analyze and mine forestry data to assist
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
forestry decision-making. The application layer is to
transfer traditional forestry information technology to the
big data platform for high-performance computing based on
the analysis of forestry big data. As shown Fig. 2, it is a
schematic diagram of a five-layer platform for forestry big
data management. The forestry big data management
framework can be deployed directly into a single computer
cluster, or constructed based on cloud computing and
related technologies.
Because the type and structure of forestry data are complex,
storing the data in database or local file system cannot meet
the performance requirements. In practical applications, it is
necessary to apply big data technology to organize and
store data reasonably. When requesting data, the data is
read into memory at the minimum cost according to the
storage structure. Next, in this chapter, we study the
technology of each layer.
A.
DATA STORAGE AND QUERY
At present, the main storage method of the data is based on
the relational database, the non-relational database and the
distributed file system. Taking full advantage of the
different storage methods for data storage can reduce the
pressure on the storage system and improve the efficiency
of data requests. After the user makes a request to the
system, at the query layer, the system needs to request data
from the storage layer according to the storage structure of
the data and the organization of the storage system, and
store the required data in the memory, which can reduce
unnecessary disk I/O. The query layer bridges that the
system loads data from disk into memory, and is a
configurable, pluggable component in the big data model.
1) STORAGE BASED ON DISTRIBUTED FILE SYSTEM
Common distributed file systems include HDFS [34], Ceph
[35], GridFS [36], and GlusterFS. As shown in Fig. 3, the
general pattern of forest data in stored procedures is
described. In the forestry data storage model based on
distributed file system, the data needs to be formatted. In
the case of raster data, the pixel data is split into rows,
columns, or blocks and then stored in the file system.
Metadata can be stored with pixel data in a file system or
stored separately in a relational database. Forestry data can
also be converted into objects with spatial attributes based
on object-oriented thinking. After the data is divided, the
data locality needs to be considered when it is to be stored.
That is, if data objects are adjacent in space, they should be
stored on the same node or rack to reduce the querying cost
among nodes. We can assign a unique ID to each data slice
using a space fill curve (such as Hilbert curve [37]) and
organize and store the data according to the IDs.
HPGFS [27] is an object- oriented distributed file system
implemented by OrangeFS. The system provides an
application-aware data layout strategy and an I/O interface
based on remote sensing data objects. Therefore, it can
effectively support various data access patterns on the
server side.
In HDFS, data is managed by Namenode and Datanode.
The Datanode is responsible for storing data, and the
Namenode mainly stores metadata related to data storage
such as copy information. Because HDFS is not suitable for
storing small files [38], Weili Kou et al. [39] propose a dual
storage mechanism. By separately storing and managing
remote sensing images and attribute data (such as metadata)
in HDFS and relational databases, the load pressure on the
Namenode nodes can be reduced, and the efficiency of
storage and query can be improved. To solve the problem
of increasing data volume in sensor networks, Huixiang
Zhou et al. [40] propose a new storage scheme based on
HDFS for sensor network data., which is a cloud-based
storage platform. It not only effectively alleviates the
pressure of massive data storage in the sensor network, but
also improves the scalability of the storage system the data
storage security of the sensor network. Guangsheng Chen et
al. [41] stores remote sensing data based on Mapfile, and
improved the cohesion of the system by storing remote
sensing images and remote sensing metadata into a
structure in HDFS. Fei Xiao [42] proposes a parallel
computing framework for National Geographic Conditions
Monitoring project of China based on spark and HDFS. The
entire framework is divided into four layers: data storage
layer, RDDs layer, space operation layer and query
language layer. In the data storage layer of the framework,
HDFS is used to store vector data and raster data.
FIGURE 3. General mode of forestry storage
When data is stored based on a distributed file system, the
data structure can be flexibly designed according to
requirements. The distributed file system can be used to
improve the throughput and reliability of the system, so that
the data can be stored stably and efficiently.
2) DATABASE-BASED STORAGE
Relational database can only store data with fixed structure,
but for forestry big data, data is irregular and many data
can't be stored according to fixed pattern. As the amount of
data increases, the load pressure of the database also
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
increases, which has high requirements for the system's
response speed and pressure resistance. Therefore, the
traditional relational database cannot guarantee the stability
and availability of the system, and it is not suitable for
storing and querying massive forestry data. Commonly,
relational databases are used to assist non-relational
databases or distributed file systems for data storage. For
some data with a small data size and a regular pattern, one
can choose a relational database for storage. This makes the
structure of the storage system simple and clear and reduces
its load pressure.
Non-relational databases mainly include columnar
databases, key/value pairs-based databases, and array
databases. Commonly used columnar databases are HBase,
MongoDB, etc. Shuo Xiao et al. [43] store the monitoring
data of wireless sensor network based on the MongoDB on
the cloud platform. HBGIS [44] is a geographic information
system based on HBase. It builds a pyramid storage model
for image data based on sensor type, data acquisition order,
spatial extent and other dimensions, and indexes the data.
This system also ensures that the browser can quickly
render the map, and can quickly query the data and display
the results on the map. Weipeng Jing et al. [45] store
remote sensing images based on HBase. They combine
Hilbert curve with grid index and use MapReduce for
parallel access to data, and they use MapReduce for parallel
access to data. The MD-HBase [46] is a location-based and
service-based system stored in the form of key values. The
underlying key-value storage mode allows the system to
insert data into a process that maintains high throughput to
ensure system fault tolerance and high availability. The
HBaseSpatial [47] is an HBase-based scalable spatial data
storage system. The system first converts the shapefiles into
WKT format and indexes the data, and then stores the data
in HBase. It is effective to search a range of vector data.
SciDB [48] is a new open source database that can be
applied to large-scale array data. SciDB has an inherent
advantage in storing and computing big multidimensional
array-based data, such as raster data. SciDB is a system
with parallel storage structure which can be used for
massively parallel data processing [49]. EarthDB [50] is a
scalable system for data storage and analysis for MODIS
remote sensing image data. It can import MODIS data into
SciDB unobstructed and filter redundant data. To facilitate
calculation and analysis, Marius Appel et al. [51] propose a
method for automatically converting earth observation data
into a multi-dimensional array format by combining and
expanding GDAL and SciDB. Therefore, data-intensive
calculations can be performed based on the GDAL code
base. Zhenyu Tan et al. [52] use the mixed storage mode of
DBMS and SciDB to store earth observation data. When
requesting data, the query layer first determines the
structure of the data by accessing the metadata in the
relational database, and then accesses the SciDB to request
the data.
3) DATA INDEX
If the data are stored in a distributed file system or a non-
relational database, the system still need to traverse all the
data during searching for data. For forestry big data systems
that process massive amounts of data, although each node
reads data in parallel, it still generates a lot of I/O and
memory consumption. Moreover, the process requires a
large amount of CPU and wastes hardware resources.
Therefore, to efficiently and accurately filter unnecessary
data in the data query process, it is necessary to index the
data in the storage system.
Forestry services are often based on spatial attributes, such
as resource inventory for a forest farm or a specific area of
the forest farm, so the index in forestry data is mainly based
on spatial attributes. The two most important concepts in
spatial indexing are data partition and index structure. In a
distributed environment, the partition here refers not only to
the data partition in the storage system, but also to the
partition of the compute node after the data is loaded into
the compute node’s memory (such as the logical partition of
RDD or the split partition of Hadoop). The index structure
includes R-tree, KD-tree, quadtree, grid index, and so on.
FIGURE 4. secondary index model
Currently, indexes in distributed systems based on forestry
big data are mainly based on Hadoop or Spark. Due to
memory limitations, for most big data systems, such as
SpatialHadoop [32, 53, 54], the secondary index model is
commonly used, that is, data is indexed by global index and
local index. As shown in Fig. 4, it is a secondary index
structure. Typically, in the Hadoop architecture, local
indexes are used to index spatial data inside each partition,
which is stored in the Datanode node. The global index is
used to index each partition and is stored in the Namenode
node.
GeoSpark [33, 55-57] is a Spark-based spatial data parallel
computing framework that supports multiple types of
spatial indexes by extending the Spark kernel. With the
update of the version, the performance of the system has
been greatly improved. Currently the framework provides a
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 1
SQL-based access interface and integrates Hive into the
framework. The GeoSpark framework provides three
spatial RDDs and a series of spatial operations for
accessing this spatial data. When creating RDDs, the data is
first divided into global grids according to spatial locations.
Then, based on the memory in the cluster and the CPU
situation, the system adaptively decides whether to create a
local index in each partition. Similarly, SparkGIS [58] is a
memory-based spatial data query system that can partition
data dynamically and generate multi-level spatial indexes in
advance or on demand.
R. Phani Bhushan et al. [59] partition raster data into
blocks, and thus maximizing the locality of the data. Fei Hu
et al. [60] propose a three-layer hierarchical indexing
strategy based on HDFS and Spark, including: global index,
local index and RDD index. And the system can
dynamically perform partition merging and splitting
according to the partition size of the data, so that it can
adjust the degree of parallelism of the calculation. It is
worth to note that when constructing a hierarchical index on
data, each layer's index can adopt different index structures
as needed.
To create an index, we first need to design an index format
based on the storage system. For instance, to build an index
in HDFS, it is necessary to determine the fields stored in the
index node, the structure of the index, and so on. To build
an index in Spark, the first thing to do is to customize an
RDD for index. Then, we build the index based on the
index data structure. As shown in Figure 5, it is a schematic
diagram of using Spark to build an index, which shows the
main Transform and Action operations during building an
index. The specific process is described as follows. Spark
first reads the data from disk into memory and then
partitions it according to the logical space of the data. The
K value in the figure is the logical space partition ID of the
data (such as the Hilbert partition ID, etc.), and the data is
repartitioned according to the partition ID. This can ensure
the locality of data and alleviate the network I/O burden
caused by frequent data communication among different
nodes. Moreover, reasonable data partitioning operation can
ensure load balance of distributed clusters, thereby reducing
node storage and computing pressure. After that, we need to
build a local index for each partition. This is done by
calling the "RDD.mapPartitionWithIndex()" operator and
passing a custom function for index construction as the
parameter. Then, after each partition generates an index, the
"RDD.collect" operator is used to submit the local index to
the master node, and the master node builds a global index
based on them. This method is not limited to building an
index in memory, it can also persist the index to the storage
system in the format of the index file.
FIGURE 5. The process of building a two-layers index using Spark
4) DATA QUERY
Data query (or data request) is an operation closely related
to the storage strategy and storage structure of the data, the
main purpose of which is to load the required data into
memory quickly and accurately for calculation. An
effective data query strategy can avoid unnecessary
network I/O and disk I/O, as well as reduce the load
pressure of CPU and memory. In the forestry information
system, there are different query methods for different
services, such as range query, K-NN query, and join query.
In the distributed computing framework based on Hadoop
or Spark, the data query mainly includes two stages:
“before reading” and “after reading”. The " before reading"
stage refers to the process in which the system reads data
from the storage layer and loads the partitions into the
memory of the compute node in parallel. In this process, the
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
system needs to logically determine the data that needs to
be loaded into the memory according to the spatiotemporal
attributes of the data. In the "read after" phase, the system
further refines the in-memory data according to the query
conditions. The purpose is to determine whether the data
meets the query conditions and retain the data that satisfies
the conditions. When requesting data with index, the
system first accesses and loads the index file, and logically
determines the data partition to be loaded into memory [61-
64]. It then reads the data into memory through the custom
"FileSplitter" and "FileRecordReader" classes. They define
the form in which data is read and the strategy of data
filtering. This avoids traversing the entire data set, which
greatly reduces data retrieval time.
The data loaded into memory is stored in the cluster in the
form of splits. When the data in the memory needs to be
further refined, the system can prune the splits using a
specific strategy according to the spatial relationship of the
query object, such as the thin-MBR and fat-MBR strategies
[65, 66]. The method used for range query is that after the
client submits the request, the system accesses the local
index in parallel in each partition and quickly finds the data
that meets the requirements in each partition. The system
then returns a collection of data that satisfies the query
criteria to the client. For K-NN queries, the system usually
performs query operations in two phases [32]. First, the K-
NN query is executed by the local index in the partition
where the query node is located, and the distance of each
data point in the result to the query point is saved, and the
maximum value of these distances is recorded as d. Then,
in the circular area with the query point as the origin and d
as the radius, the range query is executed. The K objects
closest to the query point in the query result are returned to
the master node as the final result. Join queries are a highly
complex task in a variety of forms, such as chain joins and
star joins. We can optimize queries and improve efficiency
by controlling copy strategies [67], dynamic planning
strategies. The BSMWSJ algorithm [68] is a join query
algorithm that is further improved on the basis of the
control copy strategy, which reduces the amount of data to
be copied and the computational cost. And the kNN-DP
algorithm [69] can dynamically and judiciously partitions
data to optimize kNN-join performance by suppressing data
skewness on Hadoop clusters.
If the system needs to request the remote sensing image in
the specified area, firstly, according to the range of the
MBR (minimum outer border) of the area, the image data
block overlapping with the MBR and its metadata such as
spatial coordinate information are read into the memory.
Then, the spatial vector data of the query area and the
image data block in memory are mapped to the same scale,
and the vector data is transformed into binary raster data
with the value of 1 in the region and 0 outside the region.
Finally, the raster data and image data are masked, and the
returned data is the requested target data.
The query layer is the bridge between data and computing
in the forestry big data framework. The system can
reasonably allocate I/O resources and computing resources
based on the query layer. And by scheduling and
controlling the query request, it ensures the load balancing
of the cluster, thereby improving the performance of the
entire framework.
B.
DATA ANALYSIS AND VISUALIZATION
Massive forestry data includes both spatial and non-spatial
data [70]. By analyzing and processing these data and
extracting hidden information and knowledge from the
data, more valuable data can be provided for forestry
development and planning. Knowledge acquisition
determines the knowledge representation and reasoning
methods, establishing knowledge base, writing inference
procedures, and debugging and modification [71]. At
present, data mining technology is considered as an
important tool for analyzing massive data and extracting
knowledge from data. The application of data mining and
visualization technology to forestry data analysis has
greatly promoted the development of forestry.
1) FORESTRY DATA MINING
Based on GIS, remote sensing and GPS, forestry data
mining can effectively analyze spatial data features by
finding implicit connections in data. This will deepen
people's understanding and mastery of spatial data features
and laws, and help researchers make correct predictions
about future trends.
Data mining mainly includes classification and clustering
techniques, and is widely used in forestry, such as using k-
means to classify soil types [70]. And Monidipa Das et al.
[72] propose a spatiotemporal prediction method for
satellite remote sensing data based on deep learning, which
can solve the difficulties caused by massive remote sensing
data for smart image interpretation. Yushi Chen et al. [73]
propose a new feature fusion framework based on the
convolutional neural network, which can extract the
discriminant and invariant features of remote sensing data,
and obtain the final classification result based on logistic
regression.
In traditional forestry research, forestry data analysis is
based on GIS and spatial data mining techniques [74]. The
biggest problem with this strategy is that the complexity of
the forestry data makes data mining and data analysis more
difficult. And data mining algorithms often require a lot of
iterative calculations. Therefore, forestry data mining is a
task that requires a lot of time and computing resources. In
the forestry big data system, the data mining algorithm can
be parallelized by the underlying distributed computing
framework (such as Spark or Hadoop), which can improve
the calculation efficiency and decrease the running time.
Developers can use the Mahout [75] Machine Learning
Toolkit or the Spark MLlib [76, 77] toolkit for forestry data
mining and analysis, or parallelize traditional machine
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
learning algorithms as needed. Roberto Giachetta et al. [78]
implement a large-scale geospatial and remote sensing data
processing framework based on Hadoop. The framework is
highly scalable and adaptable, allowing previously
algorithms and toolkits to be easily parallelized with fewer
changes. Hichem Omrani et al. [79] parallelize the K-means
algorithm based on Spark, and they prove that the scheme is
with high performance and scalability.
For forestry data, before building a computational model,
we need to design a data structure that can be used for
distributed computing based on the data model of the
storage layer. This is a process of abstracting data [80].
Then, after the query layer loads the data into memory, the
data is converted (such as the RDD.map() operator in
Spark) to obtain the custom data object, and the data mining
algorithm is applied to each partition of the object in
memory. After the query layer loads the data into memory,
the data is converted (such as the RDD.map() operator in
spark) to get the custom data object, and then the data
mining algorithm is applied to each partition of the memory
object. In a distributed environment, it is necessary to
consider the data communication cost amongst nodes and
the caching and serialization of data before parallelizing the
algorithm. By considering data locality, the system can
reduce the data communication cost between nodes, and it
can deal with boundary problems well by loading data
redundantly [81] . When data is computed in memory, it
needs to be cached, compressed and serialized according to
the requirements. By properly optimizing the memory
processing mechanism in the calculation process, the load
pressure of the system memory decreases significantly,
thereby improving the performance of data analysis. Based
on the above analysis, in the forestry big data framework,
the system can encapsulate the algorithms and strategies of
forestry data analysis and provide interfaces that can be
called by other programs. We can also further optimize
forestry big data systems based on existing research. For
instance, for the analytical calculation of remote sensing
data, it can be abstracted into a large matrix. Then, we can
optimize the performance of the system based on existing
research related to matrix calculations [82].
In forestry big data, the main role of the analysis layer is to
provide tools for forestry data analysis, such as commonly
used data mining algorithms. Therefore, the system can
quickly analyze and calculate forestry data for different
forestry services.
2) FORESTRY DATA VISUALIZATION
Through visualization technology, the system can
intuitively present information such as relationships
between data objects and real-time state changes in multi-
dimensional space to the user. Moreover, users can also
analyze large amounts of complex data simply and
efficiently by interacting with the system in real time and
visualizing the results. Therefore, visualization technology
plays an important role in forest management.
At present, the visualization technology for spatial data is
widely used, such as GoogleEarth [83]. The ability to
visualize forestry data is an important function of the
forestry big data framework. The main purpose of
visualization is to present the results of big data system
calculations and analysis to users. By combining with the
forestry real-time monitoring system and the forestry
decision support system, it provides great convenience for
forestry data analysis and forestry management.
Because the range of areas that need to be presented is
different, it is required that the data can be adapted to
different scaling ratios when displayed. Therefore, when
visualizing raster data such as remote sensing, the system
first needs to divide the remote sensing image into a set of
remote sensing tile data, then sample each tile data and
construct a tile pyramid model. The model is then stored in
memory or file system. As shown in Fig. 6, it is a schematic
diagram of a remote sensing image pyramid model. When
grading and tile partitioning, the number of levels depends
on the scale of the map, and the number of tiles is
determined by the size of the image [84].
After the user submits a visualization request to the system,
the system searches for the corresponding pyramid level
according to the scope of the request, and the tile data
contained in the layer within the request scope, and then the
system performs the mosaic [85] operation on the data tile
data and shows the results on the client side. Weipeng Jing
et al. [86] optimize the remote sensing image parallel
mosaic algorithm based on Spark. The method can
effectively reduce the frequency of I/O and effectively
improve the mosaic efficiency. And it is suitable for the
mosaic of massive image data.
FIGURE 6. Image pyramid model
For vector data visualization, firstly, the spatial objects
need to be rasterized according to different scales, and then
the raster images generated by these objects need to be
aggregated, superimposed and rendered according to the
spatial position [87]. Vector data visualization is often used
to create thematic maps, or some location-based services.
At present, in many web-based applications, the system
superimposes vector data and raster data to display, so that
forestry scenes and detailed forestry features can be
combined. Combining with 3D technology, forestry data
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
can be visualized in three-dimensional space [88]. Then we
can understand the impact of spatial attributes such as
terrain on forestry through spatial analysis, and visualize
the results, which greatly facilitates forestry decision-
making.
C. APPLICATION OF FORESTRY BIG DATA
Optimizing the storage, query and analysis of forestry data
based on big data technology can improve the response
performance and robustness of current forestry applications.
By real-time calculation and analysis for forestry data, we
can capture and predict changes in the state of the forest to
make quick decisions, which can reduce losses due to
delays [89]. In this chapter, we mainly summarize the
application of big data in forestry from the aspects of forest
vegetation classification and change detection, forestry
parameter inversion and forest growth and disaster
monitoring.
1) CLASSIFICATION AND CHANGE DETECTION OF
FOREST VEGETATION
A major basis for forest vegetation classification based on
remote sensing images is “same objects with same
spectrums”. However, the problems of "same object with
different spectrums" and "different objects with same
spectrums" caused by the increasing in resolution have a
great influence on the classification accuracy. For this
problem, we can combine spectral information with
features such as texture and terrain to accurately classify
vegetation cover. The forestry vegetation classification uses
data collected by multiple sensors with multiple temporal
characteristics, and the classification methods are gradually
diversified. For instance, combining Landsat-8 images with
the topographic map data, sample plot data, and the forest
manager survey data to classify tree species based on
textures [90], and using MODIS data and DEM data to
classify phenological attributes of tree species. It is also
possible to extract forest vegetation types based on
environmental satellite imagery (HJ-1 CCD) and MODIS
time series data [91], or to classify tree species fine by
fusing SAR or LiDAR data with hyperspectral remote
sensing data [92].
At present, the data is mainly processed through various big
data and machine learning technologies, including deep
learning, support vector machines, and so on. For example,
deep learning can be used to extract and classify tree
species from LiDAR data [93]. Facing the huge challenge
of computing performance due to the increasing data
volume, Bo LI et al. [94] parallelize the ISODATA
clustering algorithm based on MapReducewhich is easy
to use. I. Chebbi et al. [95] implement a satellite remote
sensing classification method based on Hadoop, which
includes two parts: data storage and processing. In the
storage part, the system stores heterogeneous data through
HDFS. In the processing part it uses the MapReduce
programming paradigm to implement the K-means
algorithm. The system can improve the classification
accuracy, and also make full use of the integrated
configuration resources.
Based on the classification technology, the change
detection of remote sensing images can obtain timely and
consistent forest interference information, so that forest
resources can be fully utilized and effectively protected.
Based on Hadoop, Dan Hammer and others implement a
high-confidence forest disturbance warning system, which
establishes appropriate geolocation classification rules and
enforces local forest conservation efforts.
2) FORESTRY PARAMETER INVERSION
The ecological parameters of forest vegetation include tree
height, DBH (Diameter at breast height), and leaf area
index. The data that the vegetation parameter inversion
relies mainly on is different radar data [96]. When inverting
parameters such as tree height, laser radar and polarization
interference synthetic aperture radar can be used to obtain
measurement data of vertical structure [97], such as using
Satellite Large-Footprint LiDAR ICESat/GLAS data to
obtain tree height information [98]. For small-scale forests,
UAV remote sensing data can be used to obtain tree height
information [99]. For parameters such as canopy closure,
they are generally calculated based on multi-spectral data
sets, such as Landsat TM and MODIS images [100]. Forest
parameter inversion based on hyperspectral data can
overcome the shortcomings that multispectral data has
fewer bands and lower resolution, but it requires additional
data preprocessing, which will occupy part of the
computing resources [101].
Forestry parameter inversion requires multiple iterative
calculations based on forestry data. These calculations are
highly complex and time-consuming. Based on big data
technology for forestry parameter inversion, the
computational efficiency can be greatly improved. Jianghua
Zhao et al. [102] implement a big data framework for the
storage and calculation for remote sensing data. The
framework is able to calculate NDVI of a specified
spatiotemporal range. Wei Huang et al. [103] calculate the
NDVI through the Spark-based remote sensing parallel
computing framework on YARN [104], which significantly
improves the computational efficiency. LandQv2 [105] is a
MapReduce-based parallel computing framework based on
vector data for arable land quality calculation and provides
visualization technology based on pyramid model. Quan
Zou et al. [106] implement a global vegetation drought
monitoring model based on Hadoop with good linear
throughput.
3) FOREST DISASTER MONITORING
The main tasks of forest disaster monitoring include forest
fire prevention and pest and disease detection. Forest
hazard monitoring has high requirements for time and
sometimes it even requires staff to analyze and deal with
the situation in a few seconds [107]. In applications such as
forest fire monitoring, the system usually uses remote
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
sensing data with high temporal resolution and low spatial
resolution, such as NOAA, MODIS, and Fengyun [108]. It
is also possible to combine satellite remote sensing with
ground-based sensor data for fire prediction and monitoring
[109]. By monitoring pests and diseases in forests,
operators can obtain information on changes in the spectral
characteristics of vegetation and take appropriate measures
in a timely manner if necessary. The data used for pest and
disease monitoring are mainly high-spectrum and high
spatial resolution data such as MODIS and SPOT [110].
In response to forest fires, T Rajasekaran et al. [111]
establish a forest fire forecasting and early warning system
based on Hadoop and other open source large data
technologies. The system can store and analyze the data
collected by wireless sensor, and forecast the forest fire
before it happens. Shijiao Zhu et al. [112] propose the fire
alarm system of video in large data environment based on
Hadoop. The system can predict forest fires and alarms in
time. For seasonal fires, Saurabh Garg et al. [43] propose a
service system for fire prediction based on cloud
computing, which takes advantage of the flexibility of
cloud infrastructure to provide dynamic services to users.
This improves resource utilization and reduces system
operating costs.
Timely detection of disease is a major challenge in the
forest management process. For pest and disease
monitoring, machine vision technologies have a wide range
of applications [113]. At present, to improve monitoring
efficiency, machine vision technology is mainly combined
with big data technology. For example, Shaocan Jiang et al.
[114] analyze and summarize forest pest and disease data in
recent decades and establish a cloud-based forest pest
monitoring platform. The platform stores data in HDFS and
HBase, and processes data based on MapReduce. It can
effectively improve the storage capacity, computing speed
and information sharing and transmission efficiency of the
existing data center.
V. CHALLENGES FACED BY FORESTRY BIG DATA
Currently, forestry big data technology based on open
source big data computing framework can effectively
manage and analyze data. However, because the
performance depends on the underlying storage system and
the computing engine, there are still several problems that
need to be solved in some forestry big data systems.
For data storage, SciDB is ideal for storing remote sensing
data because it is an array-oriented storage model.
However, since SciDB is still in the development and
improvement phase, currently it only provides a C language
interface. Therefore, for distributed computing frameworks
written in non-C languages such as Hadoop, it is difficult to
call the SciDB interface. The system maintenance costs will
increase. Therefore, SciDB-based forestry data storage
solutions are not currently widely used. NoSQL databases
such as HBase have certain advantages for storing forestry
data in vector format, which supports the structure of vector
data well. However, these NoSQL databases are not
designed for image data itself, it is necessary to reorganize
the data format during storing, such as converting data into
a byte stream. From the perspective of query efficiency,
there is no advantage in storing remote sensing data based
on NoSQL database such as HBase.
In a distributed file system-based storage strategy, the index
is usually stored in the file system with the data files. In the
process of data query, the system needs to access the local
disk of the master node and load the global index. The
system then accesses the local disks of slaves and loads the
local index based on the global index. Therefore, the disks
are accessed twice before the data is loaded, which has
great effect on the efficiency of the data query. For this
problem, because the amount of index file data is relatively
small, the system can choose to persist the index in memory
to improve the efficiency of data query or use other
strategies to reduce disk access times.
For data calculation and analysis, in big data systems based
on Hadoop or Spark, it can ensure locality of data by
partitioning data based on spatial attributes. This strategy
can greatly reduce I/O consumption among nodes during
data request and calculation. However, in the Hadoop and
Spark kernels, there is no mechanism for controlling the
location of the node where the data partition is located.
Therefore, it is impossible to ensure that the partitions are
evenly distributed among the nodes in the cluster. When
calculating the data, the system cannot fully utilize the
computing resources in the cluster, and guarantee the load
balance. This results in a low degree of parallelism in the
calculations and affects the performance of the cluster.
Therefore, the system can select the optimal storage
location for the data by considering the load pressure of
each node for storing. It is also possible to ensure the
reliability of computing resources by providing an
"available CPU" aware strategy to solve the problem of
load balancing and the problem of unstable computing
parallelism.
VI. FUTURE TREND OF FORESTRY BIG DATA
The current technology and strategy of forestry big data can
effectively deal with massive forestry data and meet the
requirements of real-time query and calculation for forestry
applications. And with the development of 5G technology
[118] and the widespread application of Internet of Things
[119, 120, 121] technology, the forestry big data technology
will be further developed and more widely used. In this
section, we mainly introduce the future development trend
of forestry big data from three aspects: data acquisition,
data analysis and data opening.
Data acquisition: Data is the basis and prerequisite for the
development of forestry big data. With the development of
big data, the current forestry big data system can well
support the real-time calculation of massive data, but in the
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
long run, the scale of data will be larger and larger, and the
computing resources and storage resources are limited.
Therefore, in the future, there are cases that existing
computing resources cannot cope with the massive data.
One method to solve this problem is to reduce the data size
during the data collection phase. With the development of
artificial intelligence (AI), related technologies have created
enormous value for various fields. Applying AI technology
to the remote sensing platform can initially process the data
at the data collection end. In recent years, AI-based satellite
remote sensing platforms have attracted the attention of
researchers and have been further developed. Running AI
technology on satellites increases the ability of satellites to
acquire and process information, which improves data
quality and reduces redundant or worthless data. For other
remote sensing platforms such as drones, AI technology can
also be applied to improve the ability to intelligently
process information when collecting data. By using AI
technology for the data collection side, the pressure on data
storage and processing can be reduced, thereby improving
the performance of forestry big data.
Data analysis: For forestry data processing, the system is
not only required to respond to the requests from users
without delay, but also to process the collected data (such
as the monitoring data collected by the ground sensor in
real time) in real time. However, there is relatively few
researches on this need in current forestry information
systems and forestry data-based applications. Therefore, the
related technology of flow calculation can be applied to the
processing of forestry data, such as storm, so that the data
can be calculated in real time and automatically. For data
visualization, the technology currently applied in forestry
information systems is only to show the results, which is
not conducive to interactive analysis and immersive
visualization. Having combined virtual reality technology
and geographic information system together, VRGIS
(Virtual Reality Geographic Information System) plays an
important role in many frontier fields such as Smart
City [122]. Therefore, during forestry data visualization,
VR technology can be introduced to improve the interaction
capability of the system and the user experience.
Open data: Forestry big data is still in the infancy, and
forestry data is characterized by dispersion, irregularity and
low quality. From the perspective of technological
development, to make a breakthrough, we first need to
obtain accurate and valuable data. Although it is difficult to
solve now, with the development of smart forestry, people's
demand for new technologies is constantly increasing.
Therefore, forestry data will gradually become open.
VII. CONCLUSION
With the development of the technology, the speed and
accuracy of forestry data acquisition have been greatly
improved. To cope with the growth of massive data, we
need to improve the ability to calculate and analyze data.
Based on the current development of big data and forestry
technologies, in this paper we summarize the data types and
data formats of forestry big data. And the five-layer model
structure of forestry big data is proposed, including data
acquisition, storage, query, analysis and visualization and
application. Based on the current development status, the
key technologies in each layer are analyzed in detail. Then
we summarize the challenges faced by forestry big data,
and predict the problems of forestry big data development
process and future development trends finally.
In the forestry data storage layer, the main management
objects are the vector data and the raster data, including
remote sensing images, ground monitoring data, and land
uses data. These data are with time and space attributes, and
the data size is significantly large. Therefore, the effective
management of data can alleviate the low performance
problems caused by massive data in forestry information
systems. To solve this problem, firstly, the data needs to be
partitioned according to the spatial attribute, and the data
locality needs to be considered when storing the block data
to the data node. Then we need to build a hierarchical
index in a distributed file system or a non-relational
database to facilitate fast access and retrieval of data. In the
data query, according to the storage structure of the storage
layer, the global index file is first accessed. The
corresponding data block is then requested according to the
global index file and the local index in the data block is
accessed. Finally, the data satisfying the condition is loaded
into the memory according to the local index. The
application of big data technology in forestry is mainly to
quickly query data, especially in many applications for real-
time monitoring of forests. Forestry data visualization
makes it easier to manage forestry data and can assist the
forestry management to make decisions. In practical
applications, because forest disasters are often destructive
and difficult to predict in advance, forestry big data
technology can reduce labor costs and avoid unnecessary
property losses.
REFERENCES
[1] M. Chen, S. Mao and Y. Liu, "Big Data: A Survey", Mobile
Networks and Applications, vol. 19, no. 2, pp. 171-209, Apr. 2014,
DOI: 10.1007/s11036-013-0489-0.
[2] J.S. Ward and A. Barker, "Undefined by Data: A Survey of Big
Data Definitions", Sep. 2013.
[3] X. Yao and G. Li, "Big spatial vector data management: a review",
Big Earth Data, vol. 2, no. 1, pp. 108-129, Feb. 2018, DOI:
10.1080/20964471.2018.1432115.
[4] R. Lu, Z. Hui, X. Liu, J.K. Liu, and J. Shao, "Toward Efficient and
Privacy-Preserving Computing in Big Data Era", IEEE Network,
vol. 28, no. 4, pp. 46-50, Jul. 2014,
DOI: 10.1109/MNET.2014.6863131.
[5] H. Guo, L. Wang, F. Chen, and D. Liang, "Scientific big data and
Digital Earth", Chinese Science Bulletin, vol. 59, no. 35, pp. 5066-
5073, Dec. 2014, DOI: 10.1007/s11434-014-0645-3.
[6] M. Chi, A. Plaza, J.A. Benediktsson, Z. Sun, J. Shen, and Y. Zhu,
"Big Data for Remote Sensing: Challenges and Opportunities", P.
IEEE, vol. 104, no. 11, pp. 2207-2219, Sep. 2016, DOI:
10.1109/JPROC.2016.2598228.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
[7] E. Al Nuaimi, H. Al Neyadi, N. Mohamed, and J. Al-Jaroodi,
"Applications of big data to smart cities", Journal of Internet
Services and Applications, vol. 6, no. 1, Dec. 2015, DOI:
10.1186/s13174-015-0041-5.
[8] K. Wang, H. Li, Y. Feng, and G. Tian, "Big Data Analytics for
System Stability Evaluation Strategy in the Energy Internet", IEEE
T. Ind. Inform., vol. 13, no. 4, pp. 1969-1978, Apr. 2017, DOI:
10.1109/TII.2017.2692775.
[9] L. Zhu, F.R. Yu, Y. Wang, B. Ning, and T. Tang, "Big Data
Analytics in Intelligent Transportation Systems: A Survey", IEEE T.
Intell. Transp., pp. 1-16, Apr. 2018, DOI:
10.1109/TITS.2018.2815678.
[10] J. Wang, Y. Wu, N. Yen, S. Guo, and Z. Cheng, "Big Data
Analytics for Emergency Communication Networks: A Survey",
IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp.
1758-1778, Mar. 2016, DOI: 10.1109/COMST.2016.2540004.
[11] K.H. Coble, A.K. Mishra, S. Ferrell, and T. Griffin, "Big Data in
Agriculture: A Challenge for the Future", Appl. Econ. Perspect. P.,
vol. 40, no. 1, pp. 79-96, Feb. 2018, DOI: 10.1093/aepp/ppx056.
[12] C. Yuan, Y. Zhang and Z. Liu, "A survey on technologies for
automatic forest fire monitoring, detection, and fighting using
unmanned aerial vehicles and remote sensing techniques", Can. J.
Forest Res., vol. 45, no. 7, pp. 783-792, Mar. 2015, DOI:
10.1139/cjfr-2014-0347.
[13] L. Liuyu, "Review and Thinking of Forestry Investigation
Technology ", Forest Resources Management, no. 5, pp. 50-58, Jul.
1995, DOI10.13466/j.cnki.lyzygl.1999.05.015. (in Chinese)
[14] C. Rongsheng, "A Survey of Research on Forest Resources Survey
Technology ", Modern Agricultural Science and Technology, no. 7,
pp. 179-181, Mar. 2015. (in Chinese)
[15] Z. Xianwen, L. Chonggui, S. Lin, Y. Kaixian, and T. Yonglin, "
Building a new system of forest resources inventory by information
technology", Journal of Beijing Forestry University, vol. 24, no. 6,
pp. 147-155, 2002. (in Chinese)
[16] F. Zhongke, H. Xiaodong and L. Fang, "Forest Survey Equipment
and Development of Information Technology", Transactions of the
Chinese Society for Agricultural Machinery, vol. 46, no. 9, pp. 257-
265, Jun. 2015, DOI: 10.6041/j.issn.1000-1298.2015.09.038. (in
Chinese)
[17] Y.L. Jay Gao, "Applications of remote sensing, GIS and GPS in
glaciology: a review", Progress in Physical Geography: Earth and
Environment, vol. 25, no. 4, pp. 520540, Dec. 2001, DOI:
10.1177/030913330102500404.
[18] Q. Zhang, J. Li, J. Rong, X. Weiheng, and H. Jinping, "Application
of WSN in precision forestry" in proc. ICEMI, Chengdu, China,
2011, pp. 320-323, DOI: 10.1109/ICEMI.2011.6038006.
[19] G. Peng, "Wireless Sensor Network as a New Ground Remote
Sensing Technology for Environmental Monitoring", JOURNAL
OF REMOTE SENSING, vol. 11, no. 4, pp. 545-551, Feb. 2007.
[20] L. Wu, "Research and Development of Mobile Forestry GIS Based
on Intelligent Terminal" presented at the 2nd Int. Conf. Remote
Sensing, Environment and Transportation Engineering, Nanjing,
China, Jun. 1-3, 2012.
[21] E.Z. Baskent and S. Keles, "Spatial forest planning: A review", Ecol.
Model., vol. 188, no. 2-4, pp. 145-173, Nov. 2005, DOI:
10.1016/j.ecolmodel.2005.01.059.
[22] M.G. Wing and P. Bettinger, "GIS: An Updated Primer on a
Powerful Management Tool", J. Forest., vol. 101, no. 4, pp. 4-8, Jun,
2003, DOI: 10.1093/jof/101.4.4.
[23] X. Liang, V. Kankare, J. Hyyppä, Y. Wang, A. Kukko, H. Haggrén,
X. Yu, H. Kaartinen, A. Jaakkola, F. Guan, M. Holopainen, and M.
Vastaranta, "Terrestrial laser scanning in forest inventories", ISPRS
J. Photogramm., vol. 115, pp. 63-77, May. 2016, DOI:
10.1016/j.isprsjprs.2016.01.006.
[24] Y. Bo and H. Wang, "The Application of Cloud Computing and the
Internet of Things in Agriculture and Forestry" presented at the Int.
Join Conf. Service Science, May. 25-27, 2011.
[25] X. Yao and G. Li, "Big spatial vector data management: a review",
Big Earth Data, vol. 2, no. 1, pp. 108-129, Feb. 2018, DOI:
10.1080/20964471.2018.1432115.
[26] N.R. Brisaboa, G.D. Bernardo, G. Gutiérrez, M.R. Luaces, and J.R.
Paramá, "Efficiently Querying Vector and Raster Data", The
Computer Journal, vol. 60, no. 9, pp. 1395-1413, Sep. 2017, DOI:
10.1093/comjnl/bxx011.
[27] L. Wang, Y. Ma, A.Y. Zomaya, R. Ranjan, and D. Chen, "A Parallel
File System with Application-Aware Data Layout Policies for
Massive Remote Sensing Image Processing in Digital Earth", IEEE
T. Parall. Distr., vol. 26, no. 6, pp. 1497-1508, Jun. 2014, DOI:
10.1109/TPDS.2014.2322362.
[28] O. Hassaan, A.K. Nasir, H. Roth, and M.F. Khan, "Precision
Forestry: Trees Counting in Urban Areas Using Visible Imagery
based on an Unmanned Aerial Vehicle", IFAC-PapersOnLine, vol.
49, no. 16, pp. 16-21, 2016, DOI: 10.1016/j.ifacol.2016.10.004.
[29] S. Mulyono and M.I. Fanany, "Remote sensing big data utilization
for paddy growth stages detection" presented at ICARES, Dec. 3-5,
2015.
[30] R. Evans, "Apache Storm, a Hands-on Tutorial" presented at IEEE
Int. Conf. Cloud Engineering, Tempe, AZ, USA, Mar. 9-13, 2015.
[31] D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, "A
comparison on scalability for batch big data processing on Apache
Spark and Apache Flink", Big Data Analytics, vol. 2, no. 1, Oct.
2016, DOI: 10.1186/s41044-016-0020-2.
[32] A. Eldawy, "SpatialHadoop: towards flexible and scalable spatial
processing using mapreduce" in proc. SIGMOD PhD
symposium, Snowbird, Utah, USA, 2014, pp. 46-50.
[33] J. Yu, J. Wu and M. Sarwat, "GeoSpark: a cluster computing
framework for processing large-scale spatial data" presented at Int.
Conf. Advances in Geographic Information Systems, Seattle,
Washington, USA, November 03 - 06, 2015.
[34] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop
Distributed File System" presented at IEEE / NASA Goddard Conf.
Mass Storage Systems and Technologies, Incline Village, NV, USA,
May. 3-7, 2010.
[35] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn, "Ceph: a
scalable, high-performance distributed file system"in proc. OSDI,
Berkeley, CA, USA, 2006, pp. 307-320.
[36] M.N. Dos Santos and R. Cerqueira, "GridFS: Targeting Data
Sharing in Grid Environments" presented at Int. Symposoim
CCGRID, Singapore, Singapore, May. 16-19, 2016.
[37] J.K. Lawder and P.J.H. King, "Querying multi-dimensional data
indexed using the Hilbert space-filling curve", ACM SIGMOD
Record, vol. 30, no. 1, pp. 19-24, Sep. 2000.
[38] G. Mackey, S. Sehrish and J. Wang, "Improving metadata
management for small files in HDFS" presented at Int. conf. Cluster
Computing and Workshops, New Orleans, LA, USA, Aug. 31-Sep.4,
2009.
[39] W. Kou, X. Yang, C. Liang, C. Xie, and S. Gan, "HDFS enabled
storage and management of remote sensing data", presented at IEEE
Int. Conf. Computer and Communications, Chengdu, China, Oct.
14-17, 2016.
[40] H.X. Zhou and Q.Y. Wen, "Data Storage Scheme of Sensor
Network Based on HDFS", Advanced Materials Research, vol. 926-
930, pp. 2462-2465, May. 2014.
[41] G. Chen, "A Distributed Storage Scheme for Remote Sensing Image
based on Mapfile", International Journal of Performability
Engineering, Vol. 14, Issue 10, pp. 2545-2552, OCT. 2018.
[42] F. Xiao, "A Big Spatial Data Processing Framework Applying to
National Geographic Conditions Monitoring", ISPRS - International
Archives of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, vol. XLII-3, pp. 1945-1950, Apr. 2018, DOI:
10.5194/isprs-archives-XLII-3-1945-2018.
[43] S. Xiao, T. Li, B. Guo, and Z. Huang, "Cloud platform wireless
sensor network detection system based on data sharing", Cluster
Computing, Vol. 4, pp. 320-323, Feb. 2018. DOI: 10.1007/s10586-
018-2260-6.
[44] Z. Xiao and Y. Liu, "Remote sensing image database based on
NOSQL database" presented at Int. Conf. Geoinformatics, Shanghai,
China, Jun. 24-26, 2011.
[45] W. Jing and D. Tian, "An improved distributed storage and query
for remote sensing data", Procedia Computer Science, vol. 129, pp.
238-247, 2018. DOI: 10.1016/j.procs.2018.03.071.
[46] S. Nishimura, S. Das, D. Agrawal, and A.E. Abbadi, "MD-HBase:
A Scalable Multi-Dimensional Data Infrastructure for Location
Aware Services" presented at Int. Conf. Mobile Data Management,
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
Lulea, Sweden, Jun. 6-9, 2011.
[47] N. Zhang, G. Zheng, H. Chen, J. Chen, and X. Chen, "HBaseSpatial:
A Scalable Spatial Data Storage Based on HBase" presented at Int.
Conf. Trust, Security and Privacy in Computing and
Communications, Beijing, China, Sep. 24-26, 2014.
[48] M. Stonebraker, P. Brown, D. Zhang, and J. Becla, "SciDB: A
Database Management System for Applications with Complex
Analytics", Comput. Sci. Eng., vol. 15, no. 3, pp. 54-62, 2013, DOI:
10.1109/MCSE.2013.19.
[49] P. Brown, "Overview of sciDB: large scale array storage, processing
and analysis" presented at Int. Conf. Management of data,
Indianapolis, Indiana, USA, Jun. 06 - 10, 2010.
[50] G. Planthaber, M. Stonebraker and J. Frew, "EarthDB: scalable
analysis of MODIS data using SciDB" presented at Int. Workshop
on Analytics for Big Geospatial Data, Redondo Beach, California,
Nov. 06 - 06, 2012.
[51] M. Appel, F. Lahn, W. Buytaert, and E. Pebesma, "Open and
scalable analytics of large Earth observation datasets: From scenes
to multidimensional arrays using SciDB and GDAL", ISPRS J.
Photogramm and remote sensing, vol. 138, pp. 47-56, Apr. 2018,
DOI: 10.1016/j.isprsjprs.2018.01.014.
[52] Z. Tan, P. Yue and J. Gong, "An Array Database Approach for
Earth Observation Data Management and Processing", ISPRS
International Journal of Geo-Information, vol. 6, no. 7, Jul. 2017,
DOI: 10.3390/ijgi6070220.
[53] A. Eldawy and M.F. Mokbel, "A demonstration of SpatialHadoop:
an efficient mapreduce framework for spatial data", Proceedings of
the VLDB Endowment, vol. 6, no. 12, pp. 1230-1233, 2013, DOI:
10.14778/2536274.2536283.
[54] A. Eldawy, L. Alarabi and M.F. Mokbel, "Spatial partitioning
techniques in SpatialHadoop", Proceedings of the VLDB
Endowment, vol. 8, no. 12, pp. 1602-1605, 2015, DOI:
10.14778/2824032.2824057.
[55] J. Yu, J. Wu and M. Sarwat, "A demonstration of GeoSpark: A
cluster computing framework for processing big spatial data",
presented at Int. Conf. Data Engineering, Helsinki, Finland, May.
16-20, 2016.
[56] Z. Huang, Y. Chen, L. Wan, and X. Peng, "GeoSpark SQL: An
Effective Framework Enabling Spatial Queries on Spark", ISPRS
International Journal of Geo-Information, vol. 6, no. 9, pp. 285, Sep.
2017, DOI: 10.3390/ijgi6090285.
[57] J. Yu, Z. Zhang and M. Sarwat, "Spatial data management in apache
spark: the GeoSpark perspective and beyond”, Geoinformatica, pp.
1-42, 2018. DOI: 10.1007/s10707-018-0330-9.
[58] F. Baig, H. Vo, T. Kurc, J. Saltz, and F. Wang, "SparkGIS:
Resource Aware Efficient In-Memory Spatial Query Processing",
presented at Int. Conf. Advances in Geographic Information
Systems, Redondo Beach, CA, USA, Nov. 07 - 10, 2017
[59] R. Phani Bhushan, D.V.L.N. Somayajulu, S. Venkatraman, and
R.B.V. Subramanyam, "A Raster Data Framework Based on
Distributed Heterogeneous Cluster", J. Indian Soc. Remote sensing,
Nov. 2018, DOI: 10.1007/s12524-018-0897-5.
[60] F. Hu, C. Yang, Y. Jiang, Y. Li, W. Song, D.Q. Duffy, J.L. Schnase,
and T. Lee, "A hierarchical indexing strategy for optimizing Apache
Spark with HDFS to efficiently query big geospatial raster data", Int.
J. Digit. Earth, Vol. 10, Issue 1, pp. 1-19, Nov. 2018, DOI:
10.1080/17538947.2018.1523957.
[61] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz,
"Hadoop-GIS: A High-Performance Spatial Data Warehousing
System over MapReduce", Proceedings of the VLDB Endowment.
International Conference on Very Large Data Bases, vol. 6, no. 11,
pp. 1009-1020, 2013, Aug. 2013. DOI: 10.14778/2536222.2536227.
[62] S. You, J. Zhang and L. Gruenwald, "Large-scale spatial join query
processing in Cloud", presented at Int. Conf. Data Engineering
Workshops, Seoul, South Korea, Apr. 13-17 2015.
[63] B. Shangguan, P. Yue, Z. Wu, and L. Jiang, "Big spatial data
processing with Apache Spark", presented at Int. Conf. Agro-
Geoinformatics, Fairfax, VA, USA, Aug. 7-10, 2017.
[64] X. Hong, L. Na, T. Weipeng, and C. Yebin, "Range queries in
spatial index research based on the Spark", presented at Int. Conf.
Computational Intelligence and Applications, Beijing, China, Sep.
8-11, 2017.
[65] G. Chen, P. Nie and W. Jing, "A Novel Query Method for Spatial
Data in Mobile Cloud Computing Environment", Wireless
Communications and Mobile Computing, vol. 2018, pp. 1-11, May.
2018. DOI: 10.1155/2018/1059231.
[66] F. Wang, X. Wang, W. Cui, X. Xiao, Y. Zhou, and J. Li,
"Distributed retrieval for massive remote sensing image metadata on
spark", presented at Int. Geoscience and Remote Sensing
Symposium, Beijing, China, Jul. 10-15 2016.
[67] H. Gupta, B. Chawda, S. Negi, T. Faruquie, L. Subramaniam, and
M. Mohania, "Processing multi-way spatial joins on map-reduce",
presented at Int. Conf. Extending Database Technology, Genoa,
Italy, March 18 - 22, 2013.
[68] B. Qiao, J. Zhu, Y. Zheng, M. Shen, and G. Wang, "A Multi-Way
Spatial Join Querying Processing Algorithm Based on Spark",
Journal of Computer Research and Development, vol. 7, no. 54, pp.
1592-1602, 2017. DOI: 10.7544/issn1000-1239.2017.20160558 (in
Chinese)
[69] X. Zhao, J. Zhang and X. Qin, "kNN-DP: Handling Data Skewness
in kNN Joins using MapReduce", IEEE T. Parall. Distr., vol. 29, no.
3, pp. 600-613, Oct. 2017. DOI: 10.1109/TPDS.2017.2767596.
[70] W. Zhangang, Z. Dafang, Q. Dongsheng, and M. Tao, "Application
Analysis of Data Mining and Visualization in Forestry", GEO-
INFORMATION SCIENCE, vol. 9, no. 4, pp. 19-22+141,2007.
[71] J. Zhao and J. Guo, "Big data analysis technology application in
agricultural intelligence decision system" presented at Int. Conf.
Cloud Computing and Big Data Analysis, Chengdu, China, Apr. 20-
22, 2018.
[72] M. Das and S.K. Ghosh, "Deep-STEP: A Deep Learning Approach
for Spatiotemporal Prediction of Remote Sensing Data", IEEE
Geosci. Remote S., vol. 13, no. 12, pp. 1984-1988, Nov. 2016, DOI:
10.1109/LGRS.2016.2619984.
[73] Y. Chen, C. Li, P. Ghamisi, X. Jia, and Y. Gu, "Deep Fusion of
Remote Sensing Data for Accurate Classification", IEEE Geosci.
Remote S., vol. 14, no. 8, pp. 1253-1257, Jun. 2017. DOI:
10.1109/LGRS.2017.2704625.
[74] H. Goyal, C. Sharma and N. Joshi, "An Integrated Approach of GIS
and Spatial Data Mining in Big Data", International Journal of
Computer Applications, vol. 169, no. 11, pp. 1-6, Jul, 2017. DOI:
10.5120/ijca2017914012.
[75] V.R. Eluri, M. Ramesh, A.S.M. Al-Jabri, and M. Jane, "A
comparative study of various clustering techniques on big data sets
using Apache Mahout", presented at Int. Conf. Big Data and Smart
City, Muscat, Oman, Mar. 15-16, 2016.
[76] S. Harifi, E. Byagowi and M. Khalilian, "Comparative Study of
Apache Spark MLlib Clustering Algorithms", presented at Int. Conf.
Data Mining and Big Data, Fukuoka, Japan, Jul. 27-Aug. 1, 2017.
[77] M. Armbrust, D. Bateman, R. Xin, M. Zaharia, and M.C. Databricks,
"Introduction to Spark 2.0 for Database Researchers", presented at
Int. Conf. Management of Data, San Francisco, California, USA,
Jun. 26-Jul. 01, 2016
[78] R. Giachetta, "A framework for processing large scale geospatial
and remote sensing data in MapReduce environment", Computers &
Graphics, vol. 49, pp. 37-46, Jun. 2015. DOI:
10.1016/j.cag.2015.03.003.
[79] H. Omrani, B. Parmentier, M. Helbich, and B. Pijanowski, "The
land transformation model-cluster framework: Applying k-means
and the Spark computing environment for large scale land change
analytics", Environ. Modell. Softw., vol. 111, pp. 182-191, Jan.
2019.DOI: 10.1016/j.envsoft.2018.10.004.
[80] S. Yue, "Research on Programming Support Tool for Fast
Processing of Remote Sensing Data “, Ph.D. dissertation, Institute
of Remote Sensing and Digital Earth Chinese Academy of Sciences,
2017. (in Chinese)
[81] R. Kune, P.K. Konugurthi, A. Agarwal, R.R. Chillarige, and R.
Buyya, "XHAMI - extended HDFS and MapReduce interface for
Big Data image processing applications in cloud computing
environments", Software: Practice and Experience, vol. 47, no. 3, pp.
455-472, Mar. 2017. DOI: 10.1002/spe.2425.
[82] R.B. Zadeh, X. Meng, A. Ulanov, B. Yavuz, L. Pu, S.
Venkataraman, E. Sparks, A. Staple, and M. Zaharia, "Matrix
Computations and Optimization in Apache Spark" in proc. KDD,
San Francisco, California, USA, 2016, pp. 31-38.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
[83] Google "GoogleEarth", Available: https://earth. google. com
/web/, 2001.
[84] A. Eldawy, M.F. Mokbel and C. Jonathan, "HadoopViz: A
MapReduce framework for extensible visualization of big spatial
data", presented at Int. Conf. Data Engineering, Helsinki, Finland,
May, 16-20, 2016
[85] J. Yan, Y. Ma, L. Wang, K.R. Choo, and W. Jie, "A cloud-based
remote sensing data production system", Future Generation
Computer Systems, vol. 86, pp. 1154-1166, Sep. 2018. DOI:
10.1016/j.future.2017.02.044.
[86] W. Jing, S. Huo, Q. Miao, and X. Chen, "A Model of Parallel
Mosaicking for Massive Remote Sensing Images Based on Spark",
IEEE Access, vol. 5, pp. 18229-18237, Sep. 2017. DOI:
10.1109/ACCESS.2017.2746098.
[87] J. Yu, A.Z. Zhang and A.M. Sarwat, "GeoSparkViz: A Scalable
Geospatial Data Visualization Framework in the Apache Spark
Ecosystem" presented at Int. Conf. Scientific and Statistical
Database Management, Bozen-Bolzano, Italy, Jul. 09 - 11, 2018.
[88] R.D. Müller, X. Qin, D.T. Sandwell, A. Dutkiewicz, S.E. Williams,
N. Flament, S. Maus, and M. Seton, "The GPlates Portal: Cloud-
Based Interactive 3D Visualization of Global Geophysical and
Geological Data in a Web Browser", PLoS One, vol. 11, no. 3, pp.
1-10, Mar. 2016. DOI: 10.1371/journal.pone.0150883.
[89] M.M. Rathore, A. Ahmad, A. Paul, and A. Daniel, "Hadoop based
real-time Big Data Architecture for remote sensing Earth
Observatory System", presented at Int. Conf. Computing,
Communication and Networking Technologies, Denton, TX, USA,
Jul. 13-15, 2015.
[90] Hao Shuang, Chen Yongfu, Liu Hua, Zhu Xuelin, Dawa Tashi, and
L. Weina, "Object-oriented Forest Classification of Linzhi County
based on CART Decision Tree with Texture Information", Remote
Sensing Technology and Application, vol. 32, no. 2, pp. 386-394,
2017, DOI: 10.11873/j.issn.1004-0323.2017.2.0386. (in Chinese)
[91] Mingming Jia, Chunying Ren, Dianwei Liu, Zongming Wang,
Xuguang Tang, and Zhangyu Dong, "Object-oriented forest
classification based on combination of HJ- 1 CCD and MODIS-
NDVI data", Acta Ecologica Sinica, vol. 34, no. 24, pp. 7167-7174,
2014. DOI:10.5846/stxb201310112438. (in Chinese)
[92] F.E. Fassnacht, H. Latifi, K. Stereńczak, A. Modzelewska, M.
Lefsky, L.T. Waser, C. Straub, and A. Ghosh, "Review of studies on
tree species classification from remotely sensed data", Remote Sens.
Environ., vol. 186, no. 214, pp. 64-87, Dec. 2016.
[93] H. Guan, Y. Yu, Z. Ji, J. Li, and Q. Zhang, "Deep learning-based
tree classification using mobile LiDAR data", Remote Sens. Lett.,
vol. 6, no. 11, pp. 864-873, Sep. 2015, DOI:
10.1080/2150704X.2015.1088668.
[94] B. Li, H. Zhao and Z. Lv, "Parallel ISODATA Clustering of Remote
Sensing Images Based on MapReduce", 2010, pp. 380-383.
[95] I. Chebbi, W. Boulila and I.R. Farah, "Improvement of Satellite
Image Classification: Approach Based on Hadoop/Map Reduce"
presented at Int. Conf. Advanced Technologies for Signal and
Image Processing, Monastir, Tunisia, Mar. 21-23 2016.
[96] W. Li, Q. Guo, M.K. Jakubowski, and M. Kelly, "A new method for
segmenting individual trees from the lidar point cloud",
Photogrammetric Engineering & Remote Sensing, vol. 78, no. 1, pp.
75-84, Jan. 2012.
[97] M. Chopping, A. Nolin, G.G. Moisen, J.V. Martonchik, and M. Bull,
"Forest canopy height from the Multiangle Imaging
SpectroRadiometer (MISR) assessed with high resolution discrete
return lidar", Remote Sens. Environ., vol. 113, no. 10, pp. 2172-
2185, Oct, 2009.
[98] ASHWORTH, Andrew, EVANS, L. David, COOKE, H. William,
LONDO, Andrew, COLLINS, and Curtis, "Predicting Southeastern
Forest Canopy Heights and Fire Fuel Models using GLAS Data",
Photogrammetric Engineering & Remote Sensing, vol. 76, no. 8, pp.
915-922, Aug. 2010.
[99] P.J. Zarco-Tejada, R. Diaz-Varela, V. Angileri, and P. Loudjani,
"Tree height quantification using very high resolution imagery
acquired from an unmanned aerial vehicle (UAV) and automatic 3D
photo-reconstruction methods", Eur. J. Agron., vol. 55, no. 2, pp.
89-99, Apr. 2014.
[100] Y. Zeng, M.E. Schaepman, B. Wu, J.G.P.W. Clevers, and A.K.
Bregt, "Scaling-based forest structural change detection using an
inverted geometric-optical model in the Three Gorges region of
China", Remote Sens. Environ., vol. 112, no. 12, pp. 4261-4271,
Dec. 2008.
[101] P.U. Ruiliang and P. Gong, "Wavelet transform applied to EO-1
hyperspectral data for forest LAI and crown closure mapping",
Remote Sens. Environ., vol. 91, no. 2, pp. 212-224, May. 2004.
[102] J. Zhao, X. Wang, Y. Zhou, and Q. Qin, "Towards a Framework for
Offering Remote Sensing Data in an Analysis-Ready Format",
presented at Int. Geoscience and Remote Sensing Symposium,
Valencia, Spain, Jul. 22-27, 2018
[103] W. Huang, L. Meng, D. Zhang, and W. Zhang, "In-Memory Parallel
Processing of Massive Remotely Sensed Data Using an Apache
Spark on Hadoop YARN Model", IEEE J.-STARS, vol. 10, no. 1,
pp. 3-19, Jan. 2017. DOI: 10.1109/JSTARS.2016.2547020.
[104] V. Vavilapalli, A. Murthy, C. Douglas, S. Agarwal, M. Konar, R.
Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O.
O'Malley, S. Radia, B. Reed, and E. Baldeschwieler, "Apache
Hadoop YARN: yet another resource negotiator", in proc. SOCC,
Santa Clara, California, 2013, pp. 1-16.
[105] X. Yao, M. Mokbel, S. Ye, G. Li, L. Alarabi, A. Eldawy, Z. Zhao, L.
Zhao, and D. Zhu, "LandQv2: A MapReduce-Based System for
Processing Arable Land Quality Big Data", ISPRS International
Journal of Geo-Information, vol. 7, no. 7, pp. 1-15, Jul. 2018. DOI:
10.3390/ijgi7070271.
[106] Q. Zou, G. Li and W. Yu, "MapReduce functions to remote sensing
distributed data processing-Global vegetation drought monitoring as
example", Software: Practice and Experience, vol. 48, no. 7, pp.
1352-1367, Mar. 2018. DOI: 1352-1367, 10.1002/spe.2578.
[107] J. Shao, D. Xu, C. Feng, and M. Chi, "Big data challenges in China
centre for resources satellite data and application", presented in
Workshop on Hyperspectral Image and Signal Processing:
Evolution in Remote Sensing, Tokyo, Japan, Jun. 2-5 2015.
[108] Q. Xianlin, C. Xiaozhong, Z. Xiangqing, Z. Xiaofeng, S. Guifen,
and Y. Lingyu, "Development of Forest Fire Early Warning and
Monitoring Technique System in China", Forest Resources
Management, no. 06, pp. 45-48, 2015. DOI:
10.13466/j.cnki.lyzygl.2015.06.009. (in Chinese)
[109] J. Gao, K. Shalini, N. Gaur, X. Guan, S. Chen, J. Hong, and M.
Mahmoud, "Data-driven forest fire analysis", presented at
SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI, San
Francisco, Aug. 4-8, 2017. DOI: 10.1109/UIC-ATC.2017.8397558
[110] H. Xingyuan, R. Chunying, C. Lin, W. Zongming, and Z. Haifeng,
"The Progress of Forest Ecosystems Monitoring with Remote
Sensing Techniques", Scientia Geographica Sinica, vol. 38, no. 07,
pp. 997-1011, 2018. DOI: 10.13249/j.cnki.sgs.2018.07.001.
[111] Ast. Proff. T Rajasekaran, J Sruthi, S Revathi, and N. Raveena,
"Forest Fire Prediction and Alert System Using Big Data
Technology", presented at Int. Con. Information Engineering,
Management and Security, 2015, pp. 23-26.
[112] S. Zhu, J. Zhang, X. Zhou, and J. Yang, "A Simulation Model of
Big Data Analysis for Fire Alarm", in proc. International
Conference on Advances in Energy, Environment and Chemical
Engineering, 2015.
[113] S. Nandhini, I. Govindharaj, N. Nivedhitha, and R. Jayakeerthana,
"A Survey on Forewarning System for Pest Control", International
Journal of Pure and Applied Mathematics, vol. 118, no. 14, pp. 531-
538, 2018.
[114] S. Jiang, C. Shen, Y. Xiao, and X. Huang, "Using MapReduce for
Data Processing in the Cloud for Forest Pest Control" presented at
Int. Conf. Networking and Distributed Computing, Hangzhou,
China, Oct. 21-24, 2010
[115] Houbing Song, Ravi Srinivasan, Tamim Sookoor, Sabina Jeschke,
Smart Cities: Foundations, Principles and Applications. ISBN: 978-
1-119-22639-0, Hoboken, NJ: Wiley, 2017, pp.1-906.
[116] Xin Li, Shidan Cheng, Zhihan Lv, Houbing Song, Tao Jia, Ning Lu,
Data analytics of urban fabric metrics for smart cities, Future
Generation Computer Systems, 2018, ISSN 0167-739X,
https://doi.org/10.1016/j.future.2018.02.017
[117] Z. Lv, H. Song, P. Basanta-Val, A. Steed and M. Jo, "Next-
Generation Big Data Analytics: State of the Art, Challenges, and
Future Research Topics," in IEEE Transactions on Industrial
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2907999, IEEE Access
VOLUME XX, 2017 9
Informatics, vol. 13, no. 4, pp. 1891-1899, Aug. 2017.
[118] X. Liu, R. Zhu, B. Jalaian, Y. Sun, Dynamic Spectrum Access Algo
rithm Based on Game Theory in Cognitive Radio Networks, Mobile
Networks and Applications, vol. 20, no. 6, pp. 817-827, Dec, 2015.
[119] R. Zhu, X. Zhang, X. Liu, W. Shu, T. Mao, B. Jalaeian, ERDT:
Energy efficient Reliable Decision Transmission for Cooperative
Spectrum Sensing in Industrial IoT, IEEE Access, vol. 3, pp. 2366-
2378, 2015.
[120] Guido Dartmann, Houbing Song, and Anke Schmeink. Big Data
Analytics for Cyber-Physical Systems: Machine Learning for the
Internet of Things. ISBN: 9780128166376. Elsevier, 2019, pp. 1-
360.
[121] Y. Sun, H. Song, A. J. Jara and R. Bie, "Internet of Things and Big
Data Analytics for Smart and Connected Communities," in IEEE
Access, vol. 4, pp. 766-773, 2016. DOI:
10.1109/ACCESS.2016.2529723
[122] Z. Lv, T. Yin, X. Zhang, H. Song and G. Chen, "Virtual Reality
Smart City Based on WebVRGIS," in IEEE Internet of Things
Journal, vol. 3, no. 6, pp. 1015-1024, Dec. 2016. DOI:
10.1109/JIOT.2016.2546307
WEITAO ZOU received the B.S. degree from
Northeast Forestry University, China, in 2018.
He is currently pursuing a doctorate at Northeast
Forestry University. His current research interests
include cloud computing and parallel computing,
as well as spatial big data and forestry big data.
WEIPENG JING received the Ph.D. degree
from the Harbin Institute of Technology of China.
He is currently an Associate Professor with
Northeast Forestry University, China. His
research interests include modeling and
scheduling for distributed computing systems,
fault tolerant computing and system reliability,
cloud computing, and spatial data mining. He has
published over 50 research articles in refereed
journals and conference proceedings, such as
CPC, PUC, and FGCS. He is IEEE Member.
GUANGSHENG CHEN is currently doctoral
supervisor and professor in Northeast Forestry
University, China. He is the member of National
Innovation Methods Research Institute, and
executive director of Education Information
Technology Council of Education Ministry. His
research interests include biomass material
prediction, intelligent detection of new
composite materials and big data on forestry. He
has published over 30 academic papers and one
monograph.
YANG LU received his B.S., M.S. and Ph.D. in
information and communication engineering
from Harbin Institute of Technology, Harbin,
China, in 2007, 2011 and 2015, respectively. In
2019, he joined Northeast Forestry University,
China. His research interests include machine
learning and cloud computing.
HOUBING SONG received the Ph.D.
degree in electrical engineering from the
University of Virginia, Charlottesville,
VA, in August 2012. In August 2017, he
joined the Department of Electrical,
Computer, Software, and Systems
Engineering, Embry-Riddle Aeronautical
University, Daytona Beach, FL, where he
is currently an Assistant Professor and
the Director of the Security and
Optimization for Networked Globe
Laboratory (SONG Lab, www.SONGLab.us). He served on the faculty of
West Virginia University from August 2012 to August 2017. In 2007 he
was an Engineering Research Associate with the Texas A&M
Transportation Institute. He serves as an Associate Technical Editor for
IEEE Communications Magazine. He is the editor of four books, including
Smart Cities: Foundations, Principles and Applications, Hoboken, NJ:
Wiley, 2017, Security and Privacy in Cyber-Physical Systems:
Foundations, Principles and Applications, Chichester, UK: Wiley-IEEE
Press, 2017, Cyber-Physical Systems: Foundations, Principles and
Applications, Boston, MA: Academic Press, 2016, and Industrial Internet
of Things: Cybermanufacturing Systems, Cham, Switzerland: Springer,
2016. He is the author of more than 100 articles. His research interests
include cyber-physical systems, internet of things, cybersecurity and
privacy, edge computing, big data analytics, and wireless communications
and networking. Dr. Song is a senior member of IEEE and ACM. Dr. Song
was the very first recipient of the Golden Bear Scholar Award, the highest
faculty research award at West Virginia University Institute of Technology
(WVU Tech), in 2016.
... The situation is further exacerbated by administrative boundaries that hinder harmonization of procedures and standardization of output formats and content. Indeed, data integration is considered one of the main challenges of forestry science (Zou et al., 2019). To mitigate these issues, organizations and networks such as European National Forest Inventory Network (ENFIN) have devoted significant resources in the harmonization of forest inventories (Vidal et al., 2016). ...
... This information was employed to generate maps displaying dominant species per municipality, offering a fairly accurate reflection of species distribution across Spain ( Figure 12). Zou et al. (2019) describe this use case in more depth. ...
... In this way, it is possible to pose arbitrary queries that make use of the different sources through the Cross-Forest endpoint. Note that data integration is one of the main challenges of forestry science, as reported in many studies such as Zou et al. (2019). The Cross-Forest Dataset thus illustrates how the use of Semantic Web technologies and Linked Open Data principles can be applied to address this challenge. ...
Article
Full-text available
Introduction Modern forestry increasingly relies on the management of large datasets, such as forest inventories and land cover maps. Governments are typically in charge of publishing these datasets, but they typically employ disparate data formats (sometimes proprietary ones) and published datasets are commonly disconnected from other sources, including previous versions of such datasets. As a result, the usage of forestry data is very challenging, especially if we need to combine multiple datasets. Methods and results Semantic Web technologies, standardized by the World Wide Web Consortium (W3C), have emerged in the last decades as a solution to publish heterogeneous data in an interoperable way. They enable the publication of self-describing data that can easily interlink with other sources. The concepts and relationships between them are described using ontologies, and the data can be published as Linked Data on the Web, which can be downloaded or queried online. National and international agencies promote the publication of governmental data as Linked Open Data, and research fields such as biosciences or cultural heritage make an extensive use of Semantic Web technologies. In this study, we present the result of the European Cross-Forest project, addressing the integration and publication of national forest inventories and land cover maps from Spain and Portugal using Semantic Web technologies. We used a bottom-up methodology to design the ontologies, with the goal of being generalizable to other countries and forestry datasets. First, we created an ontology for each dataset to describe the concepts (plots, trees, positions, measures, and so on) and relationships between the data in detail. We converted the source data into Linked Open Data by using the ontology to annotate the data such as species taxonomies. As a result, all the datasets are integrated into one place this is the Cross-Forest dataset and are available for querying and analysis through a SPARQL endpoint. These data have been used in real-world use cases such as (1) providing a graphical representation of all the data, (2) combining it with spatial planning data to reveal the forestry resources under the management of Spanish municipalities, and (3) facilitating data selection and ingestion to predict the evolution of forest inventories and simulate how different actions and conditions impact this evolution. Discussion The work started in the Cross-Forest project continues in current lines of research, including the addition of the temporal dimension to the data, aligning the ontologies and data with additional well-known vocabularies and datasets, and incorporating additional forestry resources.
... Forestry management is entering a new era of technological innovation, marked by the integration of advanced computational methods and environmental science. The advent of smart forestry, using data-driven approaches, has opened new pathways for sustainable forest management and environmental conservation [5]. Advancements in remote sensing technologies, such as satellite imagery and aerial photography, have propelled the field of forestry into the digital age [6]. ...
... The field of Smart Technology is a rapidly evolving amalgamation of hot topic technologies, including big data, cloud computing, remote sensing (RS), machine learning (ML), artificial intelligence (AI), the Internet of Things, and low-cost sensors, all of which are currently being used in some capacity to enhance agricultural and forest management across the world [1,2]. However, the pace of advancement has often created disparities between the application of these technologies and the accessibility of the data products by end users, especially the broader public outside the scientific community [3]. ...
Preprint
Full-text available
Remote sensing (RS) and Geographic Information Systems (GIS) provide significant opportunities for monitoring and managing natural resources across various temporal, spectral, and spatial resolutions. There is a critical need for natural resource managers to understand the expanding capabilities of image sources, analysis techniques, and in-situ validation methods. This article reviews a range of image analysis tools applicable to natural resource management, including agriculture, water, forests, soil, and natural hazards, and compares their functionalities. Our study highlights that Google Earth Engine (GEE) is favored for wide-area analysis due to its extensive coverage and free access. Global Mapper excels in 3D and light detection and ranging (LIDAR) data, environment for visualizing images (ENVI) specializes in multi- and hyperspectral image processing, ERDAS IMAGINE is optimal for radar data, and eCognition is used for object-based image analysis. The article emphasizes the importance of in-situ validation data, which provides essential ground truth information to calibrate and validate RS models, thereby enhancing accuracy. Understanding these tools and integrating them with in-situ validation techniques enables natural resource managers to improve their monitoring and decision-making processes, facilitating effective collaboration with RS researchers.
... All these technologies generate a huge amount of data, but advances in the field of AI allow us to deal with big data. However, the lack of homogeneous data sources and different data structures from different organizations and initiatives around the globe remain a barrier for data management (Zou et al., 2019). ...
... All these technologies generate a huge amount of data, but advances in the field of AI allow us to deal with big data. However, the lack of homogeneous data sources and different data structures from different organizations and initiatives around the globe remain a barrier for data management (Zou et al., 2019). ...
Article
Full-text available
A decade of French support to Mediterranean forests under the FAO committee Silva Mediterranea has resulted in establishing strategic approaches for Mediterranean forests, improving stakeholder engagement, consolidating information on Mediterranean forests, and developing and implementing several regional projects in line with the priorities set out under the SFMF. One key priority that emerged during this decade is the restoration of forest landscapes in the Mediterranean.
... Figura 16.1. Esquema de big data aplicados a la gestión forestal (fuente:Zou et al., 2019). ...
Chapter
Full-text available
En las últimas décadas, el acceso a datos a diferentes escalas ha crecido a un ritmo sin precedentes, duplicándose el volumen de datos cada dos años. En contraste con otros campos, los datos forestales cubren ámbitos muy diversos, no siempre complementarios, tienen una clara orientación territorial y temporal, (multidimensionales) y demandan una cierta recurrencia temporal, que hacen que su procesamiento sea más difícil. Sin embargo, la forma en cómo se adquieren, se procesan y se analizan los datos procedentes de inventarios forestales desde que comenzaron las mediciones forestales ha cambiado sustancialmente; muy particularmente en las dos últimas décadas. Los datos derivados de inventarios forestales y de fuentes complementarias suelen ser muy voluminosos, presentan estructuras muy dispares, suelen estar ordenados de acuerdo con criterios poco sistemáticos, incluyen información de aspectos muy diversos (multi-informativos) y, además, su adquisición suele ser costosa y no se suele optimizar todo su potencial analítico. Es evidente que las tecnologías emergentes han hecho que cambiemos completamente la forma de abordar la selvicultura: cada vez se utiliza más los procesos de digitalización para optimizar el uso de la información de diferentes fuentes, la capacidad de análisis de procesos y la modelización relacionados con la gestión de los recursos y sistemas forestales. En este capítulo, se desarrollan de forma teórica y práctica los aspectos relacionados con el acceso, el procesado, el análisis y el uso práctico de datos forestales para ayudar a comprender la aplicación de bases de datos en la silvicultura, incluyendo los datos disponibles, la arquitectura de datos, el desarrollo y la aplicación de nuevas aproximaciones estadísticas, así como las tendencias futuras en el desarrollo de macrodatos forestales. A lo largo del texto, se proponen ejemplos de análisis de datos de inventario mediante diferentes aproximaciones estadísticas. Este capítulo relaciona los ejemplos con el código fuente y las salidas numéricas y gráficas, para lo cual se ofrece el acceso a las librerías de R en GitHub. Palabras clave: minería de datos, inventarios, selvicultura, ordenación de montes.
Chapter
Full-text available
The field of Smart Life represents a large spectrum of various application domains coming from very established ones as Smart energy and Smart home to very recent ones as Smart environment, Smart airport, or Smart Earth. We observed a substantial growth in the scientific literature with over 126,000 papers containing “Smart” in their titles in 2021. Despite this huge number, we did not identify a detailed classification of these fields in the existing literature. There is still a notable gap in the classification and systematization of these fields. To address this, we developed a generic taxonomy for Smart applications by conducting a systematic mapping study focused on state-of-the-art and research agenda-oriented papers (2341 scientific publications in total).
Conference Paper
Full-text available
Big data generally describes data that is too large and diverse to be analyzed using traditional means, and it has a wide array of applications, including research. The Nigerian forestry subsector is pivotal in the country's economic goals and environmental sustainability. However, the subsector has failed to harness its potential and is plagued with challenges that limit its productivity and prevent forestry research from driving innovation. Big data exists as a transformative tool to create opportunities for promoting forestry research in Nigeria. Therefore, this paper examines the potential of big data in catalyzing forestry research in Nigeria. The challenges facing forestry research in Nigeria have been identified, emphasizing the need for data-driven solutions and increased funding.
Article
Full-text available
Remote sensing (RS) and Geographic Information Systems (GISs) provide significant opportunities for monitoring and managing natural resources across various temporal, spectral, and spatial resolutions. There is a critical need for natural resource managers to understand the expanding capabilities of image sources, analysis techniques, and in situ validation methods. This article reviews key image analysis tools in natural resource management, highlighting their unique strengths across diverse applications such as agriculture, forestry, water resources, soil management, and natural hazard monitoring. Google Earth Engine (GEE), a cloud-based platform introduced in 2010, stands out for its vast geospatial data catalog and scalability, making it ideal for global-scale analysis and algorithm development. ENVI, known for advanced multi- and hyperspectral image processing, excels in vegetation monitoring, environmental analysis, and feature extraction. ERDAS IMAGINE specializes in radar data analysis and LiDAR processing, offering robust classification and terrain analysis capabilities. Global Mapper is recognized for its versatility, supporting over 300 data formats and excelling in 3D visualization and point cloud processing, especially in UAV applications. eCognition leverages object-based image analysis (OBIA) to enhance classification accuracy by grouping pixels into meaningful objects, making it effective in environmental monitoring and urban planning. Lastly, QGIS integrates these remote sensing tools with powerful spatial analysis functions, supporting decision-making in sustainable resource management. Together, these tools when paired with in situ data provide comprehensive solutions for managing and analyzing natural resources across scales.
Article
Full-text available
The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.
Article
Full-text available
Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.
Article
Full-text available
Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.
Article
Full-text available
Arable land quality (ALQ) data are a foundational resource for national food security. With the rapid development of spatial information technologies, the annual acquisition and update of ALQ data covering the country have become more accurate and faster. ALQ data are mainly vector-based spatial big data in the ESRI (Environmental Systems Research Institute) shapefile format. Although the shapefile is the most common GIS vector data format, unfortunately, the usage of ALQ data is very constrained due to its massive size and the limited capabilities of traditional applications. To tackle the above issues, this paper introduces LandQv2, which is a MapReduce-based parallel processing system for ALQ big data. The core content of LandQv2 is composed of four key technologies including data preprocessing, the distributed R-tree index, the spatial range query, and the map tile pyramid model-based visualization. According to the functions in LandQv2, firstly, ALQ big data are transformed by a MapReduce-based parallel algorithm from the ESRI Shapefile format to the GeoCSV file format in HDFS (Hadoop Distributed File System), and then, the spatial coding-based partition and R-tree index are executed for the spatial range query operation. In addition, the visualization of ALQ big data with a GIS (Geographic Information System) web API (Application Programming Interface) uses the MapReduce program to generate a single image or pyramid tiles for big data display. Finally, a set of experiments running on a live system deployed on a cluster of machines shows the efficiency and scalability of the proposed system. All of these functions supported by LandQv2 are integrated into SpatialHadoop, and it is also able to efficiently support any other distributed spatial big data systems.
Article
Hyperspectral image has a large amount of data and complex structure. The distributed storage of massive remote sensing data is a hot topic today; however, the current research mostly separates the image pixels and metadata, resulting in poor system cohesion and poor data access performance. At the same time, the needs of various upper-level remote sensing algorithms are not fully considered, which makes the system less available. In view of the above problems, this paper presents a distributed image storage model based on HDFS, which stores the entire image data model in a structure to improve the system cohesion, and provides a flexible data blocking strategy for upper-level applications to meet a variety of data access needs. The comparison experiments show that the storage model has better access performance than the existing schemes.
Article
Advancements in satellite imaging and sensor technologies result in capturing of large amount of spatial data. Many parallel processing techniques based on data or control parallelism have been attempted during the past 2 decades to provide performance improvement in image processing applications such as urban sprawl, weather prediction and crop estimation. These techniques have used block-based distributed file processing or the more modern MapReduce-based programming for implementation which still have gaps between optimal and best processing in terms of resource scheduling, data distribution and ease of programming. In this paper, we present a layered framework for parallel data processing to improve storage, retrieval and processing performance of spatial data on an underlying distributed file system. The paper presents a data placement strategy across a distributed HDFS cluster in a way to optimize spatial data retrieval and processing. The presence of neighborhood pixels local to the processing node in a distributed environment reduces network latencies and improves the efficiency of applications such as object recognition, change detection and site selection. We evaluate the data placement strategy on a four-node HDFS cluster and show that it can deliver good performance benefits by way of reading blocks of data at almost 10–12 times the default, which contributes to the improvement in efficiency of the various applications that use region growing methods.
Article
This study introduces a novel framework for land change simulation that combines the traditional Land Transformation Model (LTM) with data clustering tools for the purposes of conducting land change simulations of large areas (e.g., continental scale) and over multiple time steps. This framework, called "LTM-cluster", subsets massive land use datasets which are presented to the artificial neural network-based LTM. LTM-cluster uses the k-means clustering algorithm implemented within the Spark high-performance compute environment. To illustrate the framework, we use three case studies in the United States which vary in simulation extents, cell size, time intervals, number of inputs, and quantity of urban change. Findings indicate consistent and substantial improvements in accuracy performance for all three case studies compared to the traditional LTM model implemented without input clustering. Specifically, the percent correct match, the area under the operating characteristics curve, and the error rate improved on average of 9%, 11%, and 4%. These results confirm that LTM-cluster has high reliability when handling large datasets. Future studies should expand on the framework by exploring other clustering methods and algorithms.
Conference Paper
Data Visualization allows users to summarize, analyze and reason about data. A map visualization tool first loads the designated geospatial data, processes the data and then applies the map visualization effect. Guaranteeing detailed and accurate geospatial map visualization (e.g., at multiple zoom levels) requires extremely high-resolution maps. Classic solutions suffer from limited computation resources and hence take a tremendous amount of time to generate maps for large-scale geospatial data. The paper presents GeoSparkViz a large-scale geospatial map visualization framework. GeoSparkViz extends a cluster computing system (Apache Spark in our case) to provide native support for general cartographic design. The proposed system seamlessly integrates with a Spark-based spatial data management system, GeoSpark. It provides the data scientist a holistic system that allows her to perform data management and visualization on spatial data and reduces the overhead of loading the intermediate spatial data generated during the data management phase to the designated map visualization tool. GeoSparkViz also proposes a map tile data partitioning method that achieves load balancing for the map visualization workloads among all nodes in the cluster. Extensive experiments show that GeoSparkViz can generate a high-resolution (i.e., Gigapixel image) Heatmap of 1.7 billion Open-StreetMaps objects and 1.3 billion NYC taxi trips in ≈4 and 5 minutes on a four-node commodity cluster, respectively.