Conference Paper

Scalable developments for big data analytics in remote sensing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The average AUC values of CNN model have been enhanced by 10.02%, from 81.44% to 91.47%. GPUs-based support vector machine (SVM) classifier ML model with OpenMP and MPI structure implemented by Cavallaro et al. [9]. ...
Article
Full-text available
The utilization of Artificial Intelligence (AI) and Machine Learning(ML) in the image processing domain is useful to detect and recognize the types of soil. The main aim of the present work is to process the soil images and classify them accurately by using Tensorflow and Keras Deep Learning (DL) frameworks with pre-trained weights. There are several ML models already implemented for the classification of soil images. A dataset has 903 soil images of four different types of soil (alluvial, black, clay, and red). These images were divided into a training dataset and a validation dataset. The image augmentation process was applied to the dataset, and then the models are trained with these augmented images. In the present work, the Convolutional Neural Network (CNN) model was implemented to classify the soil images and achieved an accuracy of 99.86% for training and 97.68% for validation. Furthermore, six Deep Convolution Neural Network (DCNN) models were implemented, such as Rsnet152V2, VGG-16, VGG-19, Inception-ResNetV2, Xception, and DenseNet201, to classify the soil images. The accuracy of a Rsnet152V2, VGG-16, VGG-19, Inception-ResNetV2, Xception, and Densnet201 DCNN models were 99.15%, 97.58%, 98.44%, 98.15%, 98.86%, and 98.58%, respectively. The performance of CNN and DCNN models was evaluated using a confusion matrix and K-fold technique. The proposed CNN model has outperformed the DCNN models, and also literature reported works.
... Votolo et al. [9] gave an overview of currently available implementations that are related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation, and prediction. Cavallaroa et al. [13] presented how big data analytics take advantages of techniques from the fields of data mining, machine learning, or statistics, with a focus on analyzing big data in remote sensing with modern technologies. Kussul et al. [14] proposed a framework for solving the large-scale classification and area estimation problems in the remote sensing domain based on deep learning. ...
Article
Full-text available
Aggregation and visualization of geographical data are an important part of environmental data mining, environmental modelling, and agricultural management. However, it is difficult to aggregate geospatial data of the various formats such as maps, census and surveys. This paper presents a framework named PlaniSphere, which can aggregate the various geospatial datasets and synthesizes raw data. We developed an algorithm in PlaniSphere to aggregate remote sensing images with census data for classification and visualization of land use and land cover (LULC). The results show that the framework is able to classify geospatial data sets of LULC from multiple formats. National census data sets can be used for calibration of remote sensing LULC classifications. This provides a new approach for the classification of remote sensing data. The approach proposed in this paper is useful for LULC classification in environmental spatial analysis.
... Their performance tests showed efficient and economical results. Cavallaro et al. [20] used GPUs to implement a support vector machine (SVM) classifier with MPI and openMPI frameworks. As an alternative to using HPC computing frameworks, other researchers have developed their own bespoke parallel large-scale remote sensing data processing platforms. ...
Article
Full-text available
Inquiry using data from remote Earth-observing platforms often confronts a straightforward but particularly thorny problem: huge amounts of data, in ever-replenishing supplies, are available to support inquiry, but scientists’ agility in converting data into actionable information often struggles to keep pace with rapidly incoming streams of data that amass in expanding archival silos. Abstraction of those data is a convenient response, and many studies informed purely by remotely sensed data are by necessity limited to a small study area with a relatively few scenes of imagery, or they rely on larger mosaics of images at low resolution. As a result, it is often challenging to thread explanations across scales from the local to the global, even though doing so is often critical to the science under pursuit. Here, a solution is proposed, by exploiting Apache Spark, to implement parallel, in-memory image processing with ability to rapidly classify large volumes of multiscale remotely sensed images and to perform necessary analysis to detect changes on the time series. It shows that processing on three different scales of Landsat 8 data (up to ~107.4 GB, five-scene, time series image sets) can be accomplished in 1018 seconds on local cloud environment. Applying the same framework with slight parameter adjustments, it processed same coverage MODIS data in 54 seconds on commercial cloud platform. Theoretically, the proposed scheme can handle all forms of remote sensing imagery commonly used in the Earth and environmental sciences, requiring only minor adjustments in parameterization of the computing jobs to adjust to the data. The authors suggest that the “Spark sensing” approach could provide the flexibility, extensibility, and accessibility necessary to keep inquiry in the Earth and environmental sciences at pace with developments in data provision.
... Cavallaro et al. have addressed the classification of land cover types over an image-based dataset as a concrete big data problem in their work [44]. In the scope of the study, PiSVM, an implementation based on LibSVM, was used for classification. ...
... Votolo et al. [9] gave an overview of currently available implementations that are related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation, and prediction. Cavallaroa et al. [13] presented how big data analytics take advantages of techniques from the fields of data mining, machine learning, or statistics, with a focus on analyzing big data in remote sensing with modern technologies. Kussul et al. [14] proposed a framework for solving the large-scale classification and area estimation problems in the remote sensing domain based on deep learning. ...
Article
Full-text available
Geospatial information plays an important role in environmental modelling, resource management, business operations, and government policy. However, very little or no commonality between formats of various geospatial data has led to difficulties in utilizing the available geospatial information. These disparate data sources must be aggregated before further extraction and analysis may be performed. The objective of this paper is to develop a framework called PlaniSphere, which aggregates various geospatial datasets, synthesizes raw data, and allows for third party customizations of the software. PlaniSphere uses NASA World Wind to access remote data and map servers using Web Map Service (WMS) as the underlying protocol that supports service-oriented architecture (SOA). The results show that PlaniSphere can aggregate and parses files that reside in local storage and conforms to the following formats: GeoTIFF, ESRI shape files, and KML. Spatial data retrieved using WMS from the Internet can create geospatial data sets (map data) from multiple sources, regardless of who the data providers are. The plug-in function of this framework can be expanded for wider uses, such as aggregating and fusing geospatial data from different data sources, by providing customizations to serve future uses, which the capacity of the commercial ESRI ArcGIS software is limited to add libraries and tools due to its closed-source architectures and proprietary data structures. Analysis and increasing availability of geo-referenced data may provide an effective way to manage spatial information by using large-scale storage, multidimensional data management, and Online Analytical Processing (OLAP) capabilities in one system.
... And up-to-date list of project publications is available online 4 . In addition, three papers related to North State project have been accepted as oral presentations in IGARSS 2015, on Tandem-X and Landsat 8 data fusion [7], morphological attribute profiles [8] and big data analytics in remote sensing [9]. ...
Conference Paper
Full-text available
This is a selection of results of the North State project, that demonstrate how innovative methods applied to the new Sentinel data streams can be combined with models to monitor carbon and water fluxes for pan-boreal Europe.
Article
Full-text available
Enormous amount of earth information, gathered from satellite sensors, simulations, and other resources, are collectively referred to as Big Earth Observation Data (BEOD). The data contains remarkable insights and spatio-temporal stamps of pertinent Earth phenomena for enhancing our knowledge, responding, and addressing demanding situations of earth sciences and observations. However, However, traditional data mining algorithms are generally time-inefficient, making it difficult to process and analyze BEOD. To address this challenge, we explore two ways to enhance scalability: 1) improving the algorithm with specific parameters or data modifications when run on a single machine, and 2) making the algorithm parallel through distributed execution on multiple machines, such as with cluster-based implementations. We also suggest improvements for existing techniques and widely used algorithms for processing BEOD. In this review, we conduct a systematic review of data mining techniques for classification, clustering, prediction, regression, association rules, pattern mining, and anomaly detection to determine their scalability and performance on various Earth observation use cases. We explored advanced mining techniques, including statistical, machine learning, and deep learning approaches, for handling BEOD at scale. We also identified potential challenges, open research issues, and provide future directions for the development of scalable algorithms for mining BEOD. We observed that applying data mining techniques on BEOD introduces serious concerns since the data has both spatial and temporal components. Statistical and machine learning models such as ARIMA, SARIMA, Naive Bayes, Bayesian networks, KNN, and K-means, as well as SVM, are not suitable for working with the volume and heterogeneity present in the data, but it can be improved by employing big data environments. Nowadays, deep learning-based techniques are popularly used for working with large amounts of data, but it requires specialized systems and upgrades as the data volume increases. This issue can be addressed through a big data platform. A unified deep learning architecture can be employed to handle both the spatial and temporal components of BEOD, and performance can be improved by deploying the architecture on a big data environment. Therefore, this study also reveals that, although deep learning architectures are efficient and trending, traditional statistical methods and machine learning can achieve competitive or sometimes improved performance with the involvement of big data technologies and/or internal data representation.
Chapter
This chapter covers concepts of big data analytics in general, including definitions, categories, and use case overview. Processes and requirements for remote sensing big data analytics are discussed. Two standard efforts on big data analytics are briefly covered, that is, ISO (International Organization for Standardization) Big Data Working Group and IEEE (Institute of Electrical and Electronics Engineers) Big Data Standard group.KeywordsBig data analyticsStandardDescriptive analyticsDiagnosis analyticsPredictive analyticsData scienceMachine learningData miningPredictive modeling
Article
Full-text available
Support Vector Machines (SVM) are powerful classification and regression tools. They have been widely studied by many scholars and applied in many kinds of practical fields. But their compute and storage requirements increase rapidly with the number of training vectors, putting many problems of practical interest out of their reach. For applying SVM to large scale data mining, parallel SVM are studied and some parallel SVM methods are proposed. Most currently parallel SVM methods are based on classical MPI model. It is not easy to be used in practical, especial to large scale data-intensive data mining problems. MapReduce is an efficient distribution computing model to process large scale data mining problems. Some MapReduce software were developed, such as Hadoop, Twister and so on. In this paper, parallel SVM based on iterative MapReduce model Twister is studied. The program flow is developed. The efficiency of the method is illustrated through analyzing practical problems.
Book
Advances in Environmental Remote Sensing: Sensors, Algorithms, and Applications compiles comprehensive review articles to examine the developments in concepts, methods, techniques, and applications as well as focused articles and case studies on the latest on a particular topic. Divided into four sections, the first deals with various sensors, systems, or sensing operations using different regions of wavelengths. Drawing on the data and lessons learned from the U.S. Landsat remote sensing programs, it reviews key concepts, methods, and practical uses of particular sensors/sensing systems. Section II presents new developments in algorithms and techniques, specifically in image preprocessing, thematic information extraction, and digital change detection. It gives correction algorithms for hyperspectral, thermal, and multispectral sensors, discusses the combined method for performing topographic and atmospheric corrections, and provides examples of correcting non-standard atmospheric conditions, including haze, cirrus, and cloud shadow. Section III focuses on remote sensing of vegetation and related features of the Earth’s surface. It reviews advancements in the remote sensing of ecosystem structure, process, and function, and notes important trade-offs and compromises in characterizing ecosystems from space related to spatial, spectral, and temporal resolutions of the imaging sensors. It discusses the mismatch between leaf-level and species-level ecological variables and satellite spatial resolutions and the resulting difficulties in validating satellite-derived products. Finally, Section IV examines developments in the remote sensing of air, water, and other terrestrial features, reviews MODIS algorithms for aerosol retrieval at both global and local scales, and demonstrates the retrieval of aerosol optical thickness (AOT). This section rounds out coverage with a look at remote sensing approaches to measure the urban environment and examines the most important concepts and recent research.
Article
Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than \(n\)) and \(m\) the number of machines. PSVM reduces the memory requirement by the Interior Point Method from \({\fancyscript{O}} (n^2)\;\hbox{to}\;{\fancyscript{O}}(np/m)\), and improves computation time to \({\fancyscript{O}}(np^2/m)\). Empirical studies show PSVM to be effective. This chapter\(^\dagger\) was first published in NIPS’07 [1] and the open-source code was made available at [2].
Article
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
On parallel and scalable classification and clustering techniques for earth science datasets
  • M Goetz
  • M Richerzhagen
  • G Cavallaro
  • C Boden-Stein
  • P Glock
  • M Riedel
  • J A Benediktsson
Gpu acceleration for support vector machines
  • A Athanasopoulos
  • A Dimou
  • V Meyaris
  • I Kompatsiaris
Parallel support vector machines: The cascade svm
  • H P Graf
  • E Cosatto
  • L Bottou
  • I Dourdanovic
  • V Vapnik
Hierarchical data format, version 5
  • Hdf Group
pisvm 1.2.1 analytics 10-fold cross-validation indian pines processed
  • G Cavallaro
  • M Riedel