PresentationPDF Available

Developing Scalable Information Extraction Processing Pipelines using R for Earth Observation Applications

Authors:

Abstract and Figures

The amount of spatial information on Earth surface features and human conditions are increasing at a rapid rate. Accessing, processing and extracting information from such large volume of data is a challenge for practitioners. Various desktop remote sensing and GIS software are available for these applications but they come at a high cost, limited or restricted functionality, and are often difficult to develop customized processing toolchain. This paper describes a custom processing toolchain developed in the open source R programming language to (i) download earth observation and socioeconomic data, (ii) access crowdsourced open-data as ground truth information to train and validate supervised classification models, and (iii) perform information extraction using decision tree, random forest and support vector machines. The method can easily be scaled up to larger areas using parallel computing methods.
No caption available
… 
No caption available
… 
No caption available
… 
No caption available
… 
No caption available
… 
Content may be subject to copyright.
Developing Scalable Information
Extraction Processing Pipelines using R
for Earth Observation Applications
Ani Ghosh, Alex Mandel, Robert Hijmans
Environmental Science & Policy, University of California, Davis
gfc@ucdavis.edu, biogeo.ucdavis@edu
FOSS4G Boston 2017
What motivated the study?
Who are we?
USAID Feed the Future funded lab based at UC Davis
What we do?
Support large group of interdisciplinary researchers working in developing
countries
What is our goal?
Provide simple and single interface for specific uses to Users with limited
experience of geospatial technologies
Additional motivation
Land use and land cover (LULC) mapping from remote sensing data
Thousands’ of research articles published on LULC classification
Rarely the results are reproducible and dataset are open
Investment in generating training samples for LULC classes is huge
Generated
~90000 point
samples globally
How to get the Landsat data?
Other options:
Landsat on AWS
landsat-util
Rpackage: getlandsat
Challenges:
More than 7 million catalogued Landsat
tiles available from USGS
How to identify the tiles?
Which Landsat data to get?
How to filter with dates/time?
How to find the cloud-free tiles?
yoonkoutlab42013.wordpress.com
Getting Landsat tiles: R implementation
User input
Find Landsat path &
rows for AOI from
wrs (descending)
Filter Landsat
metadata record
for other input
raster, sp sqldf
download bands
and save to disk
base
Stack downloaded
raster bands in R
raster
Advantages
Minimum user input
No interaction with websites
Only depends on R
Challenges
Depends on Google public bucket
Needs internet
Filtering is slow
Getting Landsat tiles: Phnom Penh, Cambodia
Challenges in getting OSM data
Download
Overpass API
OpenStreetMap website
Planet.osm
Extract
Osmosis
Osmoconvert
Osmfilter
yoonkoutlab42013.wordpress.com
Getting OSM data: R implementation
Land use classification schemes/definitions
are important for LULC mapping project
USGS, ESA, GlobCover LULC classes for the
global products are unique
Challenges in using OSM data
key: landuse/ value: residential key: building/ value: residential
Overlapping polygons
Data quality
Getting OSM data: R implementation
User input
Get OSM data using
overpass API for the
AOI specified
Convert sf objects to sp
objects and separate
various geometries
raster, sposmdata, sf
Randomly generate
point samples weighted
by area/length
Combine point
samples for each
LULC classes
base
Advantages
Minimum user input
No interaction with websites
Only depends on R
Challenges
Slow for large number of polygons
Needs internet raster, sp
Final classification: R implementation
User input Clean samples based on
vegetation indices thresholds
Extract pixel values,
remove duplicated cells
raster base
Classification model
building/training,
accuracy assessment
Raster prediction,
LULC map generation
Raster,
rasterVis
Advantages
Minimum user input
Only depends on R
Challenges
Depends on OSM coverage
Training phase is slow for large
samples rpart, e1071,
randomForest
Level 1 classification results
Particular study is restricted to level 1 classes only because of lack of OSM data
Scalability
3 main functions
Depending on computing resources, multiple areas can be run without
modification
Running parallel jobs in R is fairly straightforward
Future/ongoing activities
Other classification options (including deep learning methods)
Model tuning and robust accuracy assessment
Incorporating data from other crowd-sourced campaigns (e.g. geo-wiki)
Shiny interface
Change detection with OSM revision history and concurrent Landsat data
Seasonal spectral libraries for LULC classes at global scales
At FOSS4G 2017
rspatial.org: Talk by Alex Mandel today
Talks and workshop by Tina Cormier
End-to-End Geo Machine Learning
Digital Globe: SpaceNet overview
If you want to jump into R-Geo world
anighosh@ucdavis.edu
Creator and maintainer of the packages:
e1071, osmdata, randomForest, raster, rprat, sf, sp
Funding
United States Agency for International Development (USAID) to the Feed the Future Innovation Lab for
Sustainable Intensification (Cooperative Agreement No. AID-OAA-L-14-00006). The contents are the
sole responsibility of the authors and do not necessarily reflect the views of USAID or the United
States Government.
Questions and acknowledgements
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.