PosterPDF Available

Geoflow - Novel Workflow Implementations To Facilitate Big EO Data Workflows in Nextflow

Authors:
Humboldt-Universität zu Berlin | Department of Geography | Earth Observation Lab | Unter den Linden 6 | D-10099 Berlin | Tel.: +49.30.2093.6905 | Fax: +49.30.2093.6848 | e-mail: patrick.hostert@geo.hu-berlin.de
Geography Department
Florian Katerndahl1* Dirk Pflugmacher1 Fabian Lehmann2 Andreas Janz1 Ulf Leser2 Patrick Hostert1
1 Humboldt-Universität zu Berlin | Department of Geography | Earth Observation Lab | Unter den Linden 6 | D-10099 Berlin
2 Humboldt-Universität zu Berlin | Institute for Computer Science | Knowledge Management in Bioinformatics | Unter den Linden 6 | D-10099 Berlin
*contact: florian.katerndahl@geo.hu-berlin.de | https://www.geographie.hu-berlin.de/en/professorships/eol
Geoflow
Novel Workflow Implementations To Facilitate Big EO Data Workflows in Nextflow
Funded by the Deutsche
Forschungsgemeinschaft (DFG, German
Research Foundation) – Project-ID
414984028 – SFB 1404 FONDA
Poster
HUB EOL
Twitter
EO analysis workflows lack reusability, and scalability across large areas
EO analysis workflows combine tasks of heterogeneous resource
requirements
Coupling of specific input data, processing back-ends and execution
environments creates software- and/or hardware-infrastructure lock-ins
As a layer of abstraction, workflow engines promise accessible
development of portable, adaptable and dependable processing
environments (Lehmann et al. 2021 & Leser et al. 2021)
Challenge
Develop a Nextflow workflow that leverages a broad range of
existing, already widely used open source tools and programs
Map annual land cover between 2000 and 2020 across Germany
using Landsat times series and the harmonized European-wide Land
Use and Coverage Area frame Survey (LUCAS) (d’Andrimont et al.
2020)
Objectives
Domain-agnostic workflow
(execution) engine with its own
Groovy-based DSL and
connectors to resource managers
Originated from Bioinformatics (Di
Tommaso et al. 2017) – was used
in EO applications as well
(Lehmann et al. 2021)
Nextflow
Help scientists organize and execute workflows
take care of errors and dependencies
Sequence of tasks are declared in a Domain
Specific Language (DSL)
Create and schedule physical tasks
Scaling across computational infrastructure is
handled by resource managers (e.g.
Kubernetes)
Yield repeatable and portable workflows
Workflow Engines
Wrap single or multiple program calls (e.g.
script execution) into abstract tasks
Dependencies between tasks are defined via
files
A task can start once all inputs are available
Independent tasks may run in parallel
Workflows
References
d’Andrimont, R., Yordanov, M., Martinez-Sanchez, L., Eiselt, B., Palmieri, A., Dominici, P., Gallego, J., Reuter, H. I., Joebges, C., Lemoine, G., and van der Velde, M. (2020). “Harmonised LUCAS in-situ land cover and use database for field surveys from
2006 to 2018 in the European Union”. In: Scientific Data 7.1, p. 352. doi: 10.1038/s41597-020-00675-z.
Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame, C. (2017). “Nextflow enables reproducible computational workflows”. In: Nature Biotechnology 35.4, pp. 316–319. doi: 10.1038/nbt.3820.
EnMAP-Box Developers (2019). EnMAP-Box 3 - A QGIS Plugin to process and visualize hyperspectral remote sensing data. URL: https://enmap-box.readthedocs.io.
Frantz, D. (2019). “FORCE—Landsat + Sentinel-2 Analysis Ready Data and Beyond”. In: Remote Sensing 11.9, p. 1124. doi: 10.3390/rs11091124. url: http://www.mdpi.com/2072-4292/11/9/1124.
Lehmann, F., Frantz, D., Becker, S., Leser, U., and Hostert, P. (2021). “FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters”. In: Proceedings of the CIKM 2021 Workshops. url:
http://ceur-ws.org/Vol-3052/short12.pdf.
Leser, U., Hilbrich, M., Draxl, C., Eisert, P., Grunske, L., Hostert, P., Kainmüller, D., Kao, O., Kehr, B., Kehrer, T., Koch, C., Markl, V., Meyerhenke, H., Rabl, T., Reinefeld, A., Reinert, K., Ritter, K., Scheuermann, B., Schintke, F., Schweikardt, N., and
Weidlich, M. (November 2021). “The Collaborative Research Center FONDA”. In: Datenbank-Spektrum 1610-1995. doi: 10.1007/s13222-021-00397-5.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Modern Earth Observation (EO) often analyses hundreds of gigabytes of data from thousands of satellite images. This data usually is processed with hand-made scripts combining several tools implementing the various steps within such an analysis. A fair amount of geographers' work goes into optimization, tuning, and parallelization in such a setting. Development becomes even more complicated when compute clusters become necessary, introducing issues like scheduling, remote data access, and generally a greatly increased infrastructure complexity. Furthermore, tailor-made systems are often optimized to one specific system and cannot easily be adapted to other infrastructures. Data Analysis Workflow engines promise to relieve the workflow developer from finding custom solutions to these issues and thereby improve scalability, reproducibility, and reusability of workflows while reducing development cost at the infrastructure side. On the other hand, they require the workflow to be programmed in a particular language, to obey certain principles of distributed processing, and to properly configure and tune the execution stack, which puts additional burden to data scientists. Here, we study this trade-off using a concrete EO workflow for long-term vegetation dynamics in the Mediterranean. The original workflow was programmed with FORCE, a custom-made framework for assembling and executing EO workflows on stand-alone servers. We ported it to the scientific workflow system Nextflow, which is capable of seamlessly orchestrating workflows over a large variety of infrastructures. We discuss the pitfalls we faced while porting the workflow, advantages and disadvantages of such an approach, and compare in detail the efficiency of both implementations on various infrastructures. We quantify the overhead in execution time incurred by the workflow engine and give hints on how to deal with heterogeneous tasks. Overall, our Nextflow implementation shows promising behavior in terms of reusability and scalability, though this does not apply to all workflow stages.
Article
Full-text available
Accurately characterizing land surface changes with Earth Observation requires geo-located ground truth. In the European Union (EU), a tri-annual surveyed sample of land cover and land use has been collected since 2006 under the Land Use/Cover Area frame Survey (LUCAS). A total of 1351293 observations at 651780 unique locations for 106 variables along with 5.4 million photos were collected during five LUCAS surveys. Until now, these data have never been harmonised into one database, limiting full exploitation of the information. This paper describes the LUCAS point sampling/surveying methodology, including collection of standard variables such as land cover, environmental parameters, and full resolution landscape and point photos, and then describes the harmonisation process. The resulting harmonised database is the most comprehensive in-situ dataset on land cover and use in the EU. The database is valuable for geo-spatial and statistical analysis of land use and land cover change. Furthermore, its potential to provide multi-temporal in-situ data will be enhanced by recent computational advances such as deep learning.
Article
Full-text available
Ever increasing data volumes of satellite constellations call for multi-sensor analysis ready data (ARD) that relieve users from the burden of all costly preprocessing steps. This paper describes the scientific software FORCE (Framework for Operational Radiometric Correction for Environmental monitoring), an ‘all-in-one’ solution for the mass-processing and analysis of Landsat and Sentinel-2 image archives. FORCE is increasingly used to support a wide range of scientific to operational applications that are in need of both large area, as well as deep and dense temporal information. FORCE is capable of generating Level 2 ARD, and higher-level products. Level 2 processing is comprised of state-of-the-art cloud masking and radiometric correction (including corrections that go beyond ARD specification, e.g., topographic or bidirectional reflectance distribution function correction). It further includes data cubing, i.e., spatial reorganization of the data into a non-overlapping grid system for enhanced efficiency and simplicity of ARD usage. However, the usage barrier of Level 2 ARD is still high due to the considerable data volume and spatial incompleteness of valid observations (e.g., clouds). Thus, the higher-level modules temporally condense multi-temporal ARD into manageable amounts of spatially seamless data. For data mining purposes, per-pixel statistics of clear sky data availability can be generated. FORCE provides functionality for compiling best-available-pixel composites and spectral temporal metrics, which both utilize all available observations within a defined temporal window using selection and statistical aggregation techniques, respectively. These products are immediately fit for common Earth observation analysis workflows, such as machine learning-based image classification, and are thus referred to as highly analysis ready data (hARD). FORCE provides data fusion functionality to improve the spatial resolution of (i) coarse continuous fields like land surface phenology and (ii) Landsat ARD using Sentinel-2 ARD as prediction targets. Quality controlled time series preparation and analysis functionality with a number of aggregation and interpolation techniques, land surface phenology retrieval, and change and trend analyses are provided. Outputs of this module can be directly ingested into a geographic information system (GIS) to fuel research questions without any further processing, i.e., hARD+. FORCE is open source software under the terms of the GNU General Public License v. >= 3, and can be downloaded from http://force.feut.de.
Article
Today’s scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 “FONDA -– Foundations of Workflows for Large-Scale Scientific Data Analysis”, in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA’s internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the “making of” a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.
EnMAP-Box 3 -A QGIS Plugin to process and visualize hyperspectral remote sensing data
  • P Di Tommaso
  • M Chatzou
  • E W Floden
  • P P Barja
  • E Palumbo
  • C Notredame
Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame, C. (2017). "Nextflow enables reproducible computational workflows". In: Nature Biotechnology 35.4, pp. 316-319. doi: 10.1038/nbt.3820. EnMAP-Box Developers (2019). EnMAP-Box 3 -A QGIS Plugin to process and visualize hyperspectral remote sensing data. URL: https://enmap-box.readthedocs.io.