ChapterPDF Available


Soils are probably the most critical natural resource in Agriculture, and soils security represents a critical growing global issue. Soils experiments require vast amounts of high-quality data, are very hard to be reproduced, and there are few studies about data provenance of such tests. We present OpenSoils; it shares knowledge about data-centric soils experiments. OpenSoils is a provenance-oriented and lightweight e-infrastructure that collects, stores, describes, curates and, harmonizes various soil datasets.
Data Provenance in Agriculture
Sérgio Manuel Serra da Cruz
, Marcos Bacis Ceddia
Renan Carvalho Tàvora Miranda
, Gabriel Rizzo
Filipe Klinger
, Renato Cerceau
, Ricardo Mesquita
Ricardo Cerceau
, Elton Carneiro Marinho
Eber Assis Schmitz
, Elaine Sigette
, and Pedro Vieira Cruz
Federal Rural University of Rio de Janeiro, Seropédica, RJ, Brazil
National Agency of Supplementary Health, Rio de Janeiro, RJ, Brazil
Federal Fluminense University, Volta Redonda, RJ, Brazil
SENAI-RJ, Rio de Janeiro, RJ, Brazil
Federal University of Rio de Janeiro, Cidade Universitária, RJ, Brazil
Abstract. Soils are probably the most critical natural resource in Agriculture,
and soils security represents a critical growing global issue. Soils experiments
require vast amounts of high-quality data, are very hard to be reproduced, and
there are few studies about data provenance of such tests. We present OpenSoils;
it shares knowledge about data-centric soils experiments. OpenSoils is a
provenance-oriented and lightweight e-infrastructure that collects, stores,
describes, curates and, harmonizes various soil datasets.
Keywords: Reproducibility Soil security Open data Data quality
Big data
1 Introduction
According to Food and Agriculture Organization (FAO)
, an agency of the United
Nations, the worlds population is expected to grow to about 9,6 billion by 2050. Thus,
there is widespread concern about the challenges to soil and food systems in meeting
the demand of populations for sufcient, affordable, and nutritious food. There are
similar concerns about meeting those challenges in ways that agriculture would benet
hugely from common shared global agronomic data spaces.
The modern Agriculture is a data-centric interdisciplinary domain, with the inte-
gration of different subjects (from genomics to soil sciences), different scales (from
genes to geolocalisation) and, different markets (from local farmers to multinational
research teams). The ability to manage and explore these datasets is a crucial issue to
tackle the current sustainability challenges. A wide variety of datasets underpin
products and processes, which vary in size, complexity, structure, semantics, subject
matter and in how they are updated and used.
©Springer Nature Switzerland AG 2018
K. Belhajjame et al. (Eds.): IPAW 2018, LNCS 11017, pp. 257261, 2018.
Soils are probably the most critical natural resource in Agriculture; they generate
environmental, health and socio-economic benets that are vital to sustaining life on
Earth [1]. Soil experiments are indispensable sources of knowledge. Researchers
conduct several kinds of soils experiments which are characterized as long-term eld
experiments (LTE) and short-term (in vitro and in silico) lab experiments (STE).
The LTE have been running for years in many parts of the world for the last 175-years-
old (e.g. Rothamsted) and need more time to execute the research procedures. On the
other hand, STE experiments can be performed in a few weeks or months and have the
potential to contribute to the improve LTE. Thus, it is essential to deliver to the
agronomic community a novel computing infrastructure that can share raw and curated
data and the provenance of STE and LTE and augment the reproducibility of soil
experiments. This paper presents a multi-layer e-infrastructure which bring innovations
to Soils Science using FAIR principles (Findable, Accessible, Interoperable, and
Reusable) [2], W3C PROV-DM
, open data and semantic web standards.
2 Experiments in Soils Science
Soil Science represents the area that studies the soil (and its properties) as a natural
resource, including soil formation, composition, classication, mapping, management
and use [1,3], these properties could be about physical, chemical, biological, and
fertility. Soils experiments are costly because the soils are incredibly diverse, and it is
necessary to treat them in a specic manner [3]. Any recommendation ts specic soil
and weather conditions. Besides, the soil properties have high spatial and time vari-
ability. Finally, changes in soil properties can often be proved and quantied only after
The LTE is essential in monitoring and understanding the changes in soil physics or
fertility occurring because of long-term agrotechnical operations. Their scientic and
practical value is immeasurable and keeps improving over the years. The information
about the soils use cannot be replaced by any other means [3]. Additionally, the STE
produced much of the data that built the sciences of soil physics, chemistry, and
biology [1,3]. STE often explore soil processes subject to change over decades, topics
such as aggregation, weathering, microbial activity, and soil fertility itself.
Although STE enriches soil models, most tend to be reductionist, isolating individual
components, and do not study the whole soil, with its high-order interactions that
become apparent only with time.
3 Open Soils
Data and provenance are the primary and permanent assets in OpenSoils (www. The architecture is an open, provenance-oriented, and lightweight
computational e-infrastructure which rely on layers to store, compute and share curated
258 S. M. S. da Cruz et al.
data of (STE and LTE) soils experiments [5]. Figure 1illustrates a conceptual view and
the ow of information in the architecture.
Layer 1 (End-users layer) - hosts on the OpenSoils Web portal; it collects soil data
directly from the LTE into OpenSoils database. The specialists can use mobile and web
applications (e.g., OpenSoils App, API and Wet Lab tools) to collect the data directly
in the elds (LTE experiments) and trace the route of each soil sample sent to chemistry
and physics laboratories to be analyzed. Usually, the morphological properties of the
soil are analyzed in situ by the specialists. OpenSoils app sends raw data to the cloud-
based database thought the API. After that, each soil sample is tagged and sent to
laboratories where the scientist does wet experiments and execute STE which evaluate
specic physic-chemical properties of each soil horizon and selected soil samples are
shipped to the UFRRJs soils museum.
Layer 2 (Services layer) - hosts soil models and data-centric scientic workows
which ingest large amounts of legacy data and analyses the consistency of the incoming
data [3].
Layer 3 (Data layer) - stores and describes various soils datasets with metadata.
The internal structure supports a diversied degree of data granularity and uses a
database named OpenSoilsDB [5,6] which can store new curated soils data annotated
with provenance metadata. Much of the information needed to assure the data quality
and to allow researchers to reproduce STE experiments can be obtained by system-
atically capturing data provenance [4]. OpenSoilsDB can store provenance from ETL
workows and scripts. ETL Workow provenance consists of the record of the
derivation of a result (e.g., a soil experiment, an image, a map) by a computational
process represented as scientic workows. Script provenance is obtained by running
the source code of scripts (e.g. R, Pyhton). OpenSoilsDB used W3C PROV-DM
recommendation to store provenance and was designed to support the FAIR principles
for scientic data management and data stewardship [2]. The principles ensure trans-
parency, reproducibility, and reusability of the experiments, facilitating data sharing
more systematically.
The database also supports the ingestion of legacy soils data imported through ETL
workows. The layer can store scientic and governance data. Besides, to support open
data, we can use general-purpose data repositories (e.g., CKAN, Dataverse, DSpace,
Dryad, DataHub).
A specic thesaurus is used to add semantics and annotate soils data, allowing us to
link it as RDF triples in WikiData. The thesaurus used in the e-infrastructure is
Agrovoc [7], which is a SKOS-XL (Simple Knowledge Organization System eXten-
sion for Labels) concept scheme published as LOD (Linked Open Data). It covers
several areas of interest of the FAO including food, agriculture and, environment. This
thesaurus is used by researchers, librarians, and information managers for indexing,
retrieving, and organizing data in agricultural information systems.
Data management is not a target in itself, but a key conduit leading to knowledge
discovery and innovation in soil sciences. OpenSoilsDB database stores scientic and
governance data. The scientic data aims to serve high quality-assessed, georeferenced
soils proles database to the Brazilian and international communities upon their
standardization and harmonization. Each soil prole description recorded in the data-
base has more than 43 entities, and 250 attributes to stores the soil properties and soil
Data Provenance in Agriculture 259
experiments (mineralogical, morphological, chemical, physical, and environmental
data). Furthermore, the database support data versioning and provenance; stores geo-
referenced soil data (text and images) about physic-chemical analytical data from each
horizon and soil samples analyzed in wet laboratories.
Data governance is an essential block in the knowledge base of information pro-
fessionals involved in supporting data-intensive research. Its adoption is advantageous
because it is a service based on standardized, repeatable processes, designed to enable
the data discovery and the transparency of data-related transformation processes.
Layer 4 (Governance layer) - hosts data licenses, re-use rights, analytical tools,
visualization and map generation services that can be connected to other software (e.g.,
ArcGIS, R or Jupyter) to generate analytical reports, prediction and raster maps.
Although received little attention in soils research communities, this layer is founda-
tional for soils security. The prime function of the layer is to improve and maintain the
citations and quality of the soils dataset; thus, to be successful at governance, quality
must be continuously measured, and the results continuously retrieved by the data and
services layers.
4 Concluding Remarks
Maintaining healthy soils is a key to modern agriculture. However, there is still much
computational work needed to be developed in soil sciences and more in-depth studies
to understand the role of data provenance in Agriculture. We introduced OpenSoils; it
is an e-infrastructure which share knowledge about STE and LTE in soils security using
FAIR, PROV, and semantic web approaches. The infrastructure is being developed and
aims to enhance reproducibility of experiments and deliver high-quality datasets,
knowledge and maps based on curated data.
Acknowledgments. This work was supported in part by the Brazilian agencies FNDE/MEC/
SESU, PIBIC/CNPq, Petrobras and CYTED networks BigDSSAgro and SmartLogistcs@IB.
Fig. 1. Overview of the conceptual data-ow in OpenSoils.
260 S. M. S. da Cruz et al.
1. Koch, A., et al.: Soil security: solving the global soil crisis. Glob. Policy 4(4), 434441 (2013)
2. Wilkinson, M.D., et al.: The FAIR guiding principles for scientic data management and
stewardship. Sci. Data 3, 160018 (2016)
3. Körschens, M.: The importance of long-term eld experiments for soil science and
environmental research a review. Plant Soil Environ. 52,18 (2006)
4. Cruz, S.M.S., do Nascimento, J.A.P.: SisGExp: rethinking long-tail agronomic experiments.
In: Mattoso, M., Glavic, B. (eds.) IPAW 2016. LNCS, vol. 9672, pp. 214217. Springer,
Cham (2016).
5. Cruz, S.M.S., et al.: Towards an e-infrastructure for open science in soils security. In: XII
Proceedings on Brazilian E-Science Workshop (BRESCI), pp. 5966. SBC, Natal-RN (2018)
6. Rizzo, G.S.C., Ceddia, M.B., Cruz, S.M.S.: Banco de Dados Pedológico: Primeiros Estudos.
In: 5th Proceedings on Reunião Anual de Iniciação Cientíca (RAIC), pp. 12. UFRRJ,
Seropédica (2017). (in Portuguese)
7. Caracciolo, C., et al.: The AGROVOC linked dataset. Seman. Web 4(3), 341348 (2013)
Data Provenance in Agriculture 261
... Despite ongoing efforts and existing results to achieving the objectives of Digital Agriculture, there remains the need in improving standardized data/metadata formats [1,30,34,38], data and system integration [4,31,38], data security [31,34], language regionalisms [38], reusable models [31], availability [31,38], semantic analyzes [4,30], lack of provenance metadata [10] and system interoperability [1,4,30,34,38]. ...
Conference Paper
This article presents a bibliometric and terminological study of a corpus composed of abstracts and titles of 278 articles retrieved by a review protocol planned for surveying initiatives on building artifacts for modeling knowledge related to agricultural production systems. The original corpus comprised a 53,379-word linguistic extract filtered to 111 interconnected major terminologies by combining AntConc and VOSViewer tools. The reduced data were imported into the Gephi tool for analysis of lexical network graphs. Emergent clusters and their central terms underscore the thematic areas that prominently shape the landscape of agricultural Knowledge Organization Systems (KOS) and highlight the interplay between technological advancements, semantic enrichment, and domain-specific challenges. Our analysis of term occurrences and clusters contributes to a broader understanding of these concepts, inferring their significance, roles, and interconnections within the agricultural landscape. It also sheds light on the roles played by KOS in Digital Agriculture.
Full-text available
Soils Security is a critical and growing global concern. The OpenSoils´ objective is to host, connect and share large amounts of curated soil data and knowledge at the Brazilian and South America level. The e-infrastructure consists of several layers of services, a database of soil profiles, a cloud-based computational framework to compute and share soil data integrated with a map visualization tools. OpenSoils is open, elastic, provenance-oriented and lightweight computational e-infrastructure that collects, stores, describes, curates, harmonizes and directs to various soil resource types: large datasets of soils profiles, services/applications, documents, projects and external links. OpenSoils is the first open science-based computational framework of soils security in the literature.
Conference Paper
Full-text available
Reproducibility is a major feature of Science. Even agronomic research of exemplary quality may have irreproducible empirical findings because of random or systematic error. This work presents SisGExp, a provenance-based approach that aid researchers to manage, share, and enact the computational scientific workflows that encapsulate legacy R scripts. SisGExp transparently captures provenance of R scripts and endows experiments reproducibility. SisGExp is non-intrusive, does not require users to change their working way, it wrap agronomic experiments as a scientific workflow system.
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Full-text available
An insufficient use of the results of long-term field experiments is not responsible and it means harm to research capacities. Long-term experiments are very expensive, but under the condition of comprehensive and coordinated evaluation they still represent the most cost-effective research method. With the knowledge obtained on the basis of long-term field experiments the farmers could double the yields in the last decades, improve the quality of the products and the environmental protection and secure the sufficient human nutrition. Nevertheless, if there are more than 400 000 diet-related deaths annually in Germany, it is not because of the lack of food, but in the opposite, because the food is too plentiful, too good and too cheap. If a farmer fertilized his plants and fed his animals in such a way we humans nourish ourselves, he would bankrupt within a few months, because: 1. The yields in plant and livestock production would be dramatically reduced and 2. veterinary surgeon costs would become priceless. On the occasion of the 60th anniversary of the establishment of the long-term field experiment in Thyrow in Germany, an international conference took place in Berlin, in June 1997. The participants of this conference proclaimed the memorandum For the maintenance and the comprehensive use of European long-term field experiments, which was signed by many scientists responsible for the maintenance of the long-term field experiments in 14 countries. In the concluding part of this memorandum the following points were emphasized: Contribute to the maintenance of the European long-term field trials, as they are essential for agricultural and environmental research. Support the efforts aiming at more extensive and cooperative use of the long-term field trials, which are a basis of the research on sustainable land use. Help to use the scientific knowledge originating in the long-term field trials to increase food production by means of maintenance of the soil quality and protection of natural resources. Contribute to keep the long-term field trials available and functioning effectively as a scientific heritage for future generations.
Full-text available
Born in the early 1980's as a multilingual agricultural thesaurus, AGROVOC has steadily evolved over the last fifteen years, moving to an electronic version around the year 2000, and embracing the Semantic Web shortly thereafter. Today AGROVOC is a SKOS-XL concept scheme published as Linked Open Data, containing links (as well as backlinks) and references to many other Linked Datasets in the LOD cloud. In this paper we provide a brief historical summary of AGROVOC and detail its specification as a Linked Dataset.
Soil degradation is a critical and growing global problem. As the world population increases, pressure on soil also increases and the natural and the natural capital of soil faces continuing decline, international policy makers have recognized this and a range of initiatives to address it have emerged over recent years. However, a gap remains between what the science tells us about soil and its role in underpinning ecological and human sustainable development, and existing policy instruments for sustainable development. Functioning soil is necessary for ecosystem service delivery, climate change abatement, food and fiber production and fresh water storage. Yet key policy instruments and initiatives for sustainable development have under-recognised the role of soil in addressing major challenges including food and water security, biodiversity loss, climate change and energy sustainability. Soil science has not been sufficiently translated to policy for sustainable development. Two underlying reasons for this are explored and the new concept of soil security is proposed to bridge the science policy divide. Soil security is explored as a conceptual framework that could be used as the basis for a soil policy framework with soil carbon as an exemplar.
Banco de Dados Pedológico: Primeiros Estudos
  • G S C Rizzo
  • M B Ceddia
  • S M S Cruz
Rizzo, G.S.C., Ceddia, M.B., Cruz, S.M.S.: Banco de Dados Pedológico: Primeiros Estudos. In: 5th Proceedings on Reunião Anual de Iniciação Científica (RAIC), pp. 1-2. UFRRJ, Seropédica (2017). (in Portuguese)
  • C Caracciolo
Caracciolo, C., et al.: The AGROVOC linked dataset. Seman. Web 4(3), 341-348 (2013)