Data Provenance and Data Management in eScience
Abstract
eScience allows scientific research to be carried out in highly distributed environments. The complex nature of the interactions in an eScience infrastructure, which often involves a range of instruments, data, models, applications, people and computational facilities, suggests there is a need for data provenance and data management (DPDM). The W3C Provenance Working Group defines the provenance of a resource as a “record that describes entities and processes involved in producing and delivering or otherwise influencing that resource”. It has been widely recognised that provenance is a critical issue to enable sharing, trust, authentication and reproducibility of eScience process.
Data Provenance and Data Management in eScience identifies the gaps between DPDM foundations and their practice within eScience domains including clinical trials, bioinformatics and radio astronomy. The book covers important aspects of fundamental research in DPDM including provenance representation and querying. It also explores topics that go beyond the fundamentals including applications. This book is a unique reference for DPDM with broad appeal to anyone interested in the practical issues of DPDM in eScience domains.
... A chart of overall architecture is shown in Fig. 1. To facilitate the following introduction, we define the following terms: a pipeline refers to a set of steps of data processing (although it is also called a workflow for example in eScience [29]); a processing unit or an operation refers to an executable entity for each step; data refer to both intermediate and final data output of a unit or a whole pipeline; semantics refer to the relation and dependency of steps. Our UDFs are the extension of original data manipulation functions in Pig, and can fully replace the original ones. ...
... BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream())); 28 29 path += '/' + varname + '_' + in.readLine(); 30 fname = "/part_" + in.readLine(); 31 ...
... A chart of overall architecture is shown in Fig. 1. To facilitate the following introduction, we define the following terms: a pipeline refers to a set of steps of data processing (although it is also called a workflow for example in eScience [29]); a processing unit or an operation refers to an executable entity for each step; data refer to both intermediate and final data output of a unit or a whole pipeline; semantics refer to the relation and dependency of steps. Our UDFs are the extension of original data manipulation functions in Pig, and can fully replace the original ones. ...
... BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream())); 28 29 path += '/' + varname + '_' + in.readLine(); 30 fname = "/part_" + in.readLine(); 31 ...
Provenance is information about the origin and creation of data. In data science and engineering, such information is useful and sometimes even critical. In spite of that, provenance for big data is under-explored due to the challenges from the 'Vs' of big data. In data analytics, users need to query history, reproduce intermediate or final results, tune models, and adjust parameters in runtime for making data-driven decisions. In addition, users need to evaluate data and pipeline trustworthiness. Towards realising these functionalities for big data provenance, we propose a solution, called LogProv, which needs to renovate data pipelines or even some of big data software infrastructure to generate structured logs for pipeline events, and then stores data and logs separately. The data are explicitly linked to the logs, which implicitly record pipeline semantics. Semantic information can be retrieved from the logs easily since the logs are well defined and structured beforehand. We implemented LogProv in Apache Pig, and adopted ElasticSearch to provide query service. In this paper LogProv is evaluated in a Hadoop ecosystem hosted by a cloud and empirically case-studied. The results show that LogProv is efficient since the performance overhead is no more than 10%, the query can be responded within 1 second, the trustworthiness is marked clearly, and there is no impact on the data processing logic of original pipelines.
... Researchers have to ensure that their data are accurate, complete, and authentic. Some of the tools that are used to manage research data depend on the nature of the discipline and the type of experiment, while others are of a more general type that can be used in many research activities, such as data storage and analysis (Liu et al., 2014;Ray, 2014). ...
... Metadata include information about data provenance, a term used to describe the history of data from the moment they were created, through all their modifications in content and location. Data provenance is particularly important to scholarship today, because the digital environment makes it so easy to copy and paste data Liu et al., 2014). ...
Abstract The exponential growth of research data and their complexity are creating significant problems for scientists and organizations. New web technologies, infrastructure, and services that are identified with umbrella terms such as eScience, Science 2.0, digital humanities, Semantic Web, and open science are changing how scientists perform their research, find information, and collaborate with each other. This chapter looks at research data from different perspectives and shows how academic libraries and other stakeholders are engaging in supporting eScience, a term that refers to large datasets and tools that facilitate the acquisition, management, and exchange of digital scientific data.
... Provenance can be employed to find useful information and can be used for the purpose of learning and knowledge discovery (Liu et al., 2013). Huynh et al (2013) used provenance analytics to assess the quality of the crowd-generated data. ...
Sustainable urban environments based on Internet of Things (IoT) technologies require appropriate policy management. However, such policies are established as a result of underlying, potentially complex and long-term policy making processes. Consequently, better policies require improved and verifiable planning processes. In order to assess and evaluate the planning process, transparency of the system is pivotal which can be achieved by tracking the provenance of policy making process. However, at present no system is available that can track the complete cycle of urban planning and decision making. We propose to capture the complete process of policy making and to investigate the role of IoT provenance to support design-making for policy analytics and implementation. The environment in which this research will be demonstrated is that of Smart Cities whose requirements will drive the research process.
... Provenance can be employed to find useful information and can be used for the purpose of learning and knowledge discovery (Liu et al., 2013). Huynh et al (2013) used provenance analytics to assess the quality of the crowd-generated data. ...
Sustainable urban environments based on Internet of Things (IoT) technologies require appropriate policy management. However, such policies are established as a result of underlying, potentially complex and long-term policy making processes. Consequently, better policies require improved and verifiable planning processes. In order to assess and evaluate the planning process, transparency of the system is pivotal which can be achieved by tracking the provenance of policy making process. However, at present no system is available that can track the complete cycle of urban planning and decision making. We propose to capture the complete process of policy making and to investigate the role of IoT provenance to support design-making for policy analytics and implementation. The environment in which this research will be demonstrated is that of Smart Cities whose requirements will drive the research process.
Provenance is information about the origin and creation of data. In data science and engineering related with cloud environment, such information is useful and sometimes even critical. In data analytics, it is necessary for making data‐driven decisions to trace back history and reproduce final or intermediate results, even to tune models and adjust parameters in a real‐time fashion. Particularly, in cloud, users need to evaluate data and pipeline trustworthiness. In this paper, we propose a solution: LogProv, toward realizing these functionalities for big data provenance, which needs to renovate data pipelines or some of big data software infrastructure to generate structured logs for pipeline events, and then stores data and logs separately in cloud space. The data are explicitly linked to the logs, which implicitly record pipeline semantics. Semantic information can be retrieved from the logs easily since they are well defined and structured beforehand. We implemented and deployed LogProv in Nectar Cloud,* associated with Apache Pig, Hadoop ecosystem, and adopted Elasticsearch to provide query service. LogProv was evaluated and empirically case studied. The results show that LogProv is efficient since the performance overhead is no more than 10%; the query can be responded within 1 second; the trustworthiness is marked clearly; and there is no impact on the data processing logic of original pipelines.
Agent-Based Models (ABMs) assist with studying emergent collective behavior of individual entities in social, biological, economic, network, and physical systems. Data provenance can support ABM by explaining individual agent behavior. However, there is no provenance support for ABMs in a distributed setting. The Multi-Agent Spatial Simulation (MASS) library provides a framework for simulating ABMs at fine granularity, where agents and spatial data are shared application resources in a distributed memory. We introduce a novel approach to capture ABM provenance in a distributed memory, called ProvMASS. We evaluate our technique with traditional data provenance queries and performance measures. Our results indicate that a configurable approach can capture provenance that explains coordination of distributed shared resources, simulation logic, and agent behavior while limiting performance overhead. We also show the ability to support practical analyses (e.g., agent tracking) and storage requirements for different capture configurations.
Web based simulation service is actively utilized to computably analyze various kinds of phenomena in real world according to progress of computing technology and spread of Network. However it is hard to share data and information among users on the services, because most of web based simulation services do not share and open simulation processing information and results. In this paper, we design a simulation provenance data sharing service on EDISON_CFD (EDucation-research Integration Simulation On the Net for Computational Fluid Dynamics) to share the calculated simulation performance information. To store and share the simulation processing information, we define the simulation processing step as "Problem {\rightarrow} Plan, Design {\rightarrow} Mesh {\rightarrow} Simulation performance {\rightarrow} Result {\rightarrow} Report." Users can understand a problem solving method through a computer simulation by searching the simulation performance information with Search/Share API of the store. Besides, this opened simulation information can reduce the waste of calculation resource to process same simulation jobs.
ResearchGate has not been able to resolve any references for this publication.