Conference PaperPDF Available

Compressing Provenance Graphs



The provenance community has built a number of sys-tems to collect provenance, most of which assume that provenance will be retained indefinitely. However, it is not cost-effective to retain provenance information inef-ficiently. Since provenance can be viewed as a graph, we note the similarities to web graphs and draw upon tech-niques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that adapting web compression techniques results in a com-pression ratio of 2.12:1 to 2.71:1, which we can improve upon to reach ratios of up to 3.31:1.
Compressing Provenance Graphs
Yulai Xie†‡, Kiran-Kumar Muniswamy-Reddy§, Darrell D. E. Long
Ahmed Amer, Dan Feng, Zhipeng Tan
Huazhong University of Wuhan National Laboratory University of California, §Harvard Santa Clara
Science and Technology for Optoelectronics Santa Cruz University University
The provenance community has built a number of sys-
tems to collect provenance, most of which assume that
provenance will be retained indefinitely. However, it is
not cost-effective to retain provenance information inef-
ficiently. Since provenance can be viewed as a graph, we
note the similarities to web graphs and draw upon tech-
niques from the web compression domain to provide our
own novel and improved graph compression solutions for
provenance graphs. Our preliminary results show that
adapting web compression techniques results in a com-
pression ratio of 2.12:1 to 2.71:1, which we can improve
upon to reach ratios of up to 3.31:1.
1 Introduction
Provenance, though extremely valuable, can take up sub-
stantial storage space. For instance, in the PReServ [9]
provenance store, the original data was 100 KB, while
the provenance reached 1 MB. For MiMI [10], an on-
line protein database, the provenance expands to 6 GB
while the base data is only 270 MB. Similar results are
observed in other systems [3, 11]. This makes prove-
nance data an increasing overhead on the storage subsys-
A provenance dataset can be represented as a prove-
nance graph [12]. Thus, efficient representation of prove-
nance graphs can fundamentally speed up provenance
queries. While, inefficient multi-GB provenance graphs
will not fit in limited memory and dramatically reduce
query efficiency.
We propose to adapt web compression algorithms to
compress provenance graphs. Our motivation comes
from our observation that provenance graphs and web
graphs have similar structure and some common essen-
tial characteristics, i.e., similarity, locality and consec-
utiveness. We have further discovered that provenance
graphs have their own special features, and we propose
two new techniques to improve the provenance graph
We test our hypothesis by compressing the provenance
graphs generated by the PASS [3] system. Our results
show that our web compression algorithms can compress
such graphs, and that our improved algorithms can fur-
ther raise the compression ratio up to 3.31:1.
2 Web & Provenance Graphs
A web graph is a directed graph where each URL is rep-
resented as a node, and the link from one page to the
other page is a directed edge. There are some key prop-
erties exploited by current web graph compression algo-
Locality: Few links would go across URL domain,
and therefore the vast majority tend to point to pages
Similarity: Pages that are not far from each other
have common neighbors with high probability.
Consecutiveness: The node numbers of successors
of a page are in sequential order.
Provenance graphs also have a similar organizational
structure and characteristics as web graphs. Figure 1
shows the conversion from a snapshot of a NetBSD
provenance trace (generated in the PASS system [3]) to
an adjacency list that represents the provenance graph.
The notation “A INPUT[ANC] B” in the provenance
trace means that B is an ancestor of A, indicating that
there exists a directed edge from A pointing to B. In this
way, a provenance graph is also a directed graph and each
node (e.g., node 2 or 3) has a series of out-neighbors.
Provenance nodes 2 and 3 are similar, as they have
common successors in the form of nodes 4, 7, 9 and 14.
The reason for this is that many header files or library
2.0 NAME / bin/cp
2.1 INPUT [ANC] 4.0
2.1 INPUT [ANC] 5.0
2.1 INPUT [ANC] 6.0
2.1 INPUT [ANC] 7.0
2.1 INPUT [ANC] 8.0
2.1 INPUT [ANC] 9.0
2.1 INPUT [ANC] 10.0
2.1 INPUT [ANC] 11.0
2.1 INPUT [ANC] 12.0
2.1 INPUT [ANC] 13.0
2.1 INPUT [ANC] 14.0
2.1 INPUT [ANC] 15.0
3.0 NAME /disk/scripts/bulkbuild
3.1 INPUT [ANC] 4.0
3.1 INPUT [ANC] 7.0
3.1 INPUT [ANC] 9.0
3.1 INPUT [ANC] 14.0
3.1 INPUT [ANC] 17.0
3.1 INPUT [ANC] 18.0
3.1 INPUT [ANC] 19.0
3.1 INPUT [ANC] 20.0
3.1 INPUT [ANC] 21.0
8 10
17 3 21
18 20
Node Out-degree Successors
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
4, 7, 9, 14, 17, 18, 19, 20, 21
Provenance trace Provenance graph Adjacency list
Figure 1: Mapping from the provenance trace to adjacency list that represents the provenance graph. The expression
“2.1 INPUT[ANC] 4.0” indicates that node 4 is an ancestor of node 2, resulting in a directed edge from node 2 pointing
to node 4. This figure also shows that provenance graph exhibits the similar characteristics (i.e., similarity, locality
and consecutiveness) as web graph.
files that are represented as nodes like 4, 7, 9 and 14 are
repeatedly used as input by many processes (e.g., nodes 2
and 3). Nodes 2 and 3 also exhibit locality. The succes-
sors of provenance node 2 are only between 4 and 15,
and the successors of node 3 are only between 4 and 21.
This is because many header files or library files (e.g.,
the successors of nodes 2 and 3) that are used as input
by a process are probably in the same directory, so the
ordering (and therefore assigned numbers) of these files
are usually close to each other. Consecutiveness is also
clear in the successors of nodes 2 and 3, from 4 to 15
and from 17 to 21 respectively. The existence of consec-
utiveness is because a process or file may be generated
by many files that do not belong to a PASS volume, and
such files are usually numbered sequentially.
3 Related Work
Barga et al. [1] presented a layered provenance model
for workflow systems that can efficiently reduce prove-
nance size by minimizing the repeated operation infor-
mation during provenance generation. This is similar to
our approach that exploits similarity between the neigh-
bors of different nodes representing the same manipula-
tion. Chapman et al. [4] proposed two classes of algo-
rithms to compress provenance graphs: provenance fac-
torization and provenance inheritance. The former aims
to find common subtrees between different nodes and the
latter focuses on find similarities between data items that
have ancestry relationships or belong to a particular type.
These algorithms achieve good compression ratios, but
can only be applied to a flat data model, i.e., where each
node has complete provenance, and therefore cannot be
used to compress provenance generated by a provenance
system such as PASS [3]. Our methods are more general
and are hence applicable to wider range of systems.
There has been considerable work [5, 6, 7, 8] in the
domain of web graph compression. Adler et al. [5] pro-
posed to utilize reference compression to compress web
graphs. Randall et al. [6] suggested a set of technolo-
gies such as delta codes and variable-length bit encoding
to compress a database providing fast access to web hy-
perlinks. The critical web graph compression framework
was presented by Boldi and Vigna [7]. They obtained
good compression performance by fully exploiting the
locality and similarity of web pages. The algorithm we
use is based on this framework.
There are also classical techniques like LZ-based com-
pression algorithms [2]. These techniques present an up-
per bound on the compression that is possible. However,
since they do not preserve the structure of the data, the re-
sulting compressed graph will not be amenable to query-
4 Compression Algorithms
Three critical ideas lie behind web compression algo-
rithms [7]: First, encoding the successor list of one node
by using the similar successors of another node as a ref-
erence, thus efficiently avoiding encoding the duplicate
data; Second, encoding consecutive numbers by only
recording the start number and length, reducing the num-
Node Out-degree Successors
3, 11, 13, 14, 17
11, 14, 19, 20, 21, 31, 33
Node Out-degree Bit list
Extra nodes
3, 11, 13, 14, 17
19, 20, 21, 31, 33
Node Out-degree Bit list
Left extreme
3, 11, 17
31, 33
Node Out-degree Bit list
Left extreme
-12, 8, 6
15, 2
Reference compression
Find consecutive numbers
Encode gaps
Figure 2: An example on web compression algorithm
ber of successors to be encoded; and Third, encoding the
gap between the successors of a node rather than the suc-
cessors themselves, which typically requires fewer bits
to be encoded.
As we can see in Figure 1, a provenance graph can be
represented by a set of provenance nodes which have a
series of ancestors as their successors. The provenance
nodes are numbered from 0 to N1, in order, during
provenance generation. We use Out(x) to denote the suc-
cessor list of node x. The web compression algorithm to
encode this list is detailed as the follows:
1. Reference compression: Find the node with the
most similar successor list in the preceding Wsuc-
cessor lists. Wis a window parameter. Let ybe
such a reference node, Out(y) is its corresponding
successor list (called reference list). The encoding
of Out(x) can be divided into two parts: a bit list
to identify the common successors between Out(x)
and Out(y), and Extra nodes that identifies the rest
of the successors in Out(x).
2. Find consecutive numbers: Separate the consecu-
tive numbers from the Extra nodes, and then rep-
resent each set of consecutive numbers by using its
left extreme and its length.
3. Encode gaps: Let x1,x2,x3, ..., xkbe the successors
of xthat have not be encoded after the above step.
If x1x2...xk, then encode them as x1x,x2x1,
..., xkxk1.
Figure 2 shows an example on these three steps. In
this example, node 15 is the reference for node 16 and
has no reference itself. The case in Figure 1 is simpler.
The successor list of Node 2 can be encoded using step
2. The successor list of node 3 can be encoded using step
1 and 2, with node 2 serving as the reference list for node
Node Name Successors
19, 21, 32
4, 9, 13, 17
19, 20, 23
3, 11, 13, 14, 17
3, 10, 13, 17
4, 8, 9, 11
4, 8, 11, 15
5, 7, 11, 12
3, 11, 13, 14, 18
4, 6
3, 11, 13, 14, 19
Figure 3: Name-identified reference list
Our Improved Approach
We now describe two improvements beyond existing
web compression algorithms. These are motivated by
the observed properties of datasets generated by the
PASS [3] system.
(a) Name-identified Reference list:
Web compression algorithms find the most similar
reference list in the preceding Wnodes. The greater
the similarity, the better the compression performance
achieved. However, sometimes, it’s hard to find a very
similar reference list with a small W, and while a larger
Wvalue would produce a better compression ratio be-
cause it enlarges the scope of the possible reference lists,
this would be at the expense of slower compression and
We have found that many provenance datasets record
the name of provenance nodes. The nodes with same
name usually have a large set of common successors. For
instance, in PASS provenance traces, a process that is
represented as a provenance node would be scheduled to
execute many times, and each time it is scheduled, it will
use many of the same header files or library files as input,
which are actually the common successors between the
process nodes with same name. So we propose to use
name as an indicator to help find similar reference lists.
Figure 3 describes how this technology functions in an
example. When node 18 with name /usr/bin first appears,
the algorithm finds reference list in the preceding W(if
W=3) nodes. But when node 23 with name /usr/bin ap-
pears the second time, we take the successor list of node
18 as reference list rather than using W. In this algorithm,
we need only use a hash table to identify the nodes with
same name. So each time we encode the successor list of
a node, we just have to confirm whether it is in the hash
table, but we need not go backwards through the window
list (especially beneficial when Wis larger). This would
incur only a minimal time overhead on compression and
Table 2: Performance of web compression algorithm and improved algorithms that incorporate two new technologies
with respect to different Wfor various trace workloads
WNetBSD am-utils blast-lite
web web-name web-gap web web-name web-gap web web-name web-gap
1 2.00 2.62 2.27 1.60 2.57 1.86 1.66 1.98 1.80
5 2.14 2.62 2.31 1.73 2.57 2.02 1.96 1.98 2.16
10 2.32 2.63 2.69 1.87 2.57 2.23 2.03 2.02 2.25
100 2.71 2.63 3.23 2.58 2.58 3.31 2.12 2.08 2.38
Node Outd. Successors
1329, 1331
Node Outd. Successors
1171, 2
Node Outd. S uccessors
2, 1
1171, 2
2, 7
web compression
Figure 4: Node-crossing gap
Table 1: The properties of provenance traces
Trace nSize
NetBSD 140146 60.2MB
am-utils 46100 24.9MB
blast-lite 240 68kB
(b) Node-crossing gap: Current web compression algo-
rithms exploit graph locality by encoding the gaps be-
tween the successors of a node to improve compression
ratio. However, in PASS provenance traces, we find
that many provenance nodes have only one successor,
and these successors cannot be encoded with the current
web compression technology. So we propose to exploit
gaps between the successors of different nodes. Figure 4
compares the results of using the approach to gap en-
coding used in current web compression algorithms and
with our node-crossing gap encoding approach. Current
encodings can only encode the successors of node 158
(1329 158 =1171, 1331 1329 =2), while our node-
crossing approach can further encode the successors of
node 157 (157 155 =2, 1326 1325 =1) and node
159 (159 157 =2, 1333 1326 =7).
Note that the successor of node 155 is not encoded
as 1325 155 =1170 in the current web compression
algorithm, which mainly focuses on exploiting the local-
ity between the successors of a node in the usual sense,
while our improved approach further exploits the locality
between the successors of different nodes.
5 Evaluation
Our experimental datasets are generated by the PASS
system, drawn from different applications. They are:
1. NetBSD trace: Build of several components of
2. Compilation workload trace: Compilation of am-
3. Scientific trace: A simple instance of the blast bio-
logical workload.
In Table 1, we summarize the number of nodes and the
size of these provenance traces.
Table 2 shows the compression performance with re-
spect to different Wfor various trace workloads. “Web”
means to compress the provenance graph using current
web compression algorithm. “Web-name” means to
use web compression algorithm that incorporates name-
identified reference list technology. While “web-gap”
means to compress provenance graph using web com-
pression algorithm that incorporates node-crossing gap
For all three trace workloads, the web compression
algorithm achieves a better performance with increases
in W. This is because a bigger window would in-
crease the likelihood of finding similar reference lists.
We also find that for all workloads, the algorithm with
name-identified reference lists exhibits very stable per-
formance, e.g. 2.62–2.63 times for NetBSD trace. With
our name-identification approach, window size appears
to have limited impact. It results in performance con-
siderably better than basic web compression, requiring a
much larger window size for the latter in order to match
its performance.
When we consider the algorithm with node-crossing
gap encoding, we see it also outperforms web com-
pression, and even outperforms our name-identification
approach as window sizes are increased. E.g., when
W=10, W=100 for NetBSD trace, W=100 for am-
utils and W=5, W=10, W=100 for Blast-lite. We
also note that our node-crossing gap encoding approach
achieves the best compression ratios (when W=100) for
all the cases.
6 Summary and Future Work
We successfully compressed provenance graphs by
adapting techniques from web graph compression,
achieving compression ratios of 2.12 to 2.71 times. We
also introduced two new techniques based on provenance
graph properties, and demonstrated both increased com-
pression ratios (by up to 28%) and computational effi-
In the future, we intend to evaluate the impact of web
compression on provenance query performance. In addi-
tion, we plan on comparing our web compression algo-
rithms with LZ-based compression algorithms and a hy-
brid scheme involving LZ-based and web-compression
based schemes.
This work was supported in part by the National
Basic Research 973 Program of China under Grant
No. 2011CB302300, 863 project 2009AA01A401/2,
NSFC No. 61025008, 60933002, 60873028, Changjiang
innovative group of Education of China No. IRT0725.
This material is based upon work supported in part by:
the Department of Energy under Award Number DE-
FC02-10ER26017/DE-SC0005417, the Department of
Energy’s Petascale Data Storage Institute (PDSI) un-
der Award Number DE-FC02-06ER25768, and the Na-
tional Science Foundation under awards CCF-0937938
and IIP-0934401 (I/UCRC Center for Research in Intel-
ligent Storage).
This report was prepared as an account of work sponsored by an
agency of the United States Government. Neither the United States
Government nor any agency thereof, nor any of their employees, makes
any warranty, express or implied, or assumes any legal liability or re-
sponsibility for the accuracy, completeness, or usefulness of any in-
formation, apparatus, product, or process disclosed, or represents that
its use would not infringe privately owned rights. Reference herein to
any specific commercial product, process, or service by trade name,
trademark, manufacturer, or otherwise does not necessarily constitute
or imply its endorsement, recommendation, or favoring by the United
States Government or any agency thereof. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the
United States Government or any agency thereof.
[1] R. S. Barga and L. A. Digiampietri. Automatic capture
and efficient storage of escience experiment provenance.
In Concurrency and Computation: Practice and Experi-
ence, 2007
[2] J. Ziv and A. Lempel. A universal algorithm for sequen-
tial data compression. IEEE Trans. on Information The-
ory, 23(3):337-343, 1977.
[3] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and
M. I. Seltzer. Provenance-aware storage systems. Proc.
USENIX Annual Tech. Conf., 2006.
[4] A. P. Chapman, H. V. Jagadish, P. Ramanan. Efficient
Provenance Storage. Proc. SIGMOD, 2008.
[5] M. Adler and M. Mitzenmacher. Towards Compressing
Web Graphs. Proc. IEEE Data Compression Conf., 2001.
[6] K. Randall, R. Wickremesinghe, and J. Wiener. The link
database: Fast access to graphs of the Web. Research Re-
port 175, Compaq Systems Research Center, Palo Alto,
CA, 2001.
[7] P. Boldi and S. Vigna. The webgraph framework I: Com-
pression techniques. Proc. 13th WWW, 2004.
[8] T. Suel and J. Yuan. Compressing the graph structure of
the web. Proc. IEEE Data Compression Conf., 2001.
[9] P. Groth, S. Miles, W. Fang, S. C. Wong, K. Zauner and
L. Moreau. Recording and using provenance in a protein
compressibility experiment. Proc. HPDC, 2005.
[10] M. Jayapandian, A. P. Chapman, V. G. Tarcea, C. Yu,
A. Elkiss, A. Ianni, B. Liu, A. Nandi, C. Santos, P. An-
drews, B. Athey, D. States, and H.V. jagadish. Michigan
Molecular Interactions (MiMI): Putting the jigsaw puzzle
together. Nucleic Acids Research, 2007.
[11] Y. Simmhan, B. Plale and D. Gannon. A framework
for collecting provenance in data-centric scientific work-
flows. Proc. ICWS, 2006.
[12] D. A. Holland, U. Braun, D. Maclean, K.-K.
Muniswamy-Reddy, M. I. Seltzer. Choosing a data
model and query language for provenance. Proc. 2nd
Int’l. Provenance and Annotation Workshop, 2008.
... Add cand to candidate set (7) End For (8) selectedSet = cand from candidateSet with top percent sizedistance performance (9) For every cand in selectedSet Do (10) If candScoreWithScoreIBI( 0 , cand ) is optimal Then (11) prev = (12) = cand (13) End if (14) End For (15) End While (17) If dist( 0 , , Ann ) > TDIST Then (17) return prev (18) End If (19) return keep performing the mapping of annotations and reducing the size of provenance annotations set and stop when TSIZE is reached or the distance exceeds TDIST. ...
... Add cand to candidate set (7) End For (8) selectedSet = cand from candidateSet with top percent sizedistance performance (9) For every cand in selectedSet Do (10) If candScoreWithScoreIBI( 0 , cand ) is optimal Then (11) prev = (12) = cand (13) End if (14) End For (15) End While (17) If dist( 0 , , Ann ) > TDIST Then (17) return prev (18) End If (19) return keep performing the mapping of annotations and reducing the size of provenance annotations set and stop when TSIZE is reached or the distance exceeds TDIST. ...
... In [16], an interactive way for exploring large provenance graph has been proposed to control the complexities presented to the users. In [17] the authors proposed to compress provenance graphs in a lossless manner so as to reduce spatial cost. In [18], abstract provenance graphs have been proposed to provide a homomorphic view of the provenance data to help users spot useful information. ...
Full-text available
Extracting useful knowledge from data provenance information has been challenging because provenance information is often overwhelmingly enormous for users to understand. Recently, it has been proposed that we may summarize data provenance items by grouping semantically related provenance annotations so as to achieve concise provenance representation. Users may provide their intended use of the provenance data in terms of provisioning, and the quality of provenance summarization could be optimized for smaller size and closer distance between the provisioning results derived from the summarization and those from the original provenance. However, apart from the intended provisioning use, we notice that more dedicated and diverse user requirements can be expressed and considered in the summarization process by assigning importance weights to provenance elements. Moreover, we introduce information balance index (IBI), an entropy based measurement, to dynamically evaluate the amount of information retained by the summary to check how it suits user requirements. An alternative provenance summarization algorithm that supports manipulation of information balance is presented. Case studies and experiments show that, in summarization process, information balance can be effectively steered towards user-defined goals and requirement-driven variants of the provenance summarizations can be achieved to support a series of interesting scenarios.
... A large number of studies [1], [4], [5], [6], [7], [8], [9], [10], [11] focus on provenance compression. For instance, Chapman et al. [5] proposed a series of decomposition and inheritance methods for compressing provenance. ...
... There are some studies on provenance management in specific areas. For provenance compression, Xie et al. [7] explored the similarity and locality of provenance nodes and designed a hybrid compression algorithm that combines WEB compression with dictionary encoding. Zhang et al. [10] proposed CPR and PCAR which can aggregate the same type of events with the same attribute and reduce data effectively. ...
Full-text available
Provenance is a type of metadata that records the creation and transformation of data objects. It has been applied to a wide variety of areas such as security, search, and experimental documentation. However, provenance usually has a vast amount of data with its rapid growth rate which hinders the effective extraction and application of provenance. This paper proposes an efficient provenance management system via clustering and hybrid storage. Specifically, we propose a Provenance-Based Label Propagation Algorithm which is able to regularize and cluster a large number of irregular provenance. Then, we use separate physical storage mediums, such as SSD and HDD, to store hot and cold data separately, and implement a hot/cold scheduling scheme which can update and schedule data between them automatically. Besides, we implement a feedback mechanism which can locate and compress the rarely used cold data according to the query request. The experimental test shows that the system can significantly improve provenance query performance with a small run-time overhead.
... Colleagues working with CamFlow have achieved a storage size of 8% of the equivalent PROV-JSON graph size through compression techniques [89]. Others have reported on alternative ways to reduce storage requirements through compression [22,93,94], but such concerns are orthogonal to the work presented here. Regardless of the storage techniques employed, the size of the graph grows over time, and this makes query time grow proportionally. ...
Conference Paper
Full-text available
Data provenance describes how data came to be in its present form. It includes data sources and the transformations that have been applied to them. Data provenance has many uses, from forensics and security to aiding the reproducibility of scientific experiments. We present CamFlow, a whole-system provenance capture mechanism that integrates easily into a PaaS offering. While there have been several prior whole-system provenance systems that captured a comprehensive, systemic and ubiquitous record of a system's behavior, none have been widely adopted. They either A) impose too much overhead, B) are designed for long-outdated kernel releases and are hard to port to current systems, C) generate too much data, or D) are designed for a single system. CamFlow addresses these shortcoming by: 1) leveraging the latest kernel design advances to achieve efficiency; 2) using a self-contained, easily maintainable implementation relying on a Linux Security Module, NetFilter, and other existing kernel facilities ; 3) providing a mechanism to tailor the captured provenance data to the needs of the application; and 4) making it easy to integrate provenance across distributed systems. The provenance we capture is streamed and consumed by tenant-built auditor applications. We illustrate the usability of our implementation by describing three such applications: demonstrating compliance with data regulations; performing fault/intrusion detection; and implementing data loss prevention. We also show how CamFlow can be leveraged to capture meaningful provenance without modifying existing applications.
Transportation and distribution (T8D) of fresh food products is a substantial and increasing part of the economic activities throughout the world. Unfortunately, fresh food T8D not only suffers from significant spoilage and waste, but also from dismal efficiency due to tight transit timing constraints between the availability of harvested food until its delivery to the retailer. Fresh food is also easily contaminated, and together with deteriorated fresh food is responsible for much of food-borne illnesses. The logistics operations are undergoing rapid transformation on multiple fronts, including infusion of information technology in the logistics operations, automation in the physical product handling, standardization of labeling, addressing and packaging, and shared logistics operations under 3rd party logistics (3PL) and related models. In this article, we discuss how these developments can be exploited to turn fresh food logistics into an intelligent cyberphysical system driven by online monitoring and associated operational control to enhance food freshness and safety, reduce food waste, and increase T8D efficiency. Some of the issues discussed in this context are fresh food quality deterioration processes, food quality/contamination sensing technologies, communication technologies for transmitting sensed data through the challenging fresh food media, intelligent management of the T8D pipeline, and various other operational issues. The purpose of this article is to stimulate further research in this important emerging area that lies at the intersection of computing and logistics.
Users and Operating Systems (OSs) have vastly different views of files. OSs use files to persist data and structured information. To accomplish this, OSs treat files as named collections of bytes managed in hierarchical file systems. Despite their critical role in computing, little attention is paid to the lifecycle of the file, the evolution of file contents, or the evolution of file metadata. In contrast, users have rich mental models of files: they group files into projects, send data repositories to others, work on documents over time, and stash them aside for future use. Current OSs and Revision Control Systems ignore such mental models, persisting a selective, manually designated history of revisions. Preserving the mental model allows applications to better match how users view their files, making file processing and archiving tools more effective. We propose two mechanisms that OSs can adopt to better preserve the mental model: File Lifecycle Events (FLEs) that record a file’s progression and Complex File Events (CFEs) that combine them into meaningful patterns. We present the Complex File Events Engine (CoFEE), which uses file system monitoring and an extensible rulebase (Drools) to detect FLEs and convert them into complex ones. CFEs are persisted in NoSQL stores for later querying.
Full-text available
Keeping track of lifecycle history and information origins are important because that is the only key issue to confirm information probative value and integrity. Provenance information management (Data Provenance) has recently been gained significant interest in computer security specifically in the domain of distributed environment. This is due partly to the increased in digital facts (complexity and size) and it widely spread usage in other fields of study, such as medicine, government, commerce, and science. Although its applications and use bring numerous benefit to scholars and practitioners, challenges still exist in its lifecycle history and information management. In this paper, we present a comprehensive survey of potentially influential components on provenance collection and its security, which enables us to investigate and summarize the current trends, methods and techniques addressing the aforementioned elements. This review provides more insight for researchers, system developers and engineers working on data, data mining and information security. More than 90 research publications on provenance collection and security have been examined, classified and listed for reference. © 2018
Conference Paper
The increasing proliferation of the Internet of Things (IoT) devices and systems result in large amounts of highly heterogeneous data to be collected. Although at least some of the collected sensor data is often consumed by the real-time decision making and control of the IoT system, that is not the only use of such data. Invariably, the collected data is stored, perhaps in some filtered or downselected fashion, so that it can be used for a variety of lower-frequency operations. It is expected that in a smart city environment with numerous IoT deployments, the volume of such data can become enormous. Therefore, mechanisms for lossy data compression that provide a trade-off between compression ratio and data usefulness for offline statistical analysis becomes necessary. In this paper, we discuss several simple pattern mining based compression strategies for multi-attribute IoT data streams. For each method, we evaluate the compressibility of the method vs. the level of similarity between original and compressed time series in the context of the home energy management system.
Provenance is an increasingly important tool for understanding and even actively preventing system intrusion, but the excessive storage burden imposed by automatic provenance collection threatens to undermine its value in practice. This situation is made worse by the fact that the majority of this metadata is unlikely to be of interest to an administrator, instead describing system noise or other background activities that are not germane to the forensic investigation. To date, storing data provenance in perpetuity was a necessary concession in even the most advanced provenance tracking systems in order to ensure the completeness of the provenance record for future analyses. In this work, we overcome this obstacle by proposing a policy-based approach to provenance filtering, leveraging the confinement properties provided by Mandatory Access Control (MAC) systems in order to identify and isolate subdomains of system activity for which to collect provenance. We introduce the notion of minimal completeness for provenance graphs, and design and implement a system that provides this property by exclusively collecting provenance for the trusted computing base of a target application. In evaluation, we discover that, while the efficacy of our approach is domain dependent, storage costs can be reduced by as much as 89% in critical scenarios such as provenance tracking in cloud computing data centers. To the best of our knowledge, this is the first policy-based provenance monitor to appear in the literature.
Conference Paper
Full-text available
The increasing ability for the earth sciences to sense the world around us is resulting in a growing need for data-driven applications that are under the control of data-centric workflows composed of grid- and web- services. The focus of our work is on provenance collection for these workflows, necessary to validate the workflow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework, based on a loosely-coupled publish-subscribe architecture for propagating provenance activities, satisfies the needs of detailed provenance collection while a performance evaluation of a prototype finds a minimal performance overhead (in the range of 1\% for an eight service workflow using 271 data products).
Conference Paper
Full-text available
As the world is increasingly networked and digitized, the data we store has more and more frequently been chopped, baked, diced and stewed. In consequence, there is an increasing need to store and manage provenance for each data item stored in a database, describing exactly where it came from, and what manipulations have been applied to it. Storage of the complete provenance of each data item can become prohibitively expensive. In this paper, we identify important properties of provenance that can be used to considerably reduce the amount of storage required. We identify three different techniques: a family of factorization processes and two methods based on inheritance, to decrease the amount of storage required for provenance. We have used the techniques described in this work to significantly reduce the provenance storage costs associated with constructing MiMI [22], a warehouse of data regarding protein interactions, as well as two provenance stores, Karma [31] and PReServ [20], produced through workflow execution. In these real provenance sets, we were able to reduce the size of the provenance by up to a factor of 20. Additionally, we show that this reduced store can be queried efficiently and further that incremental changes can be made inexpensively.
Full-text available
Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier and supplementary information. Michigan Molecular Interactions (MiMI) assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information. Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the users to retrieve information from many different databases at once, highlighting complementary and contradictory information. To help scientists judge the usefulness of a piece of data, MiMI tracks the provenance of all data. Finally, a simple yet powerful user interface aids users in their queries, and frees them from the onerous task of knowing the data format or learning a query language. MiMI allows scientists to query all data, whether corroborative or contradictory, and specify which sources to utilize. MiMI is part of the National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at:
Conference Paper
Full-text available
Very large scale computations are now becoming routinely used as a methodology to undertake scientific research. In this context, 'provenance systems' are regarded as the equivalent of the scientist's logbook for in silico experimentation: provenance captures the documentation of the process that led to some result. Using a protein compressibility analysis application, we derive a set of generic use cases for a provenance system. In order to support these, we address the following fundamental questions: what is provenance? How to record it? What is the performance impact for grid execution? What is the performance of reasoning? In doing so, we define a technology-independent notion of provenance that captures interactions between components, internal component information and grouping of interactions, so as to allow us to analyze and reason about the execution of scientific processes. In order to support persistent provenance in heterogeneous applications, we introduce a separate provenance store, in which provenance documentation can be stored, archived and queried independently of the technology used to run the application. Through a series of practical tests, we evaluate the performance impact of such a provenance system. In summary, we demonstrate that provenance recording overhead of our prototype system remains under 10% of execution time, and we show that the recorded information successfully supports our use cases in a performant manner.
Conference Paper
Full-text available
The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.
Workflow is playing an increasingly important role in conducting e-Science experiments, but most commercial systems lack the necessary support for the collection and management of provenance data. We argue that eScience provenance data should be automatically generated by the workflow enactment engine and managed over time by an underlying storage service. In this paper, we introduce a layered model for workflow execution provenance, which allows navigation from an abstract model of the experiment to instance data collected during a specific experiment run. We outline modest extensions to a commercial workflow engine so it will automatically capture this provenance data at runtime. We then present an approach to store this provenance data in a relational database engine. Finally, we identify important properties of provenance data captured by our model that can significantly reduce the amount of storage required, and demonstrate we can reduce the size of provenance data captured from an actual experiment to 0.4% of the original size, with modest performance overhead.
Conference Paper
A Provenance-Aware Storage System (PASS) is a storage system that automatically collects and maintains prove- nance or lineage, the complete history or ancestry of an item. We discuss the advantages of treating provenance as meta-data collected and maintained by the storage sys- tem, rather than as manual annotations stored in a sepa- rately administered database. We describe a PASS imple- mentation, discussing the challenges it presents, perfor- mance cost it incurs, and the new functionality it enables. We show that with reasonable overhead, we can provide useful functionality not available in today's file systems or provenance management systems.
Conference Paper
A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure has proven to be extremely useful for improving the performance of search engines and other tools for navigating the Web. However, since the graphs in these scenarios involve hundreds of millions of nodes and even more edges, highly space-efficient data structures are needed to fit the data in memory. A first step in this direction was done by the DEC connectivity server, which stores the graph in compressed form. We describe techniques for compressing the graph structure of the Web, and give experimental results of a prototype implementation. We attempt to exploit a variety of different sources of compressibility of these graphs and of the associated set of URLs in order to obtain good compression performance on a large Web graph
Conference Paper
We consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning free in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach
A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source.