Conference PaperPDF Available

Compressing Provenance Graphs

Authors:

Abstract

The provenance community has built a number of sys-tems to collect provenance, most of which assume that provenance will be retained indefinitely. However, it is not cost-effective to retain provenance information inef-ficiently. Since provenance can be viewed as a graph, we note the similarities to web graphs and draw upon tech-niques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that adapting web compression techniques results in a com-pression ratio of 2.12:1 to 2.71:1, which we can improve upon to reach ratios of up to 3.31:1.
Compressing Provenance Graphs
Yulai Xie†‡, Kiran-Kumar Muniswamy-Reddy§, Darrell D. E. Long
Ahmed Amer, Dan Feng, Zhipeng Tan
Huazhong University of Wuhan National Laboratory University of California, §Harvard Santa Clara
Science and Technology for Optoelectronics Santa Cruz University University
Abstract
The provenance community has built a number of sys-
tems to collect provenance, most of which assume that
provenance will be retained indefinitely. However, it is
not cost-effective to retain provenance information inef-
ficiently. Since provenance can be viewed as a graph, we
note the similarities to web graphs and draw upon tech-
niques from the web compression domain to provide our
own novel and improved graph compression solutions for
provenance graphs. Our preliminary results show that
adapting web compression techniques results in a com-
pression ratio of 2.12:1 to 2.71:1, which we can improve
upon to reach ratios of up to 3.31:1.
1 Introduction
Provenance, though extremely valuable, can take up sub-
stantial storage space. For instance, in the PReServ [9]
provenance store, the original data was 100 KB, while
the provenance reached 1 MB. For MiMI [10], an on-
line protein database, the provenance expands to 6 GB
while the base data is only 270 MB. Similar results are
observed in other systems [3, 11]. This makes prove-
nance data an increasing overhead on the storage subsys-
tem.
A provenance dataset can be represented as a prove-
nance graph [12]. Thus, efficient representation of prove-
nance graphs can fundamentally speed up provenance
queries. While, inefficient multi-GB provenance graphs
will not fit in limited memory and dramatically reduce
query efficiency.
We propose to adapt web compression algorithms to
compress provenance graphs. Our motivation comes
from our observation that provenance graphs and web
graphs have similar structure and some common essen-
tial characteristics, i.e., similarity, locality and consec-
utiveness. We have further discovered that provenance
graphs have their own special features, and we propose
two new techniques to improve the provenance graph
compression.
We test our hypothesis by compressing the provenance
graphs generated by the PASS [3] system. Our results
show that our web compression algorithms can compress
such graphs, and that our improved algorithms can fur-
ther raise the compression ratio up to 3.31:1.
2 Web & Provenance Graphs
A web graph is a directed graph where each URL is rep-
resented as a node, and the link from one page to the
other page is a directed edge. There are some key prop-
erties exploited by current web graph compression algo-
rithms:
Locality: Few links would go across URL domain,
and therefore the vast majority tend to point to pages
nearby.
Similarity: Pages that are not far from each other
have common neighbors with high probability.
Consecutiveness: The node numbers of successors
of a page are in sequential order.
Provenance graphs also have a similar organizational
structure and characteristics as web graphs. Figure 1
shows the conversion from a snapshot of a NetBSD
provenance trace (generated in the PASS system [3]) to
an adjacency list that represents the provenance graph.
The notation “A INPUT[ANC] B” in the provenance
trace means that B is an ancestor of A, indicating that
there exists a directed edge from A pointing to B. In this
way, a provenance graph is also a directed graph and each
node (e.g., node 2 or 3) has a series of out-neighbors.
Provenance nodes 2 and 3 are similar, as they have
common successors in the form of nodes 4, 7, 9 and 14.
The reason for this is that many header files or library
1
……
2.0 NAME / bin/cp
2.1 INPUT [ANC] 4.0
2.1 INPUT [ANC] 5.0
2.1 INPUT [ANC] 6.0
2.1 INPUT [ANC] 7.0
2.1 INPUT [ANC] 8.0
2.1 INPUT [ANC] 9.0
2.1 INPUT [ANC] 10.0
2.1 INPUT [ANC] 11.0
2.1 INPUT [ANC] 12.0
2.1 INPUT [ANC] 13.0
2.1 INPUT [ANC] 14.0
2.1 INPUT [ANC] 15.0
3.0 NAME /disk/scripts/bulkbuild
3.1 INPUT [ANC] 4.0
3.1 INPUT [ANC] 7.0
3.1 INPUT [ANC] 9.0
3.1 INPUT [ANC] 14.0
3.1 INPUT [ANC] 17.0
3.1 INPUT [ANC] 18.0
3.1 INPUT [ANC] 19.0
3.1 INPUT [ANC] 20.0
3.1 INPUT [ANC] 21.0
...
6
8 10
5
4
2
7
19
9
12
14
11
17 3 21
18 20
Node Out-degree Successors
……
2
3
……
……
12
9
……
……
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
4, 7, 9, 14, 17, 18, 19, 20, 21
…….
13
15
Provenance trace Provenance graph Adjacency list
Figure 1: Mapping from the provenance trace to adjacency list that represents the provenance graph. The expression
“2.1 INPUT[ANC] 4.0” indicates that node 4 is an ancestor of node 2, resulting in a directed edge from node 2 pointing
to node 4. This figure also shows that provenance graph exhibits the similar characteristics (i.e., similarity, locality
and consecutiveness) as web graph.
files that are represented as nodes like 4, 7, 9 and 14 are
repeatedly used as input by many processes (e.g., nodes 2
and 3). Nodes 2 and 3 also exhibit locality. The succes-
sors of provenance node 2 are only between 4 and 15,
and the successors of node 3 are only between 4 and 21.
This is because many header files or library files (e.g.,
the successors of nodes 2 and 3) that are used as input
by a process are probably in the same directory, so the
ordering (and therefore assigned numbers) of these files
are usually close to each other. Consecutiveness is also
clear in the successors of nodes 2 and 3, from 4 to 15
and from 17 to 21 respectively. The existence of consec-
utiveness is because a process or file may be generated
by many files that do not belong to a PASS volume, and
such files are usually numbered sequentially.
3 Related Work
Barga et al. [1] presented a layered provenance model
for workflow systems that can efficiently reduce prove-
nance size by minimizing the repeated operation infor-
mation during provenance generation. This is similar to
our approach that exploits similarity between the neigh-
bors of different nodes representing the same manipula-
tion. Chapman et al. [4] proposed two classes of algo-
rithms to compress provenance graphs: provenance fac-
torization and provenance inheritance. The former aims
to find common subtrees between different nodes and the
latter focuses on find similarities between data items that
have ancestry relationships or belong to a particular type.
These algorithms achieve good compression ratios, but
can only be applied to a flat data model, i.e., where each
node has complete provenance, and therefore cannot be
used to compress provenance generated by a provenance
system such as PASS [3]. Our methods are more general
and are hence applicable to wider range of systems.
There has been considerable work [5, 6, 7, 8] in the
domain of web graph compression. Adler et al. [5] pro-
posed to utilize reference compression to compress web
graphs. Randall et al. [6] suggested a set of technolo-
gies such as delta codes and variable-length bit encoding
to compress a database providing fast access to web hy-
perlinks. The critical web graph compression framework
was presented by Boldi and Vigna [7]. They obtained
good compression performance by fully exploiting the
locality and similarity of web pages. The algorithm we
use is based on this framework.
There are also classical techniques like LZ-based com-
pression algorithms [2]. These techniques present an up-
per bound on the compression that is possible. However,
since they do not preserve the structure of the data, the re-
sulting compressed graph will not be amenable to query-
ing.
4 Compression Algorithms
Three critical ideas lie behind web compression algo-
rithms [7]: First, encoding the successor list of one node
by using the similar successors of another node as a ref-
erence, thus efficiently avoiding encoding the duplicate
data; Second, encoding consecutive numbers by only
recording the start number and length, reducing the num-
2
Node Out-degree Successors
15
16
5
7
3, 11, 13, 14, 17
11, 14, 19, 20, 21, 31, 33
Node Out-degree Bit list
15
16
5
701010
Extra nodes
3, 11, 13, 14, 17
19, 20, 21, 31, 33
Node Out-degree Bit list
15
16
5
701010
Left extreme
13
19
Length
2
3
3, 11, 17
31, 33
Residuals
Node Out-degree Bit list
15
16
5
701010
Left extreme
13
19
Length
2
3
-12, 8, 6
15, 2
Residuals
Reference compression
Find consecutive numbers
Encode gaps
Figure 2: An example on web compression algorithm
ber of successors to be encoded; and Third, encoding the
gap between the successors of a node rather than the suc-
cessors themselves, which typically requires fewer bits
to be encoded.
As we can see in Figure 1, a provenance graph can be
represented by a set of provenance nodes which have a
series of ancestors as their successors. The provenance
nodes are numbered from 0 to N1, in order, during
provenance generation. We use Out(x) to denote the suc-
cessor list of node x. The web compression algorithm to
encode this list is detailed as the follows:
1. Reference compression: Find the node with the
most similar successor list in the preceding Wsuc-
cessor lists. Wis a window parameter. Let ybe
such a reference node, Out(y) is its corresponding
successor list (called reference list). The encoding
of Out(x) can be divided into two parts: a bit list
to identify the common successors between Out(x)
and Out(y), and Extra nodes that identifies the rest
of the successors in Out(x).
2. Find consecutive numbers: Separate the consecu-
tive numbers from the Extra nodes, and then rep-
resent each set of consecutive numbers by using its
left extreme and its length.
3. Encode gaps: Let x1,x2,x3, ..., xkbe the successors
of xthat have not be encoded after the above step.
If x1x2...xk, then encode them as x1x,x2x1,
..., xkxk1.
Figure 2 shows an example on these three steps. In
this example, node 15 is the reference for node 16 and
has no reference itself. The case in Figure 1 is simpler.
The successor list of Node 2 can be encoded using step
2. The successor list of node 3 can be encoded using step
1 and 2, with node 2 serving as the reference list for node
3.
Node Name Successors
15
16
17
18
19
20
21
22
23
24
25
...
/bin/cp
/bin/bash
/bin/su
/usr/bin
/bin/hostname
/sbin/consoletype
/meminfo
/usr/bin/id
/usr/bin
/bin/sed
/usr/bin
...
19, 21, 32
4, 9, 13, 17
19, 20, 23
3, 11, 13, 14, 17
3, 10, 13, 17
4, 8, 9, 11
4, 8, 11, 15
5, 7, 11, 12
3, 11, 13, 14, 18
4, 6
3, 11, 13, 14, 19
...
W=3
W=3
Figure 3: Name-identified reference list
Our Improved Approach
We now describe two improvements beyond existing
web compression algorithms. These are motivated by
the observed properties of datasets generated by the
PASS [3] system.
(a) Name-identified Reference list:
Web compression algorithms find the most similar
reference list in the preceding Wnodes. The greater
the similarity, the better the compression performance
achieved. However, sometimes, it’s hard to find a very
similar reference list with a small W, and while a larger
Wvalue would produce a better compression ratio be-
cause it enlarges the scope of the possible reference lists,
this would be at the expense of slower compression and
decompression.
We have found that many provenance datasets record
the name of provenance nodes. The nodes with same
name usually have a large set of common successors. For
instance, in PASS provenance traces, a process that is
represented as a provenance node would be scheduled to
execute many times, and each time it is scheduled, it will
use many of the same header files or library files as input,
which are actually the common successors between the
process nodes with same name. So we propose to use
name as an indicator to help find similar reference lists.
Figure 3 describes how this technology functions in an
example. When node 18 with name /usr/bin first appears,
the algorithm finds reference list in the preceding W(if
W=3) nodes. But when node 23 with name /usr/bin ap-
pears the second time, we take the successor list of node
18 as reference list rather than using W. In this algorithm,
we need only use a hash table to identify the nodes with
same name. So each time we encode the successor list of
a node, we just have to confirm whether it is in the hash
table, but we need not go backwards through the window
list (especially beneficial when Wis larger). This would
incur only a minimal time overhead on compression and
decompression.
3
Table 2: Performance of web compression algorithm and improved algorithms that incorporate two new technologies
with respect to different Wfor various trace workloads
WNetBSD am-utils blast-lite
web web-name web-gap web web-name web-gap web web-name web-gap
1 2.00 2.62 2.27 1.60 2.57 1.86 1.66 1.98 1.80
5 2.14 2.62 2.31 1.73 2.57 2.02 1.96 1.98 2.16
10 2.32 2.63 2.69 1.87 2.57 2.23 2.03 2.02 2.25
100 2.71 2.63 3.23 2.58 2.58 3.31 2.12 2.08 2.38
Node Outd. Successors
155
156
157
158
159
1
0
1
2
1
1325
1326
1329, 1331
1333
Node Outd. Successors
155
156
157
158
159
1
0
1
2
1
1325
1326
1171, 2
1333
Node Outd. S uccessors
155
156
157
158
159
1
0
1
2
1
2, 1
1171, 2
2, 7
web compression
Node-crossing
Figure 4: Node-crossing gap
Table 1: The properties of provenance traces
Trace nSize
NetBSD 140146 60.2MB
am-utils 46100 24.9MB
blast-lite 240 68kB
(b) Node-crossing gap: Current web compression algo-
rithms exploit graph locality by encoding the gaps be-
tween the successors of a node to improve compression
ratio. However, in PASS provenance traces, we find
that many provenance nodes have only one successor,
and these successors cannot be encoded with the current
web compression technology. So we propose to exploit
gaps between the successors of different nodes. Figure 4
compares the results of using the approach to gap en-
coding used in current web compression algorithms and
with our node-crossing gap encoding approach. Current
encodings can only encode the successors of node 158
(1329 158 =1171, 1331 1329 =2), while our node-
crossing approach can further encode the successors of
node 157 (157 155 =2, 1326 1325 =1) and node
159 (159 157 =2, 1333 1326 =7).
Note that the successor of node 155 is not encoded
as 1325 155 =1170 in the current web compression
algorithm, which mainly focuses on exploiting the local-
ity between the successors of a node in the usual sense,
while our improved approach further exploits the locality
between the successors of different nodes.
5 Evaluation
Our experimental datasets are generated by the PASS
system, drawn from different applications. They are:
1. NetBSD trace: Build of several components of
NetBSD.
2. Compilation workload trace: Compilation of am-
utils.
3. Scientific trace: A simple instance of the blast bio-
logical workload.
In Table 1, we summarize the number of nodes and the
size of these provenance traces.
Table 2 shows the compression performance with re-
spect to different Wfor various trace workloads. “Web”
means to compress the provenance graph using current
web compression algorithm. “Web-name” means to
use web compression algorithm that incorporates name-
identified reference list technology. While “web-gap”
means to compress provenance graph using web com-
pression algorithm that incorporates node-crossing gap
technology.
For all three trace workloads, the web compression
algorithm achieves a better performance with increases
in W. This is because a bigger window would in-
crease the likelihood of finding similar reference lists.
We also find that for all workloads, the algorithm with
name-identified reference lists exhibits very stable per-
formance, e.g. 2.62–2.63 times for NetBSD trace. With
our name-identification approach, window size appears
to have limited impact. It results in performance con-
siderably better than basic web compression, requiring a
much larger window size for the latter in order to match
its performance.
When we consider the algorithm with node-crossing
gap encoding, we see it also outperforms web com-
pression, and even outperforms our name-identification
approach as window sizes are increased. E.g., when
4
W=10, W=100 for NetBSD trace, W=100 for am-
utils and W=5, W=10, W=100 for Blast-lite. We
also note that our node-crossing gap encoding approach
achieves the best compression ratios (when W=100) for
all the cases.
6 Summary and Future Work
We successfully compressed provenance graphs by
adapting techniques from web graph compression,
achieving compression ratios of 2.12 to 2.71 times. We
also introduced two new techniques based on provenance
graph properties, and demonstrated both increased com-
pression ratios (by up to 28%) and computational effi-
ciency.
In the future, we intend to evaluate the impact of web
compression on provenance query performance. In addi-
tion, we plan on comparing our web compression algo-
rithms with LZ-based compression algorithms and a hy-
brid scheme involving LZ-based and web-compression
based schemes.
Acknowledgments
This work was supported in part by the National
Basic Research 973 Program of China under Grant
No. 2011CB302300, 863 project 2009AA01A401/2,
NSFC No. 61025008, 60933002, 60873028, Changjiang
innovative group of Education of China No. IRT0725.
This material is based upon work supported in part by:
the Department of Energy under Award Number DE-
FC02-10ER26017/DE-SC0005417, the Department of
Energy’s Petascale Data Storage Institute (PDSI) un-
der Award Number DE-FC02-06ER25768, and the Na-
tional Science Foundation under awards CCF-0937938
and IIP-0934401 (I/UCRC Center for Research in Intel-
ligent Storage).
This report was prepared as an account of work sponsored by an
agency of the United States Government. Neither the United States
Government nor any agency thereof, nor any of their employees, makes
any warranty, express or implied, or assumes any legal liability or re-
sponsibility for the accuracy, completeness, or usefulness of any in-
formation, apparatus, product, or process disclosed, or represents that
its use would not infringe privately owned rights. Reference herein to
any specific commercial product, process, or service by trade name,
trademark, manufacturer, or otherwise does not necessarily constitute
or imply its endorsement, recommendation, or favoring by the United
States Government or any agency thereof. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the
United States Government or any agency thereof.
References
[1] R. S. Barga and L. A. Digiampietri. Automatic capture
and efficient storage of escience experiment provenance.
In Concurrency and Computation: Practice and Experi-
ence, 2007
[2] J. Ziv and A. Lempel. A universal algorithm for sequen-
tial data compression. IEEE Trans. on Information The-
ory, 23(3):337-343, 1977.
[3] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and
M. I. Seltzer. Provenance-aware storage systems. Proc.
USENIX Annual Tech. Conf., 2006.
[4] A. P. Chapman, H. V. Jagadish, P. Ramanan. Efficient
Provenance Storage. Proc. SIGMOD, 2008.
[5] M. Adler and M. Mitzenmacher. Towards Compressing
Web Graphs. Proc. IEEE Data Compression Conf., 2001.
[6] K. Randall, R. Wickremesinghe, and J. Wiener. The link
database: Fast access to graphs of the Web. Research Re-
port 175, Compaq Systems Research Center, Palo Alto,
CA, 2001.
[7] P. Boldi and S. Vigna. The webgraph framework I: Com-
pression techniques. Proc. 13th WWW, 2004.
[8] T. Suel and J. Yuan. Compressing the graph structure of
the web. Proc. IEEE Data Compression Conf., 2001.
[9] P. Groth, S. Miles, W. Fang, S. C. Wong, K. Zauner and
L. Moreau. Recording and using provenance in a protein
compressibility experiment. Proc. HPDC, 2005.
[10] M. Jayapandian, A. P. Chapman, V. G. Tarcea, C. Yu,
A. Elkiss, A. Ianni, B. Liu, A. Nandi, C. Santos, P. An-
drews, B. Athey, D. States, and H.V. jagadish. Michigan
Molecular Interactions (MiMI): Putting the jigsaw puzzle
together. Nucleic Acids Research, 2007.
[11] Y. Simmhan, B. Plale and D. Gannon. A framework
for collecting provenance in data-centric scientific work-
flows. Proc. ICWS, 2006.
[12] D. A. Holland, U. Braun, D. Maclean, K.-K.
Muniswamy-Reddy, M. I. Seltzer. Choosing a data
model and query language for provenance. Proc. 2nd
Int’l. Provenance and Annotation Workshop, 2008.
5
... A large number of studies [1], [4], [5], [6], [7], [8], [9], [10], [11] focus on provenance compression. For instance, Chapman et al. [5] proposed a series of decomposition and inheritance methods for compressing provenance. ...
... There are some studies on provenance management in specific areas. For provenance compression, Xie et al. [7] explored the similarity and locality of provenance nodes and designed a hybrid compression algorithm that combines WEB compression with dictionary encoding. Zhang et al. [10] proposed CPR and PCAR which can aggregate the same type of events with the same attribute and reduce data effectively. ...
Article
Full-text available
Provenance is a type of metadata that records the creation and transformation of data objects. It has been applied to a wide variety of areas such as security, search, and experimental documentation. However, provenance usually has a vast amount of data with its rapid growth rate which hinders the effective extraction and application of provenance. This paper proposes an efficient provenance management system via clustering and hybrid storage. Specifically, we propose a Provenance-Based Label Propagation Algorithm which is able to regularize and cluster a large number of irregular provenance. Then, we use separate physical storage mediums, such as SSD and HDD, to store hot and cold data separately, and implement a hot/cold scheduling scheme which can update and schedule data between them automatically. Besides, we implement a feedback mechanism which can locate and compress the rarely used cold data according to the query request. The experimental test shows that the system can significantly improve provenance query performance with a small run-time overhead.
... However, these techniques are lossy. Xie et al. [91], [92], [93] suggest adaptations of web graph compression and dictionary encoding techniques for provenance graphs. Their approach involves applying web graph compression methods specifically to the context of provenance graphs. ...
Preprint
As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).
... Colleagues working with CamFlow have achieved a storage size of 8% of the equivalent PROV-JSON graph size through compression techniques [89]. Others have reported on alternative ways to reduce storage requirements through compression [22,93,94], but such concerns are orthogonal to the work presented here. Regardless of the storage techniques employed, the size of the graph grows over time, and this makes query time grow proportionally. ...
Preprint
Data provenance describes how data came to be in its present form. It includes data sources and the transformations that have been applied to them. Data provenance has many uses, from forensics and security to aiding the reproducibility of scientific experiments. We present CamFlow, a whole-system provenance capture mechanism that integrates easily into a PaaS offering. While there have been several prior whole-system provenance systems that captured a comprehensive, systemic and ubiquitous record of a system's behavior, none have been widely adopted. They either A) impose too much overhead, B) are designed for long-outdated kernel releases and are hard to port to current systems, C) generate too much data, or D) are designed for a single system. CamFlow addresses these shortcoming by: 1) leveraging the latest kernel design advances to achieve efficiency; 2) using a self-contained, easily maintainable implementation relying on a Linux Security Module, NetFilter, and other existing kernel facilities; 3) providing a mechanism to tailor the captured provenance data to the needs of the application; and 4) making it easy to integrate provenance across distributed systems. The provenance we capture is streamed and consumed by tenant-built auditor applications. We illustrate the usability of our implementation by describing three such applications: demonstrating compliance with data regulations; performing fault/intrusion detection; and implementing data loss prevention. We also show how CamFlow can be leveraged to capture meaningful provenance without modifying existing applications.
... Prior work has been done on provenance compression of non-array structures, mainly in graph representation [16], [19]. Multiple works explore identifying and combining common nodes in a provenance graph [44], [43], [10]. Anand et al. introduces range and subsequence reduction [7]. ...
Preprint
Full-text available
Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.
Article
Cyberattacks have caused significant damage and losses in various domains. While existing attack investigations against cyberattacks focus on identifying compromised system entities and reconstructing attack stories, there is a lack of information that security analysts can use to locate software vulnerabilities and thus fix them. In this paper, we present AIVL, a novel software vulnerability location method to push the attack investigation further. AIVL relies on logs collected by the default built-in system auditing tool and program binaries within the system. Given a sequence of malicious log entries obtained through traditional attack investigations, AIVL can identify the functions responsible for generating these logs and trace the corresponding function call paths, namely the location of vulnerabilities in the source code. To achieve this, AIVL proposes an accurate, concise, and complete specific-domain program modeling that constructs all system call flows by static-dynamic techniques from the binary, and develops effective matching-based algorithms between the log sequences and program models. To evaluate the effectiveness of AIVL, we conduct experiments on 18 real-world attack scenarios and an APT, covering comprehensive categories of vulnerabilities and program execution classes. The results show that compared to actual vulnerability remediation reports, AIVL achieves a 100% precision and an average recall of 90%. Besides, the runtime overhead is reasonable, averaging at 7%.
Article
Transportation and distribution (T8D) of fresh food products is a substantial and increasing part of the economic activities throughout the world. Unfortunately, fresh food T8D not only suffers from significant spoilage and waste, but also from dismal efficiency due to tight transit timing constraints between the availability of harvested food until its delivery to the retailer. Fresh food is also easily contaminated, and together with deteriorated fresh food is responsible for much of food-borne illnesses. The logistics operations are undergoing rapid transformation on multiple fronts, including infusion of information technology in the logistics operations, automation in the physical product handling, standardization of labeling, addressing and packaging, and shared logistics operations under 3rd party logistics (3PL) and related models. In this article, we discuss how these developments can be exploited to turn fresh food logistics into an intelligent cyberphysical system driven by online monitoring and associated operational control to enhance food freshness and safety, reduce food waste, and increase T8D efficiency. Some of the issues discussed in this context are fresh food quality deterioration processes, food quality/contamination sensing technologies, communication technologies for transmitting sensed data through the challenging fresh food media, intelligent management of the T8D pipeline, and various other operational issues. The purpose of this article is to stimulate further research in this important emerging area that lies at the intersection of computing and logistics.
Conference Paper
Full-text available
The increasing ability for the earth sciences to sense the world around us is resulting in a growing need for data-driven applications that are under the control of data-centric workflows composed of grid- and web- services. The focus of our work is on provenance collection for these workflows, necessary to validate the workflow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework, based on a loosely-coupled publish-subscribe architecture for propagating provenance activities, satisfies the needs of detailed provenance collection while a performance evaluation of a prototype finds a minimal performance overhead (in the range of 1\% for an eight service workflow using 271 data products).
Conference Paper
Full-text available
As the world is increasingly networked and digitized, the data we store has more and more frequently been chopped, baked, diced and stewed. In consequence, there is an increasing need to store and manage provenance for each data item stored in a database, describing exactly where it came from, and what manipulations have been applied to it. Storage of the complete provenance of each data item can become prohibitively expensive. In this paper, we identify important properties of provenance that can be used to considerably reduce the amount of storage required. We identify three different techniques: a family of factorization processes and two methods based on inheritance, to decrease the amount of storage required for provenance. We have used the techniques described in this work to significantly reduce the provenance storage costs associated with constructing MiMI [22], a warehouse of data regarding protein interactions, as well as two provenance stores, Karma [31] and PReServ [20], produced through workflow execution. In these real provenance sets, we were able to reduce the size of the provenance by up to a factor of 20. Additionally, we show that this reduced store can be queried efficiently and further that incremental changes can be made inexpensively.
Article
Full-text available
Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier and supplementary information. Michigan Molecular Interactions (MiMI) assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information. Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the users to retrieve information from many different databases at once, highlighting complementary and contradictory information. To help scientists judge the usefulness of a piece of data, MiMI tracks the provenance of all data. Finally, a simple yet powerful user interface aids users in their queries, and frees them from the onerous task of knowing the data format or learning a query language. MiMI allows scientists to query all data, whether corroborative or contradictory, and specify which sources to utilize. MiMI is part of the National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at: http://mimi.ncibi.org.
Conference Paper
Full-text available
Very large scale computations are now becoming routinely used as a methodology to undertake scientific research. In this context, 'provenance systems' are regarded as the equivalent of the scientist's logbook for in silico experimentation: provenance captures the documentation of the process that led to some result. Using a protein compressibility analysis application, we derive a set of generic use cases for a provenance system. In order to support these, we address the following fundamental questions: what is provenance? How to record it? What is the performance impact for grid execution? What is the performance of reasoning? In doing so, we define a technology-independent notion of provenance that captures interactions between components, internal component information and grouping of interactions, so as to allow us to analyze and reason about the execution of scientific processes. In order to support persistent provenance in heterogeneous applications, we introduce a separate provenance store, in which provenance documentation can be stored, archived and queried independently of the technology used to run the application. Through a series of practical tests, we evaluate the performance impact of such a provenance system. In summary, we demonstrate that provenance recording overhead of our prototype system remains under 10% of execution time, and we show that the recorded information successfully supports our use cases in a performant manner.
Conference Paper
Full-text available
The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.
Article
Workflow is playing an increasingly important role in conducting e-Science experiments, but most commercial systems lack the necessary support for the collection and management of provenance data. We argue that eScience provenance data should be automatically generated by the workflow enactment engine and managed over time by an underlying storage service. In this paper, we introduce a layered model for workflow execution provenance, which allows navigation from an abstract model of the experiment to instance data collected during a specific experiment run. We outline modest extensions to a commercial workflow engine so it will automatically capture this provenance data at runtime. We then present an approach to store this provenance data in a relational database engine. Finally, we identify important properties of provenance data captured by our model that can significantly reduce the amount of storage required, and demonstrate we can reduce the size of provenance data captured from an actual experiment to 0.4% of the original size, with modest performance overhead.
Conference Paper
A Provenance-Aware Storage System (PASS) is a storage system that automatically collects and maintains prove- nance or lineage, the complete history or ancestry of an item. We discuss the advantages of treating provenance as meta-data collected and maintained by the storage sys- tem, rather than as manual annotations stored in a sepa- rately administered database. We describe a PASS imple- mentation, discussing the challenges it presents, perfor- mance cost it incurs, and the new functionality it enables. We show that with reasonable overhead, we can provide useful functionality not available in today's file systems or provenance management systems.
Conference Paper
A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure has proven to be extremely useful for improving the performance of search engines and other tools for navigating the Web. However, since the graphs in these scenarios involve hundreds of millions of nodes and even more edges, highly space-efficient data structures are needed to fit the data in memory. A first step in this direction was done by the DEC connectivity server, which stores the graph in compressed form. We describe techniques for compressing the graph structure of the Web, and give experimental results of a prototype implementation. We attempt to exploit a variety of different sources of compressibility of these graphs and of the associated set of URLs in order to obtain good compression performance on a large Web graph
Conference Paper
We consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning free in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach
Article
A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source.