Conference PaperPDF Available

Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories

Authors:

Abstract and Figures

Scientists' capacity to make use of existing data is predicated on their ability to find and understand those data. While significant progress has been made with respect to data publication, and indeed one can point to a number of well organized and highly utilized data repositories, there remain many such repositories in which archived data are poorly described and thus impossible to use. We present Skluma---an automated system designed to process vast amounts of data and extract deeply embedded metadata, latent topics, relationships between data, and contextual metadata derived from related documents. We show that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.
Content may be subject to copyright.
Skluma: A Statistical Learning Pipeline for Taming Unkempt
Data Repositories
Paul Beckman, Tyler J. Skluzacek, Kyle Chard, and Ian Foster
Computation Institute
University of Chicago and Argonne National Laboratory
Chicago, IL 60637
{pbeckman,skluzacek,chard}@uchicago.edu,foster@anl.gov
ABSTRACT
Scientists’ capacity to make use of existing data is predicated on
their ability to nd and understand those data. While signicant
progress has been made with respect to data publication, and indeed
one can point to a number of well organized and highly utilized
data repositories, there remain many such repositories in which
archived data are poorly described and thus impossible to use. We
present Skluma—an automated system designed to process vast
amounts of data and extract deeply embedded metadata, latent
topics, relationships between data, and contextual metadata derived
from related documents. We show that Skluma can be used to
organize and index a large climate data collection that totals more
than 500GB of data in over a half-million les.
CCS CONCEPTS
Information systems Data cleaning
;
Mediators and data
integration;
KEYWORDS
data wrangling, statistical learning, metadata extraction, data inte-
gration
ACM Reference format:
Paul Beckman, Tyler J. Skluzacek, Kyle Chard, and Ian Foster. 2017. Skluma:
A Statistical Learning Pipeline for Taming Unkempt Data Repositories. In
Proceedings of SSDBM ’17, Chicago, IL, USA, June 27-29, 2017, 4 pages.
https://doi.org/http://dx.doi.org/10.1145/3085504.3091116
1 INTRODUCTION
Meaningless le names. Limited documentation. Unlabeled columns.
Numerically encoded null values. Multifarious le extensions. Sci-
entists live this nightmare daily as they seek to discover and use
publicly available data stored in heterogeneous data repositories.
As the rate of data production explodes (e.g., due to higher reso-
lution instruments and massive sensor networks), clear, uniform
documentation and organization of data are often neglected. Many
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA
©
2017 Copyright held by the owner/author(s). Publication rights licensed to Associa-
tion for Computing Machinery.
ACM ISBN 978-1-4503-5282-6/17/06. . . $15.00
https://doi.org/http://dx.doi.org/10.1145/3085504.3091116
eorts have focused on standardizing data naming and organiza-
tion models within and across research groups [
17
,
20
]. We, and
others, have developed workow-oriented data publication sys-
tems that impose requirements on organization and metadata [
7
].
While there are clear success stories with respect to structured data
repositories [
4
,
9
], repositories often become dumping grounds for
poorly described data [
6
]. We postulate that new methods based in
statistical learning are needed to make sense of the vast amounts
of data already published to existing repositories. To this end, we
propose an automated pipeline (Skluma) and associated models and
methods that allows us to process and classify disorderly data, striv-
ing to provide the metadata necessary to enable complex querying
of previously incomprehensible scientic data repositories.
Skluma is organized around a three-stage pipeline: crawl hetero-
geneous data repositories, extract metadata on each le’s content,
and contextualize data in order to enrich, augment, and improve the
accuracy of existing metadata. Throughout these stages, Skluma
accumulates and renes metadata through a number of information
extraction and statistical learning models.
To illustrate the value of Skluma we have used it to organize and
index the contents of the United States Department of Energy’s Car-
bon Dioxide Information Analysis Center (CDIAC) data store [
19
].
This lesystem-based repository contains over a half-million les
of diverse types, structures, and sizes, distributed among twelve
grab-bag ‘pub’ top-level directories. Many les are compressed,
named according to undocumented systems, organized in arbitrary
hierarchies, and stored in nonstandard formats. Furthermore, there
are duplicate les both within and between directories, and sci-
entically useless les (e.g., Windows Installers, shortcuts, empty
zipped directories). CDIAC contains more than 150 dierent le
extensions, making the implementation of type- or format-specic
metadata extractors infeasible. Figure 1 shows the distribution of
le types. CDIAC is illustrative of a common problem in science:
while researchers may work hard to address problems of veriabil-
ity and reproducibility, these considerations are easily obscured or
lost by publishing disorganized and undocumented data. Skluma
works to reclaim this missing context and restore the usefulness of
disarrayed repositories.
The remainder of the paper is organized as follows: Section 2
provides a brief overview of past methods for database and repos-
itory cleaning. Section 3 outlines our pipeline and presents test
results. Section 4 describes our planned demonstration of Skluma.
Section 5 outlines future work and considerations. Finally, Section
6 provides a full analysis of Skluma.
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Paul Beckman, Tyler J. Skluzacek, Kyle Chard, and Ian Foster
Figure 1: CDIAC le extension distribution: Counts for the
35 most common le extensions in CDIAC.
2 RELATED WORK
Skluma follows a long line of attempts to organize and gain insight
into highly disorganized data. Related work on geospatial attribute
recognition has paired simple rule-based analysis with the use of
support vector machines (SVMs) in order to predict attributes in a
broad range of geospatial datasets [
2
,
14
]. Current methods focus
on well-structured data with higher consistency than data sources
like CDIAC.
Skluma is also not the rst to provide a scalable solution to collect
raw datasets and extract metadata from them. Pioneering research
on data lakes has developed methods for extracting standard meta-
data from nonstandard le types and formats [
18
]. Recently the data
lake has been adapted to standardize and extract metadata from
strictly-geospatial datasets [
16
]. Normally data lakes have some sort
of institution-specic target for which they are optimized, whether
they primarily input transactional, scientic, or networked data.
Skluma is optimized for data without any standardization guar-
antees, providing information on related, relevant data and their
attributes.
Finally, Skluma supplements existing work that cleans and la-
bels data using context features. Data Civilizer [
10
] accounts for
proximate les in data warehouses by building and interpreting
linkage graphs. Others have used context as a means to determine
how certain metadata might be leveraged to optimize performance
in a specic application or API [
15
]. Skluma collects and analyzes
context metadata in order to allow research scientists to nd, query,
and download related datasets that aid their scientic eorts.
3 PIPELINE
Skluma implements a three-stage pipeline: (1) crawling, (2) meta-
data extraction, and (3) contextualization.
3.1 Crawling
Skluma’s rst task is to catalog the data in order to understand its
organization and scope. While crawling, Skluma extracts general
le-level metadata, such as le name, path, size, a checksum, exten-
sion and MIME type [
13
] for each le. Given the wide variety of
data access protocols (e.g., HTTP, FTP, GridFTP) oered by data
repositories, Skluma is designed with a modular crawling architec-
ture in which dierent crawler implementations can be used. In the
case of CDIAC, we used FTP and HTTP crawlers. As a result of the
crawling phase Skluma creates a JSON document that stores basic
system metadata about each le. This JSON le is stored, modied,
and appended to throughout the Skluma pipeline.
In order to extract metadata from within les, Skluma requires a
mutable le system to execute decompression if necessary and to
store the resulting les. As data repositories often do not provide
such resources, the Skluma crawler mirrors data temporarily to a
high-performance storage environment, where metadata extraction
can be performed. In the CDIAC scenario, we use Globus [
12
] to
transfer all CDIAC data temporarily to Petrel [
1
], a 1 PB storage
system housed at Argonne National Laboratory.
3.2 Metadata Extraction
The second pipeline phase iterates over all les discovered during
the crawling phase and extracts metadata from each le based on its
content. We leverage a suite of modular extraction tools to obtain
general, le, and domain-specic metadata. As these extractors
can require signicant compute resources, we use Jetstream as a
scalable platform on which to execute arbitrary extraction tools on
the data. Petrel serves as a performant storage system from which
we can rapidly access data for processing on Jetstream.
The end goal of extracted metadata is to facilitate queries over a
repository with eld-specic predicates (e.g., “return all CDIAC les
with temperature values greater than 20 C”). To this end, Skluma’s
metadata extractors address two main types of data: containerized
and column-formatted. Containerized data formats like NetCDF
already include much of the metadata necessary for query construc-
tion accessible in standard formats and via standard interfaces. In
this case, Skluma simply reformats this information and copies it
into the metadata le. For any le that can be parsed as column-
formatted, we calculate the min, max, and average for numerical
columns. In addition, we collect all available headers.
In order for these aggregates to have any meaning, however,
encoded null values must be detected and (typically) skipped. Oth-
erwise, to use a CDIAC example, we may nd that a scientist records
-999 to indicate that no temperature measurement was taken, lead-
ing to a le with an average temperature of -437 C. Thus Skluma em-
ploys a supervised learning model to infer null values and exclude
them from aggregate calculations. We use a
k
-nearest neighbor
Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories SSDBM ’17, June 27-29, 2017, Chicago, IL, USA
Figure 2: PCA visualization of null values: Cutout shows dense region of the feature space at 2500x zoom.
classication algorithm, using the average, the three largest values
and their dierences, and the three smallest values and their dier-
ences as features for our model. By taking a classication rather
than regression-based approach, Skluma selects from a preset list
of null values, which avoids discounting real experimental outliers
recorded in the data itself.
Figure 2
provides a PCA visualization
of the clustering of null values in the feature space.
When trained and tested by cross-validation on a labeled test set
of 4682 columns from 335 unique les, our model achieved accuracy
0.991, precision 0.989, and recall 0.961, where precision and recall
are calculated by macro-averaging over classiers.
At this point in the pipeline the accuracy of the aggregate values
has been improved to better reect the existing data. However, if the
column header is not provided in the le itself, these values provide
very little information that can be used for query or discovery.
This is a common occurrence in CDIAC data. Files often contain
only numerical values, whereas column name information is in a
separate free-text README le in a nearby directory. The problem
of associating unlabelled data columns with headers is addressed
by the third portion of the Skluma pipeline.
3.3 Contextualization
The relationships and similarities between les within and outside
a data repository can provide valuable information regarding the
nature of these les. Specically, topic labels on free-text documents
can serve as valuable context to describe nearby undocumented data.
Skluma employs a topic mixture model based on Latent Dirichlet
Allocation [
5
]. Our model is made up of three steps. First, we train
our model on Web of Science (WoS) abstracts. Next, we use this
model to generate the topic distribution of each free-text README
or documentation le in the data repository, which is the nite
mixture over an underlying set of topics derived from the model.
Finally, we model all data les in the repository as themselves nite
mixtures of the topic distributions of the surrounding labelled les.
We calculate the topic mixture of a given le as a linear combination
w1d1+w2d2+... +wndn
of the topic distributions
d1, . . ., dn
of all
nearby tagged free-text documents within a distance threshold. The
weights
w1, . . ., wn
are inversely proportional to the distance within
the repository of the data le to the tagged text document. The
distance metric we use is dependent on the type of repository being
considered. For directory-structured le systems like CDIAC, we
use the number of directory changes that must be done in order to
reach one le from another. A simple illustration of this model is
shown in
Figure 3
. We then execute these steps by submitting a
series of jobs to the Cloud Kotta [3] platform.
This model serves three purposes. Firstly, it adds an additional
queryable attribute to the metadata, enabling searches by probable
topic. Secondly, it can act as a basis for selecting specialized meta-
data extraction tools for classied les. Finally, it may be used as
a feature for a statistical learning model used to predict missing
column headers. We discuss this nal prospect in Section 5.
4 DEMO
The demonstration of Skluma involves executing our pipeline on a
small subset of CDIAC. Specically, we will demonstrate inference
of attributes and null-values, tagging les with topics via analy-
sis of their proximate READMEs, and extracting metadata such
that the les are queryable by both their content and topic-context.
Furthermore, we will query over Skluma’s resulting metadata by
using a simple, web-based search GUI built atop ElasticSearch. The
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Paul Beckman, Tyler J. Skluzacek, Kyle Chard, and Ian Foster
Figure 3: LDA label mixture model: Files are colored accord-
ing to the relative weight of the surrounding topics; red rep-
resents “Atmospheric Science” and blue “Oceanography.
takeaway from the demo should be as follows: despite the reposi-
tory’s disorganized structure and content, Skluma is able to provide
informative metadata that can be used to facilitate data discovery.
5 FUTURE WORK
For column-structured data, one important step towards facilitating
data discovery is the inference of column headers in unlabelled
data. At this juncture, Skluma can produce accurate aggregates by
removing null values and provide topic-based context for header-
less les. We intend to develop additional statistical learning models
that leverage these informative features in conjunction with further
natural language processing techniques. These approaches may
allow us to use free-text documentation les to predict specic
column headers in nearby les, which would greatly increase the
amount of previously unsearchable data that can be indexed for
querying and discovery.
Beyond column-formatted data, we will augment the metadata
we collect from semi-structured and unstructured les. To do so, we
have begun exploring variants of other schema-extraction paradigms
[
8
,
11
] as an initial step in the pipeline. Providing further insight
into unstructured data les will broaden the coverage of Skluma’s
derived metadata, moving towards more comprehensive data dis-
covery.
6 SUMMARY
Skluma’s three-step pipeline supports crawling, metadata extrac-
tion, and contextualization, working to provide the metadata nec-
essary for a data querying environment for scientists. We employ a
number of statistical learning models in order to determine the char-
acteristics of and relationships between les despite irregularities,
missing elds, and haphazard organization. The development and
implementation of this class of automated information extraction
methods has the potential to greatly expand the quantity of usable
scientic data. We continue to expand and rene Skluma in order
to better convert unkempt data repositories into clear, searchable
resources that can propel novel research and analysis.
REFERENCES
[1]
Petrel Data Management and Sharing Pilot. (????). https://www.petrel.alcf.anl.
gov. Visited Feb. 28, 2017.
[2]
Shilpi Ahuja, Mary Roth, Rashmi Gangadharaiah, Peter Schwarz, and Rafael
Bastidas. 2016. Using Machine Learning to Accelerate Data Wrangling. In Pro-
ceedings of the 16th IEEE International Conference on Data Mining Workshops
(ICDMW). IEEE, 343–349.
[3]
Y. N. Babuji, K. Chard, A. Gerow, and E. Duede. 2016. Cloud Kotta: Enabling secure
and scalable data analytics in the cloud. In Proceedings of the IEEE International
Conference on Big Data (Big Data). 302–310. https://doi.org/10.1109/BigData.2016.
7840616
[4]
Helen M. Berman, John Westbrook,Zukang Feng, Gar y Gilliland, T. N. Bhat, Helge
Weissig, Ilya N. Shindyalov, and Philip E. Bourne. 2000. The Protein Data Bank.
Nucleic Acids Research 28, 1 (2000), 235. +http://dx.doi.org/10.1093/nar/28.1.235
[5]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.
Journal of machine Learning research 3, Jan (2003), 993–1022.
[6]
Carl F Cargill. 2011. Why standardization eorts fail. Journal of Electronic
Publishing 14, 1 (2011).
[7]
Kyle Chard, Jim Pruyne, Ben Blaiszik, Rachana Ananthakrishnan, Steven Tuecke,
and Ian Foster. 2015. Globus Data Publication As a Service: Lowering Barriers to
Reproducible Science. In Proceedings of the 2015 IEEE 11th International Conference
on e-Science (E-SCIENCE ’15). IEEE Computer Society, Washington, DC, USA,
401–410. http://dx.doi.org/10.1109/eScience.2015.68
[8]
Cloudera. 2014. RecordBreaker. Cloudera RecordBreaker GitHub Repository (2014).
https://github.com/cloudera/RecordBreaker/tree/master/src
[9]
Timothy D. Crum, Ron L. Alberty, and Donald W. Burgess. 1993. Recording,
Archiving, and Using WSR-88D Data. Bulletin of the American Meteorological
Society 74, 4 (1993), 645–653. https://doi.org/10.1175/1520-0477(1993)074< 0645:
RAAUWD>2.0.CO;2
[10]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael
Stonebraker, Ahmed Elmagarmid, Ihab F Ilyasl, Samuel Madden, Mourad Ouzzani,
and Nan Tang. 2017. The Data Civilizer System. In Proceedings of the 8th Biennial
Conference on Innovative Data Systems Research (CIDR).
[11]
Kathleen Fisher and David Walker. 2011. The PADS project: an overview. In
Proceedings of the 14th International Conference on Database Theory. ACM, 11–17.
[12]
Ian Foster. 2011. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing 15, 3 (2011), 70.
[13]
N. Freed and N. Borenstein. 1996. Multipurpose Internet Mail Extensions (MIME).
RFC 2045. IETF.
[14]
Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang, and
Edward A. Fox. 2003. Automatic Document Metadata Extraction Using Support
Vector Machines. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on
Digital Libraries (JCDL ’03). IEEE Computer Society, Washington, DC, USA,
37–48. http://dl.acm.org/citation.cfm?id=827140.827146
[15]
Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton,
Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhat-
tacharyya, Shirshanka Das, et al
.
2017. Ground: A Data Context Service. In
Proceedings of the 8th Biennial Conference on Innovative Data Systems Research
(CIDR).
[16]
Tyler J. Skluzacek, Kyle Chard, and Ian Foster. 2016. Klimatic: A Virtual Data
Lake for Harvesting and Distribution of Geospatial Data. In Proceedings of the 1st
Joint International Workshop on Parallel Data Storage & Data Intensive Scalable
Computing Systems (PDSW-DISCS ’16). IEEE Press, Piscataway, NJ, USA, 31–36.
https://doi.org/10.1109/PDSW-DISCS.2016.9
[17]
Hiroyasu Sugano, Adrian Bateman, Wayne Carr, Jon Peterson, Shingo Fujimoto,
and Graham Klyne. 2004. Presence information data format (PIDF). RFC 3863.
IETF.
[18]
Ignacio Terrizzano, Peter M Schwarz, Mary Roth, and John E Colino. 2015. Data
Wrangling: The Challenging Journey from the Wild to the Lake.. In Conf. on
Innovative Data Systems Research.
[19]
U.S. Dept. of Energy. 2017. Carbon Dioxide Information Analysis Center. (Jan
2017). ftp://cdiac.ornl.gov
[20]
John Wieczorek, David Bloom, Robert Guralnick, Stan Blum, Markus Döring,
Renato Giovanni, Tim Robertson, and David Vieglais. 2012. Darwin Core: an
evolving community-developed biodiversity data standard. PloS one 7, 1 (2012),
e29715.
... Combining FSMonitor with a metadata extraction tool, such as Skluma [8], can enable the dynamic cataloging of large research data. Skluma provides a suite of metadata extraction tools that can be applied to data. ...
... Petrel file system data and metadata have been used in studies of rule-based data management [14] and automated type inference and metadata extraction [6], and for data transfer modeling and optimization studies [26]. In each case, the ability to write programs that stage data on high-speed storage has proved useful. ...
Conference Paper
Full-text available
We report on our experiences deploying and operating Petrel, a data service designed to support science projects that must organize and distribute large quantities of data. Building on a high-performance 3.2 PB parallel file system and embedded in Argonne National Laboratory's 100+ Gbps network fabric, Petrel leverages Science DMZ concepts and Globus APIs to provide application scientists with a high-speed, highly connected, and programmatically controllable data store. We describe Petrel's design, implementation, and usage and give representative examples to illustrate the many different ways in which scientists have employed the system.
... In response, we have developed Skluma [11], a modular, scalable system for extracting metadata from scientific files and collating those metadata so that they can be used for discovery across repositories and file systems. Skluma applies a collection of specialized "metadata extractors" to files. ...
... Similarly, IBM's LabBook provides rich collaborative metadata graphs on enterprise data lakes [28]. Skluma [2] also extracts metadata graphs from a file system of datasets. Many of these metadata approaches include the use of static or dynamic linkage graphs [13,36,17,16], join graphs for adhoc navigation [54], or version graphs [22]. ...
Preprint
Full-text available
Navigation is known to be an effective complement to search. In addition to data discovery, navigation can help users develop a conceptual model of what types of data are available. In data lakes, there has been considerable research on dataset or table discovery using search. We consider the complementary problem of creating an effective navigation structure over a data lake. We define an organization as a navigation structure (graph) containing nodes representing sets of attributes (from tables or from semi-structured documents) within a data lake. An edge represents a subset relationship. We propose a novel problem, the data lake organization problem where the goal is to find an organization that allows a user to most efficiently find attributes or tables. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user finding an attribute or a table using the organization. Our approach uses the attribute values and metadata (when available). For data lakes with little or no metadata, we propose a way of creating metadata using metadata available in other lakes. We propose an approximate algorithm for the organization problem and show its effectiveness on a synthetic benchmark. Finally, we construct an organization on tables of a real data lake containing data from federal Open Data portals and show that the organization dramatically improves the expected probability of discovering tables over a baseline. Using a second real data lake with no metadata, we show how metadata can be inferred that is effective in enabling organization creation.
... Let { D 1 , ..., D m } be the set of all directories containing files from cluster C i ⊆ A. We define the head H i as the set of all directories before the drop, and the tail T i as the set of all remaining directories of C i . Under the assumptions that similar data are physically close in well-organized datasets and that the clustering C = { C 1 , ..., C k } is sufficiently cohesive, the function S(C) yields a value in [0,1] representing the cleanliness of the dataset. We define a logarithm-like function which is well-defined for a base of 1: ...
Preprint
Full-text available
As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel "cleanliness" score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.
Chapter
Artificially generated data sets are present in many data mining and machine learning publications in the experimental section. One of the reasons to use synthetic data is, that scientists can express their understanding of a “ground truth”, having labels and thus an expectation of what an algorithm should be able to detect. This permits also a degree of control to create data sets which either emphasize the strengths of a method or reveal its weaknesses and thus potential targets for improvement. In order to develop methods which detect linear correlated clusters, the necessity of generating such artificial clusters is indispensable. This is mostly done by command-line based scripts which may be tedious since they demand from users to ‘visualize’ in their minds how the correlated clusters have to look like and be positioned within the data space. We present in this work RAIL, a generator for Reproducible Artificial Interactive Linear correlated data. With RAIL, users can add multiple planes into a data space and arbitrarily change orientation and position of those planes in an interactive fashion. This is achieved by manipulating the parameters describing each of the planes, giving users immediate feedback in real-time. With this approach scientists no longer need to imagine their data but can interactively explore and design their own artificial data sets containing linear correlated clusters. Another convenient feature in this context is that the data is only generated when the users decide that their design phase is completed. If researchers want to share data, a small file is exchanged containing the parameters which describe the clusters through information such as e.g. their Hessian-Normal-Form or number of points per cluster, instead of sharing several large csv files.
Article
Open data plays a major role in supporting both governmental and organizational transparency. Many organizations are adopting Open Data Principles promising to make their open data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However, scientists generally do not have a priori knowledge about what data is available (its schema or content). Nevertheless, they want to be able to use open data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called query discovery where the main task is to discover a query (or transformation) that translates data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. We introduce a new paradigm for thinking about integration where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs. We describe a research agenda and recent progress in developing scalable data-analysis or query-aware data discovery algorithms that provide high recall and accuracy over massive data repositories.
Conference Paper
Broad access to the data on which scientific results are based is essential for verification, reproducibility, and extension. Scholarly publication has long been the means to this end. But as data volumes grow, new methods beyond traditional publications are needed for communicating, discovering, and accessing scientific data. We describe data publication capabilities within the Globus research data management service, which supports publication of large datasets, with customizable policies for different institutions and researchers; the ability to publish data directly from both locally owned storage and cloud storage; extensible metadata that can be customized to describe specific attributes of different research domains; flexible publication and curation workflows that can be easily tailored to meet institutional requirements; and public and restricted collections that give complete control over who may access published data. We describe the architecture and implementation of these new capabilities and review early results from pilot projects involving nine research communities that span a range of data sizes, data types, disciplines, and publication policies.
Article
Standardization is a poorly understood discipline in practice. While there are excellent studies of standardization as an economic phenomenon, or as technical a phenomenon, or as a policy initiative, most of these are ex post facto and written from a dispassionate academic view. They are of little help to practitioners who actually are using and creating standards. The person actually creating the standards is working in an area of imperfect knowledge, high economic incentives, changing relationships, and often, short-range planning. The ostensible failure of a standard has to be examined not so much from the focus of whether the standard or specification was written or even implemented (the usual metric), but rather from the viewpoint of whether the participants achieved their goals from their participation in the standardization process. To achieve this, various examples are used to illustrate how expectations from a standardization process may vary, so that what is perceived as a market failure may very well be a signal success for some of the participants. The paper is experientially, not empirically based, and relies on my observations as an empowered, embedded, and occasionally neutral observer in the Information Technology standardization arena. Because of my background, the paper does have a focus on computing standards, rather than publishing standards. However, from what I have observed, the lessons learned apply equally to all standardization activities, from heavy machinery to quality to publishing. Standards names may vary; human nature doesn't.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
This memo specifies the Common Profile for Presence (CPP) Presence Information Data Format (PIDF) as a common presence data format for CPP-compliant Presence protocols, and also defines a new media type "application/pidf+xml" to represent the XML MIME entity for PIDF.
Article
STD 11, RFC 822, defines a message representation protocol specifying considerable detail about US-ASCII message headers, and leaves the message content, or message body, as flat US-ASCII text. This set of documents, collectively called the Multipurpose Internet Mail Extensions, or MIME, redefines the format of messages to allow for
Article
In order to support NEXRAD program requirements, WSR-88D systems have the capability to record data and products at four levels. Of these, level II (base data) and level III (products) will be most commonly available for various applications by a wide range of users. This paper over views the data-recording capabilities of the WSR-88D system, plans for recording and archiving these data, and some uses for these data.
Article
Many businesses today save time and money, and increase their agility, by outsourcing mundane IT tasks to cloud providers. The author argues that similar methods can be used to overcome the complexities inherent in increas ingly data-intensive, computational, and collaborative scientific research. He describes Globus Online, a system that he and his colleagues are developing to realize this vision.