Beth Plale

Beth Plale
Indiana University Bloomington | IUB · School of Informatics, Computing, and Engineering

PhD computer science

About

284
Publications
65,877
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,691
Citations
Introduction
Professor Plale has broad research and governance interest in long term preservation and access to scientific data, data science, and big data. Her specific research interests are in tools for metadata and provenance capture, data repositories, cyberinfrastructure for large-scale data analysis, and workflow systems. Plale is deeply engaged in geoinformatics research and education and has substantive experience in developing stable and useable scientific cyberinfrastructure.
Additional affiliations
January 1999 - July 2001
Georgia Institute of Technology
Position
  • PostDoc Position
August 2001 - present
Indiana University Bloomington

Publications

Publications (284)
Article
Full-text available
Artificial intelligence (AI) has the potential for vast societal and economic gain; yet applications are developed in a largely ad hoc manner, lacking coherent, standardized, modular, and reusable infrastructures. The NSF‐funded Intelligent CyberInfrastructure with Computational Learning in the Environment AI Institute (“ICICLE”) aims to fundamenta...
Preprint
Full-text available
In the landscape of exascale computing collaborative research campaigns are conducted as co-design activities of loosely coordinated experiments. But the higher level context and the knowledge of individual experimental activity is lost over time. We undertook a knowledge capture and representation aid called Campaign Knowledge Network(CKN), a co-d...
Article
Full-text available
Openness and interdisciplinarity in research and data are among the challenges that are frequently discussed in the context of changing scientific and scholarly practices. Gradually, the visions of open and widely shared data are being reconciled with complex realities that stem from the disciplinary differences in data cultures. In this paper we d...
Preprint
Persistent Identifier (PID) is a widely used long-term unique reference to digital objects. Meanwhile, Handle, one of the main persistent identifier schemes in use, implements a central global registry to resolve PIDs. The value of Handle varies in sizes and types without any restrictions from user side. However, widely using the Handel raises chal...
Article
Full-text available
Background Rice molecular genetics, breeding, genetic diversity, and allied research (such as rice-pathogen interaction) have adopted sequencing technologies and high-density genotyping platforms for genome variation analysis and gene discovery. Germplasm collections representing rice diversity, improved varieties, and elite breeding materials are...
Article
Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for g...
Article
Full-text available
Open science is prompting wide efforts to make data from research available for broader use. However, sharing data is complicated by important protections on the data (e.g., protections of privacy and intellectual property). The spectrum of options existing between data needing to be fully open access and data that simply cannot be shared at all is...
Preprint
Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for g...
Article
Full-text available
A research agenda for intelligent systems that will result in fundamental new capabilities for understanding the Earth system.
Conference Paper
Full-text available
In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with bo...
Preprint
Full-text available
Background Rice molecular genetics, breeding, genetic diversity, and allied research (such as rice-pathogen interaction) have adopted sequencing technologies and high density genotyping platforms for genome variation analysis and gene discovery. Germplasm collections representing rice diversity, improved varieties and elite breeding materials are a...
Chapter
Full-text available
As Intelligent Transportation Systems (ITS) technologies mature, we can envision many scenarios where intelligent agents provide adaptable, dynamic information needed to make decisions in real time. Such information depends on both real-time and historical data. Knowing what data to use and how to compare past and present measurements is essential...
Article
Scientists' ability to synthesize and reuse long-tail scientific data lags far behind their ability to collect and produce these data. Many Earth Science Cyberinfrastructures enable sharing and publishing their data over the web using metadata standards. While profiling data attributes advances the Linked Data approach, it has become clear that bui...
Conference Paper
Full-text available
The project "Workset Creation for Scholarly Analysis and Data Capsules" is building an infrastructure where researchers have access to text processing tools that can then be used on a copyrighted set of digital data. The infrastructure is built on (1) the HathiTrust Research Center (HTRC) Data Capsule services that can be used to access the HathiTr...
Poster
Full-text available
Prototype Overview: Raw data from environmental sensors in Taiwan (temperature, humidity, pressure, particulate matter) are collected and published with persistent IDs assigned. PID-enabled data is available for analysis in Microsoft Azure.
Article
Sensor networks deployed in lakes and reservoirs, when combined with simulation models and expert knowledge from the global community, are creating deeper understanding of the ecological dynamics of lakes. However, the amount of data and the complex patterns in the data demand substantial compute resources and efficient data mining algorithms, both...
Conference Paper
Full-text available
The volumes of data in Big Data, their variety and unstructured nature, have had researchers looking beyond the data warehouse. The data warehouse, among other features, requires mapping data to a schema upon ingest, an approach seen as inflexible for the massive variety of Big Data. The Data Lake is emerging as an alternate solution for storing da...
Chapter
Full-text available
Perspectives on the varied challenges posed by big data for health, science, law, commerce, and politics. Big data is ubiquitous but heterogeneous. Big data can be used to tally clicks and traffic on web pages, find patterns in stock trades, track consumer preferences, identify linguistic correlations in large corpuses of texts. This book examines...
Conference Paper
Full-text available
An Agent Based Model (ABM) is a powerful tool for its ability to represent heterogeneous agents which through their interactions can reveal emergent phenomena. For this to occur though, the set of agents in an ABM has to accurately model a real world population to reflect its heterogeneity. But when studying human behavior in less well developed se...
Conference Paper
Full-text available
We conjecture that meaningful analysis of large-scale provenance can be preserved by analyzing provenance data in limited memory while the data is still in motion; that the provenance needs not be fully resident before analysis can occur. As a proof of concept, this paper defines a stream model for reasoning about provenance data in motion for Big...
Conference Paper
The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. W...
Article
Multi-tenancy in cloud hosted NoSQL data stores is favored by cloud providers as it allows more effective resource sharing amongst different tenants thus lowering operating costs. A NoSQL provider will often present to each tenant a dedicated view of the store but then behind the scenes consolidate tenant access into a shared instance. This multi-t...
Chapter
Big Data in the humanities is a new phenomenon that is expected to revolutionize the process of humanities research. The HathiTrust Research Center (HTRC) is a cyberinfrastructure to support humanities research on big humanities data. The HathiTrust Research Center has been designed to make the technology serve the researcher to make the content ea...
Article
Full-text available
To stay competitive in today's data driven economy, enterprises large and small are turning to stream processing platforms to process high volume, high velocity, and diverse streams of data (fast data) as they arrive. Low-level programming models provided by the popular systems of today suffer from lack of responsiveness to change: enhancements req...
Conference Paper
Full-text available
Cloud hosted NoSQL data stores are for economic reasons often shared amongst multiple tenants simultaneously. The NoSQL provider consolidates multiple tenants access into a shared NoSQL instance and provides a dedicated view for each tenant. This multi-tenancy has tenants' data and workloads coexisting in the same node, which under certain conditio...
Conference Paper
Full-text available
When the effort to curate and preserve data is made at the end of a project, there is little opportunity to leverage ongoing research work to reduce curation costs or conversely, to leverage curation efforts to improve research productivity. In the Sustainable Environment Actionable Data (SEAD) project, we have envisioned a more active approach to...
Article
Full-text available
Provenance captured from E-Science experimentation is often large and complex, for instance, from agent-based simulations that have tens of thousands of heterogeneous components interacting over extended time periods. The subject of study of my dissertation is the use of E-Science provenance at scale. My initial research studied the visualization o...
Article
Full-text available
Large-scale distributed systems are difficult to debug in the event of failure. Yet rapid fault diagnosis that pinpoints failures to the component level is critical to fast recovery. We introduce a statistical approach to fault diagnosis that utilizes a dependency graph of execution to automatically discover the most probable fault cause(s) at a co...
Conference Paper
Full-text available
As open linked data gains traction, vastly more information becomes available and discoverable online. The SEAD project (Sustainable Environments Actionable Data) wants to take advantage of the rich linked data landscape. SEAD needs information, about its researchers in the research areas around sustainable science (“people”), about data sets that...
Article
Full-text available
Data provenance captured from scientific applications is a critical precursor to data sharing and reuse. For researchers wanting to repurpose data, it is a source of information about the lineage and attribution of the data and this is needed in order to establish trust in a data set. Komadu is a standalone provenance capture and visualization syst...
Conference Paper
Application benchmarks are critical to establishing the performance of a new system or library. But benchmarking a system can be tricky and reproducing a benchmark result even trickier. Provenance can help. Referencing benchmarks and their results on similar platforms for collective comparison and evaluation requires capturing provenance related to...
Article
Mining frequent subsequences of patterns, or sequential pattern mining, has wide application in customer shopping sequence analysis, web log stream analysis, multi-modal behavioral studies, to name a few. To detect unknown, anomalous, and unexpected patterns from large-scale interval-based temporal data without complete a priori knowledge is challe...
Book
This book constitutes the revised selected papers of the 5th International Provenance and Annotation Workshop, IPAW 2014, held in Cologne, Germany in June 2014. The 14 long papers, 20 short papers and 4 extended abstracts presented were carefully reviewed and selected from 53 submissions. The papers include tools that enable provenance capture from...
Article
Data provenance is the lineage of a digital artifact or object. Its capture in workflow-controlled distributed applications is well studied but less is known about quality of provenance captured solely through existing control infrastructures (i.e., middleware frameworks used for high throughput computing). We study completeness of provenance in ca...
Article
Clouds are increasingly being used for running dataintensive scientific applications. However, science applications need to contend with the I/O and network performance characteristics of cloud environments. Additionally, managing data effectively and efficiently over these cloud resources is challenging due to the myriad storage choices with diffe...
Chapter
Cloud computing services are becoming increasingly viable for scientific model execution. As a leased computational resource, cloud computing enables a computational modeler at a smaller university to carry out sporadic large-scale experiments, and allows others to pay for CPU cycles as needed, without incurring high maintenance costs of a large co...
Conference Paper
NoSQL data stores see considerable attention today in big data, cloud hosted environments because of their fault tolerance, distribution and high availability. Shared NoSQL data stores are preferred for their ability to serve multiple tenants simultaneously which can improve resource utilization and lower management costs. Fair share in this settin...
Article
As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. R is a popular and powerful text analytics tool; however, it needs to run in parallel and requires special handling to protect copyrighted content against full access (consumptio...
Article
Full-text available
As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, natural language processing (NLP), and other text analysis techniques. In this paper we propose a virtual machine (VM) framework and methodology for non-consumptive text analysis. Using a remote VM model, the VM is conf...
Article
The MapReduce programming model has proven useful for data-driven high throughput applications. However, the conventional MapReduce model limits itself to scheduling jobs within a single cluster. As job sizes become larger, single-cluster solutions grow increasingly inadequate. We present a hierarchical MapReduce framework that utilizes computation...
Article
Data provenance, a form of metadata describing the life cycle of a data product, is crucial in the sharing of research data. Research data, when shared over decades, requires recipients to make a determination of both use and trust. That is, can they use the data? More importantly, can they trust it? Knowing the data are of high quality is one fact...
Article
The Research Data Alliance (RDA) uses Working Groups and Interest Groups to carry out its work. Groups form when a concerned community develops around a topic for which there are well defined issues, common goals, and an opportunity to create a framework for timely action. One year in, RDA has 26 Working Groups and Interest Groups whose activities...
Article
Bibliographic metadata is essential for digital library resource description. Especially as the size and number of bibliographic entities grows, high-quality metadata enables richer forms of digital library access, search, and use. Metadata records can be enriched through automated techniques. For example, a digital humanities scholar might use the...
Conference Paper
Data provenance is the lineage of an artifact or object. Provenance can provide a basis upon which data can be regenerated, and can be used to determine the quality of both the process and provenance itself. Provenance capture from workflows is comprised of capturing data dependencies as and when a workflow executes. We propose a layered provenance...
Article
Full-text available
With the interdependencies that exist between data in a scientific processing pipeline, the ability to track the provenance of the scientific process through multiple stages is necessary to determining the usability of the resulting data product. In this paper we study the capture of provenance from an existing NASA instrument ingest pipeline. Sinc...
Article
Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work...
Conference Paper
Full-text available
MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to l...
Conference Paper
Full-text available
Researchers who use agent-based models (ABM) to model social patterns often focus on the model's aggregate phenomena. However, aggregation of individuals complicates the understanding of agent interactions and the uniqueness of individuals. We develop a method for tracing and capturing the provenance of individuals and their interactions in the Net...
Conference Paper
HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assu...
Conference Paper
Full-text available
Geometric distortions are among the major challenging issues in the analysis of historical document images. Such distortions appear as arbitrary warping, folds and page curl, and have detrimental effects upon recognition (OCR) and readability. While there are many dewarping techniques discussed in the literature, there exists no standard method by...
Conference Paper
In this poster we will present the SEAD project [1] and its prototype software and describe how SEAD approaches long-term data preservation and access through multiple partnerships and how it supports sustainability science researchers in their data management, analysis and archival needs. SEAD's initial prototype system currently is being tested b...
Conference Paper
Digital repositories are grappling with an influx of scientific data brought about by the well publicized "data deluge" in science, business, and society. One particularly perplexing problem is the long-term archival and reuse of complex data sets. This paper presents an integrated approach to data discovery over heterogeneous data resources in soc...
Conference Paper
Full-text available
Academic libraries are increasingly looking to provide services that allow their users to work with digital collections in innovative ways, for example, to analyze large volumes of digitized collections. The HathiTrust Research Center (HTRC) is a large collaborative that provides an innovative research infrastructure for dealing with massive amount...
Conference Paper
Full-text available
Synthetic aperture radar Interferometry (InSAR) is a significant 3D imaging technique to generate a Digital Elevation Model (DEM). The phase difference between the complex SAR images displays an interference fringe pattern from which the elevation of any point in the imaged terrain can be determined. Phase unwrapping is the most critical step in th...
Conference Paper
The volume and complexity of data produced and analyzed in scientific collaborations is growing exponentially. It is important to track scientific data-intensive analysis workflows to provide context and reproducibility as data is transformed in these collaborations. Provenance addresses this need and aids scientists by providing the lineage or his...
Conference Paper
Full-text available
This is a position paper for the SEAD DataNet Prototype presented at the CASC Research Data Management Implementation Symposium held on March 13-14, 2013 in Arlington, VA.
Conference Paper
Cloud computing platforms are drawing increasing attention of the scientific research communities. By providing a framework to lease computation resources, cloud computing enables the scientists to carry out large-scale experiments in a cost-effective fashion without incurring high setup and maintenance costs of a large compute system. In this pape...
Conference Paper
As new data products of research increasingly become the product or output of complex processes, the lineage of the resulting products takes on greater importance as a description of the processes that contributed to the result. Without adequate description of data products, their reuse is lessened. The act of instrumenting an application for prove...
Conference Paper
Full-text available
Major research universities are grappling with their response to the deluge of scientific data emerging through research by their faculty. Many are looking to their libraries and the institutional repository as a solution. Scientific data introduces substantial challenges that the document-based institutional repository may not be suited to deal wi...
Article
Full-text available
Provenance of scientific data will play an increasingly critical role as scientists are encouraged by funding agencies and grand challenge problems to share and preserve scientific data. But it is foolhardy to believe that all human processes, particularly as varied as the scientific discovery process, will be fully automated by a workflow system....