Catharine van Ingen's research while affiliated with Lawrence Berkeley National Laboratory and other places

Publications (48)

Article
Full-text available
p>The following authors were omitted from the original version of this Data Descriptor: Markus Reichstein and Nicolas Vuichard. Both contributed to the code development and N. Vuichard contributed to the processing of the ERA-Interim data downscaling. Furthermore, the contribution of the co-author Frank Tiedemann was re-evaluated relative to the co...
Article
Full-text available
The FLUXNET2015 dataset provides ecosystem-scale data on CO2, water, and energy exchange between the biosphere and the atmosphere, and other meteorological and biological measurements, from 212 sites around the globe (over 1500 site-years, up to and including year 2014). These sites, independently managed and operated, voluntarily contributed their...
Article
The Internet, Web 2.0 and Social Networking technologies are enabling citizens to actively participate in ‘citizen science’ projects by contributing data to scientific programmes via the Web. However, the limited training, knowledge and expertise of contributors can lead to poor quality, misleading or even malicious data being submitted. Subsequent...
Conference Paper
Moderate Resolution Imaging Spectroradiometer (MODIS), the key instrument aboard NASA’s Terra and Aqua satellites, continuously generates data as the satellites cover the entire surface of earth every one to two days. This data is important to many scientific analyses, however, data procurement and processing can be challenging and cumbersome for u...
Book
Full-text available
Scientific questions of today are now more global than ever before. The answers to these questions are buried within multiple disciplines and across a diverse range of scientists and institutions. The expanse and complexity of data required by researchers often exceed the means of a single scientist. Data sharing in the form of its distributed coll...
Chapter
The Health-e-Waterways Project is a multi-disciplinary collaboration between the University of Queensland, Microsoft Research and the South East Queensland Healthy Waterways Partnership (SEQ-HWP). This project develops the underlying technological framework and set of services to enable streamlined access to the expanding collection of real-time, n...
Article
Full-text available
1] We propose the Breathing Earth System Simulator (BESS), an upscaling approach to quantify global gross primary productivity and evapotranspiration using MODIS with a spatial resolution of 1–5 km and a temporal resolution of 8 days. This effort is novel because it is the first system that harmonizes and utilizes MODIS Atmosphere and Land products...
Conference Paper
To perform computational experiments at greater scale and in less time, enterprises are increasingly looking to dynamically expand their computing capabilities through the temporary addition of cloud resources (aka "cloudbursting"). Computational infrastructure can be dismantled in minutes with no long-term capital investments. However, research is...
Article
The Sloan Digital Sky Survey established the use of relational databases for the scans and cone searches common to astronomy analyses. The Pan-STARRS project scales up SDSS by melding HPC clusters with hierarchical and spatially partitioned distributed databases to meet the challenge of near realtime handling of the multiple data surveys generated...
Article
We live in an era in which scientific discovery is increasingly driven by data exploration of massive datasets. Scientists today are envisioning diverse data analyses and computations that scale from the desktop to supercomputers, yet often have difficulty designing and constructing software architectures to accommodate the heterogeneous and often...
Article
Full-text available
The Health-e-Waterways Project is a multi-disciplinary collaboration between the University of Queensland, Microsoft Research and the South East Queensland Healthy Waterways Partnership (SEQ-HWP). This project develops the underlying technological framework and set of services to enable streamlined access to the expanding collection of real-time, n...
Conference Paper
Full-text available
Workflows are commonly used to model data intensive scientific analysis. As computational resource needs increase for eScience, emerging platforms like clouds present additional resource choices for scientists and policy makers. We introduce BReW, a tool enables users to make rapid, highlevel platform selection for their workflows using limited wor...
Article
The combination of low-cost in situ sensors, internet connectivity, and commodity computing is changing earth science research. This unprecedented data availability is enabling science synthesis; studies that span disciplines, bridge local field, modeling and remote sensing methodologies and-or span local, regional, and global scales. Data discover...
Article
Carbon-climate, like other environmental sciences, has been changing. Large-scale synthesis studies are becoming more common. These synthesis studies are often conducted by science teams that are geographically distributed and on data sets that are global in scale. A broad array of collaboration and data analytics tools are now available that could...
Article
It can be natural to believe that many of the traditional issues of scale have been eliminated or at least greatly reduced via cloud computing. That is, if one can create a seemingly well functioning cloud application that operates correctly on small or moderate-sized problems, then the very nature of cloud programming abstractions means that the s...
Conference Paper
The widely discussed scientific data deluge creates a need to computationally scale out eScience applications beyond the local desktop and cope with variable loads over time. Cloud computing offers a scalable, economic, on-demand model well matched to these needs. Yet cloud computing creates gaps that must be crossed to move existing science applic...
Conference Paper
The combination of low-cost sensors, low-cost commodity computing, and the Internet is enabling a new era of data-intensive science. The dramatic increase in this data availability has created a new challenge for scientists: how to process the data. Scientists today are envisioning scientific computations on large scale data but are having difficul...
Conference Paper
Full-text available
The growing amount of scientific data from sensors and field observations is posing a challenge to ᅢツᅡ﾿data valetsᅢツᅡ﾿ responsible for managing them in data repositories. These repositories built on commodity clusters need to reliably ingest data continuously and ensure its availability to a wide user community. Workflows provide several benefits t...
Article
The combination of internet availability of remote sensing data and emerging cloud computing technologies are enabling new environmental science research. Data products from satellites such as MODIS or GRACE augment ground based measurements from networks such as Fluxnet or WATERS. Cloud computing technologies such as Microsoft's Azure or Amazon's...
Conference Paper
Full-text available
Many of today's large-scale scientific projects attempt to collect data from a diverse set of sources. The traditional campaign-style approach to ¿synthesis¿ efforts gathers data through a single concentrated effort, and the data contributors know in advance exactly who will use their data and why. At even moderate scales, the cost and time require...
Article
Scientific workflows have gained popularity for modeling and executing in silico experiments by scientists for problem-solving. These workflows primarily engage in computation and data transformation tasks to perform scientific analysis in the Science Cloud. Increasingly workflows are gaining use in managing the scientific data when they arrive fro...
Article
Many environmental scientists today are attempting to assemble, use, share and save data from a diverse set of sources. These "synthesis" efforts are often interdisciplinary and blend data from ground-based sensors, satellites, field observations, and the literature. Today, many of these efforts are campaigns where the data are gathered, processed,...
Conference Paper
Full-text available
A sensor network data gathering and visualization infrastructure is demonstrated, comprising of Global Sensor Networks (GSN) middleware and Microsoft SensorMap. Users are invited to actively participate in the process of monitoring real-world deployments and can inspect measured data in the form of contour plots overlayed onto a high resolution map...
Conference Paper
Full-text available
The Health-e-Waterways Project is a collaboration between the University of Queensland, Microsoft Research and the South East Queensland Healthy Waterways Partnership (SEQ-HWP) (a consortium of over 60 local government, state agency, universities, community and environmental organizations). The aim of the project is to develop a highly innovative f...
Article
A sensor network data gathering and visualization infrastructure is demonstrated, comprising of Global Sensor Networks (GSN) middleware and Microsoft SensorMap. Users are invited to actively participate in the process of monitoring real-world deployments and can inspect measured data in the form of contour plots overlayed onto a high resolution map...
Conference Paper
Full-text available
Data intensive computing presents a significant challenge for traditional supercomputing architectures that maximize FLOPS since CPU speed has surpassed IO capabilities of HPC systems and BeoWulf clusters. We present the architecture for a three tier commodity component cluster designed for a range of data intensive computations operating on petasc...
Conference Paper
Big data presents new challenges to both cluster infrastructure software and parallel application design. We present a set of software services and design principles for data intensive computing with petabyte data sets, named GrayWulf. These services are intended for deployment on a cluster of commodity servers similar to the well-known Beowulf clu...
Conference Paper
Scientific workflows have become an archetype to model in silico experiments in the Cloud by scientists. There is a class of workflows that are used to by "data valets" to prepare raw data from scientific instruments into a science-ready form for use by scientists. These share data-intensive traits with traditional scientific workflows, yet differ...
Article
IINTRODUCTION The Health-e-Waterways Project is a three way collaboration between the University of Queensland, Microsoft Research and the Healthy Waterways Partnership (SEQ-HWP)(over 60 local government, state agency, universities, community and environmental organizations). The project is developing a highly innovative framework and set of servic...
Article
A gap-filled, quality assessed eddy covariance dataset has recently become available for the AmeriFluxnetwork. This dataset uses standard processing and produces commonly used science variables. This shared dataset enables robust comparisons across different analyses. Of course, there are many remaining questions. One of those is how to define 'dur...
Article
The SATA advertised bit error rate of one error in 10 terabytes is frightening. We moved 2 PB through low-cost hardware and saw five disk read error events, several controller failures, and many system reboots caused by security patches. We conclude that SATA uncorrectable read errors are not yet a dominant system-fault source - they happen, but ar...
Article
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation - one of the operational issues that can affect the performance and/or m...
Article
Fragmentation leads to unpredictable and degraded application performance. While these problems have been studied in detail for desktop filesystem workloads, this study examines newer systems such as scalable object stores and multimedia repositories. Such systems use a get/put interface to store objects. In principle, databases and filesystems can...
Article
Disk capacity has grown remarkably over the last ten years, but disk accesses, particularly random accesses, have not kept pace. The combination means that disks are deliver fewer accesses (input-output operations or IOPs) per stored byte of application data. Either the full disk capacity cannot be used or the files on the disks must be cooler - re...
Article
Storage hardware costs have often been quoted as price per capacity (\$/GB). A better metric is total cost per user as deployed. This balances performance, capacity and packaging constraints. That being said, it is important to recognize that hardware is only one part of the total cost of ownership for storage. Storage is now cheap and people remai...
Article
This paper investigates the most efficient way to read and write large sequential files using the Windows NT^TM 4.0 File System. The study explores the performance of Intel Pentium Pro^TM based memory and IO subsystems, including the processor bus, the PCI bus, the SCSI bus, the disk controllers, and the disk media. We provide details of the overhe...
Article
This paper investigates the most efficient way to read and write large sequential files using the Windows NT™ 4.0 File System. The study explores the performance of Intel Pentium Pro™ based memory and IO subsystems, including the processor bus, the PCI bus, the SCSI bus, the disk controllers, and the disk media. We provide details of the overhead c...
Article
Scientific workflows increasingly rely on diverse resource platforms to satisfy computation and data needs. Scientific users are often not able to leverage emerging new platforms such as cloud computing and new distributed sites due to overheads and cost associated with resource platform decisions. Thus there is a need for rapid, high-level analys...
Article
Systems like SciScope and Data Access System for Hydrology (DASH) rely on data catalogs to facilitate data discovery. These catalogs describe several nation-wide data repositories that are important for scientists including US Geological Survey's National Water Information System (NWIS), Environmental Protection Agency's STOrage and RETrieval Syste...
Article
The widely discussed scientific data deluge creates not only a need to computationally scale an application from a local desktop or cluster to a supercomputer, but also the need to cope with variable data loads over time. Cloud computing offers a scalable, economic, on-demand model well matched to the evolving eScience needs. Yet cloud computing cr...
Article
The Fluxnet synthesis dataset originally compiled for the La Thuile workshop contained approximately 600 site years. Since the workshop, several additional site years have been added and the dataset now contains over 920 site years from over 240 sites. A data refresh update is expected to increase those numbers in the next few months. The ancillary...
Article
Hydrologic and other environmental scientists are beginning to use commercial database technologies to locate, assemble, analyze, and archive data. A data model that is both capable of handling the diversity of data and simple enough to be used by non-database professionals remains an open question. Over the past three years, we have been working i...
Article
High-speed sequential file access is important for bulk data operations typically found in utility, multimedia, data mining, and scientific applications. High-speed sequential IO is also important in the startup of interactive applications. Minimizing IO overhead and maximizing bandwidth frees power to process the data. The goals of this study were...
Article
The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) is the next generation of digital sky surveys that builds on the success of the Sloan Digital Sky Survey (SDSS) . The Pan-STARRS consortium is centered at the University of Hawai`i, Institute for Astronomy, and includes nine other institutions worldwide. The next generation syst...
Article
Full-text available
The pervasive availability of scientific data from sensors and field observations is posing a challenge to data valets responsible for accumulating and managing them in data repositories. Science collaborations, big and small, are standing up repositories built on commodity clusters need to reliably ingest data constantly and ensure its availabilit...
Article
Full-text available
1 Abstract “Time travel” in the storage system is accessing past stor-

Citations

... The locations of the selected sites are shown in Fig. 1. The FLUXNET data were all processed following standard protocols, including friction velocity filtering, gap filling and flux partitioning (Pastorello et al., 2020). We used the variable "GPP_NT_VUT_REF" in the FLUXNET 2015 dataset as the GPP data, which was produced based on the night-time respiration partitioning method (Reichstein et al., 2005). ...
... While differences in the amounts of available data likely adds bias in favor of trends seen at sites with longer histories, however, as this network matures, and the sites accumulate more data this problem will be less prominent. This is a common situation in cross-site eddy covariance analysis work, as most larger-scale studies require utilizing data from sites with a variety of histories and record lengths Chu et al., 2021;Pastorello et al., 2020). The following crops were grown in sites included in the analysis, but not all crops were grown at all sites: maize (Zea mays L. -8 sites), soybean (Glycine max L. -5 sites), alfalfa (Medicago sativa L. -2 sites), garbanzo (Cicer arietinum L. -1 site), Footnotes: *Ameriflux designations given in parentheses when available. ...
... First, as shown in Figure 1.2a, the surrounding environment of wireless camera imposes great risks on the camera itself. Unpredictable [8] wireless sensor station at Davos was destroyed by an avalanche, before and after snow melt. weather conditions such as strong winds, avalanches, or rock slides can damage and destroy a camera. ...
... Quantitative evaluation can provide additional information on factors preventing citizen scientists from providing data that is fit for purpose. An example of quantitative evaluation is the study by Hunter et al. (2013) who investigated trustworthiness to provide quality data through evaluation of various trust metrics. Hence, there is a need for a more systematic investigation of the relationship between citizen scientists' attributes and the quality of the rainfall data that they collect. ...
... MODIS data are widely used by the community. [19][20][21][22][23] In our case study, we focus on the solar radiation at the Earth's surface which is computed based on selected MODIS products. The earth's solar radiation is an intermediate product needed for computing evapotranspiration 3 -an important part of the water cycle and responsible for cloud formation and precipitation. ...
... These approaches are based on data warehouse database structure like e.g. used by Microsoft [12,13] or [14] pro- viding generic data models supporting to extend the controlled vocabulary. In both cases many metadata are mandatory by the provided schema. ...
... This is made possible by the hard drive's DMA capability. The validity of this approach is shown by a small test program that reaches maximum rated drive throughput and by [11]. ...
... The number and duration of CO 2 flux observations from nonforest flux-tower stations throughout the world have grown exponentially during the last decade. Through the La Thuile synthesis process of the FLUXNET network (Agarwal et al. 2008; Baldocchi 2008b) and other cooperative initiatives, these observations are now available for comparative analysis and generalization. In this publication, we present a synthesis of results from tower CO 2 flux measurements at 118 tower sites representing grassland, cropland, shrubland, savanna, and wetland ecosystems of the world. ...
... We gained additional insight from those who had modified or extended ODM 1.1.1 to support additional types of data or new functionality. For example, Beran et al. (2008) adopted many of the concepts from ODM 1.1.1, but suggested a new structure with a core data model and a set of profiles that extended the core to provide functionality to support additional data types. ...
... Pan-STARRS (Kaiser et al. 2002;Burgett 2012) is designed to collect data at a rate of 3 ∼ 10 terabyte (TB) per night. The observed data of Pan-STARRS are managed in a distributed relational Microsoft SQL server database, which is spatially partitioned into slice databases using a hash function over the spatial location (RA and DEC) of each detection (Simmhan et al. 2011). The raw imaging data of LSST is expected to be about 15 TB per night (Ivezic et al. 2008). ...