-
K.G. Helmer,
J.L. Ambite,
J. Ames,
R. Ananthakrishnan,
G. Burns,
a. Chervenak,
I. Foster,
L. Liming,
D.B. Keator,
F. Macciardi,
R. Madduri,
J-P. Navarro,
S.G. Potkin,
B. Rosen,
S. Ruffins,
R. Schuler,
J.A. Turner,
A Toga,
S. Williams, C. Kesselman
Journal of the American Medical Informatics Association. 01/2011; 18:416-422.
-
[show abstract]
[hide abstract]
ABSTRACT: Distributed computing systems employ replication to improve overall system robustness, scalability, and performance. A replica location service (RLS) offers a mechanism to maintain and provide information about physical locations of replicas. This paper defines a design framework for RLSs that supports a variety of deployment options. We describe the RLS implementation that is distributed with the Globus toolkit and is in production use in several grid deployments. Features of our modular implementation include the use of soft-state protocols to populate a distributed index and Bloom filter compression to reduce overheads for distribution of index information. Our performance evaluation demonstrates that the RLS implementation scales well for individual servers with millions of entries and up to 100 clients. We describe the characteristics of existing RLS deployments and discuss how RLS has been integrated with higher-level data management services.
IEEE Transactions on Parallel and Distributed Systems 10/2009; · 1.40 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Resources in distributed infrastructure such as the grid are typically autonomously managed and shared across a distributed set of end users. These characteristics result in a fundamental conflict: resource providers optimize for throughput and utilization which coupled with a stochastic multiuser workload results in nondeterministic best effort service for any one application. This conflicts with the user who wants to optimize end-to-end application performance but is constrained by the best effort service offering. Resource reservations can be used to obtain more predictable application behaviors but they are generally not allowed due to perceived impact on the other users and overall resource utilization. In this paper, we examine two strategies for integrating reservations within the resource management fabric that address these concerns by either minimizing the adverse impact of a reservation on the other users or enabling a resource provider to recoup losses through a differentiated pricing mechanism. Correspondingly, we also present algorithms for optimizing the application performance when resources provide automated reservations using the previously developed strategies. These algorithms use a cost based model to identify the set of reservations to be made for the application in order to optimize performance while minimizing the cost for the reservations. The cost based model allows the users to do a tradeoff between the application performance and resulting resource costs. Using trace-based simulations and task graph structured applications, we compare the application performance and resource cost when it is executed using reservations to that when only best effort service is available. We show the approach incorporating reservations can provide superior performance for the application at a price that the user can predetermine. Also, the benefits of using the reservation-based approach become more pronounced when the resources are under high utilization an-
-
d/or the applications have significant resource requirements.
IEEE Systems Journal 04/2009; · 0.92 Impact Factor
-
A Baranovski,
K Beattie,
S Bharathi,
J Boverhof,
J Bresnahan,
A Chervenak,
I Foster,
T Freeman,
D Gunter,
K Keahey, C Kesselman,
R Kettimuthu,
N Leroy,
M Link,
M Livny,
R Madduri,
G Oleynik,
L Pearlman,
R Schuler,
B Tierney
[show abstract]
[hide abstract]
ABSTRACT: The Center for Enabling Distributed Petascale Science is developing serviced to enable researchers to manage large, distributed datasets. The center projects focus on three areas: tools for reliable placement of data, issues involving failure detection and failure diagnosis in distributed systems, and scalable services that process requests to access data
Journal of Physics Conference Series 08/2008; 125(1):012068.
-
[show abstract]
[hide abstract]
ABSTRACT: As a variety of science applications are integrated with large-scale HPDC (high performance distributed computing) technologies, timely resource allocation is revealed as a critical requirement to be considered. This paper introduces a new HPDC resource management paradigm named resource slot which defines a network of logical machines across time and space. A resource slot is not only a resource programming target but also a virtualized resource provisioning framework for a variety of resource management paradigms by encapsulating the resource management complexity. Especially, we present a resource provisioning technique named guided redundant submission (GRS), which probabilistically guarantees a timely resource slot allocation. Experimental results performed against 8 clusters in production show that about 5 redundant resources per slot can secure slot allocation with up to 36 logical machines, each cluster having an availability probability as low as 0.25 and the target success probability of slot allocation is 0.95.
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on; 06/2008
-
[show abstract]
[hide abstract]
ABSTRACT: Providing QoS (quality of service) in batch resources against the uncertainty of resource availability due to the space-sharing nature of scheduling policies is a critical capability required for high-performance computing. This paper introduces a technique called personal cluster which reserves a partition of batch resources on user's demand in a best-effort manner. A personal cluster provides a private cluster dedicated to the user during a user-specified time period by installing a user-level resource manager on the resource partition. This technique not only enables cost-effective resource utilization and efficient task management but also provides the user a uniform interface to heterogeneous resources regardless of local resource management software. A prototype implementation using a PBS batch resource manager and Globus Toolkits based on Web services shows that the overhead of instantiating a personal cluster of medium size is small, which is just about 1 minute for a personal cluster having 32 processors.
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on; 05/2008
-
[show abstract]
[hide abstract]
ABSTRACT: Application scheduling studies on large-scale shared resources have advocated the use of resource provisioning in the form of advance reservations for providing predictable and deterministic quality of service to applications. Resource scheduling studies however have shown the adverse impact of advance reservations in the form of reduced utilization and increased response time of the resources. Thus, resource providers either disallow reservations or impose restrictions such as minimum notice periods and this reduces the effectiveness of reservations as the means of allocating desired resources at a desired time. In this paper, we suggest adaptive pricing as an alternative for allowing reservation of resources. The price charged for allowing a reservation is based directly on the impact that the reservation has on other users sharing the resource. Using trace-based simulations, we show that adaptive pricing allows users to make reservations at the desired time while making it more expensive than best effort service. Thus, users arc induced to make the correct choice between reservations and best-effort service based on their real needs. Moreover, this pricing scheme is more cost effective and sensitive to the system load as compared to a flat pricing scheme and encourages load balancing across resources.
Grid Computing, 2007 8th IEEE/ACM International Conference on; 10/2007
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we present algorithms for Grid resource provisioning that employ agreement-based resource management. These algorithms allow userlevel resource allocation and scheduling of applications that are structured as a precedenceconstrained set of tasks. We present a provisioning model where the resource availability in the Grid can be enumerated as a set of slots. A slot is defined as a number of processors available from a certain start time for a certain duration at a certain cost. Using a cost model that combines the cost of resource allocation and the expected application runtime, we evaluate the performance of the Min-Min and of the Genetic algorithm (GA)-based heuristics for a range of synthetic applications. We show that the GA paired with a list scheduling algorithm can obtain significantly better solutions than the Min-Min heuristic alone.
e-Science and Grid Computing, 2006. e-Science '06. Second IEEE International Conference on; 01/2007
-
[show abstract]
[hide abstract]
ABSTRACT: The study and creation of the infrastructure required to enable system-level science--the integration of diverse sources of knowledge about the constituent parts of a complex system with the goal of obtaining an understanding of the system's properties as a whole--is becoming increasingly important, spawning new knowledge in variety of fields at a rapid pace.
Computer 12/2006; · 1.47 Impact Factor
-
D E Middleton,
D E Bernholdt,
D Brown,
M Chen,
A L Chervenak,
L Cinquini,
R Drach,
P Fox,
P Jones, C Kesselman,
I T Foster,
V Nefedova,
A Shoshani,
A Sim,
W G Strand,
D Williams
[show abstract]
[hide abstract]
ABSTRACT: With support from the U.S. Department of Energy's Scientific Discover Through Advanced Computing (SciDAC) program, we have developed and deployed the Earth System Grid (ESG) to make climate simulation data easily accessible to the global climate modelling and analysis community. ESG currently has 2500 registered users and manages 160 TB of data in archives distributed around the nation. From this past year alone, more than 200 scientific journal articles have been published from analyses of data delivered by the ESG.
Journal of Physics Conference Series 09/2006; 46(1):510.
-
[show abstract]
[hide abstract]
ABSTRACT: Scientific applications require sophisticated data management capabilities. We present the design and implementation of a data replication service (DRS), one of a planned set of higher-level data management services for Grids. The capabilities of the DRS are based on the publication capability of the lightweight data replicator (LDR) system developed for the LIGO Scientific Collaboration. We describe LIGO publication requirements and LDR functionality. We also describe the design and implementation of the DRS in the Globus Toolkit Version 4.0 environment and present performance results.
Grid Computing, 2005. The 6th IEEE/ACM International Workshop on; 12/2005
-
[show abstract]
[hide abstract]
ABSTRACT: One of the criteria for the Grid infrastructure is the ability to share resources with nontrivial qualities of service. However, sharing resources in Grids is complicated in that is requires the ability bridge the differing policy requirements of the resource owners to create a consistent cross-organizational policy domain that delivers the necessary capability to the end user while respecting the policy requirements of the resource owner. Further complicating the management of Grid resources is the need to coordinate resource usage, the diversity of resource types and the variety of different management modes that may be used. We present a unifying resource management framework in which we can address these issues. The fundamental underlying concept in this framework is the representation of various resource management activities in terms of an agreement. Agreements abstract local management policy by representing an underlying resource strictly in terms of policy terms which it is willing to assert, and in doing so provides the basis for building a variety of alternative Grid resource management strategies. We introduce the concepts of agreement based resource management. We present a general agreement model and examine current resource management systems in the context of this model. We then discuss how agreement based resource management is being used as the basis for standards activities and next generation resource management services.
Proceedings of the IEEE 04/2005; · 6.81 Impact Factor
-
D. Bernholdt,
S. Bharathi,
D. Brown,
K. Chanchio,
M. Chen,
A. Chervenak,
L. Cinquini,
B. Drach,
I. Foster,
P. Fox,
J. Garcia, C. Kesselman,
R. Markel,
D. Middleton,
V. Nefedova,
L. Pouchard,
A. Shoshani,
A. Sim,
G. Strand,
D. Williams
[show abstract]
[hide abstract]
ABSTRACT: Understanding the Earth's climate system and how it might be changing is a preeminent scientific challenge. Global climate models are used to simulate past, present, and future climates, and experiments are executed continuously on an array of distributed supercomputers. The resulting data archive, spread over several sites, currently contains upwards of 100 TB of simulation data and is growing rapidly. Looking toward mid-decade and beyond, we must anticipate and prepare for distributed climate research data holdings of many petabytes. The Earth System Grid (ESG) is a collaborative interdisciplinary project aimed at addressing the challenge of enabling management, discovery, access, and analysis of these critically important datasets in a distributed and heterogeneous computational environment. The problem is fundamentally a Grid problem. Building upon the Globus toolkit and a variety of other technologies, ESG is developing an environment that addresses authentication, authorization for data access, large-scale data transport and management, services and abstractions for high-performance remote data access, mechanisms for scalable data replication, cataloging with rich semantic and syntactic information, data discovery, distributed monitoring, and Web-based portals for using the system.
Proceedings of the IEEE 04/2005; · 6.81 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We present the implementation and performance of a ReplicaSet service based on the specification developed in the OGSA Data Replication Services (OREP) Working Group of the Global Grid Forum. This standard is based on the open grid services infrastructure. The ReplicaSet service aggregates information about replicated data items and enforces policies for authorization, replica semantics and consistency.
Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on; 12/2004
-
[show abstract]
[hide abstract]
ABSTRACT: Recent scientific and engineering advances increase the demands on tools for high performance interactive visual exploration of large-scale, multi-dimensional simulation and sensor-based datasets. For example, earthquake scientists can now study earthquake phenomena in detail via "first principle," physics-based, large-scale simulations in a time-volumetric space. Interactive visualization benefits the iterative scientific process to extract information from data and help scientists adapt their methods. Single-system visualization software running on high-end commodity machines can no longer sustain interactive browsing of these large science data due to their limited I/O and processing capabilities. A distributed and incremental approach is needed, to allow selective filtering of the parts of the data that the scientist wishes to view.
08/2004;
-
[show abstract]
[hide abstract]
ABSTRACT: Data sets being managed in grid environments today are growing at a rapid rate, expected to reach 100s of petabytes in the near future. Managing such large data sets poses challenges for efficient data access, data publication and data discovery. In this paper we focus on the data publication and discovery process through the use of descriptive metadata. This metadata describe the properties of individual data items and collections. We discuss issues of metadata services in service rich environments, such as the grid. We describe the requirements and the architecture for such services in the context of grid and the available grid services. We present a data model that can capture the complexity of the data publication and discovery process. Based on that model we identify a set of interfaces and operations that need to be provided to support metadata management. We present a particular implementation of a grid metadata service, basing it on existing grid services technologies. Finally we examine alternative implementations of that service.
Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on; 07/2004
-
[show abstract]
[hide abstract]
ABSTRACT: We describe the implementation and evaluate the performance of a replica location service that is part of the Globus Toolkit Version 3.0. A replica location service (RLS) provides a mechanism for registering the existence of replicas and discovering them. Features of our implementation include the use of soft state update protocols to populate a distributed index and optional Bloom filter compression to reduce the size of these updates. Our results demonstrate that RLS performance scales well for individual servers with millions of entries and up to 100 requesting threads. We also show that the distributed RLS index scales well when using Bloom filter compression for wide area updates.
High performance Distributed Computing, 2004. Proceedings. 13th IEEE International Symposium on; 07/2004
-
[show abstract]
[hide abstract]
ABSTRACT: Earthquake engineers have traditionally investigated the behavior of structures with either computational simulations or physical experiments. Recently, a new hybrid approach has been proposed that allows tests to be decomposed into independent substructures that can be located at different test facilities, tested separately, and integrated via a computational simulation. We describe a grid-based architecture for performing such novel distributed hybrid computational/physical experiments. We discuss the requirements that underlie this extremely challenging application of grid technologies, describe our architecture and implementation, and discuss our experiences with the application of this architecture within an unprecedented earthquake engineering test that coupled large-scale physical experiments in Illinois and Colorado with a computational simulation. Our results point to the remarkable impacts that grid technologies can have on the practice of engineering, and also contribute to our understanding of how to build and deploy effective grid applications.
High performance Distributed Computing, 2004. Proceedings. 13th IEEE International Symposium on; 07/2004
-
Foster I,
J. Gieraltowski,
S. Gose,
N Maltsev,
E May,
A Rodriguez,
D. Sulakhe,
A. Vaniachine,
J Shank,
S Youssef, [......],
D Bradley,
P. Couvares,
A. De Smet,
C. Kireyev,
E. Paulson,
A Roy,
S. Koranda,
B. Moe,
B Brown,
P Sheldon
[show abstract]
[hide abstract]
ABSTRACT: The Grid2003 Project has deployed a multivirtual organization, application-driven grid laboratory ("Grid3") that has sustained for several months the production-level services required by physics experiments of the Large Hadron Collider at CERN (ATLAS and CMS), the Sloan Digital Sky Survey project, the gravitational wave search experiment LIGO, the BTeV experiment at Fermilab, as well as applications in molecular structure analysis and genome analysis, and computer science research projects in such areas as job and data scheduling. The deployed infrastructure has been operating since November 2003 with 27 sites, a peak of 2800 processors, work loads from 10 different applications exceeding 1300 simultaneous jobs, and data transfers among sites of greater than 2 TB/day. We describe the principles that have guided the development of this unique infrastructure and the practical experiences that have resulted from its creation and use. We discuss application requirements for grid services deployment and configuration, monitoring infrastructure, application performance, metrics, and operational experiences. We also summarize lessons learned.
High performance Distributed Computing, 2004. Proceedings. 13th IEEE International Symposium on; 07/2004
-
Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004. Proceedings of the Third International Joint Conference on; 02/2004