This paper proposes fault tolerance service to satisfy QoS requirement in grid computing. The probability of failure in the
grid computing is higher than in a tradition parallel computing. Since the failure of resources affects job execution fatally,
fault tolerance service is essential in grid computing. And grid services are often expected to meet some minimum levels of
quality of service (QoS) for desirable operation. However Globus toolkit does not provide fault tolerance service that supports
fault detection service and management service and satisfies QoS requirement. In order to provide fault tolerance service
and satisfy QoS requirements, we expand the definition of failure, such as process failure, processor failure, and network
failure. And we propose fault detection service and fault management service and show simulation results.
"To this end, we define a reliable Grid environment as the one in which a job can, transparently to the user, continue its execution (at least from the beginning) in other resource when each one of the following conditions of failure and loss of QoS takes place  "
[Show abstract][Hide abstract] ABSTRACT: Reliability, in terms of Grid component fault tolerance and minimum quality of service, is an important aspect that has to be addressed to foster Grid technology adoption. Software reliability is critically important in today's integrated and distributed systems, as is often the weak link in system performance. In general, reliability is difficult to measure, and spe- cially in Grid environments, where evaluation methodologies are novel and controversial matters. This paper describes a straightforward procedure to analyze the reliability of computational grids from the viewpoint of an end user. The proce- dure is illustrated in the evaluation of a research Grid infrastructure based on Globus basic services and the GridWay meta-scheduler. The GridWay support for fault tolerance is also demonstrated in a production-level environment. Results show that GridWay is a reliable workload management tool for dynamic and faulty Grid environments. Transparently to the end user, GridWay is able to detect and recover from any of the Grid element failure, outage and saturation conditions specified by the reliability analysis procedure. � 2006 Elsevier B.V. All rights reserved.
Journal of Systems Architecture 12/2006; 52:727-736. DOI:10.1016/j.sysarc.2006.04.003 · 0.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The efficient usage of current emerging Grid infrastructures can only be attained by defining a standard methodology for its
evaluation. This methodology should include an appropriate set of criteria and metrics, and a suitable family of Grid benchmarks,
reflecting representative workloads, to evaluate such criteria and metrics. The establishment of this methodology would be
useful to validate the middleware, to adjust its components and to estimate the achieved quality of service.
High Performance Computing and Communications, First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005, Proceedings; 01/2005
Note: This list is based on the publications in our database and might not be exhaustive.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.