Conference Paper

A Fault Tolerance Service for QoS in Grid Computing.

DOI: 10.1007/3-540-44863-2_29 Conference: Computational Science - ICCS 2003, International Conference, Melbourne, Australia and St. Petersburg, Russia, June 2-4, 2003. Proceedings, Part III
Source: DBLP

ABSTRACT This paper proposes fault tolerance service to satisfy QoS requirement in grid computing. The probability of failure in the
grid computing is higher than in a tradition parallel computing. Since the failure of resources affects job execution fatally,
fault tolerance service is essential in grid computing. And grid services are often expected to meet some minimum levels of
quality of service (QoS) for desirable operation. However Globus toolkit does not provide fault tolerance service that supports
fault detection service and management service and satisfies QoS requirement. In order to provide fault tolerance service
and satisfy QoS requirements, we expand the definition of failure, such as process failure, processor failure, and network
failure. And we propose fault detection service and fault management service and show simulation results.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reliability, in terms of Grid component fault tolerance and minimum quality of service, is an important aspect that has to be addressed to foster Grid technology adoption. Software reliability is critically important in today's integrated and distributed systems, as is often the weak link in system performance. In general, reliability is difficult to measure, and spe- cially in Grid environments, where evaluation methodologies are novel and controversial matters. This paper describes a straightforward procedure to analyze the reliability of computational grids from the viewpoint of an end user. The proce- dure is illustrated in the evaluation of a research Grid infrastructure based on Globus basic services and the GridWay meta-scheduler. The GridWay support for fault tolerance is also demonstrated in a production-level environment. Results show that GridWay is a reliable workload management tool for dynamic and faulty Grid environments. Transparently to the end user, GridWay is able to detect and recover from any of the Grid element failure, outage and saturation conditions specified by the reliability analysis procedure. � 2006 Elsevier B.V. All rights reserved.
    Journal of Systems Architecture 12/2006; 52:727-736. · 0.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The efficient usage of current emerging Grid infrastructures can only be attained by defining a standard methodology for its evaluation. This methodology should include an appropriate set of criteria and metrics, and a suitable family of Grid benchmarks, reflecting representative workloads, to evaluate such criteria and metrics. The establishment of this methodology would be useful to validate the middleware, to adjust its components and to estimate the achieved quality of service.
    High Performance Computing and Communications, First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005, Proceedings; 01/2005