Conference Paper

A Fault Tolerance Service for QoS in Grid Computing.

DOI: 10.1007/3-540-44863-2_29 Conference: Computational Science - ICCS 2003, International Conference, Melbourne, Australia and St. Petersburg, Russia, June 2-4, 2003. Proceedings, Part III
Source: DBLP


This paper proposes fault tolerance service to satisfy QoS requirement in grid computing. The probability of failure in the
grid computing is higher than in a tradition parallel computing. Since the failure of resources affects job execution fatally,
fault tolerance service is essential in grid computing. And grid services are often expected to meet some minimum levels of
quality of service (QoS) for desirable operation. However Globus toolkit does not provide fault tolerance service that supports
fault detection service and management service and satisfies QoS requirement. In order to provide fault tolerance service
and satisfy QoS requirements, we expand the definition of failure, such as process failure, processor failure, and network
failure. And we propose fault detection service and fault management service and show simulation results.

11 Reads
  • Source
    • "To this end, we define a reliable Grid environment as the one in which a job can, transparently to the user, continue its execution (at least from the beginning) in other resource when each one of the following conditions of failure and loss of QoS takes place [14] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Reliability, in terms of Grid component fault tolerance and minimum quality of service, is an important aspect that has to be addressed to foster Grid technology adoption. Software reliability is critically important in today's integrated and distributed systems, as is often the weak link in system performance. In general, reliability is difficult to measure, and spe- cially in Grid environments, where evaluation methodologies are novel and controversial matters. This paper describes a straightforward procedure to analyze the reliability of computational grids from the viewpoint of an end user. The proce- dure is illustrated in the evaluation of a research Grid infrastructure based on Globus basic services and the GridWay meta-scheduler. The GridWay support for fault tolerance is also demonstrated in a production-level environment. Results show that GridWay is a reliable workload management tool for dynamic and faulty Grid environments. Transparently to the end user, GridWay is able to detect and recover from any of the Grid element failure, outage and saturation conditions specified by the reliability analysis procedure. � 2006 Elsevier B.V. All rights reserved.
    Full-text · Article · Dec 2006 · Journal of Systems Architecture
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The efficient usage of current emerging Grid infrastructures can only be attained by defining a standard methodology for its evaluation. This methodology should include an appropriate set of criteria and metrics, and a suitable family of Grid benchmarks, reflecting representative workloads, to evaluate such criteria and metrics. The establishment of this methodology would be useful to validate the middleware, to adjust its components and to estimate the achieved quality of service.
    Full-text · Conference Paper · Sep 2005