Conference Paper

Dynamic Scheduling Mechanism for Result Certification in Peer to Peer Grid Computing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In a peer to peer grid computing environment, volunteers have heterogeneous properties and dynamically join and leave during execution. Therefore, it is essential to adapt to an unstable and widely distributed environment. However, existing scheduling and result certification mechanisms do not adapt to such a dynamic environment. As a result, they undergo high overhead, performance degradation, and scalability problems. To solve the problems, we propose a new scheduling mechanism for result certification. The proposed mechanism applies different scheduling and result certification algorithms to different volunteer groups that are classified on the basis of their properties such as volunteering service time, availability, and credibility. It also exploits mobile agents in a distributed way in order to adapt to a dynamic peer to peer grid computing environment.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To this end, existing Desktop Grid systems exploited result certification mechanisms such as voting and spot-checking21222324252627282930313233343536 . However, (i) existing result certification mechanisms do not adapt to the distinct features resulting from the heterogeneous properties and volatility; (ii) There is no dynamic scheduling for result certification although the result certification is tightly related with scheduling in that both the special task for spot-checking and the redundant tasks for voting are allocated to volunteers in a scheduling procedure; (iii) Existing Desktop Grid systems simply used the eager scheduling mechanism1011121323], although the result certification and scheduling mechanisms are required to classify volunteers into groups that have similar properties, and then dynamically apply various scheduling mechanisms to each group. ...
... A few studies have been made on scheduling for result certification in a Desktop Grid computing environment [23,30313235,36]. Bayanihan [23] proposed majority voting and spot-checking based on an eager scheduling algorithm. ...
Article
In Desktop Grids, volunteers (i.e, resource providers) have heterogeneous properties and dynamically join and leave during execution. Moreover, some volunteers may behave erratically or maliciously. Thus, it is important to detect and tolerate erroneous results (i.e., result certification) in order to guarantee reliable execution, considering volatility and heterogeneity in a scheduling procedure. However, existing result certification mechanisms do not adapt to such a dynamic environment. As a result, they undergo high overhead and performance degradation.To solve the problems, we propose a new Group-based Adaptive Result Certification Mechanism (GARCM). GARCM applies different result certification and scheduling algorithms to volunteer groups that are constructed according to their properties such as volunteering service time, availability and credibility.
Article
Full-text available
Summary In this paper, the grid computing system is seemed as an ecosystem. The object of the optimization resource management is to promote the balance and evolution of the computing ecosystem. The architecture of the ecosystem model based grid resource management system is presented, which has the self-aware and self- optimization mechanism. The knowledge discovery based self-aware mechanism has the ability to reveal the behavior patterns knowledge hide in the history grid information system. The discovered knowledge can be used to predict the resource requirement and to optimize the resource allocation. The antigen identification mechanism is studied which can identify the factors related to the computing ecosystem unbalance state. With the self-optimization mechanism of the computing ecosystem, the resource allocation problem can be abstracted as a multi-objects optimization problem. The computing expectation, ecosystem environment, and the application characteristic are considered to design policy based adaptive resource allocation and job scheduling algorithm.
Article
Full-text available
Desktop resources are attractive for running compute-intensive distributed applications. Several systems that aggregate these resources in desktop grids have been developed. While these systems have been successfully used for many high throughput applications there has been little insight into the detailed temporal structure of CPU availability of desktop grid resources. Yet, this structure is critical to characterize the utility of desktop grid platforms for both task parallel and even data parallel applications. We address the following questions: (i) What are the temporal characteristics of desktop CPU availability in an enterprise setting? (ii) How do these characteristics affect the utility of desktop grids? (iii) Based on these characteristics, can we construct a model of server "equivalents" for the desktop grids, which can be used to predict application performance? We present measurements of an enterprise desktop grid with over 220 hosts running the Entropia commercial desktop grid software. We utilize these measurements to characterize CPU availability and develop a performance model for desktop grid applications for various task granularities, showing that there is an optimal task size. We then use a cluster equivalence metric to quantify the utility of the desktop grid relative to that of a dedicated cluster.
Conference Paper
Full-text available
It has been reported [25] that life holds but two certainties, death and taxes. And indeed, it does appear that any society, and in the context of this article, any large-scale distributed system, must address both death (failure) and the establishment and maintenance of infrastructure (which we assert is a major motivation for taxes, so as to justify our title!). Two supposedly new approaches to distributed computing have emerged in the past few years, both claiming to address the problem of organizing large-scale computational societies: peer-to-peer (P2P) [15, 36, 49] and Grid computing [21]. Both approaches have seen rapid evolution, widespread deployment, successful application, considerable hype, and a certain amount of (sometimes warranted) criticism. The two technologies appear to have the same final objective, the pooling and coordinated use of large sets of distributed resources, but are based in different communities and, at least in their current designs, focus on different requirements.
Article
Full-text available
The World Wide Web has the potential of being used as an inexpensive and convenient metacomputing resource. This brings forward new challenges and invalidates many of the assumptions made in offering the same functionality for a network of workstations. We have designed and implemented Charlotte which goes beyond providing a set of features commonly used for a network of workstations: (1) a user can execute a parallel program on a machine she does not have an account on; (2) neither a shared file system nor a copy of the program on the local file system is required; (3) local hardware is protected from programs written by "strangers"; (4) any machine on the Web can join or leave any running computation, thus utilizing the dynamic resources. Charlotte combines many complementary but isolated research efforts. It comprises a virtual machine model which isolates the program from the execution environment, and a runtime system which realizes this model on the Web. Load balancing and fault...
Conference Paper
Full-text available
Grid computing presents two major challenges for deploying large scale applications across wide area networks gathering volunteers PC and clusters/parallel computers as computational resources: security and fault tolerance. This paper presents a lightweight Grid solution for the deployment of multi-parameters applications on a set of clusters protected by firewalls. The system uses a hierarchical design based on Condor for managing each cluster locally and XtremWeb for enabling resource sharing among the clusters. We discuss the security and fault tolerance mechanisms used for this design and demonstrate the usefulness of the approach measuring the performances of a multi-parameters bio-chemistry application deployed on two sites: University of Wisconsin/Madison and Paris South University. This experiment shows that we can efficiently and safely harness the computational power of about 200 PC distributed on two geographic sites.
Conference Paper
Full-text available
Global computing achieves high throughput computing by harvesting a very large number of unused computing resources connected to the Internet. This parallel computing model targets a parallel architecture defined by a very high number of nodes, poor communication performance and continuously varying resources. The unprecedented scale of the global computing architecture paradigm requires us to revisit many basic issues related to parallel architecture programming models, performance models, and class of applications or algorithms suitable for this architecture. XtremWeb is an experimental global computing platform dedicated to provide a tool for such studies. The paper presents the design of XtremWeb. Two essential features of this design are multi-applications and high-performance. Accepting multiple applications allows institutions or enterprises to set up their own global computing applications or experiments. High-performance is ensured by scalability, fault tolerance, efficient scheduling and a large base of volunteer PCs. We also present an implementation of the first global application running on XtremWeb
Conference Paper
Full-text available
Dynamic mapping (matching and scheduling) heuristics for a class of independent tasks using heterogeneous distributed computing systems are studied. Two types of mapping heuristics are considered: on-line and batch mode heuristics. Three new heuristics, one for batch and two for on-line, are introduced as part of this research. Simulation studies are performed to compare these heuristics with some existing ones. In total, five on-line heuristics and three batch heuristics are examined. The on-line heuristics consider; to varying degrees and in different ways, task affinity for different machines and machine ready times. The batch heuristics consider these factors, as well as aging of tasks waiting to execute. The simulation results reveal that the choice of mapping heuristic depends on parameters such as: (a) the structure of the heterogeneity among tasks and machines, (b) the optimization requirements, and (c) the arrival rate of the tasks
Article
Full-text available
The term "peer-to-peer" (P2P) refers to a class of systems and applications that employ distributed resources to perform a critical function in a decentralized manner. With the pervasive deployment of computers, P2P is increasingly receiving attention in research, product development, and investment circles. This interest ranges from enthusiasm, through hype, to disbelief in its potential. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; and enabling resource aggregation.
Article
We consider the problem of a collegial decision on the basis of individual opinions of n experts deciding independently, the probability of a correct decision by the ith expert being equal to pi, where 1/2 < m ≦ pi ≦ M < 1, i = 1, 2, . . . , n. It is shown that, for the error probability of the optimal collegial decision, the estimates 1-M/M (n[n/2]) Πni-1 √pi(1-pi) ≦ Popterr ≦ m/2m-1 (n[n/2]) Πni-j √pi(1-pi) are valid.
Article
In this paper, we address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results, by presenting sabotage-tolerance mechanisms that work without depending on checksums or cryptographic techniques. We first analyze the traditional technique of voting and show how it reduces error rates exponentially with redundancy, but requires all work to be done several times, and does not work well when there are many saboteurs. We then present a new technique called spot-checking which reduces the error rate linearly (i.e. inversely) with the amount of work to be done, while only costing an extra fraction of the original time. Integrating these mechanisms, we then present the new idea of credibility-based fault-tolerance, wherein we estimate the conditional probability of results and workers being correct, based on the results of using voting, spot-checking and other techniques, and then use these probability estimates to direct the use of further redundancy. Using this technique, we are able to attain mathematically guaranteeable levels of correctness, and do so with much smaller slowdown than possible with voting or spot-checking alone. Finally, we validate these new ideas with Monte Carlo simulations, and discuss other possible variations of these techniques.
Conference Paper
SUMMARY Javelin 3 is a software system for developing large-scale, fault tolerant, adaptively parallel applications. When all or part of their application can be cast as a master- worker or branch-and-bound computation, Javelin 3 frees application developers from concerns about inter-processor communication and fault tolerance among networked hosts, allowing them to focus on the underlying application. The paper describes a fault tolerant task scheduler and its performance analysis. The task scheduler integrates work stealing with an advanced form of eager scheduling. It enables dynamic task decomposition, which improves host load-balancing in the presence of tasks whose non- uniform computational load is evident only at execution time. Speedup measurements are presented of actual performance on up to 1,000 hosts. We analyze the expected performance degradation due to unresponsive hosts, and measure actual performance degradation due to unresponsive hosts.
Conference Paper
Javelin is a Java-based infrastructure for global computing. This paper presents Javelin++, an extension of Javelin, intended to support a much larger set of computational hosts. First, Javelin++'s switch from Java applets to Java applications is explained. Then, twoscheduling schemes are presented: a probabilistic work-stealing scheduler and a deterministic scheduler. The deterministic scheduler also implements eager scheduling, as well as another fault-tolerance mechanism for hosts that have failed or retreated. AJavelin ++ API is sketched, then illustrated on a raytracing application. Performance results for the twoschedulers are reported, indicating that Javelin++, with its broker network, scales better than the original Javelin. 1 Introduction Our goal is to harness the Internet's vast, growing, computational capacity for ultra-large, coarse-grained parallel applications. Some other research projects based on a similar vision include CONDOR #21, 13#, Legion #18#, and GLOBUS #14#...
Conference Paper
Global Computing is a particular modality of Grid Computing targeting massive parallelism, Internet computing and cycle-stealing. This new computing infrastructure has been shown to be exposed to a new type of attacks, where authentication is not relevant, network security techniques are not sufficient, and result-checking algorithms may be unavailable. The behavior of a Global Computing System, which is likely to be bimodal, nevertheless offers an opportunity for a probabilistic verification process that is efficient in the most frequent cases, and degrades gracefully as the problem becomes more difficult. For the two cases of a system based on anonymous volunteers, and a better controlled system, we propose probabilistic tests which self-adapt to the behavior of the computing entities.
Article
SUMMARY Since 1984, the Condor project has enabled ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational grid. In this chapter, we provide the history and philosophy of the Condor project and describe how it has interacted with other projects and evolved along with the eld of distributed computing. We outline the core components of the Condor system and describe how the technology of computing must correspond to social structures. Throughout, we reect on the lessons of experience and chart the course traveled by research ideas as they grow into production systems.
Article
Javelin 3 is a software system for developing large-scale, fault-tolerant, adaptively parallel applications. When all or part of their application can be cast as a master–worker or branch-and-bound computation, Javelin 3 frees application developers from concerns about inter-processor communication and fault tolerance among networked hosts, allowing them to focus on the underlying application. The paper describes a fault-tolerant task scheduler and its performance analysis. The task scheduler integrates work stealing with an advanced form of eager scheduling. It enables dynamic task decomposition, which improves host load-balancing in the presence of tasks whose non-uniform computational load is evident only at execution time. Speedup measurements are presented of actual performance on up to 1000 hosts. We analyze the expected performance degradation due to unresponsive hosts, and measure actual performance degradation due to unresponsive hosts. Copyright © 2005 John Wiley & Sons, Ltd.
Conference Paper
BOINC (Berkeley Open Infrastructure for Network Computing) is a software system that makes it easy for scientists to create and operate public-resource computing projects. It supports diverse applications, including those with large storage or communication requirements. PC owners can participate in multiple BOINC projects, and can specify how their resources are allocated among these projects. We describe the goals of BOINC, the design issues that we confronted, and our solutions to these problems.
Conference Paper
Fault tolerance is essential to the further development of desktop grid computing system in order to guarantee continuous and reliable execution of tasks in spite of failures. In a desktop grid computing environment, volunteers are often susceptible to volunteer autonomy failures such as volatility failure and interference failure in the middle of execution of tasks because a desktop grid computing maximally respects autonomy of volunteers. The failures result in an independent livelock problem (i.e. the delay and blocking of the entire execution of a job). Therefore, the failures should be considered in a scheduling mechanism. In This work, in order to tolerate volunteer autonomy failures, we propose a new fault tolerant scheduling mechanism. First, we specify a volunteer autonomy failures and an independent livelock problem. Then, we propose a volunteer availability which reflects the degree of volunteer autonomy failures. Finally, we propose a fault tolerant scheduling mechanism based on volunteer availability (which is called VAFTSM).
Conference Paper
Location management and message delivery protocol is fundamental to the further development of mobile agent systems in a multiregion mobile agent computing environment in order to control mobile agents and guarantee message delivery between them. However, previous works have some problems when they are applied to a multiregion mobile agent computing environment. First, the cost of location management and message delivery is increased relatively. Second, a following problem arises. Finally, cloned mobile agents and parent & child mobile agents don't get dealt with respect to location management and message delivery. We present a HB (home-blackboard) protocol which is a new location management and message delivery protocol for mobile agents in a multiregion mobile agent computing environment. We have implemented the HB protocol. The HB protocol decreases the cost of location management and message delivery and solves the following problem with low communication cost. In addition, the HB protocol deals with the location management and message delivery of cloned and parent & child mobile agents, so that it guarantees message delivery of these mobile agents.
Conference Paper
Projects like SETI@home have demonstrated the tremendous capabilities of Internet-connected commodity resources. The rapid improvement of commodity components makes the global computing platform increasingly viable for other large-scale data and compute-intensive applications. In this paper, we study how global computing can accommodate new types of applications. We describe a global computing model that captures resource characteristics and instantiate this model with data from several surveys and studies. We propose performance metrics for global computing applications and evaluate two scheduling mechanisms in simulation. We then draw conclusions concerning the development and enhancement of global computing systems
Conference Paper
We address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results by presenting sabotage-tolerance mechanisms that work without depending on checksums or cryptographic techniques. We first analyze the traditional technique of voting, and show how it reduces error rates exponentially with redundancy, but requires all work to be done at least twice, and does not work well when there are many saboteurs. We then present a new technique called spot-checking which reduces the error rate linearly (i.e., inversely) with the amount of work to be done, while only costing an extra function of the original time. We then integrate these mechanisms by presenting the new idea of credibility-based fault-tolerance, which uses probability estimates to efficiently limit and direct the use of redundancy. By using voting and spot-checking together credibility-based fault-tolerance effectively allows us to exponentially shrink an already linearly-reduced error rate, and thus achieve error-rates that are orders-of-magnitude smaller than those offered by voting or spot-checking alone. We validate this new idea with Monte Carlo simulations, and discuss how credibility-based fault tolerance can be used with other mechanisms and in other applications
Article
Project Bayanihan is developing the idea of volunteer computing , which seeks to enable people to form very large parallel computing networks very quickly by using ubiquitous and easy-to-use technologies such as web browsers and Java. By utilizing Java's object-oriented features, we have built a flexible software framework that makes it easy for programmers to write different volunteer computing applications, while allowing researchers to study and develop the underlying mechanisms behind them. In this paper, we show how we have used this framework to write master-worker style applications, and to develop approaches to the problems of programming interface, adaptive parallelism, fault-tolerance, computational security, scalability, and user interface design. Key words: metacomputing, parallel and distributed computing, network of workstations, heterogeneous computing, Java 1 Introduction Bayanihan (pronounced "buy-uh-nee-hun") is the name of an old Filipino countryside tradition wh...
Adaptive Group Computation Approach in the Peer-to-peer Grid Computing Systems
  • M Baik
  • S Choi
  • C Hwang
  • J Gil
  • H Yu
Charlotte: Metacomputing on the Web
  • A Baratloo
  • M Karaul
  • Z Kedem
  • P Wyckoff