Conference Paper

Dynamic Virtual Clustering

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multiple clusters co-existing in a single research campus has become commonplace at many university and government labs, but effectively leveraging those resources is difficult. Intelligently forwarding and spanning jobs across clusters can increase throughput, decrease turnaround time, and improve overall utilization. Dynamic Virtual Clustering (DVC) is a system of virtual machines, deployed in a single or multi-cluster environment, to increase cluster utilization by enabling job forwarding and spanning, flexibly allow software environment changes, and effectively sandbox users and processes from each other and the system. This paper presents both the initial implementation of DVC and performance results from synthetic workloads executed under DVC.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, it does little to ensure application portability or software environment homogeneity between clusters [1]. Even though many clusters employ solutions such as environment modules in order to manage multiple software installations [2], a fundamental incompatibility with the OS version may be impossible to resolve, for example, an application that requires a newer version of the GNU C library. ...
... More recently, Dynamic Virtual Clustering (DVC) suggests that VMs can be used to improve cluster utilisation and throughput within a smaller scale campus grid [1]. The clusters that comprise a campus grid may represent a significant investment by the institution. ...
... Even where there is the potential for an application to run with a large overhead inside a VCC utilising a Software Defined Network as the interconnect, previous work suggests that the advantages of increased flexibility and dynamic features potentially open up a wider range of systems that are available to execute jobs, outweighing the individual job performance limitation [1]. ...
Article
Global software stacks on scientific cluster computing resources are required to provide a homogeneous software environment which is typically inflexible. Efforts to integrate Virtual Machines (VMs), in order to abstract the software environment of various scientific applications, suffer from performance limitations and require systems administration expertise to maintain. However, the motivation is clear; in addition to increasing resource utilization, the burden of supporting new software installations on existing systems can be reduced. In this paper, we introduce the Virtual Container Cluster (VCC) that encapsulates a typical HPC software environment within Docker containers. The novel component cluster–watcher enables context aware discovery and configuration of the virtual cluster. Containers offer a lightweight alternative to VMs that more closely match the native performance, and presents a solution that is more accessible to customization by the average user. Combined with a Software Defined Networking (SDN) technology, the VCC enables dynamic features such as transparent scaling and spanning across multiple physical resources. Although SDN introduces an additional performance limitation, within the context of a parallel communication network, the benchmarking demonstrates that this cost is application dependent. The Linpack benchmarking shows that the overhead of container virtualization and SDN interconnect is comparable to the native performance.
... Virtual Machines (VMs) have been identified as a mechanism for constructing computing clusters with homogeneous software environments. [2]–[5] By allowing each VO to provide its own VMs, it is possible to provide each VO with a dedicated cluster of machines that execute its jobs exclusively. Clusters constructed in this way are termed Virtual Organization Clusters or VOCs. ...
... In addition, VMPlants and Virtual Clusters on the Fly had explicit one-to-one mappings between VM instances and VM disk images. Dynamic Virtual Clustering (DVC) [5] implemented the scheduling of VMs on existing physical cluster nodes within a campus setting. The motivation for this work was to improve the usage of disparate cluster computing systems located within a single entity. ...
... Virtual Organization Clusters using this model are formed from VMs that are either created at the physical site where they are to be used, created by grid middleware (for example, In-VIGO [8]), or manually transmitted to the physical site by a VO administrator. Unlike systems that utilize one virtual disk image per VM instance [3], [5], [9], VOCs are explicitly designed to work with either one or two disk image files: a single image representing the configuration of a compute node, and an optional second image that contains a cluster head node. All compute node VMs are spawned from the single compute node image using a copy-on-write mechanism that allows the original image file to be read-only. ...
... By encapsulating applications, many application types can be executed on the same physical infrastructure, as each application can have its own software stack, from operating system to libraries, independent of the physical platform they are running on. Also, applications can be executed on sites that do not necessary have the software configuration required by the application [60]. Because virtual environments are isolated one from another, users can be allowed to execute privileged operations without getting special privileges from system administrators and the malicious user behavior can be restrained. ...
... In this case, using virtualization overcomes the need of installing operating system kernel modules or linking the application to special libraries. Moreover, techniques were proposed to suspend the virtual machines in which the application is running [60], by executing a "vm save" command on each host in synchronized fashion. For synchronization, an NTP server was used. ...
... Besides tuning the resource allocation in a fine-grained manner, virtualization can be used to build distributed virtual infrastructures, like clusters, spread over multiple physical clusters [60], possibly placed at different geographical locations [121]. Priority and load balancing policies can be designed for these infrastructures, by changing on-the-fly their capacity both in terms of virtual machines and resource allocated to each virtual machine [96,140,145,75]. ...
Article
Organizations owning HPC infrastructures are facing difficulties in managing their resources. These difficulties come from the need to provide concurrent resource access to different application types while considering that users might have different performance objectives for their applications. Cloud computing brings more flexibility and better resource control, promising to improve the user’s satisfaction in terms of perceived Quality of Service. Nevertheless, current cloud solutions provide limited support for users to express or use various resource management policies and they don't provide any support for application performance objectives.In this thesis, we present an approach that addresses this challenge in an unique way. Our approach provides a fully decentralized resource control by allocating resources through a proportional-share market, while applications run in autonomous virtual environments capable of scaling the application demand according to user performance objectives.The combination of currency distribution and dynamic resource pricing ensures fair resource utilization.We evaluated our approach in simulation and on the Grid'5000 testbed. Our results show that our approach can enable the co-habitation of different resource usage policies on the infrastructure, improving resource utilisation.
... Virtualization of entire clusters was first proposed in [4], and it has been subsequently realized using different models, such as Globus Virtual Workspaces [5], [6], In-VIGO [7], Dynamic Virtual Clustering [8], VMPlants [9], and Virtual Clusters on the Fly [10]. Rapid re-provisioning of physical systems, as an alternative to using virtualization technologies, has been studied in the Cluster-On-Demand (COD) system [11]. ...
... The VOC Model provides a means by which each VO can have its own dedicated cluster and associated Virtual Administrative Domain (VAD), as illustrated in figure 3. Virtual Organization Clusters using this model are formed from VMs that are either created at the physical site where they are to be used, created by grid middleware (for example, In-VIGO [7]), or manually transmitted to the physical site by a VO administrator. Unlike systems that utilize one virtual disk image per VM instance [5], [8], [10], VOCs are explicitly designed to work with either one or two disk image files: a single image representing the configuration of a compute node, and an optional second image that contains a cluster head node. All compute node VMs are spawned from the single compute node image using a copy-on-write mechanism that allows the original image file to be read-only. ...
Conference Paper
Full-text available
Cloud services that provide virtualized computa- tional clusters present a dichotomy of systems man- agement challenges, as the virtual clusters may be owned and administered by one entity, while the under- lying physical fabric may belong to a different entity. On the physical fabric, scalable tools that "push" configuration changes and software updates to the compute nodes are effective, since the physical system administrators have complete system access. However, virtual clusters executing atop federated Grid sites may not be directly reachable for management purposes, as network or policy limitations may prevent unsolicited connections to virtual compute nodes. For these sys- tems, a distributed middleware solution could permit the compute nodes to "pull" updates from a centralized server, thereby permitting the management of virtual compute nodes that are inaccessible to the system administrator. This paper compares both models of system administration and describes emerging software utilities for managing both the physical fabric and the virtual clusters.
... These clusters comprise a Campus Area Grid (CAG), which is defined as ''a group of clusters in a small geographic area . . . connected by a private, highspeed network'' [30]. Latency and bandwidth properties of the private network are considered to be favorable, thereby allowing a combination of spanned clusters to function as a single highperformance cluster for job execution. ...
... However, the software configurations of the different component clusters may differ, as the component clusters may belong to different entities with different management. DVC permits Xen virtual machines with homogeneous software to be run on federated clusters on the same CAG whenever the target cluster is not in use by its owner, thereby allowing research groups to increase the sizes of their clusters temporarily [30]. Another approach to providing virtualization services at various levels in a grid system is to modify the scheduling system, the application itself, or both. ...
Article
Virtual Organization Clusters (VOCs) are a novel mechanism for overlaying dedicated private cluster systems on existing grid infrastructures. VOCs provide customized, homogeneous execution environments on a per-Virtual Organization basis, without the cost of physical cluster construction or the overhead of per-job containers. Administrative access and overlay network capabilities are granted to Virtual Organizations (VOs) that choose to implement VOC technology, while the system remains completely transparent to end users and non-participating VOs. Unlike existing systems that require explicit leases, VOCs are autonomically self-provisioned and self-managed according to configurable usage policies.The work presented here contains two parts: a technology-agnostic formal model that describes the properties of VOCs and a prototype implementation of a physical cluster with hosted VOCs, based on the Kernel-based Virtual Machine (KVM) hypervisor. Test results demonstrate the feasibility of VOCs for use with high-throughput grid computing jobs. With the addition of a “watchdog” daemon for monitoring scheduler queues and adjusting VOC size, the results also demonstrate that cloud computing environments can be autonomically self-provisioned in response to changing workload conditions.
... Our approach has been built upon the existing experience and know-how of the infrastructure such as the LHC 1 worldwide grid for the physics and scientific community, so that existing deployment frameworks such as the PanDA Pilot could be slightly modified to enable on-demand virtual machine based job execution on the Grid. This approach is different from projects like IN-VIGO [19] and DVC [5, 7] , which focuses on creating adhoc " virtual computing grids " and presents them to the end user as grid sessions. In the modified PanDA Pilot, virtual machine based execution of the job stays transparent to the end user. ...
Article
Full-text available
The primary motivation for uptake of virtualization have been resource isolation, capacity management and resource customization: isolation and capacity management allow providers to isolate users from the site and control their resources usage while customization allows end-users to easily project the required environment onto a variety of sites. Various approaches have been taken to integrate virtualization with Grid technologies. In this paper, we propose an approach that combines virtualization on the existing software infrastructure such as Pilot Jobs with minimum change on the part of resource providers. Comment: 8 pages; Published in VTDC09 workshop which is part of IEEE ICAC09 Conference
... The term "virtual cluster" has been employed by many projects [8,10,24] with semantics differing from SnowFlock's IC. The focus has been almost exclusively on the resource provisioning and management aspects. ...
Article
Full-text available
We introduce Impromptu Clusters (ICs), a new ab- straction that makes it possible to leverage cloud-based clusters to execute short-lived parallel tasks, for ex- ample Internet services that use parallelism to deliver near-interactive responses. ICs are particularly relevant for resource-intensive web applications in areas such as bioinformatics, graphics rendering, computational fi- nance, and search. In an IC, an application encapsulated inside a virtual machine (VM) is swiftly forked into mul- tiple copies that execute on different physical hosts, and then disappear when the computation ends. SnowFlock, our IC prototype, offers near-interactive response times for many highly-parallelizable workloads, achieves sub- second parallel VM clone times, and has negligible run- time overhead.
... The ability to dynamically move, suspend, and resume VMs allows for aggressive multiplexing of VMs on server farms, and as a consequence several groups have put forward variations on the notion of a " virtual cluster " [Emeneker and Stanzione 2007; Ruth et al. 2005]. The focus of these lines of work has been almost exclusively on resource provisioning and management aspects; one salient commercial instance of this is VMware's DRS [VMware ]. ...
Article
Full-text available
A basic building block of cloud computing is virtualization. Virtual machines (VMs) encapsulate a user’s computing environment and efficiently isolate it from that of other users. VMs, however, are large entities, and no clear APIs exist yet to provide users with programatic, fine-grained control on short time scales. We present SnowFlock, a paradigm and system for cloud computing that introduces VM cloning as a first-class cloud abstraction. VM cloning exploits the well-understood and effective semantics of UNIX fork. We demonstrate multiple usage models of VM cloning: users can incorporate the primitive in their code, can wrap around existing toolchains via scripting, can encapsulate the API within a parallel programming framework, or can use it to load-balance and self-scale clustered servers. VM cloning needs to be efficient to be usable. It must efficiently transmit VM state in order to avoid cloud I/O bottlenecks. We demonstrate how the semantics of cloning aid us in realizing its efficiency: state is propagated in parallel to multiple VM clones, and is transmitted during runtime, allowing for optimizations that substantially reduce the I/O load. We show detailed microbenchmark results highlighting the efficiency of our optimizations, and macrobenchmark numbers demonstrating the effectiveness of the different usage models of SnowFlock.
... The interoperability between Grids and clouds has been under investigation for some time. The beginnings of the research can be seen in the work of adding the ability to make batch systems "dynamic"-that is, adding and removing (often virtualised) Worker Nodes from a batch system queue based on certain events [17][18][19][20][21][22]. This has resulted in such modern projects as INFN's Worker Node on Demand [23], which deploys virtualized resources at the WLCG Tier1 in Italy. ...
Article
Full-text available
The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in a distributed environment. DIRAC has been developed within the Large Hadron Collider Beauty (LHCb) collaboration. After successful usage over several years, it is the final solution adopted by the experiment. The Belle experiment at the Japanese High Energy Accelerator Research Organization (KEK) has the purpose of studying matter/anti-matter asymmetries using B mesons. During its lifetime, Belle detector has collected about 5,000 terabytes of real and simulated data. The analysis of this data requires an enormous amount of computing intensive Monte Carlo simulation. The Belle II experiment, which recently published its technical design report, will produce 50 times more data. Therefore it is interested to determine if commercial computing clouds can reduce the total cost of the experiment’s computing solution. This paper describes the setup prepared to evaluate the performance and cost of this approach using real 2010 simulation tasks of the Belle experiment. The setup has been developed using DIRAC as the overall management tool to control both the tasks to be executed and the deployment of virtual machines using the Amazon Elastic Compute Cloud as service provider. At the same time, DIRAC is also used to monitor the execution, collect the necessary statistical data, and finally upload the results of the simulation to Belle resources on the Grid. The results of a first test using over 2000 days of cpu time show that over 90% efficiency in the use of the resources can easily be achieved.
... The inclusion of the IPOP overlay network and the pilot jobs (used by the watchdog to send requests to multiple grid sites) enables Virtual Organizations to allocate resources and schedule jobs privately. With the use of more IPOP bootstrap nodes, the system is highly It has also been shown that the VOC model can be extended efficiently to leasing models such as Amazon EC2 [34], Shirako [37] and DVC [38]. The current watchdog uses a greedy implementation and might increase the cost of the system. ...
Conference Paper
Full-text available
Virtual Organizations are dynamic entities that consist of individuals and/or institutions established around a set of resource-sharing rules and conditions. The VO may require the use of on-site (local) and off-site (public) compute resources that can be leased or autonomically provisioned, based on workload and site policies. Virtual Organization Clusters provide the necessary computing infrastructure by building upon existing physical grid sites without disrupting the existing infrastructure or requiring any engagement from end users. VOCs also separate the physical and virtual administrative domains and thus encourage more sites to participate in the resource sharing and hosting. The VO can relinquish the compute resources based on job completion or other operational parameters such as cost. This paper expands on previous work with the Virtual Organization Cluster Model by demonstrating its scalability across multiple grid sites with the use of a structured peer-to-peer overlay networking system. A novel approach by which the model is extended to lease-based systems, such as the Amazon Elastic Compute Cloud (EC2), is introduced.
... Middleware has been developed to facilitate construction of virtual machine clusters. Several middleware-oriented projects have been undertaken, including In-VIGO [5], [7], [8], VMPlants [6], DVC [9], virtual disk caching [10], and the Globus Virtual Workspace [11], [12]. ...
Conference Paper
Full-text available
Sharing traditional clusters based on multiprogramming systems among different Virtual Organizations (VOs) can lead to complex situations resulting from the differing software requirements ofeach VO. This complexity could be eliminated ifeach cluster computing system supported only a single VO, thereby permitting the VO to customize the operating system and software selection available on its private cluster. While dedicating entire physical clusters on the Grid to single VOs is not practical in terms ofcost and scale, an equivalent separation ofVOs may be accomplished by deploying clusters ofVirtual Machines (VMs) in a manner that gives each VO its own virtual cluster. Such Virtual Organization Clusters (VOCs) can have numerous benefits, including isolation ofVOs from one another, independence ofeach VOC from the underlying hardware, allocation of physical resources on a per-VO basis, and clear separation ofadministrative responsibilities between the physical fabric provider and the VO itself.
... Finally, W. Emeneker et. al. [45] propose to use virtualization in clusters for job forwarding and spanning. However, none of the above cited articles use Dynamite, the load balancing tool employed on this work. ...
Chapter
Full-text available
This chapter reviews the application of a biologically inspired heuristic technique - Cellular Automata (CA) - for developing high performance simulations of a well known complex system: the laser. CA can be described as a class of mathematical systems. They were introduced several decades ago, and are well suited to model spatio-temporal phenomena. On the other hand, CA can be implemented very efficiently on parallel platforms, given both, their intrinsic parallel nature, with all the components working usually in a synchronized way, and the discreteness of the individual components using the same behavior rules. We therefore make use of this feature, and consider the problem of running Parallel CA simulations on non-dedicated clusters of workstations.We thus present results of laser dynamics simulations, traditionally modeled using differential equations.
... Each one of these steps adds its own overhead, that can be different, depending of the workload type. This was shown in [10]. The authors wanted to improve the resource utilization by running jobs in virtual machines across multiple clusters, when resources from one would not have been sufficient. ...
Article
Full-text available
Efficiently sharing resources between multiple applications that run on the same distributed infrastructure is challenging. These applications have different and possible time varying requirements and in the same time, users demand different quality of service guarantees. However, current resource management systems still make it difficult to specify and meet these requirements. Recently, virtualization technologies have been seen as a useful tool for managing distributed infrastructures. This lead us to survey the use cases of virtualization in distributed computing. We start from giving an overview of virtualization techniques together with what advantages and concerns may come from their use. Then we introduce the issues that may come from managing distributed virtualized infrastructure and we present a taxonomy of representative works that used virtualization as a tool for resource management. Finally, we conclude with an analysis of the advancements done in resource management with the use of virtualization.
... The interoperability between grids and clouds has been under investigation for some time. The beginnings of the research can be seen in the work of adding the ability to make batch systems 'dynamic' -that is, adding and removing (often virtualised) Worker Nodes from a batch system queue based on certain events[12, 13, 14, 15, 16, 17]. This has resulted in such modern projects as INFN's Worker Node on Demand[20], which deploys virtualized resources at the WLCG Tier 1 in Italy. ...
Article
Full-text available
Grid computing was developed to provide users with uniform access to large-scale distributed resources. This has worked well, however there are significant resources available to the scientific community that do not follow this paradigm - those on cloud infrastructure providers, HPC supercomputers or local clusters. DIRAC (Distributed Infrastructure with Remote Agent Control) was originally designed to support direct submission to the Local Resource Management Systems (LRMS) of such clusters for LHCb, matured to support grid workflows and has recently been updated to support Amazon's Elastic Compute Cloud. This raises a number of new possibilities - by opening avenues to new resources, virtual organisations can change their resources with usage patterns and use these dedicated facilities for a given time. For example, user communities such as High Energy Physics experiments, have computing tasks with a wide variety of requirements in terms of CPU, data access or memory consumption, and their usage profile is never constant throughout the year. Having the possibility to transparently absorb peaks on the demand for these kinds of tasks using Cloud resources could allow a reduction in the overall cost of the system. This paper investigates interoperability by following a recent large-scale production excercise utilising resources from these three different paradigms, during the 2010 Belle Monte Carlo run. Through this, it discusses the challenges and opportunities of such a model.
... Work focusing on multiplexing a set of VMs on a physical cluster has typically resorted to legacy techniques such as migration [9] or suspend/resume, without providing the performance capabilities or the convenient programming model of SnowFlock. The term "virtual cluster" is used by many projects [13,16,8] focusing on resource provisioning and management. ...
Article
Full-text available
Cloud computing promises to provide researchers with the ability to perform parallel computations using large pools of virtual machines (VMs), without facing the burden of owning or maintaining physical infrastructure. However, with ease of access to hundreds of VMs, comes also an increased management burden. Cloud users today must manually instantiate, configure and maintain the virtual hosts in their cluster. They must learn new cloud APIs that are not germane to the problem of parallel processing. Those APIs usually take several minutes to perform their VM-management tasks, forcing users to keep VMs idling and pay for unused processing time, rather than shut VMs down and power them on as needed. Furthermore, users must still configure their cluster management framework to launch their parallel jobs. In this paper we show that all this management pain is unnecessary. We show how to combine a cloud API -- SnowFlock -- and a parallel processing framework -- MPI -- to truly realize the potential of the cloud. SnowFlock allows users to fork VMs as if they were processes, occupying in sub-second time multiple physical hosts. We exploit the synergy between this paradigm and MPI's job management to completely hide all details of cloud management from the user. Maintaining a single VM and starting unmodified applications with familiar MPI commands, a user can instantaneously leverage hundreds of processors to perform a parallel computation. Besides making use of cloud resources trivial, we also eliminate the cost of idling -- VMs exist only for as long as they are involved in computation.
... The ability to dynamically move, suspend, and resume VMs allows for aggressive multiplexing of VMs on server farms, and as a consequence several groups have put forward variations on the notion of a "virtual cluster" [51,161]. The focus of these lines of work has been almost exclusively on resource provisioning and management aspects. ...
... Moreover, VM cloning technology is also widely used in cluster. Emeneker[7] proposes dynamic clustering to provide and manage resources. Emulab[8] initializes dozens of nodes with virtualization technology involved, and muti-cast protocols defined by Frisbee[9] are used to distribute disk information to every single node. ...
Article
Full-text available
Cloud computing has become more and more popular and the technology of rapid virtual machine (VM) cloning is widely used in cloud computing environment to implement the fast deployment of VMs to meet the need of bursting computational request. However, the security issue of VM cloning technology is not considered thoroughly in current approaches. This paper proposes a virtual machine cloning approach based on trusted computing which deals with memory and disk of a VM separately. This approach resolves three problems: the identities verification of involved servers; the attestation of source VM and destination VM; the protection of integrity of transmitted data.
... Virtualisation has previously been explored in order to abstract the software environment of a job [1]. It is demonstrated that the ability to adapt HPC resources in this way also benefits resource management, by increasing the number of systems that are suitable to run a job. ...
... The flexibility of the cloud allows one to create multiple clusters where each individual virtual cluster can have it's own customised software environment. This draws parallels with the development of the now defunct OSCAR-V middleware [17] and Dynamic Virtual Clustering [11] concepts, which are able to provision a virtual cluster based on the requirements of a job at the time of submission. These systems still incur the overhead of running a VM as the job execution environment and inherit the associated performance limitations. ...
Conference Paper
Full-text available
Linux container technology has more than proved itself useful in cloud computing as a lightweight alternative to virtualisation, whilst still offering good enough resource isolation. Docker is emerging as a popular runtime for managing Linux containers, providing both management tools and a simple file format. Research into the performance of containers compared to traditional Virtual Machines and bare metal shows that containers can achieve near native speeds in processing, memory and network throughput. A technology born in the cloud, it is making inroads into scientific computing both as a format for sharing experimental applications and as a paradigm for cloud based execution. However, it has unexplored uses in traditional cluster and grid computing. It provides a run time environment in which there is an opportunity for typical cluster and parallel applications to execute at native speeds, whilst being bundled with their own specific (or legacy) library versions and support software. This offers a solution to the Achilles heel of cluster and grid computing that requires the user to hold intimate knowledge of the local software infrastructure. Using Docker brings us a step closer to more effective job and resource management within the cluster by providing both a common definition format and a repeatable execution environment. In this paper we present the results of our work in deploying Docker containers in the cluster environment and an evaluation of its suitability as a runtime for high performance parallel execution. Our findings suggest that containers can be used to tailor the run time environment for an MPI application without compromising performance, and would provide better Quality of Service for users of scientific computing.
... Dynamic Virtual Clustering (DVC) [15] implemented the scheduling of VMs on existing physical cluster nodes within a campus setting. The motivation for this work was to improve the usage of disparate cluster computing systems located within a single entity. ...
Chapter
Nowadays supercomputer centers strive to provide their computational resources as services, however, present infrastructure is not particularly suited for such a use. First of all, there are standard application programming interfaces to launch computational jobs via command line or a web service, which work well for a program but turn out to be too complex for scientists: they want applications to be delivered to them from a remote server and prefer to interact with them via graphical interface. Second, there are certain applications which are dependent on older versions of operating systems and libraries and it is either non-practical to install those old systems on a cluster or there exists some conflict between these dependencies. Virtualization technologies can solve this problem, but they are not too popular in scientific computing due to overheads introduced by them. Finally, it is difficult to automatically estimate optimal resource pool size for a particular task, thus it often gets done manually by a user. If the large resource pool is requested for a minor task, the efficiency degrades. Moreover, cluster schedulers depend on estimated wall time to execute the jobs and since it cannot be reliably predicted by a human or a machine their efficiency suffers as well. Applications delivery, efficient operating system virtualization and dynamic application resource pool size defining constitute the two problems of scientific computing: complex application interfaces and inefficient use of resources available — and virtual supercomputer is the way to solve them. The research shows that there are ways to make virtualization technologies efficient for scientific computing: the use of lightweight application containers and dynamic creation of these containers for a particular job are both fast and transparent for a user. There are universal ways to deliver application output to a front-end using execution of a job on a cluster and presenting its results in a graphical form. Finally, an application framework can be developed to decompose parallel application into small independent parts with easily predictable execution time, to simplify scheduling via existing algorithms. The aim of this chapter is to promote the key idea of a virtual supercomputer: to harness all available HPC resources and provide users with convenient access to them. Such a challenge can be effectively faced using contemporary virtualization technologies. They can materialize the long-term dream of having a supercomputer at your own desk.
... Dynamic Virtual Clustering (DVC) [7] implemented the scheduling of VMs on existing physical cluster nodes within a campus setting. The motivation for this work was to improve the usage of disparate cluster computing systems located within a single entity. ...
Conference Paper
Full-text available
One of efficient ways to conduct experiments on HPC platforms is to create custom virtual computing environments tailored to the requirements of users and their applications. In this paper we investigate virtual private supercomputer, an approach based on virtualization, data consolidation, and cloud technologies. Virtualization is used to abstract applications from underlying hardware and operating system while data consolidation is applied to store data in a distributed storage system. Both virtualization and data consolidation layers offer APIs for distributed computations and data processing. Combined, these APIs shift the focus from supercomputing technologies to problems being solved. Based on these concepts, we propose an approach to construct virtual clusters with help of cloud computing technologies to be used as on-demand private supercomputers and evaluate performance of this solution.
... The interoperability between Grids and clouds has been under investigation for some time. The beginnings of the research can be seen in the work of adding the ability to make batch systems "dynamic"-that is, adding and removing (often virtualised) Worker Nodes from a batch system queue based on certain events [17][18][19][20][21][22]. This has resulted in such modern projects as INFN's Worker Node on Demand [23], which deploys virtualized resources at the WLCG Tier1 in Italy. ...
Article
The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in a distributed environment. DIRAC has been developed within the Large Hadron Collider Beauty (LHCb) collaboration. After successful usage over several years, it is the final solution adopted by the experiment. The Belle experiment at the Japanese High Energy Accelerator Research Organization (KEK) has the purpose of studying matter/anti-matter asymmetries using B mesons. During its lifetime, Belle detector has collected about 5,000 terabytes of real and simulated data. The analysis of this data requires an enormous amount of computing intensive Monte Carlo simulation. The Belle II experiment, which recently published its technical design report, will produce 50 times more data. Therefore it is interested to determine if commercial computing clouds can reduce the total cost of the experiment’s computing solution. This paper describes the setup prepared to evaluate the performance and cost of this approach using real 2010 simulation tasks of the Belle experiment. The setup has been developed using DIRAC as the overall management tool to control both the tasks to be executed and the deployment of virtual machines using the Amazon Elastic Compute Cloud as service provider. At the same time, DIRAC is also used to monitor the execution, collect the necessary statistical data, and finally upload the results of the simulation to Belle resources on the Grid. The results of a first test using over 2000 days of cpu time show that over 90% efficiency in the use of the resources can easily be achieved.
... Not only by extending the classical benefits of VMs for constructing cluster, e.g. consolidation or rapid provisioning of resources [33], [16]; but also grid-specific benefits, e.g. support to multiple VOs, isolation of workloads and the encapsulation of services, as has been published by SixSq [7]. ...
Article
This document presents the architecture for StratusLab v1.0. The architecture consists of a set of services, components and tools, integrated into a coherent whole and a single distribution, providing a turnkey solution for creating a private cloud, on which a grid site can be deployed. The constituents of the distribution are integrated and packaged such that they can easily be installed and configured, using both manual and automated methods. A reference architecture is presented, fulfilling the requirements and use cases identified by D2.1 and further analysed in this document. The selection of the elements composing the distribution was made following an updated analysis of the state-of-the-art in cloud and a gap analysis of grid and cloud related technologies, services, libraries and tools. In order to enable cloud interoperability, several interfaces have also been identified, such as contextualisation, remote access to computing and networking, as well as image formats. The result of this work will be StratusLab v1.0, a complete and robust solution enabling the deployment of grid services on a solid cloud layer. The architecture will be updated at project month 15 in Deliverable D4.4.
Article
Full-text available
Virtualization is a very important technology in the IaaS of the cloud computing. User uses computing resource as a virtual machine (VM) provided from the system provider. The VM's performance is depended on physical machine. A VM should be deployed all required resources when it is created. If there is no more resource could be deployed, the VM should be move to another physical machine for getting higher performance by using VM's live migration. The overhead of a VM's live migration is 30 to 90 seconds. If there are many virtual machines which need live migration, the cost of overhead will be very much. This paper presents how to use cluster computing architecture to improve the VM's performance. It will enhance 15% of per-formance compared with VM's live migration.
Conference Paper
In recent years, cloud computing becomes more and more popular and the technology of rapid cloning VMs is widely used to implement the fast instantiation of VMs to meet the need of bursting computational request. However, the security issue of VM cloning technology is not considered in current approaches. This paper proposes TVMCM, Trusted VM Clone Model. TVMCM is consisted of two individual models TCVMM and TCVDM where TCVMM ensures the security when cloning the memory of VM and TCVDM ensures the security when cloning the disk of VM. TVMCM resolves three problems: the identities verification of involved servers; the attestation of source VM and destination VM; the protection of integrity of transmitted data.
Article
Cloud computing is being built on top of established grid technology concepts. On the other hand, it is also true that cloud computing has much to offer to grid infrastructures. The aim of this paper is to provide the ability to build arbitrary complex grid infrastructures able to sustain the demand required by any given service, taking advantage of the pay-per-use model and the seemingly unlimited capacity of the cloud computing paradigm. It addresses mechanisms that potentially can be used to meet a given quality of service or satisfy peak demands this service may have. These mechanisms imply the elastic growth of the grid infrastructure making use of cloud providers, regardless of whether they are commercial, like Amazon EC2 and GoGrid, or scientific, like Globus Nimbus. This technology of dynamic provisioning is demonstrated in an experiment, aimed to show the overheads caused in the process of offloading jobs to resources created in the cloud.
Conference Paper
Grid computing can benefit from the improvement of cloud computing by deploying it on (grid-enabled resources). The usage of virtualization technology on cloud will improve the management and reliability of those resources. The aim of this paper is to study the benefit of applying cloud computing, by adding a service manager in grid infrastructures. This device should be able to help bearing the demand requests from distributed services and pass it to external partner (cloud), exploiting the benefit concept of pay-per-use and the powerful storage facility which exist on cloud computing technology. This will improve the quality of service (QoS) or meet the request of peak demands within the grid environment. This paper identifies the relevant work, and specified the problem that should be solved. Eventually, an architecture includes the device manager is proposed.
Conference Paper
Virtualization technologies enable flexible ways to configure computing environment according to the needs of particular applications. Combined with software defined networking technologies (SDN), operating system-level virtualization of computing resources can be used to model and tune the computing infrastructure to optimize application performance and optimally distribute virtualized physical resources between concurrent applications. We investigate capabilities provided by several modern tools (Docker, Mesos, Mininet) to model and build virtualized computational infrastructure, investigate configuration management in the integrated environment and evaluate performance of the infrastructure tuned to a particular test application.
Article
Computational clouds constructed on top of existing Grid infrastructure have the capability to provide different entities with customized execution environments and private scheduling overlays. By designing these clouds to be autonomically self-provisioned and adaptable to changing user demands, user-transparent resource flexibility can be achieved without substantially affecting average job sojourn time. In addition, the overlay environment and physical Grid sites represent disjoint administrative and policy domains, permitting cloud systems to be deployed non-disruptively on an existing production Grid. Private overlay clouds administered by, and dedicated to the exclusive use of, individual Virtual Organizations are termed Virtual Organization Clusters. A prototype autonomic cloud adaptation mechanism for Virtual Organization Clusters demonstrates the feasibility of overlay scheduling in dynamically changing environments. Commodity Grid resources are autonomically leased in response to changing private scheduler loads, resulting in the creation of virtual private compute nodes. These nodes join a decentralized private overlay network system called IPOP (IP Over P2P), enabling the scheduling and execution of end user jobs in the private environment. Negligible overhead results from the addition of the overlay, although the use of virtualization technologies at the compute nodes adds modest service time overhead (under 10%) to computationally-bound Grid jobs. By leasing additional Grid resources, a substantial decrease (over 90%) in average job queuing time occurs, offsetting the service time overhead.
Article
Cloud Computing is a model of service delivery and access where dynamically scalable and virtualized resources are provided as a service over the Internet. This model creates a new horizon of opportunity for enterprises. It introduces new operating and business models that allow customers to pay for the resources they effectively use, instead of making heavy upfront investments. The biggest challenge in Cloud Computing is the lack of a de facto standard or single architectural method, which can meet the requirements of an enterprise cloud approach. In this paper, we explore the architectural features of Cloud Computing and classify them according to the requirements of end-users, enterprises that use the cloud as a platform, and cloud providers themselves. We show that several architectural features will play a major role in the adoption of the Cloud Computing paradigm as a mainstream commodity in the enterprise world. This paper also provides key guidelines to software architects and Cloud Computing application developers for creating future architectures.
Conference Paper
Full-text available
As the use of virtual machines (VMs) for scientific applications be- comes more common, we encounter the need to integrate VM provisioning models into the existing resource management infrastructure as seamlessly as possible. To address such requirements, we describe an approach to VM management that uses multi-level scheduling to integrate VM provisioning into batch schedulers such as PBS. We then evaluate our approach on the TeraPort cluster at the University of Chicago.
Conference Paper
Virtual machine (VM) technology applied to HPC has been shown to improve system throughput and turnaround time [2, 3]. VMs provide additional compute power to backlogged clusters by abstracting the software environment of compute resources that are available on other clusters. Borrowing and provisioning resources from other clusters requires some way to efficiently manage VMs across multiple independent hosts [4].
Article
This paper investigates the impact of dynamic clustering and the use of hardware support for distinct parallel programming models in an NoC-based MPSoC environment. Using a dynamically adaptable hardware, the platform provides clusters that implement either a shared memory organization or a distributed memory organization in order to meet applications' requirements without any computational overhead. The entire process is completely transparent for the programmer. In addition, a scheduler is used to take advantage of changes on the degree of parallelism of an application to improve workload balancing. Experimental results show that dynamic clustering can improve performance up to 77% (54% in average) and can provide energy savings up to 58% (42% in average).
Article
Full-text available
Virtualization technology has enabled applications to be decoupled from the underlying hardware providing the benefits of portability, better control over execution environment and isolation. It has been widely adopted in scientific grids and commercial clouds. Since virtualization, despite its benefits incurs a performance penalty, which could be significant for systems dealing with uncertainty such as High Performance Computing (HPC) applications where jobs have tight deadlines and have dependencies on other jobs before they could run. The major obstacle lies in bridging the gap between performance requirements of a job and performance offered by the virtualization technology if the jobs were to be executed in virtual machines. In this paper, we present a novel approach to optimize job deadlines when run in virtual machines by developing a deadline-aware algorithm that responds to job execution delays in real time, and dynamically optimizes jobs to meet their deadline obligations. Our approaches borrowed concepts both from signal processing and statistical techniques, and their comparative performance results are presented later in the paper including the impact on utilization rate of the hardware resources.
Conference Paper
Advances in the industry are gaining attentions from the scientific computing community. Programming models and frameworks such as MapReduce and Hadoop are attracting scientists due to their easy-to-use interfaces and autonomic parallel processing abilities. Compared with the conventional HPC execution environments, clouds possess desirable features such as administrative privilege, customizable software environments, and are more capable of supporting new frameworks and boosting inter-organization collaborations. Isolated private clouds and public clouds are used by scientists. However, extra administrative and monetary costs are inevitable in both cases. In this paper, we propose My Cloud, a resource management framework integrating cloud-based techniques into HPC systems. My Cloud supports both conventional HPC jobs and cloud-like on-demand virtual cluster provisioning on the same HPC cluster simultaneously. Dynamic resource sharing between the two environments is beneficial for cluster utilization and burst usages handling. The design of My Cloud is friendly to the state-of-the-art Software Defined Networking technologies and opens opportunities for fine-grained application-aware network engineering.
Article
A basic building block of cloud computing is virtualization. Virtual machines (VMs) encapsulate a user’s computing environment and efficiently isolate it from that of other users. VMs, however, are large entities, and no clear APIs exist yet to provide users with programatic, fine-grained control on short time scales. We present SnowFlock, a paradigm and system for cloud computing that introduces VM cloning as a first-class cloud abstraction. VM cloning exploits the well-understood and effective semantics of UNIX fork. We demonstrate multiple usage models of VM cloning: users can incorporate the primitive in their code, can wrap around existing toolchains via scripting, can encapsulate the API within a parallel programming framework, or can use it to load-balance and self-scale clustered servers. VM cloning needs to be efficient to be usable. It must efficiently transmit VM state in order to avoid cloud I/O bottlenecks. We demonstrate how the semantics of cloning aid us in realizing its efficiency: state is propagated in parallel to multiple VM clones, and is transmitted during runtime, allowing for optimizations that substantially reduce the I/O load. We show detailed microbenchmark results highlighting the efficiency of our optimizations, and macrobenchmark numbers demonstrating the effectiveness of the different usage models of SnowFlock.
Conference Paper
As multi-core processors become increasingly the mainstream, people have likewise become more interested in how best to make use of the computing capacity of the CPU. Although many methods, running multi-thread application for example, have been adopted to increase the CPU utilization, most multi-core PC's and workstation's CPU cycles are idle, even during peak hours. So it is an efficient solution to help a personal user to build his own small non-dedicated cluster by collecting some idle PC in a LAN. In order to improve the utilization of the multi-core processor and shield the heterogeneity of different platform, virtual machine (VM) technology can be applied to partition the resource of each computer, changing a physical node into several homogeneous virtual nodes. This personal virtual cluster (PVC) can be created, managed, and released by a personal user, and run some computationally intensive parallel program such as application with MPI during some temporary time. In this paper we present a prototype of PVC with the popular, open-source, Xen virtualization system, and investigate the performance of the typical parallel programming paradigm MPI in PVC. The results of experiments show that the PVC is a helpful computing mode for a personal user in a LAN, and the application with MPI without much communication between different processes can achieve good performance in PVC.
Article
Virtual machine acceptance and deployment has exploded with the advent of cloud computing. Unfortunately, virtual machines negatively impact application performance. For parallel scientific codes, any negative performance impact is undesirable. This paper presents an initial investigation of performance-impacting machine-level events, comparing the Xen virtual machine to native Linux, and using knowledge of the underlying CPU cache architecture to improve relevant cache behaviour. Several machine-level events are gathered, including translation lookaside buffer misses and cache misses. Results from the experiments show that cache-aware virtual machine placement has a significant impact on scientific applications.
Conference Paper
Virtual machine prevalence in datacenters introduces a range of potential efficiency issues. It is well known that virtual machines negatively impact application performance when compared to native execution. Multi-core architectures present new opportunities for limiting the performance impact of virtualized execution. The research presented in this paper examines the effects of multi-core cache structure on scientific applications running inside Xen virtual machines. Multiple strategies for assigning virtual machines to physical CPUs are detailed for cases where one or more virtual machines reside on a single node. The results show that placing virtual machines in caches generally improves performance when compared to the default scheduling scheme.
Chapter
Grid and Cloud Computing models pursue the same objective of constructing large-scale distributed infrastructures, although focusing on complementary aspects. While grid focuses on federating resources and fostering collaboration, cloud focuses on flexibility and on-demand provisioning of virtualized resources. Due to their complementarity, it is clear that both models, or at least some of their concepts and techniques, will coexist and cooperate in existing and future e-infrastructures. This chapter shows how Cloud Computing will help both to overcome many of the barriers to grid adoption and to enhance the management, functionality, suitability, energy efficiency, and utilization of production grid infrastructures.
Article
Full-text available
In this paper, we present a bandwidth-centric job communication model that captures the interaction and impact of simultaneously co-allocating jobs across multiple clusters. We compare our dynamic model with previous research that utilizes a fixed execution time penalty for co-allocated jobs. We explore the interaction of simultaneously co-allocated jobs and the contention they often create in the network infrastructure of a dedicated computational multi-cluster. We also present several bandwidth-aware co-allocating meta-schedulers. These schedulers take inter-cluster network utilization into account as a means by which to mitigate degraded job run-time performance. We make use of a bandwidth-centric parallel job communication model that captures the time-varying utilization of shared inter-cluster network resources. By doing so, we are able to evaluate the performance of multi-cluster scheduling algorithms that focus not only on node resource allocation, but also on shared inter-cluster network bandwidth.
Conference Paper
Full-text available
This paper examines the suitability of different virtualization techniques in a high performance cluster environment. A survey of visualization techniques is presented. Two representative technologies (Xen and User Mode Linux) are selected for an in depth analysis of cluster readiness in terms of their performance, reliability, and their overall impact on complexity of cluster administration
Conference Paper
Full-text available
Although there is wide agreement that backfilling produces significant benefits in scheduling of parallel jobs, there is no clear consensus on which backfilling strategy is preferable - should conservative backfilling be used or the more aggressive EASY backfilling scheme. Using trace-based simulation, we show that if performance is viewed within various job categories based on their width (processor request size) and length (job duration), some consistent trends may be observed. Using insights gleaned by the characterization, we develop a selective reservation strategy for backfill scheduling. We demonstrate that the new scheme is better than both conservative and aggressive backfilling. We also consider the issue of fairness in job scheduling and develop a new quantitative approach to its characterization. We show that the newly proposed schemes are also comparable or better than aggressive backfilling with respect to the fairness criterion.
Conference Paper
Full-text available
Virtual Machine (VM) environments (e.g., VMware and Xen) are experiencing a resurgence of interest for diverse uses including server consolidation and shared hosting. An application's performance in a virtual machine environment can differ markedly from its performance in a non-virtualized environment because of interactions with the underlying virtual machine monitor and other virtual machines. However, few tools are currently available to help debug performance problems in virtual machine environments.In this paper, we present Xenoprof, a system-wide statistical profiling toolkit implemented for the Xen virtual machine environment. The toolkit enables coordinated profiling of multiple VMs in a system to obtain the distribution of hardware events such as clock cycles and cache and TLB misses. The toolkit will facilitate a better understanding of performance characteristics of Xen's mechanisms allowing the community to optimize the Xen implementation.We use our toolkit to analyze performance overheads incurred by networking applications running in Xen VMs. We focus on networking applications since virtualizing network I/O devices is relatively expensive. Our experimental results quantify Xen's performance overheads for network I/O device virtualization in uni- and multi-processor systems. With certain Xen configurations, networking workloads in the Xen environment can suffer significant performance degradation. Our results identify the main sources of this overhead which should be the focus of Xen optimization efforts. We also show how our profiling toolkit was used to uncover and resolve performance bugs that we encountered in our experiments which caused unexpected application behavior.
Conference Paper
Full-text available
Two different approaches have been commonly used to address problems associated with space sharing scheduling strategies: (a) augmenting space sharing with backfilling, which performs out of order job scheduling; and (b) augmenting space sharing with time sharing, using a technique called coscheduling or gang scheduling. With three important experimental results-impact of priority queue order on backfilling, impact of overestimation of job execution times, and comparison of scheduling techniques-this paper presents an integrated strategy that combines backfilling with gang scheduling. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining backfilling and gang scheduling are clearly demonstrated over a spectrum of performance criteria
Conference Paper
Full-text available
Xen is an x86 virtual machine monitor produced by the University of Cambridge Computer Laboratory and released under the GNU General Public License. Performance results comparing XenoLinux (Linux running in a Xen virtual machine) to native Linux as well as to other virtualization tools such as User Mode Linux (UML) were recently pub- lished in the paper "Xen and the Art of Virtualization" at the Symposium on Operating Systems Principles (October 2003). In this study, we repeat this performance analysis of Xen. We also extend the analysis in several ways, includ- ing comparing XenoLinux on x86 to an IBM zServer. We use this study as an example of repeated research. We argue that this model of research, which is enabled by open source software, is an important step in transferring the results of computer science research into production environments.
Conference Paper
Full-text available
This poster describes initial performance results comparing a tuned lightweight Linux environment to the standard Catamount lightweight kernel environment on compute nodes of the Cray XT3 system. We have created a lightweight Linux environment that consumes less memory and outperforms Catamount for the selfish micro-benchmark that measures operating system interference. In spite of this, Catamount significantly outperforms our lightweight Linux environment for all network performance micro-benchmarks. Latency and bandwidth performance are more than 20% worse for Linux and 16-byte allreduce performance is 2.5 times worse, even at small numbers of nodes. These results indicate that even a properly configured and tuned Linux environment can still suffer from performance and scalability issues on a highly balanced platform like the XT3. This poster provides a detailed description of our lightweight Linux environment, shows relevant performance results, and describes the important issues that allow Catamount to achieve superior performance.
Article
Full-text available
The specic demands of high-performance computing (HPC) often mismatch the assumptions and algorithms provided by legacy operating systems (OS) for common workload mixes. While feature- and application-rich OSes allow for exible and low-cost hardware congurations, rapid development, and exible testing and debugging, the mismatch comes at the cost of | oftentimes signican t | performance degra- dation for HPC applications. The ubiquitous availability of virtualization support in all relevant hardware architectures enables new programming and execution models for HPC applications without loos- ing the comfort and support of existing OS and application environments. In this paper we discuss the trends, motiva- tions, and issues in hardware virtualization with emphasis on their value in HPC environments.
Article
Full-text available
Focuses on the architectures of high-performance scientific computers in the United States. Definition of a computer cluster; Information of the Beowulf Project; Expectation of a revolutionary technology.
Conference Paper
Full-text available
The interaction of simultaneously co-allocated jobs can often create contention in the network infrastructure of a dedicated computational grid. This contention can lead to degraded job run-time performance. We present several bandwidth-aware co-allocating meta-schedulers. These schedulers take into account inter-cluster network utilization as a means by which to mitigate this impact. We make use of a bandwidth-centric parallel job communication model that captures the time-varying utilization of shared inter-cluster network resources. By doing so, we are able to evaluate the performance of grid scheduling algorithms that focus not only on node resource allocation, but also on shared inter-cluster network bandwidth.
Article
Full-text available
The design and implementation of a national computing system and data grid has become a reachable goal from both the computer science and computational science point of view. A distributed infrastructure capable of sophisticated computational functions can bring many benefits to scientific work, but poses many challenges, both technical and socio-political. Technical challenges include having basic software tools, higher-level services, functioning and pervasive security, and standards, while socio-political issues include building a user community, adding incentives for sites to be part of a user-centric environment, and educating funding sources about the needs of this community. This paper details the areas relating to Grid research that we feel still need to be addressed to fully leverage the advantages of the Grid. Keywords: Grid computing, survey
Article
In this paper we develop a model which represents the addressing of resources by processes executing on a virtual machine. The model distinguishes two maps: the ø-map which represents the map visible to the operating system software running on the virtual machine, and the f-map which is invisible to that software but which is manipulated by the virtual machine monitor running on the real machine. The ø-map maps process names into resource names and the f-map maps virtual resource names into real resource names. Thus, a process running on a virtual machine addresses its resources under the composed map f o ø. In recursive operation, f maps from one virtual machine level to another and we have f o f o ... o f o ø. The model is used to describe and characterize previous virtual machine designs. We also introduce and illustrate a general approach for implementing virtual machines which follows directly from the model. This design, the Hardware Virtualizer, handles all process exceptions directly within the executing virtual machine without software intervention. All resource faults (VM-faults) generated by a virtual machine are directed to the appropriate virtual machine monitor without the knowledge of processes on the virtual machine (regardless of the level of recursion).
Conference Paper
Virtual machine (VM) technologies are experiencing a resur- gence in both industry and research communities. VMs of- fer many desirable features such as security, ease of man- agement, OS customization, performance isolation, check- pointing, and migration, which can be very benecial to the performance and the manageability of high performance computing (HPC) applications. However, very few HPC ap- plications are currently running in a virtualized environment due to the performance overhead of virtualization. Further, using VMs for HPC also introduces additional challenges such as management and distribution of OS images. In this paper we present a case for HPC with virtual ma- chines by introducing a framework which addresses the per- formance and management overhead associated with VM- based computing. Two key ideas in our design are: Virtual Machine Monitor (VMM) bypass I/O and scalable VM im- age management. VMM-bypass I/O achieves high commu- nication performance for VMs by exploiting the OS-bypass feature of modern high speed interconnects such as Inni- Band. Scalable VM image management signican tly reduces the overhead of distributing and managing VMs in large scale clusters. Our current implementation is based on the Xen VM environment and InniBand. However, many of our ideas are readily applicable to other VM environments and high speed interconnects. We carry out detailed analysis on the performance and management overhead of our VM-based HPC framework. Our evaluation shows that HPC applications can achieve almost the same performance as those running in a native, non-virtualized environment. Therefore, our approach holds promise to bring the benets of VMs to HPC applications with very little degradation in performance.
Conference Paper
As larger and larger commodity clusters for high perfor- mance computing proliferate at research institutions around the world, challenges in maintaining eective use of these systems also continue to increase. Among the many challenges are maintaining the appropriate software stack for a broad array of applications, and sharing workload across clusters. The Dynamic Virtual Clustering (DVC) system inte- grates the Xen virtual machine with the Moab scheduler to allow for creation of virtual clusters on a per-job basis. These virtual clusters can provide a unique software environment for a particular application, or can provide a consistent software environment across multiple heteroge- neous clusters. In this paper, the overhead of Xen-based DVC vs. native cluster performance is examined for workloads consisting of both serial and MPI-based parallel jobs.
Article
SUMMARY Since 1984, the Condor project has enabled ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational grid. In this chapter, we provide the history and philosophy of the Condor project and describe how it has interacted with other projects and evolved along with the eld of distributed computing. We outline the core components of the Condor system and describe how the technology of computing must correspond to social structures. Throughout, we reect on the lessons of experience and chart the course traveled by research ideas as they grow into production systems.
Conference Paper
The Globus project is a multi-institutional research effort that seeks to enable the construction of computational grids providing pervasive, dependable, and consistent access to high-performance computational resources, despite geographical distribution of both resources and users. Computational grid technology is being viewed as a critical element of future high-performance computing environments that will enable entirely new classes of computation-oriented applications, much as the World Wide Web fostered the development of new classes of information-oriented applications. The authors report on the status of the Globus project as of early 1998. They describe the progress that has been achieved to date in the development of the Globus toolkit, a set of core services for constructing grid tools and applications. They also discuss the Globus Ubiquitous Supercomputing Testbed (GUSTO) that they have constructed to enable large-scale evaluation of Globus technologies, and they review early experiences with the development of large-scale grid applications on the GUSTO testbed
Conference Paper
Applications that use high-speed networks to connect geographically distributed supercomputers, databases, and scientific instruments may operate over open networks and access valuable resources. Hence, they can require mechanisms for ensuring integrity and confidentiality of communications and for authenticating both users and resources. Security solutions developed for traditional client-server applications do not provide direct support for the program structures, programming tools, and performance requirements encountered in these applications. We address these requirements via a security-enhanced version of the Nexus communication library, which we use to provide secure versions of parallel libraries and languages, including the Message Passing Interface. These tools permit a fine degree of control over what, where, and when security mechanisms are applied. In particular, a single application can mix secure and nonsecure communication allowing the programmer to make fine-grained security/performance tradeoffs. We present performance results that quantify the performance of our infrastructure
Article
Computing often recycles the old as new, and this holds true for strict virtual machines. Perhaps strict virtual architectures will soon be pressed into use for multicore chips. Programmers inhabit the next level up. Virtual programmers would be end users who could routinely put together sequences of high-level statements specific to their problems as users. These programmers could put basic operations together in their own sequences, subject to their own conditioning. This virtual programming could be done through a command and scripting/macro interface.
Article
As high-speed networks make it easier to use distributed resources, it becomes increasingly common that applications and their data are not colocated. Users have traditionally addressed this problem by manually staging data to and from remote computers. We argue instead for a new remote I/O paradigm in which programs use familiar parallel I/O interfaces to access remote filesystems. In addition to simplifying remote execution, remote I/O can improve performance relative to staging by overlapping computation and data transfer or by reducing communication requirements. However, remote I/O also introduces new technical challenges in the areas of portability, performance, and integration with distributed computing systems. We propose techniques designed to address these challenges and describe a remote I/O library called RIO that we have developed to evaluate the effectiveness of these techniques. RIO addresses issues of portability by adopting the quasi-standard MPI-IO interface and by de...
Article
Users frequently have to choose between functionality and security. When running popular Web browsers or email clients, they frequently find themselves turning off features such as JavaScript, only to switch them back on in order to view a certain site or read a particular message. Users of Unix (or similar) systems can construct a sandbox where such programs execute in a restricted environment. Creating such a sandbox is not trivial; one has to determine what files or services to place within the sandbox to facilitate the execution of the application. In this paper we describe a portable system that tracks the file requests made by applications creating an access log. The same system can then use the access log as a template to regulate file access requests made by sandboxed applications. We present an example of how this system was used to place Netscape Navigator in a sandbox. 1.
Article
State-of-the-art and emerging scientific applications require fast access to large quantities of data and commensurately fast computational resources. Both resources and data are often distributed in a wide-area network with components administered locally and independently. Computations may involve hundreds of processes that must be able to acquire resources dynamically and communicate e#ciently. This paper analyzes the unique security requirements of large-scale distributed (grid) computing and develops a security policy and a corresponding security architecture. An implementation of the architecture within the Globus metacomputing toolkit is discussed. 1 Introduction Large-scale distributed computing environments, or "computational grids" as they are sometimes termed [4], couple computers, storage systems, and other devices to enable advanced applications such as distributed supercomputing, teleimmersion, computer-enhanced instruments, and distributed data mining [2]. Grid applica...
Article
Emerging high-performance applications require the ability to exploit diverse, geographically distributed resources. These applications use high-speed networks to integrate supercomputers, large databases, archival storage devices, advanced visualization devices, and/or scientific instruments to form networked virtual supercomputers or metacomputers. While the physical infrastructure to build such systems is becoming widespread, the heterogeneous and dynamic nature of the metacomputing environment poses new challenges for developers of system software, parallel tools, and applications. In this article, we introduce Globus, a system that we are developing to address these challenges. The Globus system is intended to achieve a vertically integrated treatment of application, middleware, and network. A low-level toolkit provides basic mechanisms such as communication, authentication, network information, and data access. These mechanisms are used to construct various higher-leve...
Article
Cluster computing is not a new area of computing. It is, however, evident that there is a growing interest in its usage in all areas where applications have traditionally used parallel or distributed computing platforms. The growing interest has been fuelled in part by the availability of powerful microprocessors and high-speed networks as off-the-shelf commodity components as well as in part by the rapidly maturing software components available to support high performance and high availability applications. This White Paper has been broken down into eleven sections, each of which has been put together by academics and industrial researchers who are both experts in their fields and where willing to volunteer their time and effort to put together this White Paper. The status of this paper is draft and we are at the stage of publicizing its presence and making a Request For Comments (RFC).
What is the Open Science Grid?
  • O S Grid
O. S. Grid, " What is the Open Science Grid? " 2007. [Online]. Available: http://www.opensciencegrid.org/About/What is the Open Science Grid
VM and the VM Community: Past, Present, and Fu-ture
  • M Varian
M. Varian, " VM and the VM Community: Past, Present, and Fu-ture, " 1997, http://www.os.nctu.edu.tw/vm/pdf/VM and the VM Com-munity Past Present and Future.pdf.