Cluster Computing (CLUSTER COMPUT)

Publisher Springer Verlag

Description

Cluster Computing: the Journal of Networks Software Tools and Applications will provide a forum for presenting the latest research and technology that unify the fields of parallel processing distributed computing systems and computer networks. The current advances in processing and networking technology and software have spurred a lot of research interest in network computing as demonstrated in the federal High Performance Computing and Communications (HPCC) and the National Information Infrastructure (NII) initiatives. In the last few years we have seen an increased interest in developing applications software tools communications protocols and computer networks to capitalize on these advances and initiatives. Publications about these developments currently appear in several journals that either focus on the communications field or on parallel and distributed computing with a strong emphasis on the parallel computing field. Cluster Computing will uniquely address the latest results in these three fields that support high performance distributed computing over a computer network. The journal will be an important source of information for the growing number of researchers developers and users of High Performance Distributed Computing (HPDC) environments. In HPDC environments parallel and/or distributed computing techniques are applied to the solution of computationally intensive applications across networks of computers.

  • Impact factor
    0.52
    Show impact factor history 
     
    Impact factor
  • Website
    Cluster Computing website
  • Other titles
    Cluster computing (Online)
  • ISSN
    1386-7857
  • OCLC
    43078361
  • Material type
    Document, Periodical, Internet resource
  • Document type
    Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

Springer Verlag

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author can archive a post-print version
  • Conditions
    • Authors own final version only can be archived
    • Publisher's version/PDF cannot be used
    • On author's website or institutional repository
    • On funders designated website/repository after 12 months at the funders request or as a result of legal obligation
    • Published source must be acknowledged
    • Must link to publisher version
    • Set phrase to accompany link to published version (The original publication is available at www.springerlink.com)
    • Articles in some journals can be made Open Access on payment of additional charge
  • Classification
    ​ green

Publications in this journal

  • Article: Optimizing tensor contraction expressions for hybrid CPU-GPU execution
    Cluster Computing 01/2013;
  • Article: Lifetime Extension Based on Residual Energy for Receiver-driven Multi-hop Wireless Network
    [show abstract] [hide abstract]
    ABSTRACT: An important research topic in wireless sensor networking is the extension of operating time by controlling the power consumption of individual nodes. In a receiver-driven communication protocol, a receiver node periodically transmits its ID to the sender node, and in response the sender node sends an acknowledgment, after which data transmission starts. By applying such a receiver-driven protocol to wireless sensor networks, the average power consumption of the network can be controlled, but there still remains the problem of unbalanced load distribution among nodes. Therefore, part of the network shuts down when the battery of the node that consumes the most power is completely discharged. To extend the network lifetime, we propose a method where information about the residual energy level is exchanged through ID packets in order to balance power consumption. Simulation results show that the network lifetime can be extended by about 70–100 % while maintaining high network performance in terms of packet collection ratio and delay.
    Cluster Computing 05/2012;
  • Article: Explicit coordination to prevent congestion in data center networks
    [show abstract] [hide abstract]
    ABSTRACT: Large cluster-based cloud computing platforms increasingly use commodity Ethernet technologies, such as Gigabit Ethernet, 10GigE, and Fibre Channel over Ethernet (FCoE), for intra-cluster communication. Traffic congestion can become a performance concern in the Ethernet due to consolidation of data, storage, and control traffic over a common layer-2 fabric, as well as consolidation of multiple virtual machines (VMs) over less physical hardware. Even as networking vendors race to develop switch-level hardware support for congestion management, we make the case that virtualization has opened up a complementary set of opportunities to reduce or even eliminate network congestion in cloud computing clusters. We present the design, implementation, and evaluation of a system called XCo, that performs explicit coordination of network transmissions over a shared Ethernet fabric to proactively prevent network congestion. XCo is a software-only distributed solution executing only in the end-nodes. A central controller uses explicit permissions to temporally separate (at millisecond granularity) the transmissions from competing senders through congested links. XCo is fully transparent to applications, presently deployable, and independent of any switch-level hardware support. We present a detailed evaluation of our XCo prototype across a number of network congestion scenarios, and demonstrate that XCo significantly improves network performance during periods of congestion. We also evaluate the behavior of XCo for large topologies using NS3 simulations. KeywordsCongestion–Ethernet–Virtualization
    Cluster Computing 04/2012;
  • Article: Service control with the preemptive parallel job scheduler Scojo-PECT
    [show abstract] [hide abstract]
    ABSTRACT: User satisfaction and scheduling on grids makes predictability of response times and quality-of-service highly desirable. However, existing approaches for response-time prediction still show significant prediction errors, mostly due to problems in dynamic arrival of jobs with potentially higher priority and hard-to-anticipate packing and backfilling effects. The same problems imply that quality-of-service cannot be solved with standard approaches from communication systems. Thus, this paper presents a scheduling approach which provides a more suitable framework for service guarantees and predictability. The approach is based on coarse-grain preemption, combined with an innovative separation of job classes. Resource shares can be determined as necessary to meet target service levels. A further extension permits limited dynamic resource allocation to adapt to variations in machine load and job mixes. The feasibility of service control is demonstrated with various workloads. KeywordsJob scheduling–Gang scheduling–Preemption–Quality-of-service–Prediction
    Cluster Computing 04/2012; 14(2):165-182.
  • Article: Sensor scheduling for p-percent coverage in wireless sensor networks
    [show abstract] [hide abstract]
    ABSTRACT: We study sensor scheduling problems of p-percent coverage in this paper and propose two scheduling algorithms to prolong network lifetime due to the fact that for some applications full coverage is not necessary and different subareas of the monitored area may have different coverage requirements. Centralized p-Percent Coverage Algorithm (CPCA) we proposed is a centralized algorithm which selects the least number of nodes to monitor p-percent of the monitored area. Distributed p-Percent Coverage Protocol (DPCP) we represented is a distributed algorithm which can determine a set of nodes in a distributed manner to cover p-percent of the monitored area. Both of the algorithms can guarantee network connectivity. The simulation results show that our algorithms can remarkably prolong network lifetime, have less than 5% un-required coverage for large networks, and employ nodes fairly for most cases. KeywordsWireless sensor networks– p-coverage–Sensor schedule
    Cluster Computing 04/2012; 14(1):27-40.
  • Article: A stochastic approach to estimating earliest start times of nodes for scheduling DAGs on heterogeneous distributed computing systems
    [show abstract] [hide abstract]
    ABSTRACT: Previously, DAG scheduling schemes used the mean (average) of computation or communication time in dealing with temporal heterogeneity. However, it is not optimal to consider only the means of computation and communication times in DAG scheduling on a temporally (and spatially) heterogeneous distributed computing system. In this paper, it is proposed that the second order moments of computation and communication times, such as the standard deviations, be taken into account in addition to their means, in scheduling “stochastic” DAGs. An effective scheduling approach which accurately estimates the earliest start time of each node and derives a schedule leading to a shorter average parallel execution time has been developed. Through an extensive computer simulation, it has been shown that a significant improvement (reduction) in the average parallel execution times of stochastic DAGs can be achieved by the proposed approach. KeywordsAverage parallel execution time–Competing situation–Scheduling–Spatial heterogeneity–Stochastic DAG–Temporal heterogeneity
    Cluster Computing 04/2012; 14(4):377-395.
  • Article: A high performance integrated web data warehousing
    [show abstract] [hide abstract]
    ABSTRACT: Over the years, we have seen a significant number of integration techniques for data warehouses to support web integrated data. However, the existing works focus extensively on the design concept. In this paper, we focus on the performance of a web database application such as an integrated web data warehousing using a well-defined and uniform structure to deal with web information sources including semi-structured data such as XML data, and documents such as HTML in a web data warehouse system. By using a case study, our implementation of the prototype is a web manipulation concept for both incoming sources and result outputs. Thus, the system not only can be operated through the web, it can also handle the integration of web data sources and structured data sources. Our main contribution is the performance evaluation of an integrated web data warehouse application which includes two tasks. Task one is to perform a verification of the correctness of integrated data based on the result set that is retrieved from the web integrated data warehouse system using complex and OLAP queries. The result set is checked against the result set that is retrieved from the existing independent data source systems. Task two is to measure the performance of OLAP or complex query by investigating source operation functions used by these queries to retrieve the data. The information of source operation functions used by each query is obtained using the TKPROF utility.
    Cluster Computing 04/2012; 10(1):95-109.
  • Article: Performance analysis of ALOHA and p-persistent ALOHA for multi-hop underwater acoustic sensor networks
    [show abstract] [hide abstract]
    ABSTRACT: The extreme conditions under which multi-hop underwater acoustic sensor networks (UASNs) operate constrain the performance of medium access control (MAC) protocols. The MAC protocol employed significantly impacts the operation of the network supported, and such impacts must be carefully considered when developing protocols for networks constrained by both bandwidth and propagation delay. Time-based coordination, such as TDMA, have limited applicability due to the dynamic nature of the water channel used to propagate the sound signals, as well as the significant effect of relatively small changes in propagation distance on the propagation time. These effects cause inaccurate time synchronization and therefore make time-based access protocols less viable. The large propagation delays also diminish the effectiveness of carrier sense protocols as they do not predict with any certainty the status of the intended recipients at the point when the traffic would arrive. Thus, CSMA protocols do not perform well in UASNs, either. Reservation-based protocols have seldom been successful in commercial products over the past 50 years due to many drawbacks, such as limited scalability, relatively low robustness, etc. In particular, the impact of propagation delays in UASNs and other such constrained networks obfuscate the operation of the reservation protocols and diminish, if not completely negate, the benefit of reservations. The efficacy of the well-known RTS-CTS scheme, as a reservation-based enhancement to the CSMA protocol, is also adversely impacted by long propagation delays. An alternative to these MAC protocols is the much less complex ALOHA protocol, or one of its variants. However, the performance of such protocols within the context of multi-hop networks is not well studied. In this paper we identify the challenges of modeling contention-based MAC protocols and present models for analyzing ALOHA and p-persistent ALOHA variants for a simple string topology. As expected, an application of the model suggests that ALOHA variants are very sensitive to traffic loads. Indeed, when the traffic load is small, utilization becomes insensible to p values. A key finding, though, is the significance of the network size on the protocols’ performance, in terms of successful delivery of traffic from outlying nodes, indicating that such protocols are only appropriate for very small networks, as measured by hop count. KeywordsUnderwater acoustic sensor networks–MAC–ALOHA– p-persistent ALOHA–Multi-hop
    Cluster Computing 04/2012; 14(1):65-80.
  • Article: Integrated parallel performance views
    [show abstract] [hide abstract]
    ABSTRACT: The influences of the operating system and system-specific effects on application performance are increasingly important considerations in high performance computing. OS kernel measurement is key to understanding the performance influences and the interrelationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-based framework provides parallel kernel performance measurement from both a kernel-wide and process-centric perspective. The first characterizes overall aggregate kernel performance for the entire system. The second characterizes kernel performance when it runs in the context of a particular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU’s measurement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system/application monitoring.
    Cluster Computing 04/2012; 11(1):57-73.
  • Source
    Article: Ranking the importance of alerts for problem determination in large computer systems
    [show abstract] [hide abstract]
    ABSTRACT: The complexity of large computer systems has raised unprecedented challenges for system management. In practice, operators often collect large volume of monitoring data from system components and set up many rules to check data and trigger alerts. However, the alerts from various rules usually have different problem reporting accuracy because their thresholds are often manually set based on operators’ experience and intuition. Meantime, due to system dependencies, a single problem may trigger many alerts at the same time in large systems and the critical question is which alert should be analyzed first in the following problem determination process. In this paper, we propose a novel peer review mechanism to rank the importance of alerts and the top ranked alerts are more likely to be true positives. After comparing a metric value against its threshold to generate alerts, we also compare the value with the equivalent thresholds from many other rules to determine the importance of alerts. Our approach is evaluated with a real test bed system and experimental results are also included to demonstrate its effectiveness. KeywordsFault management–Rule management–Alertranking–Fault model–Invariant network–Peerreview
    Cluster Computing 04/2012; 14(3):213-227.
  • Article: FRASystem: fault tolerant system using agents in distributed computing systems
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1)arecovery agent performs a rollback-recovery protocol after a failure, (2)an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3)afacilitator agent controls the communication between agents, (4)agarbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1)this is the first rollback-recovery protocol using agents, (2)FRASystem is not dependent on an operating system, and (3)FRASystem provides a portability and extensibility. KeywordsFault tolerance–Multi-agent system–Distributed computing system–Rollback-recovery–Garbage-collection
    Cluster Computing 04/2012; 14(1):15-25.
  • Article: Stochastic bounds for composite Web services response times
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we propose bounding models, which provide upper and lower bounds on response time in composite Web service model, for alleviating the state explosion problem. The considered models have heterogeneous servers and the number of elementary Web services can be very large. More precisely, we study two types of composite Web services. First, we investigate the performance of a single composite Web service execution instance. Second, this assumption is relaxed (i.e. multiple composite Web services execution instances are considered). These models allows to find trade-off between the accuracy of the bounds and the computation complexity. KeywordsWeb services–Process coupling–Response time–Markov Chain
    Cluster Computing 04/2012;
  • Article: A new degree of freedom for memory allocation in clusters
    [show abstract] [hide abstract]
    ABSTRACT: Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster. Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability. Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas. The first results from real executions in this prototype demonstrate not only that our proposal works but also that it can efficiently execute applications that make use of remote memory resources. KeywordsCluster–Memory aggregation–HyperTransport
    Cluster Computing 04/2012;
  • Source
    Article: DataStager: scalable data staging services for petascale applications
    [show abstract] [hide abstract]
    ABSTRACT: Known challenges for petascale machines are that (1) the costs of I/O for high performance applications can be substantial, especially for output tasks like checkpointing, and (2) noise from I/O actions can inject undesirable delays into the runtimes of such codes on individual compute nodes. This paper introduces the flexible ‘DataStager’ framework for data staging and alternative services within that jointly address (1) and (2). Data staging services moving output data from compute nodes to staging or I/O nodes prior to storage are used to reduce I/O overheads on applications’ total processing times, and explicit management of data staging offers reduced perturbation when extracting output data from a petascale machine’s compute partition. Experimental evaluations of DataStager on the Cray XT machine at Oak Ridge National Laboratory establish both the necessity of intelligent data staging and the high performance of our approach, using the GTC fusion modeling code and benchmarks running on 1000+ processors. KeywordsI/O-WARP-GTC-XT3-Datatap-XT4-Staging-Data services
    Cluster Computing 04/2012; 13(3):277-290.

Keywords

Computer networks
 
Computernetwerken
 
Electronic data processing
 
Parallel processing (Electronic computers)
 

Related Journals