Derrick Kondo’s research while affiliated with PX'Therapeutics, SA, Grenoble, France and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (76)


Optimization of Composite Cloud Service Processing with Virtual Machines
  • Article

June 2015

·

27 Reads

·

29 Citations

IEEE Transactions on Computers

·

Derrick Kondo

·

By leveraging virtual machine (VM) technology, we optimize cloud system performance based on refined resource allocation, in processing user requests with composite services. Our contribution is three-fold. (1) We devise a VM resource allocation scheme with a minimized processing overhead for task execution. (2) We comprehensively investigate the best-suited task scheduling policy with different design parameters. (3) We also explore the best-suited resource sharing scheme with adjusted divisible resource fractions on running tasks in terms of Proportional-share model (PSM), which can be split into absolute mode (called AAPSM) and relative mode (RAPSM). We implement a prototype system over a cluster environment deployed with 56 real VM instances, and summarized valuable experience from our evaluation. As the system runs in short supply, lightest workload first (LWF) is mostly recommended because it can minimize the overall response extension ratio (RER) for both sequential-mode tasks and parallel-mode tasks. In a competitive situation with over-commitment of resources, the best one is combining LWF with both AAPSM and RAPSM. It outperforms other solutions in the competitive situation, by 16 ;+; % w.r.t. the worst-case response time and by 7.4 ;+; % w.r.t. the fairness.


Characterizing and modeling cloud applications/jobs on a Google data center

July 2014

·

73 Reads

·

58 Citations

The Journal of Supercomputing

In this paper, we characterize and model Google applications and jobs, based on a 1-month Google trace from a large-scale Google data center. We address four contributions: (1) we compute the valuable statistics about task events and resource utilization for Google applications, based on various types of resources and execution types; (2) we analyze the classification of applications via a K-means clustering algorithm with optimized number of sets, based on task events and resource usage; (3) we study the correlation of Google application properties and running features (e.g., job priority and scheduling class); (4) we finally build a model that can simulate Google jobs/tasks and dynamic events, in accordance with Google trace. Experiments show that the tasks simulated based on our model exhibit fairly analogous features with those in Google trace. 95+ % of tasks’ simulation errors are \(


Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions

May 2014

·

25 Reads

·

6 Citations

Journal of Parallel and Distributed Computing

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling problem of optimizing application makespan and reliability. In particular, we determine whether one can optimize both makespan or reliability simultaneously, or whether one metric must be degraded in order to improve the other. We also devise scheduling algorithms for achieving (approximately) optimal makespan or reliability. When failure rates decrease, we prove that makespan and reliability are opposing metrics. In contrast, when failure rates increase, we prove that one can optimize both makespan and reliability simultaneously. Moreover, we show that the largest processing time (LPT) list scheduling algorithm achieves good performance when processors are of uniform speed. The implications of our findings are the accelerated completion and improved reliability of parallel jobs executed across large distributed systems. Finally, we conduct simulations to investigate the impact of failures on the performance, which is done using an actual application of biological sequence comparison.


Fig. 1 Example of BoT execution with noteworthy values 
Fig. 2 Profiling execution of BoTs in BE-DCIs: Tail Slowdown is the BoT completion time divided by the ideal completion time (i.e. determined by assuming a constant completion rate). The cumulative distribution function of observed slowdowns is represented
Table 3 Availability and unavailability of Best Effort DCI nodes. Av. quartiles and Unav. quartiles are the nodes availability and unavailabil- ity duration quartiles, in seconds 
Table 4 Computing power of Best Effort DCI nodes. Avg. power and 
Table 6 Characteristic of BoT workload: size is the number of tasks in the BoT, nops/task is the number of instructions per tasks and arrival the repartition function of tasks arrival time. weib is the Weibull distribution and norm, the Normal distribution 

+1

SpeQuloS: A QoS service for hybrid and elastic computing infrastructures
  • Article
  • Full-text available

March 2014

·

125 Reads

·

9 Citations

Cluster Computing

The large choice of Distributed Computing Infrastructures (DCIs) available allows users to select and combine their preferred architectures amongst Clusters, Grids, Clouds, Desktop Grids and more. In these hybrid DCIs, elasticity is emerging as a key property. In elastic infrastructures, resources available to execute application continuously vary, either because of application requirements or because of constraints on the infrastructure, such as node volatility. In the former case, there is no guarantee that the computing resources will remain available during the entire execution of an application. In this paper, we show that Bag-of-Tasks (BoT) execution on these “Best-Effort” infrastructures suffer from a drop of the task completion rate at the end of the execution. The SpeQuloS service presented in this paper improves the Quality of Service (QoS) of BoT applications executed on hybrid and elastic infrastructures. SpeQuloS monitors the execution of the BoT, and dynamically supplies fast and reliable Cloud resources when the critical part of the BoT is executed. SpeQuloS offers several features to hybrid DCIs users, such as estimating completion time and execution speedup. Performance evaluation shows that BoT executions can be accelerated by a factor 2, while offloading less than 2.5 % of the workload to the Cloud. We report on several scenarios where SpeQuloS is deployed on hybrid infrastructures featuring a large variety of infrastructures combinations. In the context of the European Desktop Grid Initiative (EDGI), SpeQuloS is operated to improve QoS of Desktop Grids using resources from private Clouds. We present a use case where SpeQuloS uses both EC2 regular and spot instances to decrease the cost of computation while preserving a similar QoS level. Finally, in the last scenario SpeQuloS allows to optimize Grid5000 resources utilization.

Download

Google hostload prediction based on Bayesian model with optimized feature combination

January 2014

·

94 Reads

·

49 Citations

Journal of Parallel and Distributed Computing

We design a novel prediction method with Bayes model to predict a load fluctuation pattern over a long-term interval, in the context of Google data centers. We exploit a set of features that capture the expectation, trend, stability and patterns of recent host loads. We also investigate the correlations among these features and explore the most effective combinations of features with various training periods. All of the prediction methods are evaluated using Google trace with 10,000+ heterogeneous hosts. Experiments show that our Bayes method improves the long-term load prediction accuracy by 5.6%–50%, compared to other state-of-the-art methods based on moving average, auto-regression, and/or noise filters. Mean squared error of pattern prediction with Bayes method can be approximately limited in [10−8,10−5][10−8,10−5]. Through a load balancing scenario, we confirm the precision of pattern prediction in finding a set of idlest/busiest hosts from among 10,000+ hosts can be improved by about 7% on average.


Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

November 2013

·

322 Reads

·

62 Citations

·

·

·

[...]

·

In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average.


Characterizing Cloud Applications on a Google Data Center

October 2013

·

150 Reads

·

118 Citations

In this paper, we characterize Google applications, based on a one-month Google trace with over 650k jobs running across over 12000 heterogeneous hosts from a Google data center. On one hand, we carefully compute the valuable statistics about task events and resource utilization for Google applications, based on various types of resources (such as CPU, memory) and execution types (e.g., whether they can run batch tasks or not). Resource utilization per application is observed with an extremely typical Pareto principle. On the other hand, we classify applications via a K-means clustering algorithm with optimized number of sets, based on task events and resource usage. The number of applications in the K-means clustering sets follows a Pareto-similar distribution. We believe our work is very interesting and valuable for the further investigation of Cloud environment.


The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

August 2013

·

76 Reads

·

91 Citations

Journal of Parallel and Distributed Computing

With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)—an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.


Optimization and stabilization of composite service processing in a cloud system

June 2013

·

26 Reads

·

3 Citations

With virtual machines (VM), we design a cloud system aiming to optimize the overall performance, in processing user requests made up of composite services. We address three contributions. (1) We optimize VM resource allocation with a minimized processing overhead subject to task's payment budget. (2) For maximizing the fairness of treatment in a competitive situation, we investigate the best-suited scheduling policy. (3) We devise a resource sharing scheme adjusted based on Proportional-Share model, further mitigating the resource contention. Experiments confirm two points: (1) mean task response time approaches the theoretically optimal value in non-competitive situation; (2) as system runs in short supply, each request could still be processed efficiently as compared to their ideal results. Combining Lightest Workload First (LWF) policy with Adjusted Proportional-Share Model (LWF+APSM) exhibits the best performance. It outperforms others in a competitive situation, by 38% w.r.t. worst-case response time and by 12% w.r.t. fairness of treatment.


Towards Payment-Bound Analysis in Cloud Systems with Task-Prediction Errors

June 2013

·

6 Reads

·

1 Citation

In modern cloud systems, how to optimize user service level based on virtual resources customized on demand is a critical issue. In this paper, we comprehensively analyze the payment bound under a cloud model with virtual machines (VMs), by taking into account that task's workload may be predicted with errors. The analysis is based on an optimized resource allocation algorithm with polynomial time complexity. We theoretically derive the upper bound of task payment based on a particular margin of workload prediction-error. We also extend the payment-minimization algorithm to adapt to the dynamic changes of host availability over time, and perform the evaluation by a real-cluster environment with 56 VMs deployed. Experiments confirm the correctness of our theoretical inference, and show that our payment-minimization solution can keep 95% of user payments below 1.15 times as large as the theoretical values of the ideal payment with hypothetically accurate information. The ratio for the rest user payments can be limited to about 1.5 at the worst case.


Citations (62)


... Approaches dealing with non functional properties in the volunteer computing context exist, mainly addressing reliability and availability aspects. For example, [4] faces the problem of availability prediction and availability guarantees of non-dedicated resources. Few of them also take into account other QoS and performance metrics. ...

Reference:

QoS Assessment of Mobile Crowdsensing Services
Modeling and Optimizing Availability of Non-Dedicated Resources
  • Citing Chapter
  • June 2012

... There has been a lot of research toward improving its performance and productivity. [29][30][31] The Berkeley Open Infrastructure for Network Computing (BOINC) 4 is a middleware system widely used for volunteer computing. Condor, 32 is a workload management system that can effectively harness wasted CPU power. ...

Preface to the special issue on volunteer computing and desktop grids
  • Citing Article
  • June 2012

Future Generation Computer Systems

... Recently Al-Azzoni and Kondo in their paper [8] have performed the Mean Value Analysis (MVA) to predict performance of multi-tier web applications running on multiple heterogeneous virtual machines over a public cloud (Amazon EC2). Their approach was shown to produce good performance predictions. ...

Cost-Aware Performance Modeling of Multi-tier Web Applications in the Cloud
  • Citing Conference Paper
  • April 2012

Communications in Computer and Information Science

... Such a resource management service is often implemented by combining cloud middleware and resource virtualisation technology, which is responsible for managing virtual machine (VM) instances based on per-user fashion (Lagar-Cavilla et al., 2011;Canali and Lancellotti, 2014). As VM instances are often deployed across distributed resources, which means that a lot of large VM state data (e.g., disk, RAM, vCPU) needs to be transferred (Di et al., 2015). Besides, as more and more data-intensive applications are deployed in cloud environments, a reliable and secure storage platform plays a key role to execute these applications (Shamsi et al., 2013;Song et al., 2013;Xiao and Han, 2014). ...

Optimization of Composite Cloud Service Processing with Virtual Machines
  • Citing Article
  • June 2015

IEEE Transactions on Computers

... All published studies related to this dataset so far [9] include data analysis. In [10], also the researchers focused on the analysis of data and concluded their conclusions regarding machine availability, as in [11], [12], jobs and tasks, as in [12], [1] [13] and resource usage, as in [5,[14][15][16]. In [17], the author proposed the DMMM framework to work on analyzing the workloads related to different types of setup and anticipate the consumption of resources in the cloud environment. ...

Characterization and Comparison of Google Cloud Load versus Grids
  • Citing Article
  • March 2012

... In [6], the authors proposed a model for calculating the optimal number of checkpoints for Cloud tasks using different probability distributions for failure events. Unlike works such as Young's work [7] or Daly's work [8], which assume an exponential distribution for the probability of failure, the proposed faulttolerant mechanism is not dependent on a specific failure distribution. ...

Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

... Many studies have characterized the performance and resource utilization of HPC and cloud systems using monitoring data. Most studies have primarily focused on the CPU and memory utilization to characterize workloads [5,8,10,11,20,28,29,31]. For example, Peng et al. analyzed memory utilization on two HPC systems. ...

Characterization and Comparison of Cloud versus Grid Workloads

... Di et al. [29] developed a host load prediction model using a Bayesian approach for Google's compute cloud. This model uses Bayesian inference to predict future host loads, providing a probabilistic framework that accounts for uncertainty and variability in load patterns. ...

Host Load Prediction in a Google Compute Cloud with a Bayesian Model
  • Citing Conference Paper
  • November 2012