Ishai Menache

Microsoft, Washington, West Virginia, United States

Are you Ishai Menache?

Claim your profile

Publications (39)17.05 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Datacenter WAN traffic consists of high priority transfers that have to be carried as soon as they arrive alongside large transfers with pre-assigned deadlines on their completion (ranging from minutes to hours). The ability to offer guarantees to large transfers is crucial for business needs and impacts overall cost-of-business. State-of-the-art traffic engineering solutions only consider the current time epoch and hence cannot provide pre-facto promises for long-lived transfers. We present Tempus, an online traffic engineering scheme that exploits information on transfer size and deadlines to appropriately pack long-running transfers across network paths and time, thereby leaving enough capacity slack for future high-priority requests. Tempus builds on a tailored approximate solution to a mixed packing-covering linear program, which is parallelizable and scales well in both running time and memory usage. Consequently, Tempus is able to quickly and effectively update its solution when new transfers arrive or unexpected changes happen. These updates involve only small edits to existing transfers. Therefore, as experiments on traces from a large production WAN show, Tempus can offer and keep promises to long-lived transfers well in advance of their actual deadline; the promise on minimal transfer size is comparable with an offline optimal solution and outperforms state-of-the-art solutions by 2-3X.
    08/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We found that interactive services at Bing have highly variable datacenter-side processing latencies because their processing consists of many sequential stages, parallelization across 10s-1000s of servers and aggregation of responses across the network. To improve the tail latency of such services, we use a few building blocks: reissuing laggards elsewhere in the cluster, new policies to return incomplete results and speeding up laggards by giving them more resources. Combining these building blocks to reduce the overall latency is non-trivial because for the same amount of resource (e.g., number of reissues), different stages improve their latency by different amounts. We present Kwiken, a framework that takes an end-to-end view of latency improvements and costs. It decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages. Through simulations with production traces, we show sizable gains; the 99th percentile of latency improves by over 50% when just 0.1% of the responses are allowed to have partial results and by over 40% for 25% of the services when just 5% extra resources are used for reissues.
    Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM; 09/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.
    ACM SIGCOMM Computer Communication Review 09/2012; 42(4). · 0.91 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Virtualization can deliver significant benefits for cloud computing by enabling VM migration to improve utilization, balance load and alleviate hotspots. While several mechanisms exist to migrate VMs, few efforts have focused on optimizing migration policies in a multirooted tree datacenter network. The general problem has multiple facets, two of which map to generalizations of well-studied problems: (1) Migration of VMs in a bandwidth-oversubscribed tree network generalizes the maximum multicommodity flow problem in a tree, and (2) Migrations must meet load constraints at the servers, mapping to variants of the matching problem – generalized assignment and demand matching. While these problems have been individually studied, a new fundamental challenge is to simultaneously handle the packing constraints of server load and tree edge capacities. We give approximation algorithms for several versions of this problem, where the objective is to alleviate a maximal number of hot servers. We empirically demonstrate the effectiveness of these algorithms through large scale simulations on real data.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling} of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new ( C/(C-k) ⋅ s/(s-1))-approximation algorithm for this objective, where C denotes cloud capacity, k is the maximal bound on parallelized execution (in practical settings, k l C) and s is the slackness on the job completion time i.e., the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem. Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the job true value and properties (deadline, work volume and the parallelism bound) is a dominant strategy for all users. To that end, we provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on data-center job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users' valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.
    01/2012;
  • Ishai Menache, Asuman Ozdaglar, Nahum Shimkin
    [Show abstract] [Hide abstract]
    ABSTRACT: The cloud computing paradigm offers easily accessible computing resources of variable size and capabilities. We consider a cloud-computing facility that provides simultaneous service to a heterogeneous, time-varying population of users, each associated with a distinct job. Both the completion time, as well as the user's utility, may depend on the amount of computing resources applied to the job. In this paper, we focus on the objective of maximizing the long-term social surplus, which comprises of the aggregate utility of executed jobs minus load-dependent operating expenses. Our problem formulation relies on basic notions of welfare economics, augmented by relevant queueing aspects. We first analyze the centralized setting, where an omniscient controller may regulate admission and resource allocation to each arriving job based on its individual type. Under appropriate convexity assumptions on the operating costs and individual utilities, we establish existence and uniqueness of the social optimum. We proceed to show that the social optimum may be induced by a single per-unit price, which charges a fixed amount per unit time and resource from all users.
    Proceedings of the 5th International ICST Conference on Performance Evaluation Methodologies and Tools; 05/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider scheduling over a wireless system, where the channel state information is not available a priori to the scheduler, but can be inferred from the past. Specifically, the wireless system is modeled as a network of parallel queues. We assume that the channel state of each queue evolves stochastically as an ON/OFF Markov chain. The scheduler, which is aware of the queue lengths but is oblivious of the channel states, has to choose one queue at a time for transmission. The scheduler has no information regarding the current channel states, but can estimate them by using the acknowledgment history. We first characterize the capacity region of the system using tools from Markov Decision Processes (MDP) theory. Specifically, we prove that the capacity region boundary is the uniform limit of a sequence of Linear Programming (LP) solutions. Next, we combine the LP solution with a queue length based scheduling mechanism that operates over long `frames,' to obtain a throughput optimal policy for the system. By incorporating results from MDP theory within the Lyapunov-stability framework, we show that our frame-based policy stabilizes the system for all arrival rates that lie in the interior of the capacity region.
    INFOCOM, 2011 Proceedings IEEE; 05/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The popularity of Peer-to-Peer (P2P) file sharing has re-sulted in large flows between different ISPs, which imposes significant transit fees on the ISPs in whose domains the communicating peers are located. The fundamental tradeoff faced by a peer-swarm is between free, yet delayed content exchange between intra-domain peers, and inter-domain communication of content, which results in transit fees. This dilemma is complex, since peers who possess the content dynamically increase the content capacity of the ISP domain to which they belong. In this paper, we study the decision problem faced by peer swarms as a routing-in-time problem with time-varying capacity. We begin with a system of two swarms, each belonging to a different ISP: One swarm that has excess service capacity (a steady-state swarm) and one that does not (a transient swarm). We propose an asymptotically accurate fluid-approximation for the stochastic system, and explicitly obtain the optimal policy for the transient swarm in the fluid regime. We then consider the more complex case where multiple transient swarms compete for service from a single steady-state swarm. We utilize a proportional-fairness mechanism for allocating capacity between swarms, and study its performance as a non-cooperative game. We characterize the resulting Nash equilibrium, and study its efficiency both analytically and numerically. Our results indicate that while efficiency loss incurs due to selfish decision-making, the actual Price of Anarchy (PoA) remains bounded even for a large number of competing swarms.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider a dynamic spectrum access system in which Secondary Users (SUs) choose to either acquire dedicated spectrum or to use spectrum-holes (white spaces) which belong to Primary Users (PUs). The tradeoff incorporated in this decision is between immediate yet costly transmission and free but delayed transmission (a consequence of both the possible appearance of PUs and sharing the spectrum holes with multiple SUs). We first consider a system with a single PU band, in which the SU decisions are fixed. Employing queueing-theoretic methods, we obtain explicit expressions for the expected delays associated with using the PU band. Based on that, we then consider self-interested SUs and study the interaction between them as a noncooperative game. We prove the existence and uniqueness of a symmetric Nash equilibrium, and characterize the equilibrium behavior explicitly. Using our equilibrium results, we show how to maximize revenue from renting dedicated bands to SUs. Finally, we extend the scope to a scenario with multiple PUs, show that the band-pricing analysis can be applied to some special cases, and provide numerical examples.
    Proceedings of the 12th ACM Interational Symposium on Mobile Ad Hoc Networking and Computing, MobiHoc 2011, Paris, France, May 16-20, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a novel pricing and resource allocation approach for batch jobs on cloud systems. In our economic model, users submit jobs with a value function that specifies willingness to pay as a function of job due dates. The cloud provider in response allocates a subset of these jobs, taking into advantage the flexibility of allocating resources to jobs in the cloud environment. Focusing on social-welfare as the system objective (especially relevant for private or in-house clouds), we construct a resource allocation algorithm which provides a small approximation factor that approaches 2 as the number of servers increases. An appealing property of our scheme is that jobs are allocated nonpreemptively, i.e., jobs run in one shot without interruption. This property has practical significance, as it avoids significant network and storage resources for checkpointing. Based on this algorithm, we then design an efficient truthful-in-expectation mechanism, which significantly improves the running complexity of black-box reduction mechanisms that can be applied to the problem, thereby facilitating its implementation in real systems.
    Algorithmic Game Theory, 4th International Symposium, SAGT 2011, Amalfi, Italy, October 17-19, 2011. Proceedings; 01/2011
  • Source
    Ishai Menache, Nahum Shimkin
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider an uplink wireless collision channel, shared by multiple mobile users. As part of the medium access proto- col, channel reservation is carried out by using request-to- send (RTS) and clear-to-send (CTS) control packets. Conse- quently, collisions are reduced to the relatively short periods where mobiles request channel use. In our model, users are free to schedule their individual channel requests, while the objective of each user is to minimize its own power invest- ment subject to a minimum-throughput demand. Our anal- ysis reveals that for feasible throughput demands, there exist exactly two Nash equilibrium points in stationary strategies, with one being superior to the other uniformly for all users. We then show how this better equilibrium point can be ob- tained through a distributed mechanism. Finally, we discuss the optimal design of the reservation periods, while consid- ering capacity, power and delay tradeos.
    Telecommunication Systems 01/2011; 47:95-108. · 1.03 Impact Factor
  • Ishai Menache, Asuman E. Ozdaglar
    [Show abstract] [Hide abstract]
    ABSTRACT: Preliminary review / Publisher’s description: Traditional network optimization focuses on a single control objective in a network populated by obedient users and limited dispersion of information. However, most of today’s networks are large-scale with lack of access to centralized information, consist of users with diverse requirements, and are subject to dynamic changes. These factors naturally motivate a new distributed control paradigm, where the network infrastructure is kept simple and the network control functions are delegated to individual agents which make their decisions independently (“selfishly”). The interaction of multiple independent decision-makers necessitates the use of game theory, including economic notions related to markets and incentives. This monograph studies game theoretic models of resource allocation among selfish agents in networks. The first part of the monograph introduces fundamental game theoretic topics. Emphasis is given to the analysis of dynamics in game theoretic situations, which is crucial for design and control of networked systems. The second part of the monograph applies the game theoretic tools for the analysis of resource allocation in communication networks. We set up a general model of routing in wireline networks, emphasizing the congestion problems caused by delay and packet loss. In particular, we develop a systematic approach to characterizing the inefficiencies of network equilibria, and highlight the effect of autonomous service providers on network performance. We then turn to examining distributed power control in wireless networks. We show that the resulting Nash equilibria can be efficient if the degree of freedom given to end-users is properly designed. Table of Contents: Static Games and Solution Concepts / Game Theory Dynamics / Wireline Network Games / Wireless Network Games / Future Perspectives
    01/2011; Morgan & Claypool Publishers.
  • Source
    Niv Buchbinder, Navendu Jain, Ishai Menache
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy costs are becoming the fastest-growing element in datacenter operation costs. One basic approach to reduce these costs is to exploit the spatiotemporal variation in electricity prices by moving computation to datacenters in which energy is available at a cheaper price. However, injudicious job migration between datacenters might increase the overall operation cost due to the bandwidth costs of transferring application state and data over the wide-area network. To address this challenge, we propose novel online algorithms for migrating batch jobs between datacenters, which handle the fundamental tradeoff between energy and bandwidth costs. A distinctive feature of our algorithms is that they consider not only the current availability and cost of (possibly multiple) energy sources, but also the future variability and uncertainty thereof. Using the framework of competitive-analysis, we establish worst-case performance bounds for our basic online algorithm. We then propose a practical, easy-to-implement version of the basic algorithm, and evaluate it through simulations on real electricity pricing and job workload data. The simulation results indicate that our algorithm outperforms plausible greedy algorithms that ignore future outcomes. Notably, the actual performance of our approach is significantly better than the theoretical guarantees, within 6% of the optimal offline solution.
    NETWORKING 2011 - 10th International IFIP TC 6 Networking Conference, Valencia, Spain, May 9-13, 2011, Proceedings, Part I; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Summary form only given. Potential games is a class of games for which many of the simple user dynamics, such as best response dynamics and fictitious play, converge to a Nash equilibrium. The objective of this paper is to examine whether the convergence properties of dynamics in potential games can be extended to games that are "close" to potential games, thereby enhancing the predictability of user interactions. Intuitively, dynamics in potential games and dynamics in games that are close (in terms of the payoffs of the players) to potential games should be related. In this paper, we make this intuition precise by establishing convergence of dynamics to "approximate" equilibrium sets in near-potential games.
    Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on; 11/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we introduce a novel flow representation for finite games in strategic form. This representation allows us to develop a canonical direct sum decomposition of an arbitrary game into three components, which we refer to as the potential, harmonic and nonstrategic components. We analyze natural classes of games that are induced by this decomposition, and in particular, focus on games with no harmonic component and games with no potential component. We show that the first class corresponds to the well-known potential games. We refer to the second class of games as harmonic games, and study the structural and equilibrium properties of this new class of games. Intuitively, the potential component of a game captures interactions that can equivalently be represented as a common interest game, while the harmonic part represents the conflicts between the interests of the players. We make this intuition precise, by studying the properties of these two classes, and show that indeed they have quite distinct and remarkable characteristics. For instance, while finite potential games always have pure Nash equilibria, harmonic games generically never do. Moreover, we show that the nonstrategic component does not affect the equilibria of a game, but plays a fundamental role in their efficiency properties, thus decoupling the location of equilibria and their payoff-related properties. Exploiting the properties of the decomposition framework, we obtain explicit expressions for the projections of games onto the subspaces of potential and harmonic games. This enables an extension of the properties of potential and harmonic games to "nearby" games. We exemplify this point by showing that the set of approximate equilibria of an arbitrary game can be characterized through the equilibria of its projection onto the set of potential games.
    Mathematics of Operations Research 05/2010; · 0.90 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the power control problem in a time-slotted wireless channel, shared by a finite number of mobiles that transmit to a common base station. The channel between each mobile and the base station is time varying, and the system objective is to maximize the overall data throughput. It is assumed that each transmitter has a limited power budget, to be sequentially divided during the lifetime of the battery. We deviate from the classic work in this area, by considering a realistic scenario where the channel quality of each mobile changes arbitrarily from one transmission to the other. Assuming first that each mobile is aware of the channel quality of all other mobiles, we propose an online power-allocation algorithm, and prove its optimality under mild assumptions. We then indicate how to implement the algorithm when only local state information is available, requiring minimal communication overhead. Notably, the competitive ratio of our algorithm (nearly) matches the one we previously obtained for the (much simpler) single-transmitter case [BLMNO09], albeit requiring significantly different algorithmic solutions.
    INFOCOM, 2010 Proceedings IEEE; 04/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study power control in a multi-cell CDMA wireless system whereby self-interested users share a common spectrum and interfere with each other. Our objective is to design a power control scheme that achieves a (near) optimal power allocation with respect to any predetermined network objective (such as the maximization of sum-rate, or some fairness criterion). To obtain this, we introduce the potential-game approach that relies on approximating the underlying noncooperative game with a "close" potential game, for which prices that induce an optimal power allocation can be derived. We use the proximity of the original game with the approximate game to establish through Lyapunov-based analysis that natural user-update schemes (applied to the original game) converge within a neighborhood of the desired operating point, thereby inducing near-optimal performance in a dynamical sense. Additionally, we demonstrate through simulations that the actual performance can in practice be very close to optimal, even when the approximation is inaccurate. As a concrete example, we focus on the sum-rate objective, and evaluate our approach both theoretically and empirically.
    INFOCOM 2010. 29th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 15-19 March 2010, San Diego, CA, USA; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider an uplink power control problem where each mobile wishes to maximize its throughput (which depends on the transmission powers of all mobiles) but has a constraint on the average power consumption. A finite number of power levels are available to each mobile. The decision of a mobile to select a particular power level may depend on its channel state. We consider two frameworks concerning the state information of the channels of other mobiles: i) the case of full state information and ii) the case of local state information. In each of the two frameworks, we consider both cooperative as well as non-cooperative power control. We manage to characterize the structure of equilibria policies and, more generally, of best-response policies in the non-cooperative case. We present an algorithm to compute equilibria policies in the case of two non-cooperative players. Finally, we study the case where a malicious mobile, which also has average power constraints, tries to jam the communication of another mobile. Our results are illustrated and validated through various numerical examples.
    IEEE Transactions on Automatic Control 11/2009; · 2.72 Impact Factor
  • Source
    E. Altman, T. Basar, I. Menache, H. Tembine
    [Show abstract] [Hide abstract]
    ABSTRACT: We study a dynamic random access game with a finite number of opportunities for transmission and with energy constraints. We provide sufficient conditions for feasible strategies and for existence of Nash-Pareto solutions and show that finding Nash-Pareto policies of the dynamic random access game is equivalent to partitioning the set of time slot opportunities with constraints into a set of terminals. We further derive upper bounds for pure Nash-Pareto policies, and extend the study to non-integer energy constraints and unknown termination time, where time division multiplexing policies can be suboptimal. We show that the dynamic random access game has several strong equilibria (resilient to coalition of any size), and we compute them explicitly. We introduce the (strong) price of anarchy concept to measure the gap between the payoff under strong equilibria and the social optimum.
    Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks, 2009. WiOPT 2009. 7th International Symposium on; 07/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider a wireless collision channel, shared by a finite number of mobile users who transmit to a common base station. Each user wishes to optimize its individual network utility that incorporates a natural tradeoff between throughput and power. The channel quality of every user is affected by global and time-varying conditions at the base station, which are manifested to all users in the form of a common channel state. Assuming that all users employ stationary, state-dependent transmission strategies, we investigate the properties of the Nash equilibrium of the resulting game between users. While the equilibrium performance can be arbitrarily bad (in terms of aggregate utility), we bound the efficiency loss at the best equilibrium as a function of a technology-related parameter. Under further assumptions, we show that sequential best-response dynamics converge to an equilibrium point in finite time, and discuss how to exploit this property for better network usage.
    Game Theory for Networks, 2009. GameNets '09. International Conference on; 06/2009