Conference Paper

Service Chain Composition with Failures in NFV Systems: A Game-Theoretic Perspective

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In [12] an integration-based method is proposed to estimate and pool the weighted priors; it utilizes Bayes' theorem to integrate heterogeneous priors. In addition, in the service composition concept, where the nodes are traversed in a given order to provide a service, some related techniques have been applied to solve the raised reliability issues, (e.g., Markov chain [13], Stochastic Reward Networks [14], Game theory [15], and Universal Generating Functions [16]). In the case of a complex system comprising a large number of subsystems and components in the field of reliability engineering, prior knowledge of system reliability can provide information about the subsystem and system levels. ...
... Bian, X. Huang, Z. Shao, X. Gao and Y. Yang [15] 2015 ...
... Therefore, during reliability assessment of web services, substituting the distribution function, f (x θ) , of the number of failures and the prior probability density function, π(θ), of failure probabilities into (15), the risk function of the decision function, δ(x), can be obtained as the following. ...
Article
Full-text available
Web service composition is the process of combining and reusing existing web services to create new business processes to satisfy specific user requirements. Reliability plays an important role in ensuring the quality of web service composition. However, owing to the flexibility and complexity of such architecture, sufficient estimation of reliability is difficult. In this paper, the authors propose a method to estimate the reliability of web service compositions based on Bayes reliability assessment by considering it to be a decision-making problem. This improves the testing efficiency and accuracy of such methods. To this end, the authors focus on fully utilizing prior information of web services to increase the accuracy of prior distributions, and construct a Markov model in terms of the reliabilities of the web composition and each web service to integrate the limited test data. The authors further propose a method of minimum risk (MMR) to calculate the initial values of hyperparameters satisfying the constraint of minimal risk of the wrong decision. Experiments demonstrate that the proposed method is capable of efficiently utilizing prior module-level failure information, comparing with the Bayesian Monte Carlo method (BMCM) and expert scoring method (ESM), when the number of failures increased from 0 to 5, reducing the required number of test cases from 19.8% to 28.9% and 6.1% to 14.1% separately, improving the reliability assessment of web service compositions, and reducing the expenses incurred by system-level reliability testing and demonstration.
... It can flexibly meet the special demands of IoT business and the changing network conditions. However, with a large number of IoT applications, how to embed multiple SFCs requirements into an NFVenabled IoT network has become a pivotal challenge [10]. ...
... However, this solution still requires a lot of computing and communication resources. In [10], Bian et al. considered the situation of both user and resource failures and designed a distributed and low-complexity algorithm that can reduce delay and congestion. ...
Article
Full-text available
Nowadays, the compelling applications of the Inter-net of Things (IoT) bring unexpected economical benefits to our daily life. However, with the explosion of IoT devices and various applications, it poses a huge challenge for service providers with rigid networks. Recently, network functions virtualization (NFV) enabled network is considered as a promising solution to solve these problems. In NFV-enabled architecture, network services are implemented in a set of orderly virtual network functions (VNFs) that rely on the business logic, named service function chains (SFCs). However, with the explosion of IoT applications, how to embed multiple SFCs in an NFV-enabled network becomes a challenging problem. Traditional centralized solutions suffer from scalability and private issue. Distributed algorithms suffer serious non-convergence problems. In this paper , we propose a hybrid intelligent control architecture. It adopt the centralized training and distributed execution paradigm to deal with such problem. Considering the competitive behavior of users and the limitation of network resources, we formulate the problem as a multiuser competition game model. Based on this framework, we proposed an actor-critic-based multi-agent reinforcement learning-based SFCs deployment algorithm.
... The main challenge of providing a resilient service in a VCAV system is to optimize the placement and routing of VNFs in response to a failure in an NFV infrastructure (NFVI). Previous researches have considered various techniques to address several aspects of the resilient service problem [4][5][6][7][8][9]. However, all previous approaches could not be applied to a VCAV system due to the high dynamics of VCAV data traffic. ...
... Equation (7) guarantees that node v supplies demand d with u di if and only if u d(i−1) is fulfilled by either node v or its preceding node that belongs to the demand d's path. Equation (8) guarantees that y σ 2vdi = 1 if and only if u di is delivered by a node between s d and v and the node belongs to the path realizing demand d. Note that we have the sum of y σ 2edi on the right-hand side of Equation (8) because there might be several incoming links of node v. Equations (9) and (10) assures thatȳ σ 2edi = 1 if and only if link e belongs to the path realizing demand d, and VNF u di is deployed at either i e or its preceding node that belongs to the demand d's path. ...
Article
Full-text available
The massive amount of data generated daily by various sensors equipped with connected autonomous vehicles (CAVs) can lead to a significant performance issue of data processing and transfer. Network Function Virtualization (NFV) is a promising approach to improving the performance of a CAV system. In an NFV framework, Virtual Network Function (VNF) instances can be placed in edge and cloud servers and connected together to enable a flexible CAV service with low latency. However, protecting a service function chain composed of several VNFs from a failure is challenging in an NFV-based CAV system (VCAV). We propose an integer linear programming (ILP) model and two approximation algorithms for resilient services to minimize the service disruption cost in a VCAV system when a failure occurs. The ILP model, referred to as TERO, allows us to obtain the optimal solution for traffic engineering, including the VNF placement and routing for resilient services with regard to dynamic routing. Our proposed algorithms based on heuristics (i.e., TERH) and reinforcement learning (i.e., TERA) provide an approximation solution for resilient services in a large-scale VCAV system. Evaluation results with real datasets and generated network topologies show that TERH and TERA can provide a solution close to the optimal result. It also suggests that TERA should be used in a highly dynamic VCAV system.
... SFC provides a flexible and economical alternative for IoT application service providers to replace today's rigid network environment. However, due to a large number of IoT applications, how to embed multiple SFC requirements into an NFV-enabled IoT infrastructure has become a pivotal challenge [7]. ...
... However, this solution still requires a lot of computing and communication resources. In [7], Bian et al. considered the situation of both user and resource failures and designed a distributed and lowcomplexity algorithm that can reduce delay and congestion. ...
Preprint
Nowadays, the compelling applications of the Internet of things (IoT) bring unexpected economical benefits to our daily life. However, with the explosion of IoT devices and various applications, it poses a huge challenge for service providers with rigid networks. Recently, network functions virtualization (NFV) enabled network is considered as a promising solution to solve these problems. NFV abstracts network functions from dedicated hardware and deploys them to virtual servers, thereby reduce cost and accelerate service deployment for network operators. In NFV-enabled architecture, network services are implemented in a set of orderly virtual network functions (VNFs) that rely on the business logic, named Service Function Chains (SFCs). However, with the explosion of IoT applications, how to embed multiple SFCs in an NFV-enabled network becomes a challenging problem. Traditional centralized solutions suffer from scalability and private issue. Therefore, in this paper, we try to design a distributed reinforcement learning-based SFCs deployment policy. We introduce a centralized training and distributed implementation framework in an IoT network. A centralized platform simplifies the learning process with global network information whilst agents can make decisions based on their local observations in a distributed environment. Based on this framework, we proposed an actor-critic-based multi-agent reinforcement learning-based SFCs deployment algorithm.
... The goal of the partition game is that the VNFs can be placed in appropriate cloud sites, while minimizing deployment cost. Bian et al. [107] propose a distributed and low-complexity algorithm that is inspired by game theory, where the players are the users who behave selfishly until they reach a NE. Each user subscribes to a specific network service which is denoted by an ordered VNFs chain. ...
... The utility function is the sum of the latency and congestion. The innovation of [107] is to balance latency and congestion by considering failures due to user/resource unavailability in the model. Chen et al. [108] present a mixed strategy non-cooperative game, where servers compete for the optimal VNFs placement strategies and distribution due to revenue and QoS incentives. ...
Article
Full-text available
Network Function Virtualization (NFV) has been emerging as an appealing solution that transforms complex network functions from dedicated hardware implementations to software instances running in a virtualized environment. In this survey, we provide an overview of recent advances of resource allocation in NFV. We generalize and analyze four representative resource allocation problems, namely, (1) the VNF Placement and Traffic Routing problem, (2) VNF Placement problem, (3) Traffic Routing problem in NFV, and (4) the VNF Redeployment and Consolidation problem. After that, we study the delay calculation models and VNF protection (availability) models in NFV resource allocation, which are two important Quality of Service (QoS) parameters. Subsequently, we classify and summarize the representative work for solving the VPTR problem and the VRC problem by considering various QoS parameters (e.g., cost, delay, reliability and energy) and different scenarios (e.g., edge cloud, online provisioning and distributed provisioning). Finally, we conclude our survey with a short discussion on the state-of-the-art of literatures and emerging topics in the related field, and highlight areas where we expect high potential for future research.
... The goal of the partition game is that the VNFs can be placed in appropriate cloud sites, while minimizing deployment cost. Bian et al. [107] propose a distributed and low-complexity algorithm that is inspired by game theory, where the players are the users who behave selfishly until they reach a NE. Each user subscribes to a specific network service which is denoted by an ordered VNFs chain. ...
... The utility function is the sum of the latency and congestion. The innovation of [107] is to balance latency and congestion by considering failures due to user/ resource unavailability in the model. Chen et al. [108] present a mixed strategy non-cooperative game, where servers compete for the optimal VNFs placement strategies and distribution due to revenue and QoS incentives. ...
Article
Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC<sup>2</sup>-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.
... So far, there has been little discussion about the design of a robust service composition scheme for NFV [23,24,25,26,27]. Marotta and Kassler propose a novel mathematical model for the problem of robust VNF placement in which they focus on minimizing the power consumption and protecting a virtualized Evolved Packet Core from severe deviations of the input parameters [23]. ...
... However, their solution does not provide a mechanism of routing path calculation in the controller. In [27], Bian et al. study the problem of robust service composition as a non-cooperative game where a strategy A set of nodes V = V 1 ∪ V 2 where V 1 is a set of nodes in NFV infrastructure (NFVI) (i.e, NFVI nodes), and V 2 is a set of end nodes. E A set of links E = E 1 ∪ E 2 where E 1 and E 2 is a set of NFVI links among NFVI nodes, and a set of links between an NFVI node and an end node, respectively. ...
Article
Full-text available
Fault tolerance is critical for constructing a reliable service in Network Functions Virtualization (NFV). In this paper, we propose novel models and algorithms that provide the resilience of NFV services from multiple node and link failures. We first design an optimization model and the PAR protection algorithm that can efficiently protect an NFV service demand from network failures without any action from a controller due to the diversity of flow assignment. We then develop an optimization model for total demand protection with a guarantee of recovering the whole demand volume. Further, a new restoration algorithm, namely UNIT, is proposed for the design of large survivable NFV-based networks with the recovery of the affected bandwidth under the uncertainty of multiple network failures. We analytically prove the performance guarantee of UNIT in comparison with the optimal static solution. The results of our experimental study in a Mininet-based environment with the Ryu controller show that a combination of PAR and UNIT efficiently protects NFV-based networks from failures in terms of both resource restoration and recovery time. A personalized URL providing free access before January 04, 2020: https://authors.elsevier.com/a/1a3su4xsUrvgXu
... In addition, with regard to the unreliability of data transmission or service caused by network failures, some strategies related to the redundancy of network components (e.g., virtual machines, router, etc.) [37][38][39] have been proposed to protect network components from network failures. ...
Article
Full-text available
Reliable multicast distribution is essential for some applications such as Internet of Things (IoT) alarm information and important file distribution. Traditional IP reliable multicast usually relies on multicast source retransmission for recovery losses, causing huge recovery delay and redundancy. Moreover, feedback implosion tends to occur towards multicast source as the number of receivers grows. Information-Centric Networking (ICN) is an emerging network architecture that is efficient in content distribution by supporting multicast and in-network caching. Although ubiquitous in-network caching provides nearby retransmission, the design of cache strategy greatly affects the performance of loss recovery. Therefore, how to recover losses efficiently and quickly is an urgent problem to be solved in ICN reliable multicast. In this paper, we first propose an overview architecture of ICN-based reliable multicast and formulate a problem using recovery delay as the optimization target. Based on the architecture, we present a Congestion-Aware Probabilistic Cache (CAPC) strategy to reduce recovery delay by caching recently transmitted chunks during multicast transmission. Then, we propose NACK feedback aggregation and recovery isolation scheme to decrease recovery overhead. Finally, experimental results show that our proposal can achieve fully reliable multicast and outperforms other approaches in recovery delay, cache hit ratio, transmission completion time, and overhead.
Article
Service Function Chaining (SFC) is a trending paradigm. It has attracted significant attention from both industry and academia because of its potential to significantly improve dynamicity and flexibility in service chain provisioning. Thus, it is easier and more convenient to compose on-demand service chains tailored to application-specific needs. In addition to SFC, Network Functions Virtualization (NFV) and Software-Defined Networking (SDN) are two technology enablers for driving software-based service chain solutions. In particular, SFC leverages NFV for flexible deployment and placement of virtual resources and Virtual Network Functions (VNFs), and employs SDN to provide traffic steering and network connectivity between the deployed VNF instances to form an application-specific service chain. Despite SFC brings many promising advantages, security is an important concern and has a potential barrier to the broad adoption of SFC technology. As it relies on NFV and SDN, such an integration of these technologies introduces a wide variety of security risks in different levels of SFC stacks, resulting in a greater attack surface. Therefore, this survey is intended to conduct a comprehensive analysis of SFC from a security perspective. To have a clear understanding about the SFC, we first examine its architecture in detail, including the design principles and the relationship between other function components. The significant enhancement with the adoption of SFC is highlighted. We also exemplify its deployment in several realistic use cases. Second, based on the SFC layering model, we analyze security threats with the purpose to identify all possible risk exposures and establish a layer-specific threat taxonomy accordingly. Third, using this established threat taxonomy, we systematically analyze the existing defensive solutions and propose a set of security recommendations to secure an SFC-enabled domain. Our goal is to help network operators to deploy cost-effective security hardening based on their particular needs. Finally, several open research challenges and future directions for SFC are discussed.
Article
Satellite edge computing has become a promising way to provide computing services for Internet of Things (IoT) users in remote areas, which are out of the coverage of terrestrial networks. Nevertheless, it is not suitable for large-scale IoT users due to the resource limitation of satellites. Cloud computing can provide sufficient available resources for IoT users, but it does not meet delay-sensitive services as high network latency. Satellite edge clouds can facilitate flexible service provisioning for numerous IoT users by incorporating the advantages of edge computing and cloud computing. In this paper, we investigate the dynamic resource allocation problem for virtual network function (VNF) placement in satellite edge clouds. The aim is to minimize the network bandwidth cost and the service end-to-end delay jointly. We formulate the VNF placement problem as an integer non-linear programming problem and then propose a distributed VNF placement (D-VNFP) algorithm to address it. The experiments are conducted to evaluate the performance of the proposed D-VNFP algorithm, where Viterbi and Game theory are considered as the baseline algorithms. The results show that the proposed D-VNFP algorithm is effective and efficient for solving the VNF placement problem in satellite edge clouds.
Article
The Softwarization of networks is enabled by the SDN (Software-Defined Networking), NV (Network Virtualization), and NFV (Network Function Virtualization) paradigms, and offers many advantages for network operators, service providers and data-center providers. Given the strong interest in both industry and academia in the softwarization of telecommunication networks and cloud computing infrastructures, a series of special issues was established in IEEE Transactions on Network and Service Management, which aims at the timely publication of recent innovative research results on the management of softwarized networks.
Article
Through network function virtualization (NFV), virtual network functions (VNFs) can be mapped onto substrate networks as service function chains (SFC) to provide customized services with guaranteed quality-of-service (QoS). In this paper, we solve a multi-SFC embedding problem by a game-theoretical approach considering the heterogeneity of NFV nodes, the effect of processing-resource sharing among various VNFs, and the capacity constraints of NFV nodes. Specifically, each SFC is treated as a player whose objective is to minimize the overall latency experienced by the supported service flow, while satisfying the capacity constraints of all NFV nodes. Due to processing-resource sharing, additional delay is incurred and incorporated into the overall latency for each SFC. The capacity constraints of NFV nodes are considered by adding a penalty term into the cost function of each player, and are guaranteed by a prioritized admission control mechanism. We prove that the formulated resource constrained multi-SFC embedding game (RC-MSEG) is an exact potential game admitting at least one pure Nash equilibrium (NE) and has the finite improvement property (FIP). Two iterative algorithms are developed, namely, the best response (BR) algorithm with fast convergence and the spatial adaptive play (SAP) algorithm with great potential to obtain the best NE. Simulations are conducted to demonstrate the effectiveness of the proposed game-theoretical approach.
Article
For systems that are based on network function virtualization (NFV), it remains a key challenge to conduct effective service chain composition with the lowest request latency and the minimum network congestion. In such an NFV system, users are usually non-cooperative, i.e., they compete with each other to optimize their own benefits. However, existing solutions often ignore such non-cooperative behaviors of users. What is more, they may fall short in the face of unexpected resource failures such as breakdown of virtual machines and loss of connections to users. In this article, we formulate the service chain composition problem with resource failures in NFV systems as a non-cooperative game , and show that such a game is a weighted potential game , aiming to search for the optimal Nash equilibrium (NE). By adopting Markov approximation techniques, we devise a distributed scheme called MH-SCCA, which achieves a provably near-optimal NE and adapts to resource failures in a timely manner. For comparison, we also propose two baseline schemes (DRL-SCCA and MCTS-SCCA) for centralized service chain composition that are based on deep reinforcement learning (DRL) and Monte Carlo tree search (MCTS) techniques, respectively. Our simulation results demonstrate the effectiveness of the three proposed schemes in terms of both latency reduction and congestion mitigation, as well as the adaptivity of MH-SCCA when faced with resource failures.
Article
Full-text available
Motivated by practical scenarios, we study congestion games with failures. We investigate two models. The first one is congestion games with both resource and agent failures (CGRAF), where each agent chooses the same number of resources with the minimum expected cost. We prove that the game is potential and hence admits at least one pure-strategy Nash equilibrium (pure-NE). We also show that the Price of Anarchy and the Price of Stability are bounded (equal to 1 in some cases). The second model is congestion games with only resource failures (CG-CRF), where resources are provided in packages, and their failures can be correlated with each other. Each agent can choose multiple packages for reliability’s sake and utilize the survived one having the minimum cost. CG-CRF is shown to be not potential. We prove that it admits at least one pure- NE by constructing one efficiently. Finally, we discuss various applications of these two games in the networking field. To the best of our knowledge, this is the first work studying congestion games with the coexistence of resource and agent failures, and we give also the first proof of the existence of a pure-NE in congestion games with correlated package failures.
Article
Full-text available
The Network Function Virtualization (NFV) paradigm has gained increasing interest in both academia and industry as it promises scalable and flexible network management and orchestration. In NFV networks, network services are provided as chains of different Virtual Network Functions (VNFs), which are instantiated and executed on dedicated VNF-compliant servers. The problem of composing those chains is referred to as the Service Chain Composition problem. In contrast to centralized solutions that suffer from scalability and privacy issues, in this paper we leverage non-cooperative game theory to achieve a low-complexity distributed solution to the above problem. Specifically, to account for selfish and competitive behavior of users, we formulate the service chain composition problem as an atomic weighted congestion game with unsplittable flows and player-specific cost functions. We show that the game possesses a weighted potential function and admits a Nash Equilibrium (NE). We prove that the price of anarchy (PoA) is upper-bounded, and also propose a distributed and privacy-preserving algorithm which provably converges towards a NE of the game in polynomial time. Finally, through extensive numerical results, we assess the performance of the proposed distributed solution to the service chain composition problem.
Conference Paper
Full-text available
Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. Nevertheless, the “softwarization” of network functions imposes software reliability concerns on future networks, which will be exposed to software issues arising from virtualization technologies. In this paper, we discuss the challenges for reliability in NFVIs, and present an industrial research project on their reliability assurance, which aims at developing novel fault injection technologies and systematic guidelines for this purpose.
Article
Full-text available
We propose a natural model for agent failures in congestion games. In our model, each of the agents may fail to participate in the game, introducing uncertainty regarding the set of active agents. We examine how such uncertainty may change the Nash equilibria (NE) of the game. We prove that although the perturbed game induced by the failure model is not always a congestion game, it still admits at least one pure Nash equilibrium. Then, we turn to examine the effect of failures on the maximal social cost in any NE of the perturbed game. We show that in the limit case where failure probability is negligible new equilibria never emerge, and that the social cost may decrease but it never increases. For the case of non-negligible failure probabilities, we provide a full characterization of the maximal impact of failures on the social cost under worst-case equilibrium outcomes.
Book
Service Level Agreements for Cloud Computing provides a unique combination of business-driven application scenarios and advanced research in the area of service-level agreements for Clouds and service-oriented infrastructures. Current state-of-the-art research findings are presented in this book, as well as business-ready solutions applicable to Cloud infrastructures or ERP (Enterprise Resource Planning) environments. Service Level Agreements for Cloud Computing contributes to the various levels of service-level management from the infrastructure over the software to the business layer, including horizontal aspects like service monitoring. This book provides readers with essential information on how to deploy and manage Cloud infrastructures. Case studies are presented at the end of most chapters. Service Level Agreements for Cloud Computing is designed as a reference book for high-end practitioners working in cloud computing, distributed systems and IT services. Advanced-level students focused on computer science will also find this book valuable as a secondary text book or reference.
Article
We introduce a new class of games, {\it congestion games with failures} (CGFs), which allows for resource failures in congestion games. In a CGF, players share a common set of resources (service providers), where each service provider (SP) may fail with some known probability (that may be constant or depend on the congestion on the resource). For reliability reasons, a player may choose a subset of the SPs in order to try and perform his task. The cost of a player for utilizing any SP is a function of the total number of players using this SP. A main feature of this setting is that the cost for a player for successful completion of his task is the \underline{minimum} of the costs of his successful attempts. We show that although CGFs do not, in general, admit a (generalized ordinal) potential function and the finite improvement property (and thus are not isomorphic to congestion games), they always possess a pure strategy Nash equilibrium. Moreover, every best reply dynamics converges to an equilibrium in any given CGF, and the SPs' congestion experienced in different equilibria is (almost) unique. Furthermore, we provide an efficient procedure for computing a pure strateguy equilibrium in CGFs and show that every best equilibrium (one minimizing the sum of the players' disutilities) is semi-strong. Finally, for the subclass of symmetric CGFs we give a constructive characterization of best and worst equilibria.
Article
Network Functions Virtualization is an emerging initiative where standard IT virtualization evolves to consolidate network functions onto high volume servers, switches and storage that can be located anywhere in the network. In NFV, network services are built by chaining a set of Virtual Network Functions (VNFs) that must be allocated on top of the physical network infrastructure (commodity hardware). This challenge is commonly known as the NFV resource allocation problem, that is divided in two problem stages: 1) service chain composition and 2) service chain embedding. Up to now, existing approaches do not scale with regard to problem size. In this paper, we address this problem and propose CoordVNF, a heuristic method to coordinate the composition of VNF chains and their embedding into the substrate network. Evaluation results show that the heuristic is able to quickly solve the allocation problem even in substrate network topologies with hundreds of nodes.
Conference Paper
In today's commercial data centers, the computation density grows continuously as the number of hardware components and workloads in units of virtual machines increase. The service availability guaranteed by data centers heavily depends on the reliability of the physical and virtual servers. In this study, we conduct an analysis on 10K virtual and physical machines hosted on five commercial data centers over an observation period of one year. Our objective is to establish a sound understanding of the differences and similarities between failures of physical and virtual machines. We first capture their failure patterns, i.e., the failure rates, the distributions of times between failures and of repair times, as well as, the time and space dependency of failures. Moreover, we correlate failures with the resource capacity and run-time usage to identify the characteristics of failing servers. Finally, we discuss how virtual machine management actions, i.e., consolidation and on/off frequency, impact virtual machine failures.
Article
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.
Conference Paper
Game theoretic modeling and equilibrium analysis of con- gestion games have provided insights in the performance of Internet congestion control, road transportation networks, etc. Despite the long history, very little is known about their transient (non equilibrium) performance. In this paper, we are motivated to seek answers to ques- tions such as how long does it take to reach equilibrium, when the system does operate near equilibrium in the pres- ence of dynamics, e.g. nodes join or leave. In this pursuit, we provide three contributions in this paper. First, a novel probabilistic model to capture realistic behaviors of agents allowing for the possibility of arbitrariness in conjunction with rationality. Second, evaluation of (a) time to con- verge to equilibrium under this behavior model and (b) dis- tance to Nash equilibrium. Finally, determination of trade- off between the rate of dynamics and quality of performance (distance to equilibrium) which leads to an interesting un- certainty principle. The novel technical ingredients involve analysis of logarithmic Sobolov constant of Markov process with time varying state space and methodically this should be of broader interest in the context of dynamical systems.
Conference Paper
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.
Conference Paper
We introduce a new class of games, congestion games with failures (CGFs), which extends the class of congestion games to allow for facility failures. In a basic CGF (BCGF) agents share a common set of facilities (service providers), where each service provider (SP) may fail with some known proba- bility. For reliability reasons, an agent may choose a subset of the SPs in order to try and perform his task. The cost of an agent for utilizing any SP is a function of the total num- ber of agents using this SP. A main feature of this setting is that the cost for an agent for successful completion of his task is the minimum of the costs of his successful attempts. We show that although BCGFs do not admit a potential function, and thus are not isomorphic to classic congestion games, they always possess a pure-strategy Nash equilib- rium. We also show that the SPs' congestion experienced in dierent Nash equilibria is (almost) unique. For the sub- class of symmetric BCGFs we give a characterization of best and worst Nash equilibria. We extend the basic model by making task submission costly and define a model for taxed CGFs (TCGFs). We prove the existence of a pure-strategy Nash equilibrium for quasi-symmetric TCGFs, and present an ecient algorithm for constructing such Nash equilibrium in symmetric TCGFs.
Problem statement for service function chaining
  • P Quinn
  • T Nadeau
Network function virtualization: Challenges and directions for reliability assurance
  • D Cotroneo
  • L Simone
  • A K Lannillo
  • A Lanzaro
  • R Natella
  • J Fan
Markov approximation for combinatorial network optimization
  • M Chen
  • S C Liew
  • Z Shao
  • C Kai