Conference Paper

On the Design of Overlay Networks for IP Links Fault Verification

DOI: 10.1109/GLOCOM.2008.ECP.468 Conference: Global Telecommunications Conference, 2008. IEEE GLOBECOM 2008. IEEE
Source: IEEE Xplore


Accurate fault detection and location is essential to the efficient and economical operation of ISP networks. In addition, it affects the performance of Internet applications such as VoIP and online gaming. Fault detection algorithms typically depend on spatial correlation to produce a set of fault hypotheses, the size of which increases by the existence of lost and spurious symptoms, and the overlap among network paths. The network administrator is left with the task of accurately locating and verifying these fault scenarios, which is a tedious and time-consuming task. In this paper, we formulate the problem of designing infrastructure overlay networks for verifying the location of IP links faults taking into account the cost of the debugging paths and the stress on the underlying IP links. We map the problem into a integer generalized flow problem, and prove its NP-hardness. We relax the link stress constraint and formulate the resulting problem as a minimum cost circulation that can be solved in polynomial time. We evaluate the fault verification and IP links coverage capabilities of various overlay network sizes and topologies using real-life Internet topologies. Finally, we identify some interesting research problems in this context.

Download full-text


Available from: Manimaran Govindarasu, Dec 04, 2014
16 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we consider the problem of inferring link-level loss rates from end-to-end multicast measurements taken from a collection of trees. We give conditions under which loss rates are identifiable on a specified set of links. Two algorithms are presented to perform the link-level inferences for those links on which losses can be identified. One, the minimum variance weighted average (MVWA) algorithm treats the trees separately and then averages the results. The second, based on expectation-maximization (EM) merges all of the measurements into one computation. Simulations show that EM is slightly more accurate than MVWA, most likely due to its more efficient use of the measurements. We also describe extensions to the inference of link-level delay, inference from end-to-end unicast measurements, and inference when some measurements are missing.
    ACM SIGMETRICS Performance Evaluation Review 05/2002; 30(1). DOI:10.1145/511334.511338
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Fault localization is a core element in fault management. Many fault reasoning techniques use deterministic or probabilistic symptom-fault causality model for fault diagnoses and localization. Symptom-fault map is commonly used to describe symptom-fault causality in fault reasoning. However, due to lost and spurious symptoms in fault reasoning systems that passively collect symptoms, the performance and accuracy of the fault localization can be significantly degraded. In this paper, we propose an extended symptom-fault-action model to incorporate actions into fault reasoning process to tackle the above problem. This technique is called active integrated fault reasoning (AIR), which contains three modules: fault reasoning, fidelity evaluation and action selection. Corresponding fault reasoning and action selection algorithms are elaborated. Simulation study shows both performance and accuracy of fault reasoning can be greatly improved by taking actions, especially when the rate of spurious and lost symptoms is high.
    Integrated Network Management, 2005. IM 2005. 2005 9th IFIP/IEEE International Symposium on; 06/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Today's IP backbones are provisioned to provide excellent performance in terms of loss, delay and availability. However, performance degradation and service disruption are likely in the case of failure, such as fiber cuts, router crashes, etc. In this paper, we investigate the occurence of failures in Sprint's IP backbone and their potential impact on emerging services such as Voice-over-IP (VoIP). We first examine the frequency and duration of failure events derived from IS-IS routing updates collected from three different points in the Sprint IP backbone. We observe that link failures occur as part of everyday operation, and the majority of them are short-lived (less than 10 minutes) . We also discuss various statistics such as the distribution of inter-failure time, distribution of link failure durations, etc. which are essential for constructing a realistic link failure model. Next, we present an analysis of routing and service reconvergence time during a controlled link failure scenario in our backbone. Our results indicate that disruption to packet forwarding after link failures depends not only on routing protocol dynamics, but also on the design of routers' architectures and control planes. Thus our results offer insights into two basic components for defining network-wide availability, which we consider a more appropriate metric for service-level agreements to support emerging applications.
Show more