Conference Paper

A Fundamental Trade-Off between the Download Cost and Repair Bandwidth in Distributed Storage Systems

Dept. of Electr. Eng., Shahed Univ., Tehran, Iran
DOI: 10.1109/NETCOD.2010.5487685 Conference: Network Coding (NetCod), 2010 IEEE International Symposium on
Source: IEEE Xplore

ABSTRACT Distributed storage systems are mainly justified due to the limited amount of storage capacity and improving the reliability through distributing data over multiple storage nodes. However, it may happen the data is stored in unreliable nodes, while it is desirable the end user to have a reliable access to the stored data. So, in an event that a node is damaged, to prevent the system reliability to regress, it is necessary to regenerate a new node with the same amount of stored data as the damaged node to retain the number of storage nodes, thereby having the previous reliability. This requires the new node to connect to some of existing nodes, and downloads the required information, thereby occupying some bandwidth, called the repair bandwidth. On the other hand, it is more likely the cost of downloading varies across different nodes. This paper aims at investigating the fundamental trade-off between the download cost and repair bandwidth, and more importantly, it is shown any point on this curve can be achieved through the use of the so called generalized regenerating codes which is an enhancement to the regenerating codes introduced by Dimakis et al. in.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Erasure code based distributed storage systems provide data robustness by storing encoded-fragments over servers. To maintain data robustness, a repair mechanism recovers a storage system from server failures by repairing encoded-fragments. For decentralized erasure code based storage systems, we propose a decentralized repair mechanism. Our mechanism has the following features. Firstly, an encoded-fragment is replenished by a combination of a number u of encoded-fragments that are randomly chosen. Secondly, the number u depends on the number of the available encoded-fragments and is independent of the pattern of missing encoded-fragments. Thirdly, multiple encoded-fragments are simultaneously replenished in parallel. We measure the communication cost in terms of the number u of required network connections for replenishing an encoded-fragment. We then conducted a numerical analysis by using traces of real systems. We find that our requirement on u is smaller than that from existing methods. Both theoretical and numerical results show that our decentralized repair mechanism outperforms existing ones in terms of the communication cost under the same consideration of efficiency cost for storage.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Erasure codes are applied in distributed storage systems to provide data robustness against server failures by storing data redundancy among many storage servers. A (n, k) erasure code encodes a data object, which is represented as k elements, into a codeword of n elements such that any k out of these n codeword elements can recover the data object back. Decentralized erasure codes are proposed for distributed storage systems without a central authority. The characteristic of decentralization makes resulting storage systems more scalable and suitable for loosely-organized networking environments. However, different from conventional erasure codes, decentralized erasure codes trade some probability of a successful data retrieval for decentralization. Although theoretical lower bounds on the probability are overwhelming from a theoretical aspect, it is essential to know what the data retrievability is in real applications from a practical aspect. We focus on decentralized erasure code based storage systems and investigate data retrievability from both theoretical and practical aspects. We conduct simulation for random processes of storage systems to evaluate data retrievability. Then we compare simulation results and analytical values from theoretical bounds. By our comparison, we find that data retrievability is underestimated by those bounds. Data retrievability is over 99% in most cases in our simulations, where the order of the used finite field is an 8-bit prime. Data retrievability can be enlarged by using a larger finite field. We believe that data retrievability of decentralized erasure code based storage systems is acceptable for real applications.
    Software Security and Reliability (SERE), 2013 IEEE 7th International Conference on; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Regenerating codes are a class of recently developed codes for distributed storage that, like Reed-Solomon codes, permit data recovery from any subset of nodes within the -node network. However, regenerating codes possess in addition, the ability to repair a failed node by connecting to an arbitrary subset of nodes. It has been shown that for the case of functional repair, there is a tradeoff between the amount of data stored per node and the bandwidth required to repair a failed node. A special case of functional repair is exact repair where the replacement node is required to store data identical to that in the failed node. Exact repair is of interest as it greatly simplifies system implementation. The first result of this paper is an explicit, exact-repair code for the point on the storage-bandwidth tradeoff corresponding to the minimum possible repair bandwidth, for the case when . This code has a particularly simple graphical description, and most interestingly has the ability to carry out exact repair without any need to perform arithmetic operations. We term this ability of the code to perform repair through mere transfer of data as repair by transfer. The second result of this paper shows that the interior points on the storage-bandwidth tradeoff cannot be achieved under exact repair, thus pointing to the existence of a separate tradeoff under exact repair. Specifically, we identify a set of scenarios which we term as “helper node pooling,” and show that it is the necessity to satisfy such scenarios that overconstrains the system.
    IEEE Transactions on Information Theory 01/2012; 58(3):1837-1852. · 2.65 Impact Factor