A Fundamental Trade-Off between the Download Cost and Repair Bandwidth in Distributed Storage Systems
ABSTRACT Distributed storage systems are mainly justified due to the limited amount of storage capacity and improving the reliability through distributing data over multiple storage nodes. However, it may happen the data is stored in unreliable nodes, while it is desirable the end user to have a reliable access to the stored data. So, in an event that a node is damaged, to prevent the system reliability to regress, it is necessary to regenerate a new node with the same amount of stored data as the damaged node to retain the number of storage nodes, thereby having the previous reliability. This requires the new node to connect to some of existing nodes, and downloads the required information, thereby occupying some bandwidth, called the repair bandwidth. On the other hand, it is more likely the cost of downloading varies across different nodes. This paper aims at investigating the fundamental trade-off between the download cost and repair bandwidth, and more importantly, it is shown any point on this curve can be achieved through the use of the so called generalized regenerating codes which is an enhancement to the regenerating codes introduced by Dimakis et al. in.
- [Show abstract] [Hide abstract]
ABSTRACT: Erasure codes are applied in distributed storage systems to provide data robustness against server failures by storing data redundancy among many storage servers. A (n, k) erasure code encodes a data object, which is represented as k elements, into a codeword of n elements such that any k out of these n codeword elements can recover the data object back. Decentralized erasure codes are proposed for distributed storage systems without a central authority. The characteristic of decentralization makes resulting storage systems more scalable and suitable for loosely-organized networking environments. However, different from conventional erasure codes, decentralized erasure codes trade some probability of a successful data retrieval for decentralization. Although theoretical lower bounds on the probability are overwhelming from a theoretical aspect, it is essential to know what the data retrievability is in real applications from a practical aspect. We focus on decentralized erasure code based storage systems and investigate data retrievability from both theoretical and practical aspects. We conduct simulation for random processes of storage systems to evaluate data retrievability. Then we compare simulation results and analytical values from theoretical bounds. By our comparison, we find that data retrievability is underestimated by those bounds. Data retrievability is over 99% in most cases in our simulations, where the order of the used finite field is an 8-bit prime. Data retrievability can be enlarged by using a larger finite field. We believe that data retrievability of decentralized erasure code based storage systems is acceptable for real applications.Software Security and Reliability (SERE), 2013 IEEE 7th International Conference on; 01/2013
- [Show abstract] [Hide abstract]
ABSTRACT: Regenerating codes are a class of recently developed codes for distributed storage that, like Reed-Solomon codes, permit data recovery from any subset of nodes within the -node network. However, regenerating codes possess in addition, the ability to repair a failed node by connecting to an arbitrary subset of nodes. It has been shown that for the case of functional repair, there is a tradeoff between the amount of data stored per node and the bandwidth required to repair a failed node. A special case of functional repair is exact repair where the replacement node is required to store data identical to that in the failed node. Exact repair is of interest as it greatly simplifies system implementation. The first result of this paper is an explicit, exact-repair code for the point on the storage-bandwidth tradeoff corresponding to the minimum possible repair bandwidth, for the case when . This code has a particularly simple graphical description, and most interestingly has the ability to carry out exact repair without any need to perform arithmetic operations. We term this ability of the code to perform repair through mere transfer of data as repair by transfer. The second result of this paper shows that the interior points on the storage-bandwidth tradeoff cannot be achieved under exact repair, thus pointing to the existence of a separate tradeoff under exact repair. Specifically, we identify a set of scenarios which we term as “helper node pooling,” and show that it is the necessity to satisfy such scenarios that overconstrains the system.IEEE Transactions on Information Theory 01/2012; 58(3):1837-1852. · 2.62 Impact Factor
Conference Paper: Enabling node repair in any erasure code for distributed storage[Show abstract] [Hide abstract]
ABSTRACT: Erasure codes are an efficient means of storing data across a network in comparison to data replication, as they tend to reduce the amount of data stored in the network and offer increased resilience in the presence of node failures. The codes perform poorly though, when repair of a failed node is called for, as they typically require the entire file to be downloaded to repair a failed node. A new class of erasure codes, termed as regenerating codes were recently introduced, that do much better in this respect. However, given the variety of efficient erasure codes available in the literature, there is considerable interest in the construction of coding schemes that would enable traditional erasure codes to be used, while retaining the feature that only a fraction of the data need be downloaded for node repair. In this paper, we present a simple, yet powerful, framework that does precisely this. Under this framework, the nodes are partitioned into two types and encoded using two codes in a manner that reduces the problem of node-repair to that of erasure-decoding of the constituent codes. Depending upon the choice of the two codes, the framework can be used to avail one or more of the following advantages: simultaneous minimization of storage space and repair-bandwidth, low complexity of operation, fewer disk reads at helper nodes during repair, and error detection and correction.Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on; 09/2011